Database Tutorial

Course Tutorial Site

Site Admin

Aggregation

without comments

Week #7: Articles

Learning Outcomes

  • Learn how to write queries that use aggregation.
  • Learn how to write queries that use aggregation for only some columns.

Queries that Aggregate

Aggregation means counting, adding, and grouping results of COUNT, SUM, AVERAGE, MIN, and MAX functions. Aggregation queries add one or two more clauses and return non-aggregated columns with the GROUP BY and they filter aggregated result sets with the HAVING clauses.

The GROUP BY clause must refer to all nonaggregated columns in the SELECT list, because they’re not unique, and there’s no sense in returning all the rows when you need only one row with the nonunique columns and the aggregated result. The GROUP BY clause instructs the database to do exactly that: return only distinct versions of nonunique columns with the aggregated result columns. The GROUP BY clause runs after the query has identified all rows and columns. The COUNT function takes an asterisk (*) as its single argument. The * represents an indirection operator that points to rows returned by the query. The * is equivalent to the ROWID pseudocolumn in Oracle. It counts rows whether a row contain any values or not.

After the database returns the aggregated result set, the HAVING clause filters the result set. In the example, it returns only those aggregated results that have two or more rows in the table. The ORDER BY clause then sorts the return set.

The following list qualifies the ANSI SQL pattern for processing a single table query with aggregation and a GROUP BY and HAVING clause:

  1. It finds a table in the FROM clause.
  2. It optionally assigns a table alias as a runtime placeholder for the table name.
  3. It gets the table definition from the data catalog to determine the valid column names.
  4. If a table alias is present (and it is), it optionally maps the alias to the table’s data catalog definition.
  5. It filters rows into the result set based on the value of columns in the WHERE clause.
  6. The list of columns in the SELECT clause filters the desired columns from the complete set of columns in a row.
  7. The aggregation function triggers a check for a GROUP BY clause when nonaggregated columns are returned in the SELECT-list and then aggregates results.
  8. The HAVING operator filters the result set from the aggregation or the GROUP BY aggregation.
  9. If an ORDER BY clause occurs in the query, rows are sorted by the designated columns.

We’ll work through the basic aggregation steps most developers use frequently. They cover the COUNT, SUM, AVERAGE, MAX, and MIN functions. The following discussions use two sets of ordinal and cardinal numbers (some values are not displayed to save space) that are stored in the ordinal table, like so:

        ID LIST_SET             LIST_NAME  LIST_VALUE
---------- -------------------- ---------- ----------
         1 Value Set A          Zero                0
         2 Value Set A          One                 1
         3 Value Set A          Two                 2
         4 Value Set A          Three               3
         5 Value Set A          Four                4
         6 Value Set A          Five                5
         7 Value Set A          Six                 6
         8 Value Set A          Seven               7
         9 Value Set A          Eight               8
        10 Value Set A          Nine                9
        11 Value Set A
        12 Value Set B          Zero                0 
        13 Value Set B          One                 1
        ...
        21 Value Set B          Nine                9
        22 Value Set B

You’ve been exposed to the data set to help you understand how the aggregation functions work in the following subsections.

Aggregate Columns Only

The COUNT function has two behaviors: counting by reference and counting by value. They differ on how they treat null values. You count the number of physical rows when you count by reference, and you count the physical values when you count by value.

The count by reference example counts the number of rows in the ordinal table, like this:

SQL> SELECT COUNT(*) AS number_of_rows FROM ordinal;

It returns the following:

NUMBER_OF_ROWS
--------------
            22

The count by value example counts the values in the list_value column. The list_ value column contains two null values. The column name is substituted for the asterisk, like this:

SQL> SELECT COUNT(list_value) AS number_of_values FROM ordinal;

It returns the following:

NUMBER_OF_VALUES
----------------
              20

The return set is two less than the number of rows because the COUNT function doesn’t count null values. You can also count all values (which is the default performed in the preceding example) or distinct values only. Both approaches exclude null values.

The following query demonstrates counting using the default, an explicit ALL, and DISTINCT number of values found in the list_name and list_value columns:

SQL> SELECT COUNT(list_name) AS default_number
  2  ,      COUNT(ALL list_name) AS explicit_number
  3  ,      COUNT(DISTINCT list_value) AS distinct_number
  4  FROM   ordinal;

Here are the results:

DEFAULT_NUMBER EXPLICIT_NUMBER DISTINCT_NUMBER
-------------- --------------- ---------------
            20              20              10

Notice that the COUNT function returns the same number with or without the ALL keyword. That’s because the default is ALL, which is provided when you don’t use it. It also counts the occurrences of strings or numbers. When ALL is specified, you count each individual element, not just the unique set of elements. The DISTINCT keyword forces a unique sort of the data set before counting the results.

The SUM, AVG, MAX, and MIN functions work only with numbers. The following demonstrates the SUM and AVG functions against the list_value column:

SQL> SELECT   SUM(ALL list_value) AS sum_all
  2  ,        SUM(DISTINCT list_value) AS sum_distinct
  3  ,        AVG(ALL list_value) AS avg_all
  4  ,        AVG(DISTINCT list_value) AS avg_distinct
  5  FROM     ordinal;

Here’s the result set:

   SUM_ALL SUM_DISTINCT    AVG_ALL AVG_DISTINCT
---------- ------------ ---------- ------------
        90           45        4.5          4.5

The sum of two sets of the ordinal numbers is 90, and the sum of one set is 45. The average of all or the distinct set is naturally the same.

The next example runs the MAX and MIN functions:

SQL> SELECT   MIN(ALL list_value) AS min_all
  2  ,        MIN(DISTINCT list_value) AS min_distinct
  3  ,        MAX(ALL list_value) AS max_all
  4  ,        MAX(DISTINCT list_value) AS max_distinct
  5  FROM     ordinal;

It produces these results:

MIN_ALL    MIN_DISTINCT MAX_ALL    MAX_DISTINCT
---------- ------------ ---------- ------------
         0            0          9            9

The minimum or maximum of two sets of the same group of numbers is always the same. The minimum is 0 and the maximum is 9 for ordinal numbers.

Aggregate and Nonaggregate Columns

The principal of returning aggregate and nonaggregate columns starts with understanding that you get only one row when you add a column of numbers. By extension, you get one row for every type of thing you count. A real-world example of that would be counting a bag of fruit. You separate the fruit into groups, such as apples, oranges, pears, and apricots. Then you count the number of each type of fruit.

The following counts the number of rows and values for each unique value in the list_set column:

SQL> SELECT   list_set AS grouping_by_column
  2  ,        COUNT(*)
  3  ,        COUNT(list_value)
  4  FROM     ordinal
  5  GROUP BY list_set;

And here are the results of this query:

GROUPING_BY_COLUMN     COUNT(*) COUNT(LIST_VALUE)
-------------------- ---------- -----------------
VALUE SET A                  11                10
VALUE SET B                  11                10

The results tells you that you have 11 rows in each group and only 10 values, which means each group has one row that contains a null value. You change the SELECT-list and the GROUP BY clause when you want to identify the rows with the null values.

The following query returns a 0 when the value is null and returns a 1 otherwise:

SQL> SELECT   list_set AS grouping_by_not_null
  2  ,        list_name AS group_by_null_too
  3  ,        COUNT(*)
  4  ,        COUNT(list_value)
  5  FROM     ordinal
  6  WHERE    list_set = 'Value Set A'
  7  GROUP BY list_set
  8  ,        list_name;

And here are the results from the query:

GROUPING_BY_NOT_NULL GROUP   COUNT(*) COUNT(LIST_VALUE)
-------------------- ----- ---------- -----------------
Value Set A          Zero           1                 1
Value Set A          Five           1                 1
Value Set A          Three          1                 1
Value Set A          Four           1                 1
Value Set A          One            1                 1
Value Set A          Two            1                 1
Value Set A          Eight          1                 1
Value Set A          Nine           1                 1
Value Set A          Seven          1                 1
Value Set A          Six            1                 1
Value Set A                         1                 0

The only problem with the return set is that the cardinal numbers aren’t in numeric order. That requires a special ORDER BY clause with a CASE statement. You could add the following to the last query to get them sorted into numeric order:

  9  ORDER BY CASE
 10             WHEN list_name = 'Zero'  THEN 0
 11             WHEN list_name = 'One'   THEN 1
 12             WHEN list_name = 'Two'   THEN 2
 13             WHEN list_name = 'Three' THEN 3
 14             WHEN list_name = 'Four'  THEN 4
 15             WHEN list_name = 'Five'  THEN 5
 16             WHEN list_name = 'Six'   THEN 6
 17             WHEN list_name = 'Seven' THEN 7
 18             WHEN list_name = 'Eight' THEN 8
 19             WHEN list_name = 'Nine'  THEN 9
20 END;

This type of ORDER BY clause lets you achieve numeric ordering without changing any of the data. Note that null values are always sorted last in an ascending sort and first in a descending sort.

The following query demonstrates the GROUP BY for the SUM, AVG, MAX, and MIN functions:

SQL> SELECT   list_set AS grouping_by_not_null
  2  ,        SUM(list_value) AS ordinal_sum
  3  ,        AVG(list_value) AS ordinal_avg
  4  ,        MIN(list_value) AS ordinal_min
  5  ,        MAX(list_value) AS ordinal_max
  6  FROM     ordinal
  7  GROUP BY list_set;

It displays the following:

GROUPING_BY_NOT_NULL ORDINAL_SUM ORDINAL_AVG ORDINAL_MIN ORDINAL_MAX
-------------------- ----------- ----------- ----------- -----------
Value Set A                   45         4.5           0           9
Value Set B                   45         4.5           0           9

This returns the expected result set from the functions. They naturally match for each set of ordinal numbers. If you were to alter the data set, you could get different results.

Written by michaelmclaughlin

August 13th, 2018 at 3:10 pm

Posted in