Aggregation
Week #7: Articles
Learning Outcomes
- Learn how to write queries that use aggregation.
- Learn how to write queries that use aggregation for only some columns.
Queries that Aggregate
Aggregation means counting, adding, and grouping results of COUNT
, SUM
, AVERAGE
, MIN
, and MAX
functions. Aggregation queries add one or two more clauses and return non-aggregated columns with the GROUP BY
and they filter aggregated result sets with the HAVING
clauses.
The GROUP BY
clause must refer to all nonaggregated columns in the SELECT
list, because they’re not unique, and there’s no sense in returning all the rows when you need only one row with the nonunique columns and the aggregated result. The GROUP BY
clause instructs the database to do exactly that: return only distinct versions of nonunique columns with the aggregated result columns. The GROUP BY
clause runs after the query has identified all rows and columns. The COUNT
function takes an asterisk (*) as its single argument. The * represents an indirection operator that points to rows returned by the query. The * is equivalent to the ROWID
pseudocolumn in Oracle. It counts rows whether a row contain any values or not.
After the database returns the aggregated result set, the HAVING
clause filters the result set. In the example, it returns only those aggregated results that have two or more rows in the table. The ORDER BY
clause then sorts the return set.
The following list qualifies the ANSI SQL pattern for processing a single table query with aggregation and a GROUP BY
and HAVING
clause:
- It finds a table in the
FROM
clause. - It optionally assigns a table alias as a runtime placeholder for the table name.
- It gets the table definition from the data catalog to determine the valid column names.
- If a table alias is present (and it is), it optionally maps the alias to the table’s data catalog definition.
- It filters rows into the result set based on the value of columns in the
WHERE
clause. - The list of columns in the
SELECT
clause filters the desired columns from the complete set of columns in a row. - The aggregation function triggers a check for a
GROUP BY
clause when nonaggregated columns are returned in theSELECT
-list and then aggregates results. - The
HAVING
operator filters the result set from the aggregation or theGROUP BY
aggregation. - If an
ORDER BY
clause occurs in the query, rows are sorted by the designated columns.
We’ll work through the basic aggregation steps most developers use frequently. They cover the COUNT
, SUM
, AVERAGE
, MAX
, and MIN
functions. The following discussions use two sets of ordinal and cardinal numbers (some values are not displayed to save space) that are stored in the ordinal table, like so:
ID LIST_SET LIST_NAME LIST_VALUE ---------- -------------------- ---------- ---------- 1 Value Set A Zero 0 2 Value Set A One 1 3 Value Set A Two 2 4 Value Set A Three 3 5 Value Set A Four 4 6 Value Set A Five 5 7 Value Set A Six 6 8 Value Set A Seven 7 9 Value Set A Eight 8 10 Value Set A Nine 9 11 Value Set A 12 Value Set B Zero 0 13 Value Set B One 1 ... 21 Value Set B Nine 9 22 Value Set B |
You’ve been exposed to the data set to help you understand how the aggregation functions work in the following subsections.
Aggregate Columns Only
The COUNT
function has two behaviors: counting by reference and counting by value. They differ on how they treat null values. You count the number of physical rows when you count by reference, and you count the physical values when you count by value.
The count by reference example counts the number of rows in the ordinal
table, like this:
SQL> SELECT COUNT(*) AS number_of_rows FROM ordinal; |
It returns the following:
NUMBER_OF_ROWS -------------- 22 |
The count by value example counts the values in the list_value column. The list_ value column contains two null values. The column name is substituted for the asterisk, like this:
SQL> SELECT COUNT(list_value) AS number_of_values FROM ordinal; |
It returns the following:
NUMBER_OF_VALUES ---------------- 20 |
The return set is two less than the number of rows because the COUNT
function doesn’t count null values. You can also count all values (which is the default performed in the preceding example) or distinct values only. Both approaches exclude null values.
The following query demonstrates counting using the default, an explicit ALL
, and DISTINCT
number of values found in the list_name
and list_value columns:
SQL> SELECT COUNT(list_name) AS default_number 2 , COUNT(ALL list_name) AS explicit_number 3 , COUNT(DISTINCT list_value) AS distinct_number 4 FROM ordinal; |
Here are the results:
DEFAULT_NUMBER EXPLICIT_NUMBER DISTINCT_NUMBER -------------- --------------- --------------- 20 20 10 |
Notice that the COUNT
function returns the same number with or without the ALL keyword. That’s because the default is ALL, which is provided when you don’t use it. It also counts the occurrences of strings or numbers. When ALL is specified, you count each individual element, not just the unique set of elements. The DISTINCT
keyword forces a unique sort of the data set before counting the results.
The SUM
, AVG
, MAX
, and MIN
functions work only with numbers. The following demonstrates the SUM
and AVG
functions against the list_value
column:
SQL> SELECT SUM(ALL list_value) AS sum_all 2 , SUM(DISTINCT list_value) AS sum_distinct 3 , AVG(ALL list_value) AS avg_all 4 , AVG(DISTINCT list_value) AS avg_distinct 5 FROM ordinal; |
Here’s the result set:
SUM_ALL SUM_DISTINCT AVG_ALL AVG_DISTINCT ---------- ------------ ---------- ------------ 90 45 4.5 4.5 |
The sum of two sets of the ordinal numbers is 90, and the sum of one set is 45. The average of all or the distinct set is naturally the same.
The next example runs the MAX
and MIN
functions:
SQL> SELECT MIN(ALL list_value) AS min_all 2 , MIN(DISTINCT list_value) AS min_distinct 3 , MAX(ALL list_value) AS max_all 4 , MAX(DISTINCT list_value) AS max_distinct 5 FROM ordinal; |
It produces these results:
MIN_ALL MIN_DISTINCT MAX_ALL MAX_DISTINCT ---------- ------------ ---------- ------------ 0 0 9 9 |
The minimum or maximum of two sets of the same group of numbers is always the same. The minimum is 0 and the maximum is 9 for ordinal numbers.
Aggregate and Nonaggregate Columns
The principal of returning aggregate and nonaggregate columns starts with understanding that you get only one row when you add a column of numbers. By extension, you get one row for every type of thing you count. A real-world example of that would be counting a bag of fruit. You separate the fruit into groups, such as apples, oranges, pears, and apricots. Then you count the number of each type of fruit.
The following counts the number of rows and values for each unique value in the list_set column:
SQL> SELECT list_set AS grouping_by_column 2 , COUNT(*) 3 , COUNT(list_value) 4 FROM ordinal 5 GROUP BY list_set; |
And here are the results of this query:
GROUPING_BY_COLUMN COUNT(*) COUNT(LIST_VALUE) -------------------- ---------- ----------------- VALUE SET A 11 10 VALUE SET B 11 10 |
The results tells you that you have 11 rows in each group and only 10 values, which means each group has one row that contains a null value. You change the SELECT
-list and the GROUP BY
clause when you want to identify the rows with the null values.
The following query returns a 0 when the value is null and returns a 1 otherwise:
SQL> SELECT list_set AS grouping_by_not_null 2 , list_name AS group_by_null_too 3 , COUNT(*) 4 , COUNT(list_value) 5 FROM ordinal 6 WHERE list_set = 'Value Set A' 7 GROUP BY list_set 8 , list_name; |
And here are the results from the query:
GROUPING_BY_NOT_NULL GROUP COUNT(*) COUNT(LIST_VALUE) -------------------- ----- ---------- ----------------- Value Set A Zero 1 1 Value Set A Five 1 1 Value Set A Three 1 1 Value Set A Four 1 1 Value Set A One 1 1 Value Set A Two 1 1 Value Set A Eight 1 1 Value Set A Nine 1 1 Value Set A Seven 1 1 Value Set A Six 1 1 Value Set A 1 0 |
The only problem with the return set is that the cardinal numbers aren’t in numeric order. That requires a special ORDER BY
clause with a CASE
statement. You could add the following to the last query to get them sorted into numeric order:
9 ORDER BY CASE 10 WHEN list_name = 'Zero' THEN 0 11 WHEN list_name = 'One' THEN 1 12 WHEN list_name = 'Two' THEN 2 13 WHEN list_name = 'Three' THEN 3 14 WHEN list_name = 'Four' THEN 4 15 WHEN list_name = 'Five' THEN 5 16 WHEN list_name = 'Six' THEN 6 17 WHEN list_name = 'Seven' THEN 7 18 WHEN list_name = 'Eight' THEN 8 19 WHEN list_name = 'Nine' THEN 9 20 END; |
This type of ORDER BY
clause lets you achieve numeric ordering without changing any of the data. Note that null values are always sorted last in an ascending sort and first in a descending sort.
The following query demonstrates the GROUP BY
for the SUM
, AVG
, MAX
, and MIN
functions:
SQL> SELECT list_set AS grouping_by_not_null 2 , SUM(list_value) AS ordinal_sum 3 , AVG(list_value) AS ordinal_avg 4 , MIN(list_value) AS ordinal_min 5 , MAX(list_value) AS ordinal_max 6 FROM ordinal 7 GROUP BY list_set; |
It displays the following:
GROUPING_BY_NOT_NULL ORDINAL_SUM ORDINAL_AVG ORDINAL_MIN ORDINAL_MAX -------------------- ----------- ----------- ----------- ----------- Value Set A 45 4.5 0 9 Value Set B 45 4.5 0 9 |
This returns the expected result set from the functions. They naturally match for each set of ordinal numbers. If you were to alter the data set, you could get different results.