Aggregate Functions
We’ve learned how to write queries to retrieve information from the database. Now, we are going to learn how to perform calculations using SQL.
Calculations performed on multiple rows of a table are called aggregates.
- Click on the following link to open the Online SQL Editor
- Click on File>Open DB and select the file
fake_apps.db
inClassesICT
. - In the editor, type the following:
What are the column names?
- id
- name
- category
- downloads
- price
- Type the following commands in the SQL editor to try and find out what they do:
COUNT
SUM
MAX
AVERAGE
ROUND
Group By
Based on criteria
Oftentimes, we will want to calculate an aggregate for data with certain characteristics.
For instance, if we go back to our movie
database, we might want to know the mean IMDb ratings for all movies each year. We could calculate each number by a series of queries with different WHERE
statements, like so:
SELECT AVG(imdb_rating)
FROM movie
WHERE year = 1999;
SELECT AVG(imdb_rating)
FROM movie
WHERE year = 2000;
SELECT AVG(imdb_rating)
FROM movie
WHERE year = 2001;
and so on.
Luckily, there’s a better way!
We can use GROUP BY
to do this in a single step:
GROUP BY
is a clause in SQL that is used with aggregate functions. It is used in collaboration with the SELECT
statement to arrange identical data into groups.
The GROUP BY
statement comes after any WHERE
statements, but before ORDER BY
or LIMIT
.
Based on calculation
Sometimes, we want to GROUP BY
a calculation done on a column.
For instance, we might want to know how many movies have IMDb ratings that round to 1, 2, 3, 4, 5. We could do this using the following syntax:
SELECT ROUND(imdb_rating),
COUNT(name)
FROM movie
GROUP BY ROUND(imdb_rating)
ORDER BY ROUND(imdb_rating);
However, this query may be time-consuming to write and more prone to error.
SQL lets us use column reference(s) in our GROUP BY
that will make our lives easier.
- 1 is the first column selected
- 2 is the second column selected
- 3 is the third column selected
and so on.
The following query is equivalent to the one above:
Here, the 1 refers to the first column in our SELECT
statement, ROUND(imdb_rating)
.
Having
In addition to being able to group data using GROUP BY
, SQL also allows you to filter which groups to include and which to exclude.
For instance, imagine that we want to see how many movies of different genres were produced each year, but we only care about years and genres with at least 10 movies.
We can’t use WHERE
here because we don’t want to filter the rows; we want to filter groups.
This is where HAVING
comes in.
HAVING
is very similar to WHERE
. In fact, all types of WHERE
clauses you learned about thus far can be used with HAVING
.
We can use the following for the problem:
HAVING
statement always comes after GROUP BY
, but before ORDER BY
and LIMIT
.
WHERE vs HAVING
- When we want to limit the results of a query based on values of the individual rows, use
WHERE
. - When we want to limit the results of a query based on an aggregate property, use
HAVING
.
Exercises
Exercise 1
- Using the
fake_apps
db, in the code editor, type:
Here, our aggregate function is COUNT()
and we arranged price into groups.
What do you expect the result to be?
-
In the previous query, add a
WHERE
clause to count the total number of apps that have been downloaded more than 20 000 times, at each price. -
Write a new query that calculates the total number of downloads for each category. Select
category
andSUM(downloads)
.
Exercise 2
Suppose we have the query below:
Write the exact query, but use column reference numbers instead of column names after GROUP BY
.
Exercise 3
Suppose we have the query below:
It returns the average downloads (rounded) and the number of apps – at each price point.
However, certain price points don’t have very many apps, so their average downloads are less meaningful.
Add a HAVING
clause to restrict the query to price points that have more than 10 apps.