Aggregate Functions

We’ve learned how to write queries to retrieve information from the database. Now, we are going to learn how to perform calculations using SQL.

Calculations performed on multiple rows of a table are called aggregates.

Click on the following link to open the Online SQL Editor
Click on File>Open DB and select the file fake_apps.db in ClassesICT.
In the editor, type the following:

SQL

SELECT * FROM fake_apps;

What are the column names?

id
name
category
downloads
price

Type the following commands in the SQL editor to try and find out what they do:

COUNT

SQL

SELECT COUNT(*)
FROM fake_apps;

SUM

SQL

SELECT SUM(downloads)
FROM fake_apps;

MAX

SQL

SELECT MAX(downloads)
FROM fake_apps;

AVERAGE

SQL

SELECT AVG(downloads)
FROM fake_apps;

ROUND

SQL

SELECT ROUND(price, 0)
FROM fake_apps;

Group By

Based on criteria

Oftentimes, we will want to calculate an aggregate for data with certain characteristics.

For instance, if we go back to our movie database, we might want to know the mean IMDb ratings for all movies each year. We could calculate each number by a series of queries with different WHERE statements, like so:

SQL

SELECT AVG(imdb_rating)
FROM movie
WHERE year = 1999;

SELECT AVG(imdb_rating)
FROM movie
WHERE year = 2000;

SELECT AVG(imdb_rating)
FROM movie
WHERE year = 2001;

and so on.

Luckily, there’s a better way!

We can use GROUP BY to do this in a single step:

SQL

SELECT year,
   AVG(imdb_rating)
FROM movie
GROUP BY year
ORDER BY year;

GROUP BY is a clause in SQL that is used with aggregate functions. It is used in collaboration with the SELECT statement to arrange identical data into groups.

The GROUP BY statement comes after any WHERE statements, but before ORDER BY or LIMIT.

Based on calculation

Sometimes, we want to GROUP BY a calculation done on a column.

For instance, we might want to know how many movies have IMDb ratings that round to 1, 2, 3, 4, 5. We could do this using the following syntax:

SQL

SELECT ROUND(imdb_rating),
   COUNT(name)
FROM movie
GROUP BY ROUND(imdb_rating)
ORDER BY ROUND(imdb_rating);

However, this query may be time-consuming to write and more prone to error.

SQL lets us use column reference(s) in our GROUP BY that will make our lives easier.

1 is the first column selected
2 is the second column selected
3 is the third column selected

and so on.

The following query is equivalent to the one above:

SQL

SELECT ROUND(imdb_rating),
   COUNT(name)
FROM movie
GROUP BY 1
ORDER BY 1;

Here, the 1 refers to the first column in our SELECT statement, ROUND(imdb_rating).

Having

In addition to being able to group data using GROUP BY, SQL also allows you to filter which groups to include and which to exclude.

For instance, imagine that we want to see how many movies of different genres were produced each year, but we only care about years and genres with at least 10 movies.

We can’t use WHERE here because we don’t want to filter the rows; we want to filter groups.

This is where HAVING comes in.

HAVING is very similar to WHERE. In fact, all types of WHERE clauses you learned about thus far can be used with HAVING.

We can use the following for the problem:

SQL

SELECT year,
   genre,
   COUNT(name)
FROM movie
GROUP BY 1, 2
HAVING COUNT(name) > 3;

HAVING statement always comes after GROUP BY, but before ORDER BY and LIMIT.

WHERE vs HAVING

When we want to limit the results of a query based on values of the individual rows, use WHERE.
When we want to limit the results of a query based on an aggregate property, use HAVING.

Exercises

Exercise 1

Using the fake_apps db, in the code editor, type:

SQL

SELECT price, COUNT(*) 
FROM fake_apps
GROUP BY price;

Here, our aggregate function is COUNT() and we arranged price into groups. What do you expect the result to be?

In the previous query, add a WHERE clause to count the total number of apps that have been downloaded more than 20 000 times, at each price.
Write a new query that calculates the total number of downloads for each category. Select category and SUM(downloads).

Exercise 2

Suppose we have the query below:

SQL

SELECT category, 
   price,
   AVG(downloads)
FROM fake_apps
GROUP BY category, price;

Write the exact query, but use column reference numbers instead of column names after GROUP BY.

Exercise 3

Suppose we have the query below:

SQL

SELECT price, 
   ROUND(AVG(downloads)),
   COUNT(*)
FROM fake_apps
GROUP BY price;

It returns the average downloads (rounded) and the number of apps – at each price point.

However, certain price points don’t have very many apps, so their average downloads are less meaningful.

Add a HAVING clause to restrict the query to price points that have more than 10 apps.