Project 6
Hacker News is a popular website run by Y Combinator. It’s widely known by people in the tech industry as a community site for sharing news, showing off projects, asking questions, among other things.
In this project, you will be working with a table named hacker_news
that contains stories from Hacker News since its launch in 2007.
This data was kindly made publicly available under the MIT license.
1. Click on the following link to open the Online SQL Editor.
2. Click on File>Open DB and select the file hackerNews.db
in ClassesICT
3. Start by getting a feel for the hacker_news table:
What are the column names?
- title: the title of the story
- user: the user who submitted the story
- score: the score of the story
- timestamp: the time of the story
- url: the link of the story
4. What are the top five stories with the highest scores?
Recent studies have found that online forums tend to be dominated by a small percentage of their users (1-9-90 Rule). Is this true of Hacker News? Is a small percentage of Hacker News submitters taking the majority of the points?
5. First, find the total score of all the stories.
6. Next, we need to pinpoint the users who have accumulated a lot of points across their stories. Find the individual users who have gotten combined scores of more than 200, and their combined scores.
GROUP BY
and HAVING
are needed!
Info
Now if we add these users' scores together and divide by the total we got in the previous query, we will know if Hacker News dominated by these users!
7. Oh no! While we are looking at the power users, some users are rickrolling — tricking readers into clicking on a link to a funny video and claiming that it links to information about coding. The url of the video is: https://www.youtube.com/watch?v=dQw4w9WgXcQ
How many times has each offending user posted this link?
Which sites feed Hacker News?
Hacker News stories are essentially links that take users to other websites. Which of these sites feed Hacker News the most: GitHub, Medium, or New York Times?
8. First, we want to categorize each story based on their source. We can do this using a CASE statement:
SELECT CASE
WHEN url LIKE '%github.com%' THEN 'GitHub'
-- WHEN statement here
-- WHEN statement here
-- ELSE statement here
END AS 'Source'
FROM hacker_news;
Fill in the other WHEN
statements and the ELSE
statement.
9. Next, build on the previous query:
- Add a column for the number of stories from each URL using
COUNT()
. - Also,
GROUP BY
theCASE
statement.
Remember that you can refer to a column in GROUP BY using a number.
What's the best time to post a story?
Every submitter wants their story to get a high score so that the story makes it to the front page, but… What’s the best time of the day to post a story on Hacker News?
10. Before we get started, let’s run this query and take a look at the timestamp column:
Notice that the values are formatted like:
2018-05-08T12:30:00Z
If you ignore the T and Z, the format is:
YYYY-MM-DD HH:MM:SS
11. SQLite comes with a strftime()
function - a very powerful function that allows you to return a formatted date.
It takes two arguments: strftime(format, column)
Let’s test this function out:
12. Okay, now we understand how strftime()
works.
Let’s write a query that returns three columns:
- The hours of the timestamp
- The average score for each hour
- The count of stories for each hour
13. Let’s edit a few things in the previous query:
- Round the average scores using
ROUND()
. - Rename the columns to make it more readable using
AS
. - Add a
WHERE
clause to filter out theNULL
values in timestamp.
Take a look at the result again: What are the best hours to post a story on Hacker News?