Project 6

Hacker News is a popular website run by Y Combinator. It’s widely known by people in the tech industry as a community site for sharing news, showing off projects, asking questions, among other things.

In this project, you will be working with a table named hacker_news that contains stories from Hacker News since its launch in 2007.

This data was kindly made publicly available under the MIT license.

1. Click on the following link to open the Online SQL Editor.

2. Click on File>Open DB and select the file hackerNews.db in ClassesICT

3. Start by getting a feel for the hacker_news table:

SQL
SELECT *
FROM hacker_news;
What are the column names?
  • title: the title of the story
  • user: the user who submitted the story
  • score: the score of the story
  • timestamp: the time of the story
  • url: the link of the story

4. What are the top five stories with the highest scores?


Recent studies have found that online forums tend to be dominated by a small percentage of their users (1-9-90 Rule). Is this true of Hacker News? Is a small percentage of Hacker News submitters taking the majority of the points?

5. First, find the total score of all the stories.

6. Next, we need to pinpoint the users who have accumulated a lot of points across their stories. Find the individual users who have gotten combined scores of more than 200, and their combined scores.

GROUP BY and HAVING are needed!

Info

Now if we add these users' scores together and divide by the total we got in the previous query, we will know if Hacker News dominated by these users!

7. Oh no! While we are looking at the power users, some users are rickrolling — tricking readers into clicking on a link to a funny video and claiming that it links to information about coding. The url of the video is: https://www.youtube.com/watch?v=dQw4w9WgXcQ

How many times has each offending user posted this link?


Which sites feed Hacker News?

Hacker News stories are essentially links that take users to other websites. Which of these sites feed Hacker News the most: GitHub, Medium, or New York Times?

8. First, we want to categorize each story based on their source. We can do this using a CASE statement:

SQL
SELECT CASE
   WHEN url LIKE '%github.com%' THEN 'GitHub'
   -- WHEN statement here
   -- WHEN statement here
   -- ELSE statement here
  END AS 'Source'
FROM hacker_news;

Fill in the other WHEN statements and the ELSE statement.

9. Next, build on the previous query:

  • Add a column for the number of stories from each URL using COUNT().
  • Also, GROUP BY the CASE statement.

Remember that you can refer to a column in GROUP BY using a number.


What's the best time to post a story?

Every submitter wants their story to get a high score so that the story makes it to the front page, but… What’s the best time of the day to post a story on Hacker News?

10. Before we get started, let’s run this query and take a look at the timestamp column:

SQL
SELECT timestamp
FROM hacker_news
LIMIT 10;

Notice that the values are formatted like: 2018-05-08T12:30:00Z If you ignore the T and Z, the format is: YYYY-MM-DD HH:MM:SS

11. SQLite comes with a strftime() function - a very powerful function that allows you to return a formatted date.

It takes two arguments: strftime(format, column)

Let’s test this function out:

SQL
SELECT timestamp,
   strftime('%H', timestamp)
FROM hacker_news
GROUP BY 1
LIMIT 20;

12. Okay, now we understand how strftime() works. Let’s write a query that returns three columns:

  • The hours of the timestamp
  • The average score for each hour
  • The count of stories for each hour

13. Let’s edit a few things in the previous query:

  • Round the average scores using ROUND().
  • Rename the columns to make it more readable using AS.
  • Add a WHERE clause to filter out the NULL values in timestamp.

Take a look at the result again: What are the best hours to post a story on Hacker News?