Improving the Hacker News ranking algorithm

Felix Dietze - 2021-02-07

This is a work in progress...

In my opinion, the Hacker News community is a diverse group of the smartest people on the planet. Together they decide which content will appear on the front page and therefore get the most attention. Anyone can submit new links or questions for the community to vote on. The frontpage usually shows very interesting high-quality content. So everything seems to work as intended. Unfortunately, there is a weakness in the voting mechanism that prevents the Community from properly evaluating every submission.

You may have noticed that many submissions barely receive any upvotes. This becomes clearer if you look at the newest submissions or submit something yourself. A low score does not mean that these submissions are of low quality. It simply means that the community could not evaluate them properly. In this article I will try to explain what exactly happens, why it happens and how the situation could be improved. My goal is to improve the quality of the HN frontpage without touching the user experience.

Overview

  1. First, I analyze all HN submissions to date, and show that the scores do not necessarily reflect the opinion of the community. Most articles do not receive any votes at all, which means they are not properly evaluated. Even submissions with very high scores have been submitted several times before they achieved their high score. I factor out submission time and different titles to show that this effect happens across all submissions.
  2. Then I explain that this happens because of a positive feedback loop between the ranking on the front page, the views and the upward votes of the submissions. This produces a chaotic ranking behavior, which prevents most submissions from being evaluated properly.
  3. In the end I will present several approaches to break this positive feedback loop and transform it into a balancing feedback loop. I ran some frontpage simulations to evaluate my ideas. The most promising solution approach I came across so far is slightly modifying the HN ranking formula by dividing it by the number of views for that specific article. I hope that this solution will be evaluated by the HN community.

I'd like to have feedback. If you find any mistakes in my methods or interpretations, or simply want to talk about this topic, please contact me: felix.dietze@rwth-aachen.de

The Data

I use a Kaggle notebook to query the Hacker News dataset on BigQuery and create the visualizations. If you want to play with the queries and plots, you can fork my notebook.

The dataset contains all submissions - including comments - since the beginning of Hacker News. Here's an overview of the submitted stories over the years:

year submissions submitters total_votes subm_per_user votes_per_subm votes_quartiles
2020 313347 60078 4574869 5.22 14.60 [1, 1, 2, 3, 3816]
2019 357189 62524 5583918 5.71 15.63 [1, 1, 2, 3, 3287]
2018 355756 62104 4998940 5.73 14.05 [1, 1, 2, 3, 6015]
2017 372126 66627 4972573 5.59 13.36 [1, 1, 2, 3, 4107]
2016 363380 65847 4567656 5.52 12.57 [1, 1, 2, 3, 5771]
2015 327572 62358 3805106 5.25 11.62 [1, 1, 2, 4, 2228]
2014 290956 57453 3451749 5.06 11.86 [1, 1, 2, 3, 3086]
2013 307998 53090 3513977 5.80 11.41 [1, 1, 2, 3, 2744]
2012 311203 52841 2996594 5.89 9.63 [1, 1, 1, 3, 3531]
2011 292844 49971 2769304 5.86 9.46 [1, 1, 1, 3, 4338]
2010 179078 26243 1813635 6.82 10.13 [1, 1, 2, 4, 1297]
2009 111533 15462 980209 7.21 8.79 [1, 1, 2, 6, 928]
2008 70224 6118 440908 11.48 6.28 [1, 1, 2, 5, 532]
2007 22487 1945 114721 11.56 5.10 [1, 1, 2, 5, 262]
2006 50 16 299 3.12 5.98 [1, 3, 4, 7, 57]

The community grew a lot until 2011 and roughly kept its size since then. Every year there are around 350k submissions by 62k submitters. The number of total upvotes seems still to be growing today and was at its peak of 5.5M in 2019. As a result, the average votes per submission are growing as well and reached 15.6 in 2019. That seems enough to evaluate the quality of a submission. But the quartiles indicate that the upvotes are distributed exponentially: Very few submissions take almost all the votes and most submissions (> 75%) receive only 3 or less upvotes.

A scatter plot gives a better overview of the exponential distribution of the scores. The plot shows all submissions by submission date and reached score. Note, that the score axis is on a log-scale.

Scatterplot of all submissions with date on the x-axis and score on a logarithmic y-axis. Submission density and maximum scores slowly increase until around 2012 then stay on a similar level. Below score 10^1 the density is very high.

The part below about 10 scoring points is very dense. It shows, that almost all submissions get low scores. Since scores are integers and would only be shown as lines on the log-scale, I added some jitter to visualize this high density in the scatter plot.

"Almost all" is a bit vague, so let's put this into numbers. The following query counts the submissions for many different ranges of achieved scores:

WITH
    stories AS (
        SELECT *
        FROM `bigquery-public-data.hacker_news.full`
        WHERE `type` = 'story' AND url != '' AND score IS NOT NULL AND score > 0
    ),
    intervals AS (    
        SELECT
            min(score) as min_score,
            max(score) as max_score,
            COUNT(*) as submissions,
            SUM(score) as total_votes,
        FROM
            stories
        GROUP BY
            ceil(log(score)/log(2))
    ),
    totals AS (
        SELECT
            count(*) AS total,
            sum(score) AS total_score,
        FROM stories
    )
SELECT
    max_score,
    [min_score, max_score] as score_interval,
    submissions,
    submissions / totals.total as subm_fraction,
    (SELECT COUNT(*) FROM stories where score <= max_score) / totals.total as cumulative_subm_fraction,
    total_votes,
    total_votes / totals.total_score as votes_fraction,
FROM
    intervals,
    totals
ORDER BY
    min_score ASC
score_interval submissions subm_fraction cumulative_subm_fraction total_votes votes_fraction
[1, 1] 1650064 0.477735 0.477735 1650064 0.038771
[2, 2] 703537 0.203692 0.681427 1407074 0.033062
[3, 4] 455960 0.132012 0.813439 1512041 0.035528
[5, 8] 197334 0.057133 0.870572 1198615 0.028164
[9, 16] 113632 0.032899 0.903471 1338238 0.031444
[17, 32] 84617 0.024499 0.927970 1984564 0.046631
[33, 64] 87730 0.025400 0.953370 4097706 0.096284
[65, 128] 79369 0.022979 0.976349 7261423 0.170621
[129, 256] 53224 0.015410 0.991759 9493582 0.223070
[257, 512] 22127 0.006406 0.998165 7654516 0.179858
[513, 1024] 5425 0.001571 0.999736 3645849 0.085666
[1025, 2031] 828 0.000240 0.999976 1093717 0.025699
[2049, 3816] 80 0.000023 0.999999 201124 0.004726
[4107, 6015] 4 0.000001 1.000000 20231 0.000475

Almost half of all url submissions got zero upvotes (score = 1). Around 81% of all submissions get fewer than 5 upvotes. Fewer submissions fall into higher ranges. I chose the ranges to be twice as big than their previous range, to make the exponential distribution of scores visible. Now you can see that almost all submissions fall into lower scoring ranges and almost all upvotes went into the few submissions with a high score (around score 256) .

submission_and_votes_distribution_over_score_intervals.png

This might suggest that most submissions on Hacker News are bad, don't fit the community or are spam. If they were better they would have been upvoted, right? But how can we be sure?

Interestingly, Hacker News permits re-submissions of the same url. This allows us to do some deeper analysis, since we have multiple community evaluations for some submissions. So how often are urls resubmitted? Do resubmitted urls get the same scores?

I counted the number of submissions for every url/title combination with a subquery, grouping by url,title. Then I'm grouping the results of that query again and counting the occurence of different submission counts. I'm taking the title into account to avoid measuring resubmissions with a different title (clickbait).

WITH
    stories AS (
        SELECT *
        FROM `bigquery-public-data.hacker_news.full`
        WHERE `type` = 'story' AND url != '' AND score IS NOT NULL AND score > 0
    ),
    counts AS (
        SELECT
            COUNT(*) as submission_count,
            max(score) as max_score,
        FROM stories
        GROUP BY url,title
    ),
    totals AS (
        SELECT
            count(*) AS total
        FROM counts
    )
SELECT
    counts.submission_count,
    COUNT(*) as urls,
    COUNT(*) / ANY_VALUE(totals.total) as fraction,
    pow(2,avg(log(counts.max_score)/log(2))) as max_score_log2_avg,
    approx_quantiles(counts.max_score, 2)[OFFSET(1)] as max_score_median,
FROM
     counts,
     totals
GROUP BY counts.submission_count
HAVING urls > 50 /* only show submission counts with enough samples */
ORDER BY counts.submission_count
submission_count urls fraction max_score_log2_avg max_score_median
1 3200032 0.96766 2.42 2.00
2 85619 0.02589 4.64 3.00
3 14234 0.00430 5.93 4.00
4 3861 0.00117 6.49 4.00
5 1469 0.00044 6.17 4.00
6 669 0.00020 6.43 4.00
7 357 0.00011 5.34 4.00
8 178 0.00005 4.70 3.00
9 149 0.00005 3.82 3.00
10 128 0.00004 4.37 3.00
11 70 0.00002 4.32 3.00

The results show that 96% of all submitted url/title combinations were submitted only once. Only 2% were submitted twice.

mean_score_for_different_submission_counts.png

The average of the maximum score reached per url increases significantly if it's submitted more often. It's 11 if only submitted once and 56 when submitted 5 times. Submitting more than 5 times doesn't seem to be useful. The same goes for the median: 2 for single submissions and up until 7 for 6-time submissions.

But the descriptive statistics like average and median don't reveal the full distribution. Let's look at it:

WITH
    stories AS (
        SELECT *
        FROM `bigquery-public-data.hacker_news.full`
        WHERE `type` = 'story' AND url != '' AND score IS NOT NULL AND score > 0
    ),
    best AS (
        SELECT
            url,
            title,
            max(score) as max_score,
        FROM stories
        GROUP BY url,title
        HAVING count(*) > 1
    )
SELECT
    /*best.url,*/
    submission.score + pow(2, rand())-1 as score, /* exponential jitter to make the lower values easier to interpret */
    best.max_score + pow(2, rand())-1 as max_score,
FROM best
JOIN stories as submission
    ON submission.url = best.url AND submission.title = best.title
ORDER BY best.max_score DESC

score_histograms_for_different_submission_counts.png

Here it is a bit more obvious that urls which have been submitted around 4 times are distributed more towards higher scores. Don't compare the height of lines in this plot. It's not a histogram, but a kernel density estimation

The smaller bump on the right represents the submissions that appeared on the frontpage. They got even more upvotes by being shown there. These submissions grew while the ones that didn't make it to the frontpage kept their score.

A common question is: WHEN is the right time to submit on Hacker News? I suggest that a more important question is: HOW OFTEN should you submit to Hacker News?

The answer to that question seems to be: 4 times if your submission won't make it to the frontpage and 8 times if it will.

More on submission time later.

The low median hints again at the exponential distribution of scores. That picture gets a bit clearer when we look at the quartiles. The maximum score is always extremely high.

We could interpret the results this way: You have to submit a url 5 times to unveil their real community sore.

If we add up the fractions to see how many urls have been submitted less than 5 times, we get 99.62%.

Therefore almost all submissions might not have got a chance to reach their maximum possible score. That seems a bit strange, doesn't it?

But how about the stories that got a high number of upvotes? Are at least their scores consistent? Let's look at some examples of the best urls which have been submitted at least 5 times. I additionally limited the submission count to 15 and the time span between first and last submission to maximum 2 weeks.

SELECT
    url,
    COUNT(*) as submissions,
    ARRAY_AGG(score order by time ASC) as scores_by_time,
    approx_quantiles(score, 4) as quartiles,
    DATE_DIFF(DATE(TIMESTAMP_SECONDS(MAX(time))),DATE(TIMESTAMP_SECONDS(MIN(time))), DAY) as days,
FROM `bigquery-public-data.hacker_news.full`
WHERE `type` = 'story' AND url != '' AND score IS NOT NULL AND score > 0
GROUP BY url
HAVING submissions >= 5 AND days <= 14
ORDER BY max(score) DESC
LIMIT 30
url submissions scores_by_time quartiles days
The Death of Microservice Madn... 8 [5, 2, 3, 5, 5, 2, 4, 993] [2, 2, 4, 5, 993] 9
React Native for Android... 5 [5, 2, 6, 907, 6] [2, 5, 6, 6, 907] 0
On Being a Principal Engineer... 5 [2, 4, 2, 1, 893] [1, 2, 2, 4, 893] 12
Google, what were you thinking... 15 [1, 1, 1, 1, 1, 1, 1, 1, 751, 1, 1, 1, 1, 1, 4] [1, 1, 1, 1, 751] 0
Why time management is ruining... 5 [552, 6, 4, 5, 8] [4, 5, 6, 8, 552] 2
The Internet of Beefs... 5 [2, 4, 3, 3, 513] [2, 3, 3, 4, 513] 3
Giving GPT-3 a Turing Test... 5 [3, 3, 12, 32, 453] [3, 3, 12, 32, 453] 13
Performance Matters... 5 [3, 2, 2, 401, 3] [2, 2, 3, 3, 401] 9
Static Analysis in GCC 10... 5 [8, 18, 9, 3, 394] [3, 8, 9, 18, 394] 2
The Cost of JavaScript in 2019... 5 [1, 2, 2, 2, 390] [1, 2, 2, 2, 390] 5
Deploys at Slack... 5 [2, 5, 4, 1, 360] [1, 2, 4, 5, 360] 10
Flawed Algorithms Are Grading ... 5 [2, 7, 3, 6, 358] [2, 3, 6, 7, 358] 9
Getting Started with Security ... 5 [12, 5, 2, 8, 346] [2, 5, 8, 12, 346] 10
The rise and fall of the PlayS... 6 [4, 3, 4, 2, 2, 283] [2, 2, 3, 4, 283] 4
Random Acts of Optimization... 5 [11, 3, 15, 2, 279] [2, 3, 11, 15, 279] 2
Minify Your SVGs... 5 [2, 3, 3, 1, 264] [1, 2, 3, 3, 264] 10
Roadmap to becoming a web deve... 5 [6, 2, 6, 4, 261] [2, 4, 6, 6, 261] 11
The New Wilderness... 5 [249, 12, 8, 6, 7] [6, 7, 8, 12, 249] 4
The 9.9 Percent Is the New Ame... 5 [4, 19, 9, 2, 248] [2, 4, 9, 19, 248] 11
The Debugging Mindset... 5 [3, 1, 2, 2, 246] [1, 2, 2, 3, 246] 11
IBM Watson Overpromised and Un... 5 [5, 3, 1, 2, 237] [1, 2, 3, 5, 237] 5
What they don’t tell you about... 5 [1, 2, 1, 2, 234] [1, 1, 2, 2, 234] 9
Supercharging the Git Commit G... 5 [4, 4, 3, 227, 8] [3, 4, 4, 8, 227] 3
Dollar General Hits a Gold Min... 5 [2, 2, 2, 5, 227] [2, 2, 2, 5, 227] 5
Legit-Looking iPhone Lightning... 5 [6, 215, 1, 16, 1] [1, 1, 6, 16, 215] 3
My Billion Dollar Mistake... 6 [2, 4, 8, 8, 4, 211] [2, 4, 4, 8, 211] 6
Why Did a Chinese Peroxide Com... 7 [2, 3, 2, 1, 1, 2, 192] [1, 1, 2, 3, 192] 6
He Was a Hacker for the NSA an... 7 [16, 9, 4, 7, 4, 4, 189] [4, 4, 7, 16, 189] 5
Crypto Tokens: A Breakthrough ... 5 [184, 1, 1, 3, 3] [1, 1, 3, 3, 184] 2
Sperm Count Zero... 5 [7, 3, 7, 7, 182] [3, 7, 7, 7, 182] 4

It's interesting to see that multiple submissions of a good url always include submissions that received only a few or no upvotes. In the quartiles you can easily see the minimum (left), maximum (right) and median (center) scores.

Note that the scores are sorted by time and the highest score usually comes last. That suggests, that the urls have been submitted until they finally received a high score.

Furthermore, the Hacker News FAQ states:

Are reposts ok?
When a story has had significant attention in the last year or so, we bury reposts as duplicates. If not, a small number of reposts is ok.

That means we can only say something about submissions sequences where the highest score comes last. In the other cases, submissions are probably buried by the HN algorithm.

Are the scores in different scoring ranges consistent?

Is this inconsistency visible across all kinds of urls, from low to high maximum score? Let's look at all urls, which have been submitted multiple times.

WITH
    best AS (
        SELECT
            url,
            max(score) as max_score,
        FROM `bigquery-public-data.hacker_news.full`
        WHERE `type` = 'story' AND score IS NOT NULL AND score > 0
        GROUP BY url
        HAVING count(*) > 1
    )
SELECT
    /*best.url,*/
    submission.score + pow(2, rand())-1 as score, /* exponential jitter to make the lower values easier to interpret */
    best.max_score + pow(2, rand())-1 as max_score,
FROM best
JOIN `bigquery-public-data.hacker_news.full` as submission
    ON submission.url = best.url
WHERE `type` = 'story' AND score IS NOT NULL AND score > 0
ORDER BY best.max_score DESC

Scatter plot scores of multiple submissions of the same url on the y-axis against its maximum score on the x-axis. Both axes are in logarthmic scale. The picture shows a triangle on the lower right, that is very dense at the bottom left, where score and max_score are minimal. The whole plot is dense at the bottom, even for higher max_scores.

You can see that in all ranges, and even for urls that reached very high scores most of the submissions only got a few upvotes.

Did the higher score happen at a specific time of day?

One last thing I'd like to talk about is submission time. Do you have to submit at a specific time of day to get higher scores? I know of two acticles that tried to answer that question, but came to contradictory conclusions:

What is the best time to post to Hacker News? - by Michael A. Schaefer (2017):

Overall, it appears that the best time to post to Hacker News is on Monday or Wednesday between 5 PM and 6 PM.

The Best Time to Submit To Hacker News 2018 - 2019 - by David Chanin, 2019

Interestingly, this analysis comes up with the opposite answer compared with the article from 2017 - it’s best to post on weekends and times of low activity for Hacker News so as to minimize the competition that your post will face. Articles posted on Sunday, 6am UTC are 2.5x more likely to make it to the front page than posting on Wednesday, 9am UTC. Of course, these frontpage posts will likely get less views than frontpage posts at a more popular time, so it’s a tradeoff.

scores_over_hour.png

histogram_of_best_submission_hours.png

There is more to come...

Frontpage Simulations to evaluate different ranking algorithms:
https://github.com/fdietze/downvote-scoring

I'm looking for ways to fund my research. If you can help, please contact me.

Page created with unote.io