how to build the fastest wikipedia race solver on the internet

wikipedia is a graph

Every^¹ wikipedia page has links to other wikipedia pages and by clicking on one of those links we end up on that page.

Because of this Wikipedia can be thought of as a graph where each page is a node and each link is an edge^².

Dune (novel)

David Lynch

Frank Herbert

Star Wars

Zendaya

Denis Villeneuve

Ecology

Hugo Award

Science fiction

Eight of ’Dune (novel)’s 723 outgoing links

We traverse the graph by clicking on a link to another page. If we click on the right sequence of links we can get from one page to almost any other page. - footnote: wikipedia is not a fully connected graph so not every page is reachable from every other page.

Sushi

Matcha

Green tea

Starbucks

Matcha Latte

TikTok

Clairo

Performative Male

Generation Z

Labubu

You can get from Matcha to Labubu in 3 clicks

There is actually a semi-official game called Wikiracing where “players compete to navigate from one Wikipedia page to another (chosen manually or randomly) using only the internal links of the page”.

There are two variants of Wikiracing. One minimizes the number of clicks and the the other minimizes the time to complete the race. I thought it would be a fun challenge to see if I could build a system that is the fastest in the internet at both.

Here is how it works… TODO: should the below section be hook? So here is how I built fastest wikipedia race solver on the internet^³. You can play with my solver here and try overload it.

the algorithm: bidirectional breadth-first search

To start we need an algorithm to find the shortest path was between any two randomly selected pages on Wikipedia.

The only information we start with is the page we are initially on: the start page (S). and the page we want to get to: the target page (T).

I like to visualize this process kinda as two expanding solar systems centered on the start and target page that grow until they overlap. Every step of the algorithm we add an new further out orbit to one of the solar systems. As soon as the two outermost orbits intersect we know that we have found the shortest path(s).

Begin with the start and target page.

Get all the pages that link to the target page.

Get all the pages that the start page links to.

Get all the pages that can reach the target page in 2 links.

Get all the pages that the start page can reach in 2 links.

Find all the connections to the nodes where they intersect.

We have found the shortest paths from start to target!

As you can see the algorithm is pretty straight forward and brute force.

Initially we only know about the start and target page. And we have no idea how far apart they are.
We begin by checking either all the outgoing links for the start page or all the incoming links for the target page. In order to guarantee we find the shortest path we have no choice but to check every outgoing and incoming link for every page we encounter from here on out.
Since we got all the incoming links to the target page in the previous step we now need to get all the outgoing links from the start page. Both of our solar systems have their first orbital ring. We call the outer ring of pages that are all exactly K links away from the center node the frontier.
Now we need to decide which frontier to grow next. Our goal is to minimize the number of pages we have to check so we expand the frontier with the fewest pages^⁴. Since our target frontier has 6 pages and our start frontier has 8 we expand the target frontier. Notice that this is exactly what we just did in step 3 when start frontier had one page.
Repeat the previous step until our frontiers intersect. We iteratively continue this process until we encounter a page that is part of the other solar system.
Once our frontiers intersect at at least one page we know that we have found the shortest path between the start and target page.
As you can see we had to check a lot of pages that were not on the shortest path. This wasted computation is unavoidable since we have no way of knowing ex ante which pages will be on the shortest path.

sounds easy right?

I first learned about this approach through Six Degrees of Wikipedia. They were an inspiration and baseline for this work. One of the reasons I built this was to measure if the six degrees of separation hypothesis is empirically true.

Unfortunately it turns out that, at least for Wikipedia, it is false. The diameter of Wikipedia is 41^⁵.

on enwiki-20260401 there are 1,205 pages with zero outgoing or incoming links, 2,247 with zero outgoing links, and 283,534 (3.95842%) with zero incoming links.↩︎
redirect links are equivalent to edges even though wikipedia stores them as separate pages. see more in the preprocessing section.↩︎
that I know of. please tell me if I am wrong ↩︎
The astute among you will notice that we actually want to expand the frontier with the fewest edges (not nodes). However we don’t actually know how many edges each frontier has without summing them up for every page in both frontiers. requiring a memory read for every page in both frontiers plus the required work to get all the edges of the smaller frontier. We can save some time and effort with a simple heuristic: since wikipedia has an equal number of incoming and outgoing links we know that number of edges in a frontier is proportional to the number of nodes in the frontier. by chosing the frontier with the fewest nodes we make the optimal choice in expectation.↩︎
The shortest path between ‘Yu ssi samdaerok’ and ‘2009 California League season’ is 41 steps ↩︎

hunter paulson

how to build the fastest wikipedia race solver on the internet

wikipedia is a graph

the algorithm: bidirectional breadth-first search

sounds easy right?