PageRank, Stochastic Matrices and the Power Iteration

In this post, we will revisit a popular algorithm called PageRank, which is used by Google to rank webpages for its search engine. Surprising to some but not so to others, PageRank is simple enough that only a level of first-year undergraduate linear algebra is required to understand it.

The Web as a Graph

Consider the web as a collection of pages, some of which are connected to each other using hyperlinks. For example, the Wikipedia article on general relativity contains a hyperlink to another article on Albert Einstein. By clicking the link, we move from the former webpage to the latter.

We can model this as a graph \(G = (V, E)\), where the set of nodes (or vertices) \(V\) contains the webpages and the set of edges \(E\) contains binary relations \((v_i, v_j)\), indicating that the page \(v_i \in V\) contains a hyperlink to \(v_j \in V\). Since \(v_i\) may lead to \(v_j\) but not the other way around, the edge \((v_i, v_j)\) may be in \(E\) while \((v_j, v_i)\) may not. In this case we call the graph \(G\) a directed graph.

The Importance of a Page

PageRank defines a score for each webpage where more “important” pages have high scores. This is particularly useful in information retrieval, where a system is asked to return pages relevant to a query. An assumption is that higher-ranked pages should be returned first, as they are more important and therefore have a higher chance of being what the user wants. Some other intuitions on building a ranking system are:

Let \(r \in \mathbb^n\) be the rank vector—that is, \(r_i\) is the numerical value denoting the importance of page \(i\). We will propose a method for finding \(r\) such that if \(r_i > r_j\), then page \(i\) is more important than page \(j\). Note that since the rankings are ordinal, we can just scale \(r\) by a positive number and the relative ordering of the pages based on importance will not change at all.

The importance matrix

To find the importance of every page, we will need to exploit the structure of the graph, specifically the in- and out-links of every node. Suppose that page \(i\) with importance \(r_i\) has \(d_i\) out-neighbors—that is, pages that \(i\) links to. In graph theory, \(d_i\) is also called the out-degree of \(i\). Based on the intuitions above, we want these out-neighbors to enjoy \(i\)’s importance. To do so, we assume that each out-neighbor of \(i\) will get an equal amount of importance from \(i\). In other words, each out-neighbor will get an amount \(\frac\) of importance from \(i\).

In this setting, the importance of a page \(j\) will be the sum of all importance flowing into it from its in-neighbors:

Notice that we have a recursive structure: Every page influences the pages it leads to. But the importance of that page is flows from the pages leading to it.

Define a matrix \(A\), called the importance matrix, where \(A_ = \frac\) if page \(i\) leads to page \(j\). In other words, each column of \(i\) of \(A\) is a vector containing either \(0\) (where there is no out-going edge) or \(\frac\) (when there is). Since the out-degree of \(i\) is exactly \(d_i\), it must be the case that every column of \(i\) sums to \(1\).

The product \(A r\) gives us the importance flowing into every page. To see why it is, consider the \(j\)th component of this product:

where we have the last inequality because \(A_\) is non-zero (and equal to \(\frac\)) when there is an edge from \(i\) to \(j\). This equation exactly matches \(\eqref\).

The random surfer

One can think of \(A\) as an adjacency matrix of \(G\), but instead of \(A_ = 1\) when there is an edge from \(i\) to \(j\), we have \(A_ = \frac\). There is a nice interpretation of \(A\) called the random surfer model.

Suppose we have a web surfer who is currently on page \(i\). To visit a new page, the surfer will randomly choose one of the out-neighbors of \(i\). Since the out-degree of \(i\) is \(d_i\), if we assume that all out-neighbors are equally likely to be chosen, the probability that the surfer will choose a neighbor is \(\frac\). This is exactly captured in the matrix \(A\).

PageRank as a fixed-point problem

Since \((A r)_j\) gives us the importance of page \(j\), which is also equal to \(r_j\), we have:

The solution \(r\) to this linear system is the vector containing the ranks of our webpages. Note that we can scale \(r\) by a positive number and it would still satisfy this equation, achieving our goal of preserving the order from positive scaling stated above.

Such an \(r\) satisfying \(\eqref\) is called a fixed point of \(A\), because applying \(A\) to \(r\) (that is, multiplying \(A\) by \(r\)) will not change the values of \(r\) at all. I have another post on solving for a fixed point in the context of machine learing, which can be found here. In this post, we will revisit a method to solve for \(r\).

Solving PageRank

If we look again at equation \(\eqref\), we can recognize that this is an eigenvector problem. Specifically, if \(\eqref\) holds, then \(r\) must be an eigenvector of \(A\) corresponding to an eigenvalue of \(1\). There are two important questions to answer.

First, is it guaranteed that \(A\) has \(1\) as an eigenvalue? After all, \(A\) is just a non-negative matrix with each column summing to \(1\). It turns out that this is true, and we will see the proof below.

Second, given that \(1\) is an eigenvalue, then we can solve \(A r = r\) using a row-reduction algorithm such as Gaussian elimination. Is that it? The answer is no, because Gaussian elimination has the time complexity of \(O(n^3)\), where \(n\) is the number of pages. This does not scale well with our page collection, as \(n\) could be in the billions, if not more. Therefore, we need to find another way to solve \(\eqref\).

Stochastic matrices

To answer the first question above, notice that the matrix \(A\) is an example of a stochastic matrix, which is a square matrix with non-negative entries and having every column sum to 1. In the context of PageRank, \(A\) is also called the stochastic adjacency matrix.

What is interesting about a stochastic matrix is that it accepts \(1\) as an eigenvalue, and all other eigenvalues (real or complex) of \(A\) are less than or equal to \(1\) in absolute value.

Proof.
Since $A$ is a square matrix, $A$ and $A^\top$ share the same eigenvalues. We need to prove that $1$ is an eigenvalue of $A^\top$. Because every row of $A^\top$ sums to $1$, we have $A^\top \mathbf_n = \mathbf_n$, where $\mathbf_n$ is a column vector of $n$ ones. So, $1$ is an eigenvalue of $A^\top$ and, therefore, of $A$.

To show why all other eigenvalues of $A$ are less than or equal to $1$ in absolute value, let $\lambda$ be an eigenvalue of $A$. So $\lambda$ is also an eigenvalue of $A^\top$, associated with an eigenvector $x = [x_1,\ldots,x_n]^\top$. In other words, $A^\top x = \lambda x$. Let $j$ be index of the largest element in absolute value of $x$, that is, $|x_i| \leq |x_j| ~ \text ~ i=1,\ldots,n$. We have $$ \begin |\lambda| |x_j| = |\lambda x_j| = \left| \sum_^ A_ x_i \right| \leq \sum_^ A_ |x_j| = |x_j| \sum_^ A_ = |x_j|, \end $$ where the first inequality uses the triangle inequality and the definition of $x_j$, and the last equality uses the fact the column $j$ of $A$ sums to 1. Since $x_j \neq 0$, this implies that $|\lambda| \leq 1$.

The fixed-point iteration

To answer the second question, we use the fact we just proved above, which is that \(1\) is the largest eigenvalue of \(A\) in absolute value. In linear algebra, it is also called the spectral radius of \(A\). As an alternative to Gaussian elimination, a popular algorithm to find the spectral radius and its corresponding eigenvector is the power iteration.

Procedure: Power Iteration
Input: A diagonalizable $n \times n$ matrix $A$
Let $b_0$ some non-zero vector
For $k = 0, \ldots, K-1$ do
Apply $A$ to $b_k$: $\tilde_ = A b_$
Normalize: $b_ = \frac<\tilde_><\lVert \tilde_ \rVert>$
Output: $b_$

The sequence \(\left(\frac<\lVert A b_k \rVert><\lVert b_k \rVert>\right)_k\) is guaranteed to converge to the spectral radius of \(A\) (which is \(1\) in our case), and the sequence \((b_k)_k\) converges to the corresponding eigenvector with unit norm.

How fast this sequence converges can be found here. In practice, one would run the power iteration until the difference between two iterates falls below some pre-defined tolerance \(\epsilon\). For example, we can run until \(\lVert b_ - b_ \rVert \leq 10^\).

Conclusion (temporary)

We have learned how to find the importance scores of webpages in order to rank them. First, we construct the importance matrix from the structure of the graph. Then, we use the power iteration to solve for the fixed point of this matrix, which is the eigenvector corresponding to the largest eigenvalue in absolute value. This solution \(r\) now contains the importance of the pages, and we are ready to use \(r\) to rank them!

However, there are two potential problems with this approach. We will explore it and propose a solution in the below.

Two Problems with PageRank

Problem 1: dead ends

In the previous section, we have learned to use the power iteration to solve for the importance vector \(r\). However, the power iteration works under an assumption that the matrix \(A\) is diagonalizable. This will not hold if a column of \(A\) contains all zeros. This case happens when a webpage has no outgoing links. In other words, the page is a dead end.

How do we solve this? Let’s go back to the random surfer model above. If the surfer is at a dead end, meaning there is no hyperlink on the page the surfer can click to go to, we will assume that they will randomly jump to any other page in our collection. In addition, all pages are assumed to be equally likely to be chosen. So, if a page \(i\) is a dead end, we will replace the all-zeros column for \(i\) with a column of all \(\frac\)’s, where \(n\) is the number of pages in our collection.

Therefore, we can transform the matrix \(A\) into one without dead ends. Let us call this matrix \(A'\). Every column of \(A'\) now sums to 1.

Problem 2: spider traps

The matrix \(A'\) is now guaranteed to be a stochastic matrix, and we are ready to use the power iteration to find its fixed point. However, the result might not be what we want. Consider the following scenario: In our web graph, there is a set of at least one node such that there are no links coming out of this set. There can be links between nodes in this set, but there are no links to any other outside node.

We call such a set of nodes a spider trap. But what is the problem? If we use the power iteration for a graph with a spider trap, the algorithm will cause all importance scores to be captured within the nodes in this spider trap, and the rest of the nodes will have zero importance. This kind of pages can be constructed intentionally or unintentionally, but their existence will cause PageRank to output an undesirable result.

So how do we deal with spider traps? Once the random surfer is in a spider trap, they will never be able to leave it. We will assume that, when the surfer is at page \(i\), they will flip a coin. If the coin comes up heads, the surfer will follow a link at random, and the probability of choosing a page is found by looking up the \(i\)th column of \(A'\). If the coin comes up tails, the surfer will jump to a page in our collection uniformly at random. So, if page \(i\) is in a spider trap, the surfer has a some chance of jumping outside the trap when the coin comes up tails.

To formalize this, let \(p\) be the probability of the coin coming up heads. The probability that the surfer, currently at page \(i\), will go to page \(j\) is

The Google matrix

In 1998, Larry Page and Sergey Brin, the founders of Google, proposed a matrix combining the solutions to these two problems. It is now widely called the Google matrix:

\[\begin \mathscr = pA' + (1-p) \frac \mathbf_n \mathbf_n^\top. \end\]

By using the power iteration on \(\mathscr\), we can find the importance scores of the pages in our collection. This is the algorithm that Google uses to rank webpages.

Resources

  1. Interactive Linear Algebra by Dan Margalit and Joseph Rabinoff. Specifically Chapter 5.
  2. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Specifically Chapter 5.
  3. CS224W - Machine Learning with Graphs by Jure Leskovec, Fall 2021 edition. Specifically Lecture 4.
  4. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page.