where \(E\left(u\right)\) is some vector over the web pages that corresponds to a source of rank (see Section
6). Note that if \(E\) is all positive, \(c\) must be reduced to balance the equation. Therefore, this
technique corresponds to a decay factor. In matrix notation we have \(R'=c\left(AR'+E\right)\). Since \(\left|\left|R\right|\right|_1=1\), we can rewrite this as \(R'=c\left(A+E\ x1\right)R'\) where \(1\) is the vector consisting of all ones.
So, \(R'\) is an eigenvector of \(\left(A+E\ x\ 1\right)\)
2.5 Random Surfer Model
The definition of PageRank above has another intuitive basis in random walks on graphs. The
simplified version corresponds to the standing probability distribution of a random walk on the
graph of the Web. Intuitively, this can be thought of as modeling the behavior of a "random
surfer". The "random surfer" simply keeps clicking on successive links at random. However, if a
real Web surfer ever gets into a small loop of web pages, it is unlikely that the surfer will continue
in the loop forever. Instead, the surfer will jump to some other page. The additional factor \(E\) can
be viewed as a way of modeling this behavior: the surfer periodically "gets bored" and jumps to a random page chosen based on the distribution in \(E\).
So far we have left \(E\) as a user dened parameter. In most tests we let \(E\) be uniform over all
web pages with value \(\alpha\). However, in Section 6 we show how different values of \(E\) can generate "customized" page ranks.
2.6 Computing PageRank
The computation of PageRank is fairly straightforward if we ignore the issues of scale. Let S be
almost any vector over Web pages (for example \(E\)). Then PageRank may be computed as follows:
\(R_o\leftarrow S\)
loop:
\(R_{i+1}\leftarrow\ AR_i\)
\(d\leftarrow\left|\left|R_i\right|\right|_1-\left|\left|R_{i+1}\right|\right|_1\)
\(R_{i+1}\leftarrow R_{i+1}+dE\)
\(\delta\leftarrow\left|\left|R_{i+1}-R_i\right|\right|_1\)
while \(\delta>\epsilon\)
Note that the \(d\) factor increases the rate of convergence and maintains \(\left|\left|R\right|\right|_1\). An alternative
normalization is to multiply \(R\) by the appropriate factor. The use of \(d\) may have a small impact
on the influence of \(E\).
2.7 Dangling Links
One issue with this model is dangling links. Dangling links are simply links that point to any page
with no outgoing links. They affect the model because it is not clear where their weight should
be distributed, and there are a large number of them. Often these dangling links are simply pages
that we have not downloaded yet, since it is hard to sample the entire web (in our 24 million pages
currently downloaded, we have 51 million URLs not downloaded yet, and hence dangling). Because
dangling links do not affect the ranking of any other page directly, we simply remove them from
the system until all the PageRanks are calculated. After all the PageRanks are calculated, they
can be added back in, without affecting things significantly. Notice the normalization of the other
links on the same page as a link which was removed will change slightly, but this should not have
a large effect.
3 Implementation
As part of the Stanford WebBase project \cite{Brin_1998}, we have built a complete crawling and indexing
system with a current repository of 24 million web pages. Any web crawler needs to keep a database
of URLs so it can discover all the URLs on the web. To implement PageRank, the web crawler
simply needs to build an index of links as it crawls. While a simple task, it is non-trivial because
of the huge volumes involved. For example, to index our current 24 million page database in about
five days, we need to process about 50 web pages per second. Since there about about 11 links on
an average page (depending on what you count as a link) we need to process 550 links per second.
Also, our database of 24 million pages references over 75 million unique URLs which each link must
be compared against.
Much time has been spent making the system resilient in the face of many deeply and intricately
awed web artifacts. There exist infinitely large sites, pages, and even URLs. A large fraction of
web pages have incorrect HTML, making parser design difficult. Messy heuristics are used to help
the crawling process. For example, we do not crawl URLs with /cgi-bin/ in them. Of course it
is impossible to get a correct sample of the "entire web" since it is always changing. Sites are
sometimes down, and some people decide to not allow their sites to be indexed. Despite all this, we
believe we have a reasonable representation of the actual link structure of publicly accessible web.
3.1 PageRank Implementation
We convert each URL into a unique integer, and store each hyperlink in a database using the integer
IDs to identify pages. Details of our implementation are in \cite{Brin_1998}. In general, we have implemented
PageRank in the following manner. First we sort the link structure by Parent ID. Then dangling
links are removed from the link database for reasons discussed above (a few iterations removes the
vast majority of the dangling links). We need to make an initial assignment of the ranks. This
assignment can be made by one of several strategies. If it is going to iterate until convergence, in
general the initial values will not aeffect final values, just the rate of convergence. But we can speed
up convergence by choosing a good initial assignment. We believe that careful choice of the initial
assignment and a small finite number of iterations may result in excellent or improved performance.
Memory is allocated for the weights for every page. Since we use single precision floating point
values at 4 bytes each, this amounts to 300 megabytes for our 75 million URLs. If insufficient
RAM is available to hold all the weights, multiple passes can be made (our implementation uses
half as much memory and two passes). The weights from the current time step are kept in memory, and the previous weights are accessed linearly on disk. Also, all the access to the link database, \(A\), is linear because it is sorted. Therefore, \(A\) can be kept on disk as well. Although these data
structures are very large, linear disk access allows each iteration to be completed in about 6 minutes
on a typical workstation. After the weights have converged, we add the dangling links back in and
recompute the rankings. Note after adding the dangling links back in, we need to iterate as many
times as was required to remove the dangling links. Otherwise, some of the dangling links will have
a zero weight. This whole process takes about five hours in the current implementation. With less
strict convergence criteria, and more optimization, the calculation could be much faster. Or, more
efficient techniques for estimating eigenvectors could be used to improve performance. However, it
should be noted that the cost required to compute the PageRank is insignificant compared to the
cost required to build a full text index.
4 Convergence Properties
As can be seen from the graph in Figure 4 PageRank on a large 322 million link database converges
to a reasonable tolerance in roughly 52 iterations. The convergence on half the data takes roughly
45 iterations. This graph suggests that PageRank will scale very well even for extremely large
collections as the scaling factor is roughly linear in log n.
One of the interesting ramifications of the fact that the PageRank calculation converges rapidly
is that the web is an expander-like graph. To understand this better, we give a brief overview of
the theory of random walks on graphs; refer to Motwani-Raghavan \cite{Motwani} for further details. A
random walk on a graph is a stochastic process where at any given time step we are at a particular
node of the graph and choose an outedge uniformly at random to determine the node to visit at
the next time step. A graph is said to be an expander if it is the case that every (not too large)
subset of nodes \(S\) has a neighborhood (set of vertices accessible via outedges emanating from nodes in S) that is larger than some factor \(\alpha\) times \(\left|S\right|\) here, \(\alpha\) is called the expansion factor. It is the
case that a graph has a good expansion factor if and only if the largest eigenvalue is sufficiently
larger than the second-largest eigenvalue. A random walk on a graph is said to be rapidly-mixing
if it quickly (time logarithmic in the size of the graph) converges to a limiting distribution on the
set of nodes in the graph. It is also the case that a random walk is rapidly-mixing on a graph if
and only if the graph is an expander or has an eigenvalue separation.
To relate all this to the PageRank computation, note that it is essentially the determination of
the limiting distribution of a random walk on the Web graph. The importance ranking of a node
is essentially the limiting probability that the random walk will be at that node after a sufficiently
large time. The fact that the PageRank computation terminates in logarithmic time is equivalent to
saying that the random walk is rapidly mixing or that the underlying graph has a good expansion
factor. Expander graphs have many desirable properties that we may be able to exploit in the
future in computations involving the Web graph.