that point, the disparity decreased. In Table 2, we show the resulting page rank percentiles for
an assortment of different pages. Pages related to computer science have a higher McCarthy-rank
than Netscape-rank and pages related to computer science at Stanford have a considerably higher
McCarthy-rank. For example, the Web page of another Stanford Computer Science Dept. faculty
member is more than six percentile points higher on the McCarthy-rank. Note that the page ranks
are displayed as percentiles. This has the effect of compressing large differences in PageRank at
the top of the range.
Such personalized page ranks may have a number of applications, including personal search
engines. These search engines could save users a great deal of trouble by efficiently guessing a
large part of their interests given simple input such as their bookmarks or home page. We show an
example of this in Appendix A with the "Mitchell" query. In this example, we demonstrate that
while there are many people on the web named Mitchell, the number one result is the home page
of a colleague of John McCarthy named John Mitchell.
6.1 Manipulation by Commercial Interests
These types of personalized PageRanks are virtually immune to manipulation by commercial interests.
For a page to get a high PageRank, it must convince an important page, or a lot of
non-important pages to link to it. At worst, you can have manipulation in the form of buying
advertisements(links) on important sites. But, this seems well under control since it costs money. This immunity to manipulation is an extremely important property. This kind of commercial manipulation
is causing search engines a great deal of trouble, and making features that would be great
to have very difficult to implement. For example fast updating of documents is a very desirable
feature, but it is abused by people who want to manipulate the results of the search engine.
A compromise between the two extremes of uniform \(E\) and single page \(E\) is to let \(E\) consist of
all the root level pages of all web servers. Notice this will allow some manipulation of PageRanks.
Someone who wished to manipulate this system could simply create a large number of root level
servers all pointing at a particular site.
7 Applications
7.1 Estimating Web Traffic
Because PageRank roughly corresponds to a random web surfer (see Section 2.5), it is interesting
to see how PageRank corresponds to actual usage. We used the counts of web page accesses from
NLANR \cite{NLANR} proxy cache and compared these to PageRank. The NLANR data was from several
national proxy caches over the period of several months and consisted of 11,817,665 unique URLs
with the highest hit count going to Altavista with 638,657 hits. There were 2.6 million pages in the
intersection of the cache data and our 75 million URL database. It is extremely difficult to compare
these datasets analytically for a number of different reasons. Many of the URLs in the cache access
data are people reading their personal mail on free email services. Duplicate server names and page
names are a serious problem. Incompleteness and bias a problem is both the PageRank data and
the usage data. However, we did see some interesting trends in the data. There seems to be a high
usage of pornographic sites in the cache data, but these sites generally had low PageRanks. We
believe this is because people do not want to link to pornographic sites from their own web pages.
Using this technique of looking for differences between PageRank and usage, it may be possible to
find things that people like to look at, but do not want to mention on their web pages. There are
some sites that have a very high usage, but low PageRank such as netscape.yahoo.com. We believe
there is probably an important backlink which simply is omitted from our database (we only have
a partial link structure of the web). It may be possible to use usage data as a start vector for
PageRank, and then iterate PageRank a few times. This might allow filling in holes in the usage
data. In any case, these types of comparisons are an interesting topic for future study.
7.2 PageRank as Backlink Predictor
One justification for PageRank is that it is a predictor for backlinks. In \cite{Cho_1998} we explore the
issue of how to crawl the web efficiently, trying to crawl better documents first. We found on tests
of the Stanford web that PageRank is a better predictor of future citation counts than citation
counts themselves.
The experiment assumes that the system starts out with only a single URL and no other
information, and the goal is to try to crawl the pages in as close to the optimal order as possible.
The optimal order is to crawl pages in exactly the order of their rank according to an evaluation
function. For the purposes here, the evaluation function is simply the number of citations, given
complete information. The catch is that all the information to calculate the evaluation function is
not available until after all the documents have been crawled. It turns out using the incomplete
data, PageRank is a more effective way to order the crawling than the number of known citations.
In other words, PageRank is a better predictor than citation counting even when the measure is
the number of citations! The explanation for this seems to be that PageRank avoids the local
maxima that citation counting gets stuck in. For example, citation counting tends to get stuck in
local collections like the Stanford CS web pages, taking a long time to branch out and find highly
cited pages in other areas. PageRank quickly finds the Stanford homepage is important, and gives
preference to its children resulting in an efficient, broad search.
This ability of PageRank to predict citation counts is a powerful justification for using PageRank.
Since it is very difficult to map the citation structure of the web completely, PageRank may
even be a better citation count approximation than citation counts themselves.