5.2 Rank Merging
The reason that the title based PageRank system works so well is that the title match ensures
high precision, and the PageRank ensures high quality. When matching a query like \University"
on the web, recall is not very important because there is far more than a user can look at. For
more specific searches where recall is more important, the traditional information retrieval scores
over full-text and the PageRank should be combined. Our Google system does this type of rank
merging. Rank merging is known to be a very difficult problem, and we need to spend considerable
additional effort before we will be able to do a reasonable evaluation of these types of queries.
However, we do believe that using PageRank as a factor in these queries is quite beneficial.
5.3 Some Sample Results
We have experimented considerably with Google, a full-text search engine which uses PageRank.
While a full-scale user study is beyond the scope of this paper, we provide a sample query in
Appendix A. For more queries, we encourage the reader to test Google themselves \cite{Brin_Page}.
Table 1 shows the top 15 pages based on PageRank. This particular listing was generated in
July 1996. In a more recent calculation of PageRank, Microsoft has just edged out Netscape for
the highest PageRank.
5.4 Common Case
One of the design goals of PageRank was to handle the common case for queries well. For example,
a user searched for "wolverine", remembering that the University of Michigan system used for all
administrative functions by students was called something with a wolverine in it. Our PageRank
based title search system returned the answer "Wolverine Access" as the first result. This is sensible
since all the students regularly use the Wolverine Access system, and a random user is quite likely
to be looking for it given the query "wolverine". The fact that the Wolverine Access site is a good
common case is not contained in the HTML of the page. Even if there were a way of defining good meta-information of this form within a page, it would be problematic since a page author could
not be trusted with this kind of evaluation. Many web page authors would simply claim that their
pages were all the best and most used on the web.
It is important to note that the goal of finding a site that contains a great deal of information
about wolverines is a very different task than finding the common case wolverine site. There is an
interesting system \cite{Marchiori_1997} that attempts to find sites that discuss a topic in detail by propagating
the textual matching score through the link structure of the web. It then tries to return the page
on the most central path. This results in good results for queries like "flower"; the system will
return good navigation pages from sites that deal with the topic of flowers in detail. Contrast that
with the common case approach which might simply return a commonly used commercial site that
had little information except how to buy flowers. It is our opinion that both of these tasks are
important, and a general purpose web search engine should return results which fulfill the needs
of both of these tasks automatically. In this paper, we are concentrating only on the common case
approach.
5.5 Subcomponents of Common Case
It is instructive to consider what kind of common case scenarios PageRank can help represent.
Besides a page which has a high usage, like the Wolverine Access cite, PageRank can also represent
a collaborative notion of authority or trust. For example, a user might prefer a news story simply
because it is linked is linked directly from the New York Times home page. Of course such a story
will receive quite a high PageRank simply because it is mentioned by a very important page. This
seems to capture a kind of collaborative trust, since if a page was mentioned by a trustworthy
or authoritative source, it is more likely to be trustworthy or authoritative. Similarly, quality or
importance seems to t within this kind of circular definition.
6 Personalized PageRank
An important component of the PageRank calculation is \(E-\) a vector over the Web pages which
is used as a source of rank to make up for the rank sinks such as cycles with no outedges (see
Section 2.4). However, aside from solving the problem of rank sinks, \(E\) turns out to be a powerful
parameter to adjust the page ranks. Intuitively the \(E\) vector corresponds to the distribution of web
pages that a random surfer periodically jumps to. As we see below, it can be used to give broad
general views of the Web or views which are focussed and personalized to a particular individual.
We have performed most experiments with an \(E\) vector that is uniform over all web pages with \(\left|\left|E\right|\right|_1=0.15\). This corresponds to a random surfer periodically jumping to a random web page.
This is a very democratic choice for E since all web pages are valued simply because they exist.
Although this technique has been quite successful, there is an important problem with it. Some
Web pages with many related links receive an overly high ranking. Examples of these include
copyright warnings, disclaimers, and highly interlinked mailing list archives.
Another extreme is to have \(E\) consist entirely of a single web page. We tested two such \(E's\) {
the Netscape home page, and the home page of a famous computer scientist, John McCarthy. For
the Netscape home page, we attempt to generate page ranks from the perspective of a novice user
who has Netscape set as the default home page. In the case of John McCarthy's home page we
want to calculate page ranks from the perspective of an individual who has given us considerable
contextual information based on the links on his home page.
In both cases, the mailing list problem mentioned above did not occur. And, in both cases, the
respective home page got the highest PageRank and was followed by its immediate links. From
Table 2: Page Ranks for Two Different Views: Netscape vs. John McCarthy