Loading...

An Introduction to the PageRank Algorithm

Presentation (Elaboration) 2012 10 Pages

Computer Science - Internet, New Technologies

Excerpt

An Introduction to the PageRank Algorithm

Haoyue Hu

1 Introduction

Google’s success in the past years is strongly connected with its web search delivering precise results to queries. Over 200 factors are taken into account before the resulting pages are listed according to an overall relevancy score. One of the factors is the PageRank, which is a numerical value used to establish an importance ranking between all the web pages.

The following quote from Google’s web page emphasizes the significance of PageRank:

“The heart of our software is PageRankTM, a system for ranking web pages developed by our founders Larry Page and Sergey Brin at Stanford University. And while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to provide the basis for all of our web search tools.”

This paper tries to give a brief overview of the PageRank algorithm and its related topics. In section 2 the basic concepts and a first definition of PageRank are introduced. Due to problems arising with the simple model, the random surfer model leading to the Google matrix is depicted in section 3. Section 4 deals with convergence and sensitivity issues of the PageRank vector. The computation of the PageRank vector using different methods is shown in section 5.

2 Basics of PageRank

2.1 The World Wide Web as a Directed Graph

A directed graph always consists of nodes and directed edges. In the context of the World Wide Web, each node represents a web page and each directed edge represents a hyperlink. We distinguish between inlinks pointing into nodes and outlinks pointing out from nodes. If a node has no outlinks, it is called a dangling node. An example of a tiny web graph consisting of 5 pages is shown in figure 1. In this example node 2 is a dangling node.

illustration not visible in this excerpt

Figure 1: An example web graph

2.2 The Basic Idea Behind PageRank

A link to a page can be seen as a recommendation to that page. A page is considered to be important if it is recommended by a lot of other important pages or in other words, if it is pointed to by other important pages. As a consequence, the more inlinks a page has, the more important is the page. Not only the amount of inlinks plays a role, but also the quality of the pages linking to a specific page. A weighting factor can be introduced to evaluate the importance of the inlinking pages: Inlinks from important pages carry more weight than inlinks from unimportant or even spam pages. If a page points to several pages (i.e. has several outlinks), its weight is evenly distributed among all the outlinking pages.

2.3 The Original Summation Formula for PageRank

With regard to the basic concepts given above, the PageRank is defined as follows:

The PageRank of a page i is the sum of the PageRanks of all pages pointing into page i:

illustration not visible in this excerpt

where

- r(Pj) is the PageRank of page j

- BPi is the set of pages pointing into page i

- |Pj | is the number of outlinks from page j

To calculate the PageRank of a page, the other PageRanks have to be known. This definition implies that the calculation is based on a recursive procedure. The (k + 1)-th iteration step is[Abbildung in dieser Leseprobe nicht enthalten]

In equation (2) the initial value r0(Pi) can be chosen arbitrarily. Usually an equal ranking to all the pages in the web graph is assumed at the beginning, so the initial value is set to[Abbildung in dieser Leseprobe nicht enthalten]

where n is the total number of web pages in the web graph.

To simplify the computation, equation (2) can be reformulated into a vector-matrix multiplication [Abbildung in dieser Leseprobe nicht enthalten]

with p being the PageRank vector of size (1 × n) which holds the PageRanks for all n pages and H being the hyperlink matrix of size (n × n). H represents the link structure and is defined as:[Abbildung in dieser Leseprobe nicht enthalten]

Each iteration involves one vector-matrix multiplication which has the computational effort of O(n2 ). It is estimated that a web page has an average of 10 outlinks[4]. Therefore H is a very sparse matrix and only minimal storage schemes are needed. The effort is reduced to O(nnz(H)), where nnz(H) ≈ 10n is the number of nonzeros in H, so the algorithm has O(n). Another observation concerns dangling nodes: Their respective rows in H only have zero entries, whereas the rows of nondangling nodes are stochastic (sum of all entries in the row is equal to 1).

Using this PageRank definition, two problems might occur: The first problem is the appearance of so called rank sinks which are pages that cannot share their PageRanks with other pages due to the lack of outlinks. In figure 2, the dangling node 2 is a rank sink.

illustration not visible in this excerpt

Figure 2: Graph with a dangling node acting as a rank sink

The second problem are cycles; the simplest case of a cycle is shown in figure [3]. Because page 1 only points to page 2 and vice versa, thus creating a cycle or in other words an infinite loop, no convergence can be achieved in the PageRank iterations. For example, consider [Abbildung in dieser Leseprobe nicht enthalten] as the starting vector. Then the PageRank vector jumps between [Abbildung in dieser Leseprobe nicht enthalten] in all further iteration steps.

illustration not visible in this excerpt

Figure 3: Graph with a cycle

Therefore, we need some modifications to the PageRank algorithm to overcome these problems.

3 Adjustments to the Basic Model - Random Surfer Model

Imagine a web surfer who successively clicks on links randomly. At dangling nodes (e.g. pdf files), he jumps to any page. When he gets bored by clicking on arbitrary links, he simply enters a new page in the browser‘s URL line. The proportion of time he spends on a page corresponds to its relative importance.

To model the random surfer in a mathematical context, the zero rows in H corresponding to dangling nodes are replaced by[Abbildung in dieser Leseprobe nicht enthalten].Weobtainthe[

illustration not visible in this excerpt

[...]


Details

Pages
10
Year
2012
ISBN (eBook)
9783656369288
File size
1.4 MB
Language
English
Catalog Number
v209302
Institution / College
Technical University of Munich
Grade
1,3
Tags
introduction pagerank algorithm

Author

Share

Previous

Title: An Introduction to the PageRank Algorithm