# Exploring Yelp Dataset with Neo4j — Part II: PageRank

Also:

- Exploring Yelp Dataset with Neo4j — Part I: From Raw data to Nodes and Relationships with Python

- Exploring Yelp Dataset with Neo4j — Part III: Louvain Community Detection Algorithm

Let’s first look at the schema. In Neo4j Browser, enter `CALL db.schema.visualization()`

:

Statistics can be *quickly* retrieved with`CALL apoc.meta.stats()`

. It is clear the Dataset downloaded from Yelp on 03.05.2020 (MD5 for *yelp_dataset.tar*: `7610af013edf610706021697190dab15`

) is different than the one used in the book. For example, the one in the book contained more `Country`

than that of the downloaded.

We are going to compute PageRank on `User`

nodes. One thing worth noticing is that PageRank implementation in Neo4j’s Graph Data Science Library respects *relationship direction*, meaning, for example, if `(a:Site)-[:LINK_TO]->(b:Site)`

, `a`

is linked to `b`

, but it is not necessary `b`

is also linked to `a`

.

In Yelp case, when a `User`

A is a friend of`User`

B, they are automatically both friends to each other. In our case, as by design, this friendship is reflected by using only one single connection `(A:User)-[:FRIENDS]->(B:User)`

, in order to simplify the relationship as well as save resources. Later, one can easily query to get all friends of a given users by *disregarding the direction*, e.g. `MATCH (n:User {name: 'Alice'})-[:FRIENDS]-(m:User) RETURN n, COUNT(m)`

.

Needless to say, we need to project the original graph into a new one where there are only `User`

nodes, and if there is a `FRIENDS`

relationship between a pair of `User`

nodes, we generate 2 relationships of type `FRIEND_OF`

in different directions. As an example, projecting that into a named graph:

`CALL gds.graph.create(`

'my-graph',

{

User: {

label: 'User'

}

},

{

FRIEND_OF: {

type: 'FRIENDS',

orientation: 'UNDIRECTED'

}

}

)

Before moving on, we take some time to discuss the PageRank formula. The original one is:

But the formula used for PageRank implementation in Neo4j’s Graph Data Science Library is:

As a result, the side effect, according to Wikipedia, is “*The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N.*” I prefer the second version, because in my opinion, manipulating very small numbers need extra attention, regarding the accuracy. As in Appendix [2], PageRank scores are found to be stored as Java's`Double`

, which should be decently good if not dividing by N (the total number of nodes) in PageRank score computation.

The next step is to run PageRank algorithm on the projected graph. There are some parameters that can be configured to affect how the algorithm computes the scores. Notably:

`dampingFactor`

: default`0.85`

.`tolerance`

: default`0.0000001`

.`maxIterations`

: default`20`

.

To my knowledge, Damping Factor `0.85`

is found to be generally good. The PageRank implementation follows *interative* approach, meaning it keeps computing iteratively until the `maxIterations`

is reached or the changes of PageRank scores in all nodes are less than `tolerance`

. If we want the computation to soon converge, then increase the `tolerance`

. If we want to find out *more stable* PageRank scores, increase the `maxIterations`

and decrease the `tolerance`

. But trade-off would be time-consuming and we might risk not finding expectedly stable scores for all nodes.

As an engineer, I focus on *actionable results* with *adequate efforts* while maintaining a *decent quality*. That said, let’s keep the `dampingFactor`

and `tolerance`

as default; then try `maxIterations`

with a variety of values, e.g. `20`

, `30`

, `40`

. In the computation results, if seeing `didCOnverge`

as `true`

, that would be great, we’ll stop there; if not, we extract the Top 10 highest scores for each `maxIteration`

in certain categories, and examine whether the scores in Top 10 jump around *slightly *or *wildly*. If slightly, great, we’ll stop there; if not, increase `maxIterations`

. *Note*: It is assumed that we are only interested in the most influential nodes, e.g. Top 10; and do not care about the rest.

The Cypher to compute PageRank:

`# Use `**anonymous** projected graph so no need to clean up later.

UNWIND [20, 30, 40] AS numIter

CALL gds.pageRank.write({

nodeProjection: 'User',

relationshipProjection: {

FRIEND_OF: {

type: 'FRIENDS',

orientation: 'UNDIRECTED'

}

},

writeProperty: 'pageRank' + toString(numIter),

maxIterations: numIter

})

YIELD ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration

RETURN numIter, ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration

Unfortunately, the flag`didConverge`

returned `false`

for *all* `maxIterations`

(link). Let’s examine the Top 10 PageRank scores for Categories `Hostels`

and `Hotels`

for `maxIterations `

of`20`

, `30`

, and `40`

.

`UNWIND ['Hotels', 'Hostels'] as cat_name`

MATCH (c:Category {category_id: cat_name})<-[:IN_CATEGORY]-(:Business)<-[:REVIEWS]-(:Review)<-[:WROTE]-(u:User)

WITH c, u

WITH c.category_id AS category_id, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank20")[..10] AS userWithPageRank20, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank30")[..10] AS userWithPageRank30, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank40")[..10] AS userWithPageRank40, range(0, 9, 1) AS indexes

UNWIND indexes AS index

RETURN category_id, userWithPageRank20[index].name AS name20, userWithPageRank20[index].pageRank20 AS pageRank20, userWithPageRank30[index].name AS name30, userWithPageRank30[index].pageRank30 AS pageRank30, userWithPageRank40[index].name AS name40, userWithPageRank40[index].pageRank40 AS pageRank40

See Appendix [3] for an issue with `apoc.coll.sortNodes()`

. The result:

It looks like we have a *relatively stable* result: The `User`

in each `Category`

do not jump around wildly; when some switch places, looking carefully, we’ll see their PageRank scores are almost equal.

Beside, in the result (link), in `centralityDistribution`

column, we see `p1`

near `0.15`

. That’s due to `1-dampingFactor`

for pages where there is almost no in-link, i.e. very *unpopular* pages. Said in other word, `0.15`

is the minimum value of PageRank score.

In conclusion, given the dataset, the PageRank scores for `User`

with default `dampingFactor`

,`tolerance`

and `maxIterations`

being either `30`

or `40`

are sufficiently good for Category `Hostels`

and `Hotels`

. This may not be necessarily hold true in other Categories, e.g. `Grocery`

though.

# Appendix

*[1] Find out if the PageRank implementation respects relationship direction.*

Suppose an empty Neo4j database and it is stopped. Neo4j’s Graph Data Science source code. Download and add this code snippet into `graph-data-science\algo-common\src\main\java\org\neo4j\graphalgo\pagerank\PageRank.java`

, function `initializeSteps()`

:

`# degree: the number of out-links for a given node.`

var iter = graph.nodeIterator();

while(iter.hasNext()){

var node = iter.next();

var degree = graph.degree(node);

getProgressLogger().logMessage(String.format("*** node [%s]: has degree [%s]", node, degree));

}

Recompile the library with command: `gradlew packaging:shadownJar`

. Rename the JAR file in `graph-data-science\packaging\build\libs\`

into `gdsLibrary-1.2.1.jar`

and copy/overwrite the one in `plugins`

folder of the current Neo4j Installation.

Start Neo4j again. In Neo4j Browser, create a simple Graph:

`CREATE (a:User {name: "a"}), (b:User {name: "b"}), (c:User {name: "c"}), (a)-[:FRIEND_WITH]->(b), (c)-[:FRIEND_WITH]->(b)`

Then run PageRank algorithm on this graph:

`CALL gds.pageRank.write({`

nodeProjection: 'User',

relationshipProjection: 'FRIEND_WITH',

writeProperty: 'pageRank',

maxIterations: 20

})

YIELD ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration

Check log file in `logs`

folder of Neo4j Installation (Hint: searching for `***`

):

`*** node [0]: has degree [1]`

*** node [1]: has degree [1]

*** node [2]: has degree [0]

It shows that the PageRank implementation respects relationship direction.

*[2] Find out which data type to store PageRank.*

In `graph-data-science\algo-common\src\main\java\org\neo4j\graphalgo\pagerank\BaseComputeStep.java`

, function `initialize()`

, we see `this.pageRank`

which was declared as `double[]`

.

Manipulating very large or very small numbers with floating-point numbers might lead to accuracy problem. Another alternative is Java’s BigDecimal. However, there is no free lunch. `BigDecimal`

requires much bigger memory consumption, computation time and complexity, compared to `double`

. (link)

*[3] Inconsistency between **apoc.coll.sort()** and **apoc.coll.sortNodes()**:*

The first sorts ascendingly while the second descendingly. Later, if this issue is fixed, one can reverse the collection by `apoc.coll.reverse()`

. Another trick, `apoc.coll.sortNodes(..., "`

to switch the sorting order. (link)**^**propertyToSort")