Exploring Yelp Dataset with Neo4j — Part II: PageRank

TRAN Ngoc Thach
6 min readJun 13, 2020

Also:
- Exploring Yelp Dataset with Neo4j — Part I: From Raw data to Nodes and Relationships with Python
- Exploring Yelp Dataset with Neo4j — Part III: Louvain Community Detection Algorithm

Let’s first look at the schema. In Neo4j Browser, enter CALL db.schema.visualization():

Statistics can be quickly retrieved withCALL apoc.meta.stats(). It is clear the Dataset downloaded from Yelp on 03.05.2020 (MD5 for yelp_dataset.tar: 7610af013edf610706021697190dab15) is different than the one used in the book. For example, the one in the book contained more Country than that of the downloaded.

We are going to compute PageRank on User nodes. One thing worth noticing is that PageRank implementation in Neo4j’s Graph Data Science Library respects relationship direction, meaning, for example, if (a:Site)-[:LINK_TO]->(b:Site), a is linked to b, but it is not necessary b is also linked to a.

In Yelp case, when a User A is a friend ofUser B, they are automatically both friends to each other. In our case, as by design, this friendship is reflected by using only one single connection (A:User)-[:FRIENDS]->(B:User), in order to simplify the relationship as well as save resources. Later, one can easily query to get all friends of a given users by disregarding the direction, e.g. MATCH (n:User {name: 'Alice'})-[:FRIENDS]-(m:User) RETURN n, COUNT(m).

Needless to say, we need to project the original graph into a new one where there are only User nodes, and if there is a FRIENDS relationship between a pair of User nodes, we generate 2 relationships of type FRIEND_OF in different directions. As an example, projecting that into a named graph:

CALL gds.graph.create(
'my-graph',
{
User: {
label: 'User'
}
},
{
FRIEND_OF: {
type: 'FRIENDS',
orientation: 'UNDIRECTED'
}
}
)

Before moving on, we take some time to discuss the PageRank formula. The original one is:

PageRank formula (Original)

But the formula used for PageRank implementation in Neo4j’s Graph Data Science Library is:

PageRank formula (GDS’s Implementation)

As a result, the side effect, according to Wikipedia, is “The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N.” I prefer the second version, because in my opinion, manipulating very small numbers need extra attention, regarding the accuracy. As in Appendix [2], PageRank scores are found to be stored as Java'sDouble, which should be decently good if not dividing by N (the total number of nodes) in PageRank score computation.

The next step is to run PageRank algorithm on the projected graph. There are some parameters that can be configured to affect how the algorithm computes the scores. Notably:

  • dampingFactor: default 0.85.
  • tolerance: default 0.0000001.
  • maxIterations: default 20.

To my knowledge, Damping Factor 0.85 is found to be generally good. The PageRank implementation follows interative approach, meaning it keeps computing iteratively until the maxIterations is reached or the changes of PageRank scores in all nodes are less than tolerance. If we want the computation to soon converge, then increase the tolerance. If we want to find out more stable PageRank scores, increase the maxIterations and decrease the tolerance. But trade-off would be time-consuming and we might risk not finding expectedly stable scores for all nodes.

As an engineer, I focus on actionable results with adequate efforts while maintaining a decent quality. That said, let’s keep the dampingFactor and tolerance as default; then try maxIterations with a variety of values, e.g. 20, 30, 40. In the computation results, if seeing didCOnverge as true, that would be great, we’ll stop there; if not, we extract the Top 10 highest scores for each maxIteration in certain categories, and examine whether the scores in Top 10 jump around slightly or wildly. If slightly, great, we’ll stop there; if not, increase maxIterations. Note: It is assumed that we are only interested in the most influential nodes, e.g. Top 10; and do not care about the rest.

The Cypher to compute PageRank:

# Use anonymous projected graph so no need to clean up later.
UNWIND [20, 30, 40] AS numIter
CALL gds.pageRank.write({
nodeProjection: 'User',
relationshipProjection: {
FRIEND_OF: {
type: 'FRIENDS',
orientation: 'UNDIRECTED'
}
},
writeProperty: 'pageRank' + toString(numIter),
maxIterations: numIter
})
YIELD ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration
RETURN numIter, ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration

Unfortunately, the flagdidConverge returned false for all maxIterations (link). Let’s examine the Top 10 PageRank scores for Categories Hostels and Hotels for maxIterations of20, 30, and 40.

UNWIND ['Hotels', 'Hostels'] as cat_name
MATCH (c:Category {category_id: cat_name})<-[:IN_CATEGORY]-(:Business)<-[:REVIEWS]-(:Review)<-[:WROTE]-(u:User)
WITH c, u
WITH c.category_id AS category_id, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank20")[..10] AS userWithPageRank20, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank30")[..10] AS userWithPageRank30, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank40")[..10] AS userWithPageRank40, range(0, 9, 1) AS indexes
UNWIND indexes AS index
RETURN category_id, userWithPageRank20[index].name AS name20, userWithPageRank20[index].pageRank20 AS pageRank20, userWithPageRank30[index].name AS name30, userWithPageRank30[index].pageRank30 AS pageRank30, userWithPageRank40[index].name AS name40, userWithPageRank40[index].pageRank40 AS pageRank40

See Appendix [3] for an issue with apoc.coll.sortNodes(). The result:

Top 10 PageRank scores for Category “Hotels” and “Hostels”

It looks like we have a relatively stable result: The User in each Category do not jump around wildly; when some switch places, looking carefully, we’ll see their PageRank scores are almost equal.

Beside, in the result (link), in centralityDistribution column, we see p1 near 0.15. That’s due to 1-dampingFactor for pages where there is almost no in-link, i.e. very unpopular pages. Said in other word, 0.15 is the minimum value of PageRank score.

In conclusion, given the dataset, the PageRank scores for User with default dampingFactor,tolerance and maxIterations being either 30or 40are sufficiently good for Category Hostels and Hotels. This may not be necessarily hold true in other Categories, e.g. Grocery though.

Appendix

[1] Find out if the PageRank implementation respects relationship direction.

Suppose an empty Neo4j database and it is stopped. Neo4j’s Graph Data Science source code. Download and add this code snippet into graph-data-science\algo-common\src\main\java\org\neo4j\graphalgo\pagerank\PageRank.java, function initializeSteps():

# degree: the number of out-links for a given node.
var iter = graph.nodeIterator();
while(iter.hasNext()){
var node = iter.next();
var degree = graph.degree(node);
getProgressLogger().logMessage(String.format("*** node [%s]: has degree [%s]", node, degree));
}

Recompile the library with command: gradlew packaging:shadownJar. Rename the JAR file in graph-data-science\packaging\build\libs\ into gdsLibrary-1.2.1.jar and copy/overwrite the one in plugins folder of the current Neo4j Installation.

Start Neo4j again. In Neo4j Browser, create a simple Graph:

CREATE (a:User {name: "a"}), (b:User {name: "b"}), (c:User {name: "c"}), (a)-[:FRIEND_WITH]->(b), (c)-[:FRIEND_WITH]->(b)

Then run PageRank algorithm on this graph:

CALL gds.pageRank.write({
nodeProjection: 'User',
relationshipProjection: 'FRIEND_WITH',
writeProperty: 'pageRank',
maxIterations: 20
})
YIELD ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration

Check log file in logs folder of Neo4j Installation (Hint: searching for ***):

*** node [0]: has degree [1]
*** node [1]: has degree [1]
*** node [2]: has degree [0]

It shows that the PageRank implementation respects relationship direction.

[2] Find out which data type to store PageRank.

In graph-data-science\algo-common\src\main\java\org\neo4j\graphalgo\pagerank\BaseComputeStep.java, function initialize(), we see this.pageRank which was declared as double[].

Manipulating very large or very small numbers with floating-point numbers might lead to accuracy problem. Another alternative is Java’s BigDecimal. However, there is no free lunch. BigDecimal requires much bigger memory consumption, computation time and complexity, compared to double. (link)

[3] Inconsistency between apoc.coll.sort() and apoc.coll.sortNodes():

The first sorts ascendingly while the second descendingly. Later, if this issue is fixed, one can reverse the collection by apoc.coll.reverse(). Another trick, apoc.coll.sortNodes(..., "^propertyToSort") to switch the sorting order. (link)

--

--