Exploring Yelp Dataset with Neo4j — Part II: PageRank
Also:
- Exploring Yelp Dataset with Neo4j — Part I: From Raw data to Nodes and Relationships with Python
- Exploring Yelp Dataset with Neo4j — Part III: Louvain Community Detection Algorithm
Let’s first look at the schema. In Neo4j Browser, enter CALL db.schema.visualization()
:
Statistics can be quickly retrieved withCALL apoc.meta.stats()
. It is clear the Dataset downloaded from Yelp on 03.05.2020 (MD5 for yelp_dataset.tar: 7610af013edf610706021697190dab15
) is different than the one used in the book. For example, the one in the book contained more Country
than that of the downloaded.
We are going to compute PageRank on User
nodes. One thing worth noticing is that PageRank implementation in Neo4j’s Graph Data Science Library respects relationship direction, meaning, for example, if (a:Site)-[:LINK_TO]->(b:Site)
, a
is linked to b
, but it is not necessary b
is also linked to a
.
In Yelp case, when a User
A is a friend ofUser
B, they are automatically both friends to each other. In our case, as by design, this friendship is reflected by using only one single connection (A:User)-[:FRIENDS]->(B:User)
, in order to simplify the relationship as well as save resources. Later, one can easily query to get all friends of a given users by disregarding the direction, e.g. MATCH (n:User {name: 'Alice'})-[:FRIENDS]-(m:User) RETURN n, COUNT(m)
.
Needless to say, we need to project the original graph into a new one where there are only User
nodes, and if there is a FRIENDS
relationship between a pair of User
nodes, we generate 2 relationships of type FRIEND_OF
in different directions. As an example, projecting that into a named graph:
CALL gds.graph.create(
'my-graph',
{
User: {
label: 'User'
}
},
{
FRIEND_OF: {
type: 'FRIENDS',
orientation: 'UNDIRECTED'
}
}
)
Before moving on, we take some time to discuss the PageRank formula. The original one is:
But the formula used for PageRank implementation in Neo4j’s Graph Data Science Library is:
As a result, the side effect, according to Wikipedia, is “The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N.” I prefer the second version, because in my opinion, manipulating very small numbers need extra attention, regarding the accuracy. As in Appendix [2], PageRank scores are found to be stored as Java'sDouble
, which should be decently good if not dividing by N (the total number of nodes) in PageRank score computation.
The next step is to run PageRank algorithm on the projected graph. There are some parameters that can be configured to affect how the algorithm computes the scores. Notably:
dampingFactor
: default0.85
.tolerance
: default0.0000001
.maxIterations
: default20
.
To my knowledge, Damping Factor 0.85
is found to be generally good. The PageRank implementation follows interative approach, meaning it keeps computing iteratively until the maxIterations
is reached or the changes of PageRank scores in all nodes are less than tolerance
. If we want the computation to soon converge, then increase the tolerance
. If we want to find out more stable PageRank scores, increase the maxIterations
and decrease the tolerance
. But trade-off would be time-consuming and we might risk not finding expectedly stable scores for all nodes.
As an engineer, I focus on actionable results with adequate efforts while maintaining a decent quality. That said, let’s keep the dampingFactor
and tolerance
as default; then try maxIterations
with a variety of values, e.g. 20
, 30
, 40
. In the computation results, if seeing didCOnverge
as true
, that would be great, we’ll stop there; if not, we extract the Top 10 highest scores for each maxIteration
in certain categories, and examine whether the scores in Top 10 jump around slightly or wildly. If slightly, great, we’ll stop there; if not, increase maxIterations
. Note: It is assumed that we are only interested in the most influential nodes, e.g. Top 10; and do not care about the rest.
The Cypher to compute PageRank:
# Use anonymous projected graph so no need to clean up later.
UNWIND [20, 30, 40] AS numIter
CALL gds.pageRank.write({
nodeProjection: 'User',
relationshipProjection: {
FRIEND_OF: {
type: 'FRIENDS',
orientation: 'UNDIRECTED'
}
},
writeProperty: 'pageRank' + toString(numIter),
maxIterations: numIter
})
YIELD ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration
RETURN numIter, ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration
Unfortunately, the flagdidConverge
returned false
for all maxIterations
(link). Let’s examine the Top 10 PageRank scores for Categories Hostels
and Hotels
for maxIterations
of20
, 30
, and 40
.
UNWIND ['Hotels', 'Hostels'] as cat_name
MATCH (c:Category {category_id: cat_name})<-[:IN_CATEGORY]-(:Business)<-[:REVIEWS]-(:Review)<-[:WROTE]-(u:User)
WITH c, u
WITH c.category_id AS category_id, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank20")[..10] AS userWithPageRank20, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank30")[..10] AS userWithPageRank30, apoc.coll.sortNodes(COLLECT(DISTINCT u), "pageRank40")[..10] AS userWithPageRank40, range(0, 9, 1) AS indexes
UNWIND indexes AS index
RETURN category_id, userWithPageRank20[index].name AS name20, userWithPageRank20[index].pageRank20 AS pageRank20, userWithPageRank30[index].name AS name30, userWithPageRank30[index].pageRank30 AS pageRank30, userWithPageRank40[index].name AS name40, userWithPageRank40[index].pageRank40 AS pageRank40
See Appendix [3] for an issue with apoc.coll.sortNodes()
. The result:
It looks like we have a relatively stable result: The User
in each Category
do not jump around wildly; when some switch places, looking carefully, we’ll see their PageRank scores are almost equal.
Beside, in the result (link), in centralityDistribution
column, we see p1
near 0.15
. That’s due to 1-dampingFactor
for pages where there is almost no in-link, i.e. very unpopular pages. Said in other word, 0.15
is the minimum value of PageRank score.
In conclusion, given the dataset, the PageRank scores for User
with default dampingFactor
,tolerance
and maxIterations
being either 30
or 40
are sufficiently good for Category Hostels
and Hotels
. This may not be necessarily hold true in other Categories, e.g. Grocery
though.
Appendix
[1] Find out if the PageRank implementation respects relationship direction.
Suppose an empty Neo4j database and it is stopped. Neo4j’s Graph Data Science source code. Download and add this code snippet into graph-data-science\algo-common\src\main\java\org\neo4j\graphalgo\pagerank\PageRank.java
, function initializeSteps()
:
# degree: the number of out-links for a given node.
var iter = graph.nodeIterator();
while(iter.hasNext()){
var node = iter.next();
var degree = graph.degree(node);
getProgressLogger().logMessage(String.format("*** node [%s]: has degree [%s]", node, degree));
}
Recompile the library with command: gradlew packaging:shadownJar
. Rename the JAR file in graph-data-science\packaging\build\libs\
into gdsLibrary-1.2.1.jar
and copy/overwrite the one in plugins
folder of the current Neo4j Installation.
Start Neo4j again. In Neo4j Browser, create a simple Graph:
CREATE (a:User {name: "a"}), (b:User {name: "b"}), (c:User {name: "c"}), (a)-[:FRIEND_WITH]->(b), (c)-[:FRIEND_WITH]->(b)
Then run PageRank algorithm on this graph:
CALL gds.pageRank.write({
nodeProjection: 'User',
relationshipProjection: 'FRIEND_WITH',
writeProperty: 'pageRank',
maxIterations: 20
})
YIELD ranIterations, didConverge, createMillis, computeMillis, writeMillis, nodePropertiesWritten, centralityDistribution, configuration
Check log file in logs
folder of Neo4j Installation (Hint: searching for ***
):
*** node [0]: has degree [1]
*** node [1]: has degree [1]
*** node [2]: has degree [0]
It shows that the PageRank implementation respects relationship direction.
[2] Find out which data type to store PageRank.
In graph-data-science\algo-common\src\main\java\org\neo4j\graphalgo\pagerank\BaseComputeStep.java
, function initialize()
, we see this.pageRank
which was declared as double[]
.
Manipulating very large or very small numbers with floating-point numbers might lead to accuracy problem. Another alternative is Java’s BigDecimal. However, there is no free lunch. BigDecimal
requires much bigger memory consumption, computation time and complexity, compared to double
. (link)
[3] Inconsistency between apoc.coll.sort()
and apoc.coll.sortNodes()
:
The first sorts ascendingly while the second descendingly. Later, if this issue is fixed, one can reverse the collection by apoc.coll.reverse()
. Another trick, apoc.coll.sortNodes(..., "^propertyToSort")
to switch the sorting order. (link)