Exploring Yelp Dataset with Neo4j — Part III: Louvain Community Detection Algorithm

TRAN Ngoc Thach
11 min readOct 2, 2020
TABLE OF CONTENTS
-Introduction
-(Parallel) Louvain algorithm
-Experiment
-Evaluation
-Conclusion

Also:
- Exploring Yelp Dataset with Neo4j — Part I: From Raw data to Nodes and Relationships with Python
- Exploring Yelp Dataset with Neo4j — Part II: PageRank

1. Introduction

Paying attention, we will see that network is ubiquitous in our life, from Internet of Things, to even entire society. Modeling these physcial entities with Graph Databases and extracting their insights are increasingly popular. However, I would not dare say that this Database type will soon supersede the tradtional ones, i.e. Retional Database, which has been mature, well-developed, and proven for decades. Nevertheless, when it comes to flexibility, e.g. the everchanging real-world, and complex modeling of highly connected data, Graph Database is to shine.

Apart from querying highly connected data more efficiently and effectively, as discussed in this article, Graph Databases usually support advanced algorithms, e.g. PageRank to find the most influential node, or Dijkstra Shortest Path algorithm. More on Neo4j’s algorithms can be found here.

Necessity is the mother of invention. (Plato)

The world is organized into, let’s say, nations; in turn, each of those is divided into regions; and then in each region, there are states; then cities, districts in more granular levels. Another example, supposed there is forum about Java, the user interests are broken down in categorical hierarchy, let’s say, Language and Library. In Language, there are Syntax, Data Structure, Design Pattern; in Library, I/O, Network, XML. The question is: how can we decide objectively this is the only possible group hierarchy, and we did not miss anything? I think one of the secrets to Google Search Engine dominance is it indexed the Web automatically at scales based on objective criteria (e.g. PageRank).

By the same logic, it would be great if, given a huge amount of connected data, one could detect communities inside as well as their hierachical structures, automatically, in an acceptable time. Louvain Community Detection is a promising candidate that can satisfy our requirements. Plus, it is readily available in Neo4j’s Graph Data Science (GDS) library — a must-have companion add-on. Although the algorithm is based on the original paper (“Fast unfolding of communities in large networks”, Vincent D. Blondel et al., 2008), its implementation in Neo4j is actually the result of another paper (“Parallel Heuristics for Scalable Community Detection”, Hao Lu et al., 2014), which is the parallel version of Louvain algorithm. We will explore Users communities in Yelp Dataset with Louvain.

2. (Parallel) Louvain algorithm

Although I am not going to explain every detail in those papers, let’s get through a few key take-aways.

The original Louvain paper:

  • Modularity is a criterion, ranging from -1 to 1, used to measure the density of links inside communities as compared to links between communities. It was invented before Louvain algorithm.
  • Beside as a criterion, modularity itself can be used as an optimization function to find the best communities. Unfortunately, it is proven to be NP-Complete. Thus, heuristical approaches are needed.
  • Louvain et al. figured out the Modulariry gain when a node is attached to a community, can be quickly computed, thanks to the following formula. (Note: the proof can be found here.)
  • Among other things, this technique enables Louvain algorithm to explore large networks efficiently in reasonable amounts of time. E.g. 118 million nodes/1 billion links took 152 minutes in Web Graph. In fact, it “outperform all other known community detection method in terms of computation time.
  • The algorithm consists of two phases, which are repeated iteratively. Put it simply, in the first phase, each node itself belongs to a different community. The algorithm then tries to detach a node from its current community and attach it to a neighbor community. Then calculate the Modularity gain. After doing this for all of its neighbor communities, join the one having the highest score (and stay there!). Other nodes are taken into account in the same manner. Going through all nodes in multiple rounds until the no further improvement can be achieved. Next, in the 2nd phase, the identified communities turn into super-nodes; the links between them are reasonably calculated. Those super-nodes will constitute a new network. (A level of hierarchy is finished.) Then the algorithm starts again with this new network. The community hierarchy is built up during the process.
  • The authors claimed that the algorithm complexity is linear on typical and sparse data. The Modularity Resolution Limit is seemingly circumvented due to the multi-level approach of the algorithm.

The parallel version of Louvain paper:

  • It is noticable that the original Louvain algorithm is sequential in nature (Nodes are approached one-by-one sequentially). Nevertheless, the advantage is at any time, one can have the latest information over the entire network and act accordingly. This is true for any sequential computations; but not necessarily true for parallel ones.
  • With the coming era of multi-core and distributed computing, one can potentially speed things up by parallelizing. However, parallelism may introduce a new set of problems, such as consistency. In the article, the authors identified a few problems such as: Modularity Gain can be better than original (sequential) Louvain algorithm, but also can be negative! The negative gain could at worst in theory lead to the algorithm unable to terminate, or at least slow down the convergence, thus taking more iterations. Another problem is “Swap and local maxima”, which can be overcome by suggested Label Heuristics. The authors also leveraged K-1 Coloring (paper, implementation) so that no two adjacent nodes are processed concurrently, mitigating consistency problems. Said in other words, each core takes care of one different (ideally far away) part of the network; so those cores can’t interfere each other.
  • The authors also merged the one-degree node to its (only) neighbor community; thus, reducing computation. This heuristic was suggested by Louvain et al. in their Further Works.
  • Generally, the parallel version of Louvain is substantially sped up. In a case, 16x faster using 32 threads, compared to the sequential version.

3. Experiment

Let’s revisit the Yelp Dataset in my former article.

User nodes are the main focus; thus, in order to detect the communities in User network, we project it before feeding into proceduregds.louvain:

CALL gds.graph.create(
'usersGraph',
'User',
{
FRIEND_OF: {
type: 'FRIENDS',
orientation: 'UNDIRECTED'
}
}
)

However, as a rule of thumb, one should check how connected a graph is. What if there are islands of nodes, which are separate from each other and absolutely no connection between them? What if there are standalone nodes, which are not connected to anyone? Let’s do some pre-processing exploration! In Neo4j’s GDS, these islands, namely Components, can be detected with Weakly Connected Components (WCC) algorithm, as followed.

CALL gds.wcc.write(
'usersGraph',
{
writeProperty: "componentId"
}
)

In User nodes, each is now having componentId field, indicating which Component it belongs to. Group them by componentId and calculate the sizes:

MATCH (n:User)
RETURN n.componentId, COUNT(*) AS num
ORDER BY num DESC
LIMIT 100

Result:

+---------------+--------+
| n.componentId | num |
+---------------+--------+
| 0 | 930882 |
| 101816 | 7 |
| 60948 | 5 |
| 172525 | 5 |
| ... | ... |
+---------------+--------+

In our case, the largest island of User nodes is considered. Thus, component 0 is chosen. Now, recreate the projected graph:

CALL gds.graph.create.cypher(
'usersGraph',
'MATCH (n:User) WHERE n.componentId = 0 RETURN id(n) AS id',
'MATCH (n1:User)-[r:FRIENDS]-(n2:User) WHERE n1.componentId = 0 AND n2.componentId = 0 RETURN DISTINCT id(n1) AS source, id(n2) AS target'
)

We will write back the results permanently to the Yelp Graph. Therefore, gds.louvain.write() is used. Regarding the parameters, there are a few notable ones:

  • maxLevels (default 10): as discussed above, every two-phase constitutes a new network, which is a Level for the community hierarchy. When running, the higher the Level is, the bigger the communities will be, which might not be exactly what we want. This parameter specifies the max possible Level. The algorithm can return earlier before reaching that.
  • maxIterations (default 10): in two-phase round, the first phase keeps repeating iteratively until either the improvement of Modularity is negligible (tolerance below) or the iterations number is reached as specified here.
  • tolerance (default 0.0001): see above. In addition, the smaller the tolerance is, the higher Modularity can be squeezed out. (Modularity: the higher, the better; but more iterations may be needed.)
  • includeIntermediateCommunities (default false): Typically, we want to see the final communities. But if interested, the intermediate community structures, aka. community hierarchy, can be kept. For example, we may want final communities as nations, but cities in each nation and districts in each city are also desired.
  • concurrency (default 4): The number of concurrent threads used for running the algorithm.

The tolerance parameter is worth experimenting, e.g. try a series of values from 0.01 to 0.0000000001. Then based on the results, we can argue on how it is related to the real runLevel, the communityCount, the final modularity.

# Do the same query for other values of tolerance.
CALL gds.louvain.write(
'usersGraph',
{
tolerance: 0.01,
writeProperty: 'communityId'
}
)

Statistics:

+--------------+-----------+----------------+------------+
| tolerance | ranLevels | communityCount | modularity |
+--------------+-----------+----------------+------------+
| 0.01 | 2 | 2805 | 0.62454623 |
| 0.001 | 3 | 505 | 0.62544985 |
| 0.0001 | 3 | 434 | 0.62679003 |
| 0.00001 | 3 | 554 | 0.62170562 |
| 0.000001 | 4 | 623 | 0.62755465 |
| 0.0000001 | 5 | 609 | 0.61980147 |
| 0.00000001 | 5 | 224 | 0.62771956 |
| 0.000000001 | 4 | 317 | 0.62822139 |
| 0.0000000001 | 4 | 206 | 0.62845032 |
+--------------+-----------+----------------+------------+

Note that by default the number of concurreny is 4. As stated in the parallel version of Louvain paper, due to K-1 Coloring heuristic, variations in the results are expected despite being negligible. If the concurrency is chosen as 1, I found that the result is understandably stable. From the table above, the result parameters (ranLevel, communityCount, modularity) flip-flop during the reduction of tolerance, but the general trend is that when the tolerance is reduced, fewer final communities are detected (communityCount); the community hierarchy also increases (ranLevels); the modularity is expectedly squeezed out more. The quick take-away is if wanting high-quality partitioning of network, decrease the tolerance, but expect more computation time!

4. Evaluation

We choose tolerance 0.0000000001 to have the highest Modularity. Let’s count the number of nodes in each community.

MATCH (n:User)
WHERE EXISTS(n.communityId)
RETURN n.communityId AS communityId, COUNT(*) AS size
ORDER BY size DESC

Result:

+-------------+--------+
| communityId | size |
+-------------+--------+
| 765139 | 181228 |
| 242178 | 134309 |
| 603835 | 130053 |
| 449199 | 126311 |
| 227334 | 73328 |
| 144302 | 67628 |
| 852953 | 59848 |
| ...| ... |
+-------------+--------+

Its distribution:

More than 90% of communities are small (less than 1000 users). If I were an advertiser, to maximize the impact on potential customers, I would choose the largest communities to carry out the advertising campaign. For example, given the largest community above, I can run PageRank against it to find the most influential User in this community; then ask this user to write reviews for a business, which hopefully leads to more attention from many other Users of the same community.

Another sensible concern arises. Community itself implies there is something special that glues the members of the community together (i.e. they talk to each other much more than to someone else outside). For instance, a community of football fans. To find out, let’s see in this community, which cities the members write Review about:

MATCH (n:User)
WHERE n.communityId = 765139
WITH n
MATCH (n)-[:WROTE]->(:Review)-[:REVIEWS]->(:Business)-[:IN_CITY]->(c:City)
RETURN c.name, COUNT(*) AS total_reviews
ORDER BY total_reviews DESC

Result:

+-----------------+---------------+
| c.name | total_reviews |
+-----------------+---------------+
| Paradise Valley | 118986 |
| Phoenix | 112476 |
| Gilbert | 68320 |
| Scottsdale | 62291 |
| Glendale | 52412 |
| Guadalupe | 46955 |
| Peoria | 36942 |
| Tempe | 36565 |
| Chandler | 33517 |
| Tempe Junction | 31800 |
| ... | ... |
+-----------------+---------------+

Mark the top 20 cities on maps, we got:

Most of them are in the West side of America, specifically around Phoenix.

Unfortunately, in Yelp Dataset, the location of Users is not exposed, likely due to privacy concern. By the same logic, we can also figure out the Business Category the largest community is interested in. One can argue that without Community Detection, we can just run PageRank for the whole graph, and put advertisement money to the top most influential Users. Yes, that’s one of the ways! But the notion of community signifies its members are highly interacting with each other; one member’s act can greatly inspire others. Community also allows us to better focus, e.g. community of football fans, so that we can target advertising more effectively.

Models are as good as the assumptions.

Our approach so far bears some implications. The Users are connected by the means of “FRIENDS” relationship on Yelp. They can be friends, but they don’t necessarily always follow each other’s posts; or even interact. In addition, the dynamic of time is also not captured in this “modeling”: through time, a community may have more and more members, but their activities are not always ensured to be high-spirited, i.e. a dying community? An advertiser might want to avoid that. All this is due to the limitation of the given dataset. If the dataset captured who view what, who send message to whom, who “like” which review at what time, we could factor in such criteria to measure how “lively” a community is.

5. Conclusion

Community Detection algorithms, one of which is Louvain, are among precious tools for Data Analytics. Louvain algorithm is well-known for its extremely fast partitioning of large networks into communities as well as building a community hierarchy along the process. It is already fast, but it is sequential. The parallel version of this algorithm speeds up the computation substantially at the expense of more resource usage. No problem as generally “Time is precious while computing resources are cheap”.

In my opinion, at the end of day, tools can be useful; but what determines success is the quality/quantity of data as well as the methodologies of modeling in order to achieve certain goals. Imagining the possibilities of Data Economy in the future keeps me awake at night and intrigued.

--

--