Exploring Yelp Dataset with Neo4j — Part I: From Raw data to Nodes and Relationships with Python

TRAN Ngoc Thach
8 min readMay 11, 2020
TABLE OF CONTENTS
Data Understanding
Goals
Data Preparation for Import Tool
Ingestion automation from A-Z with Python
Install Neo4j Database service with Neo4j Desktop
Conclusion
Appendix — Developing Environment

Also:
- Exploring Yelp Dataset with Neo4j — Part II: PageRank
- Exploring Yelp Dataset with Neo4j — Part III: Louvain Community Detection Algorithm

Inspired by the book Graph Algorithms — Practical Examples in Apache Spark & Neo4J, specifically Chapter 7 — Graph Algorithms in Practice, this article is a step-by-step guide to completely transform the raw data, provided by Yelp, into a Graph in Neo4j. After making it into Neo4j, the readers can move on to playing around with the Graph Algorithms mentioned in the Chapter.

Why Graph? Reasonably, it is among the most ideal tools to model the real-world, especially human world, where connectedness reflects social interactions. Case in point, Panama Papers’ investigation, by ICIJ (International Consortium of Investigative Journalists), was empowered, thanks to leveraging Neo4j in order to analyze the highly connected networks of offshore tax structures. There are many other Graph Technologies, but according to DB-Engines, Neo4j is most common and well-known; so whenever having usage difficulty, we should be able to quickly find the solution on Google or through Neo4j’s Community. Let’s get started!

Updated on 28.05.2020: Support Neo4j v4.0.4 and py2neo v5.0b1.

Data Understanding

Yelp publish a subset of their Database, which is not super up-to-date, but is good enough for learning purposes. For example, as of 03.05.2020, the latest review is dating back to 2019-12-13 15:51:19. That of useris 2019–12–13 15:46:07. There is no such information in business. Although one can guess an approximate date of the business, when it was first open through the field date in tip and checkin data, in our case, we are not interested in this date of business, so we can safely ignore this missing piece of information.

The dataset consists of several json files:

  • yelp_academic_dataset_business.json: 145.8 MB
  • yelp_academic_dataset_checkin.json: 428.8 MB (ignored)
  • yelp_academic_dataset_review.json: 5.8 GB
  • yelp_academic_dataset_tip.json: 251.2 MB (ignored)
  • yelp_academic_dataset_user.json: 3 GB
  • photos.json + photos folder: 6.8 GB (ignored)

…whose names indicate its content. Each line of the files is a json string, representing a single json object. We will ignore checkin and tip as well as photos data.

Despite that Yelp helps us in understanding the data, there are some issues regarding its quality and expectation:

  • Ids are only unique in specific domains. E.g. in business, business_id are unique; in user, user_id are unique. However, business_id oiAlXZPIFm2nBCt0DHLu_Q is equal to that of user_id. The potential problem is when creating relationships in Neo4j, the START_ID and END_ID of whatever nodes must be uniquely identifiable.
  • In user, each user may have friends; but not all friends exist, e.g. Unknown friend oeMvJh94PiGQnx_6GlndPQ for User ntlvfPzc8eglqvk92iDIAw, which leads to the problem of relationships with non-existent nodes in Neo4j.
  • Some business contain duplicate categories, such as business_id HEGy1__jKyMhkhXRW3O1ZQ has 2 duplicated Gas Stations. An associated problem is that this may end up as 2 separate relationships in Neo4j.
  • In user, friends field is supposed to be an array of strings. But it turns out to be a single string with friend_ids concatenated by commas. Same problem with categories field inbusiness.

All of those Data problems must be addressed before importing into Neo4j. Although Neo4j’s Import Tool has a flag to ignore “bad relationship”, we shall prefer having “clean” data beforehand. By this way, we have an active control over data quality.

Trivial point: In user data, there are “useful”, “funny” and “cool” fields, which are votes that the user receives from others for their reviews. A list of “compliment_*” fields, e.g. “compliment_cute”, denotes the data captured as showed here. Nevertheless, we intentionally ignore such fields.

At this point, the readers may wonder why we have made a variety of choices about the data, e.g. ignoring tip,checkin and photos data…

Goals

Given the data, we will reproduce the below Graph, similar to the one described in Chapter 7 of the book.

With the Graph, we can play around with Neo4j’s built-in Graph Algorihms, such as PageRank, Louvain Community Detection.

Data Preparation for Import Tool

In general, it is recommended that LOAD_CSV command is suited for ingesting data having less than 10 million records. Given our data and schema, there are 26,645,898 relationships and 10,201,067 nodes. Thus, the alternative method Import Tool is more desired. Its secret to high-speed data ingestion is to skip the transaction layer and directly build the actual store files of Neo4j Database. However, the Database must be empty and offline.

In order to make this ingestion scheme possible, Import Tool advises on following specific CSV file formats:

  • The list of nodes csv-files and relationships csv-files, e.g. user_nodes.csv, review_nodes.csv, relationships.csv.
  • The expected headers are, for example:
Nodes csv-file header
Relationships csv-file header
  • In nodes csv-file, one line means one to-be-created node. In relationship csv-file, one line means one to-be-created relationship.
  • The user_id:ID will end up as a property (user_id) of the nodes, but the :ID denotation will signify that the field is used to identify that node when making the related relationships.

Ingestion automation from A-Z with Python

A Jupyter Notebook (GitHub) is written to automate the entire process, from reading the raw data, to fixing identified Data problems, to resetting the current Database, to start/stop the Database service, to invoking the Import Tool, to ensuring the ingestion is successful.

However, please note that there are a few assumptions, e.g. the raw data from Yelp must be put in data folder relative to the Notebook; the url/port as well as the credentials to log in (The Jupyter Notebook uses py2neo to communicate with the Database service). The Neo4j version, as of now, is v4.0.4. Important Note: The Database must be a Service. See more info in the next section.

At the beginning, the script was first coded in favor of speed, meaning reading everything into memory, transforming the data there, then saving the results all-at-once to disk. But this approach turned out to be exhausting the memory capacity (my computer: Core i7, 16 GB RAM, HDD), causing the OS to start using Hard Disk as additional RAM; in turn, slowing down the whole system and freezing up everything. To overcome this limitation, the Notebook was refactored to reduce memory footprint as much as possible at the expense of a longer running time. For example,

  • Wrapping functionalities inside functions so that the allocated local variables are automatically released when they are out of scope at the moment the functions end.
  • Selective del instructions were also employed when knowing for sure the variables will not be in use again.
  • Saving the pieces of data onto Disk right away if there is no need for further transformation or later use.

The Notebook is designed to “Re-Run All” at any step of the process. For instance, it checks if the fixed_* raw files are available; if not, it produces them. It then checks if the csv files for Import Tool are there or not; if not, it moves to re-creating them.

In brief, we start with the raw data from Yelp (specifically 3 files: yelp_academic_dataset_business.json, yelp_academic_dataset_review.json, yelp_academic_dataset_user.json); we will then end up with a nice Graph Database in Neo4j.

Install Neo4j Database service with Neo4j Desktop

Typically, as a Developer who wants to try out Neo4j, Neo4j Desktop edition is the right choice. Other editions, e.g. Enterprise Server, or Community Server.

After creating a Database inside Neo4j Desktop, its physical location can be pinpointed through:

Within this installation folder, one can find a BAT file allowing us to install this Database as an OS Service. Being a Service enables the Database to automatically run whenever the computer starts. In addition, it can operate without relying on a specific user.

Neo4j Tool — Install Database as an OS Service
Database as an OS Service

A gentle approach for starting the just-installed Service is through .\bin\neo4j.bat start. Now we’re all set to connect to that Database instance with either Bolt/HTTP/HTTPS protocols. Important note: Sufficient privileges are needed to start/stop/install Services; or simply being an Admin User (a UAC dialog may appear, asking for permissions though).

The reason why we need the Database as an OS Service, rather than pressing the button Start (picture below) is that, to my knowledge, there is no way to start/stop the database programmatically. I digged in the documentation of (Python)py2neo/neo4j-driver, and found no relevant API for such functionalities. This may make sense because maintaining Database Services is not supposed to be done remotely through Driver for safety reason. Feel free to correct me!

If pressing the Start button, a temporary yet persistent Database Service is mounted up, running under the current user, as indicated in Process Explorer. As soon as this user logs off, this Service process is terminated.

Database Service if pressing the Start button in Neo4j Desktop (not recommended)
Database Service if starting as Windows Service (recommended)

Conclusion

We have understood the Yelp’s Dataset, defined the goals to be achieved, identified relevant Data problems, and made the Python script to realize what needs to be done in an automatic manner.

The Neo4j’s Graph Database, transformed from Yelp’s Raw Dataset, is now readily available for further investigation and learning purposes.

Enjoy learning Graph Algorithms and diving deep into Yelp’s Dataset!

Appendix — Developing Environment

  • Windows 8.1 x64.
  • Anaconda3–2020.02 x64 (Python v3.7.6).
  • Neo4j v4.0.4 (JDK v11.0.7 LTS x64), with dbms.memory.heap.max_size 4GB.
  • Yelp Dataset (03.05.2020). MD5 (yelp_dataset.tar): 7610af013edf610706021697190dab15.
  • The Neo4j’s Database should be a Windows Service. The Neo4j Desktop and Neo4j Python Driver (py2neo) connect to it through url bolt://localhost:7687 with username/password as neo4j/12345.
  • On Windows, users need sufficient privileges to start/stop /install Services. A UAC dialog may appear, asking for permission.
  • Additional 3rd party Python libraries: neo4j-driver, py2neo (v5.0b1), regex, reverse_geocoder.

--

--