A glimpse of Graph Algorithms. What is practically possible?
Making sense of Connected Data is increasingly crucial in today world. An typical example is Social Network, where millions of users are connected in different ways, e.g. a friendship, a like, a view, a comment. Another highly relevant usecase at present is how China consolidated a huge amount of diverse data sources to have an unprecedented view on what was going on in the outbreak, thus effectively detecting, tracking and isolating the spread of Covid-19 epidemic. We are truly living in a Connected World! A hundred years ago, during Spanish flu epidemic (1918), nothing much could be done, beside crossing one’s finger and bracing for the worst. A hundred years later (2020), our hope in defeating diseases is not only coming from medicine and healthcare advances, but also from technologies: Big Data, AI and Graph.
Our intention is clear: model the world as a Graph. The remaining issue is which technology to be used in realizing this idea. Relational databases (RDBMS), e.g. SQL Server, came up first in my mind because they are so familiar. Indeed, RDBMSs are mature, well-documented, well-optimized, proven even in Banking industry. Recently, I was involved in a project, in which we visualized the connections of entities, e.g object A depends on object B, object B on object C; objects A and C managed by actor X. Some queries were “What are the longest Dependency Paths?”, “Who are the most influential Actors?”. I soon realized RDBMS with SQL queries would struggle in tackling those types of queries, which involved complex JOINs, let alone organizing and storing the ever-changing data. Worse, who will later maintain the lengthy, complex SQL scripts? An interesting example is showed here.
However, I can’t stress enough that there is no one-size-fit-all database technology. For storing digital documents? Check Document-oriented databases. For storing key-value pairs? Check Key-value databases. For storing and extracting insights of Connected Data? Graph Database is ideal. I heard of Neo4j a few times from my peers, and a quick search on the Internet confirmed my impression of Neo4j as the most popular choice for Graph Database (as of 2020). One may wonder what Graph Technology is used at Facebook? Being frustrated with traditional relational databases, in 2009, Facebook engineers created a custom Graph Database: TAO (Associations and Objects). Unfortunately, it is not publicly available as a standalone/ downloadable application. Another promising Graph-related technology is Apache Spark GraphX. However, to my knowledge, this is more of a Graph Processing Engine, rather than a storage. A screnario where GraphX can be useful is extracting insights, with Graph approach in mind, from very large data, without the need for transaction, modification of the original data, or real-time processing.
So far, my experience with Neo4j has inspired me in the following aspects:
- Graph can better reflect real-life connected data. No more normalization, or breaking data into tables then later assuming their connection through Primary Key/Foreign Key.
- Its Property Graph Model enables fast-changing data model, allowing business to be more flexible and agile.
- Its querying language, Cypher, is intuitive, expressive, concise and joyful to write. Complicated long SQL scripts now frighten me, really!
- It’s production-grade and enterprise-ready, especially from version 4, released in 01.2020. ACID, Scaling, Sharding, Security,… to name a few.
The technology is ready. Knowing what, why, how to take advantage of it is the next step. Disclaimer: I am for sure not at the forefront of Graph. So let’s focus on common and practical applications. Also be mindful that my finding is in no way an exhausted list.
To facilitate my work as well as to enrich our offerings, I did a survey on what is technically and practically feasible, regarding the Graph applications. Consider this as a glimpse of what Graph can empower us to achieve:
At first, Neo4j allowed me to model the data in a way that better reflects the real-world. Check whiteboard-friendly concept. It then enabled to query the data intuitively and concisely, thanks to Cypher, without being bogged down with complex JOINs like in SQL. After that, one can step up to advanced queries, e.g. “What is the most influential entity?” (answerable with PageRank), “What are the densely connected group?” (answerable with Louvain Community Detection), “Where are the brokers, aka. bridges?” (answerable with Betweenness Centrality). Is that all? Not yet. In Machine Learning, specifically Feature Engineering, if we think how an entity is connected influences its behaviors, e.g. a change in its classfication, then we can query the graph and make it a feature, e.g. one more column named “PageRank”. It is called Graphy Feature. Furthermore, Graph can also be useful in Feature Selection by modeling each feature as a node in graph.
From my experience, capturing the right pieces of data is often an incremental and continuous process. That said, the (needed) raw data is not always in one place for us to consume; the (needed) relationships might not be always straightforward. For instance, in an Excel file, one can extract “Last Modified By” and “Authors” fields. That File as well as the Users can be easily made as nodes in a graph. But we may miss something: people who view the File. Those people cannot be saved within the File itself; one must get back to the Server, logging who downloaded or viewed the file inside the browser. Then, consolidate the Excel files and the log information into one source of truth. In another hand, from above, with MODIFY_FILE and VIEW_FILE relationships, we cannot run PageRank over the graph, which requires a single type of node and a single type of relationship. Thus, we can generate FOLLOW relationship between user A, who views file X, and user B, who modifies file X. We can imagine: one type of node (User), one type of relationship (FOLLOW), then PageRank is applicable on such subgraph.
Note: I used Mindmup to render the Graph above. In “Examples” nodes, there is usually the original paper (attached for each child node), detailing the idea. Unfortunately, they cannot be exported as a whole with the Graph into PNG. So I exported them separately as followed:
BTW, if Neo4j is interesting to you and you want to expand your knowledge in the right direction, Neo4j Certified Professional may be the ideal short-term goal. I got it in 02.2020.
Resources
[1] Book: Graph Databases — New Opportunities For Connected Data (Ian Robinson, Jim Webber, Emil Eifrem, 06–2015)
[2] Book: Graph Algorithms — Practical Examples in Apache Spark & Neo4J (Mark Needham, Amy E. Hodler, 05–2019)
[3] Book: Neo4j Graph Data Modeling — Design efficient and flexible databases by optimizing the power of Neo4j (Mahesh Lal, 07.2015)
[4] YouTube Series: Intro to Graph Databases Series (by Neo4j): https://www.youtube.com/watch?v=5Tl8WcaqZoc&list=PL9Hl4pk2FsvWM9GWaguRhlCQ-pa-ERd4U
[5] YouTube Series: Introduction to SNA (by Leonid Zhukov): https://www.youtube.com/watch?v=wwam5UZO7os&list=PLriUvS7Iljvn0GYwsGSRA8PWSE9eEiEoE