Thomas Markovich Machine Learning Research Scientist I work on hard problems and write about some of them here.

Knowledge Graphs for Enhanced Machine Reasoning at Forge.AI

Introduction

Natural Language Understanding at an industrial scale requires an efficient, high quality knowledge graph for tasks such as entity resolution and reasoning. Without the ability to reason about information semantically, natural language understanding systems are only capable of shallow understanding. As the requirements of machine reasoning and machine learning tasks become more complex, more advanced knowledge graphs are required. Indeed, it has been previously observed that knowledge graphs are capable of producing impressive results when used to augment and accelerate machine reasoning tasks at small scales, but struggle at large scale due to a mix of data integrity and performance issues. Solving this problem and enabling machine driven semantic reasoning at scale is one of the foundational technological challenges that we are addressing at Forge.AI.

To understand the complexity of this task, it’s necessary to define what a knowledge graph is. There are many academic definitions floating around, but most are replete with jargon and impenetrable. Simply said, a knowledge graph is a graph where each vertex represents an entity and each edge is directed and represents a relationship between entities. Entities are typically proper nouns and concepts (e.g. Apple and Company, respectively), with the edges representing verbs (e.g. Is A). Together, these form large networks that encode semantic information. For example, encoding the fact that “Apple is a Company” in the knowledge graph is done by storing two vertices, one for “Apple” and one for “Company”, with a directed edge originating with Apple and pointing to Company of type “isA”. This is visualized in Figure 1:

Figure 1: Visualized simple knowledge graph representing the fact that "Apple is a Company"

A knowledge graph encodes many facts, each through the use of a directed edge. Each vertex can have many facts connected to it, making this ultimately a directed multigraph. This type of representation provides an intuitive way to reason about queries. For example, from the knowledge graph represented in Figure 1 we can reason about the question “Is apple a company?” by simply walking through the graph, starting at “Apple” and walking to “Company”, testing edges and concepts along the way. In production, knowledge graphs tend to be quite large and complex with millions or billions of edges. Such a large amount of knowledge allows us to use these graphs to easily reason about semantic connections for tasks such as enriching business relevant data and resolving entities. At Forge.AI, we perform these tasks as part of our NLP / NLU pipeline for extracting individual events from unstructured text into a machine-readable format.

With a working definition of a knowledge graph in hand, we will next explore some of the use cases that we’ve found for the knowledge graph here at Forge.AI. Then, we’ll explore the graph infrastructure to understand what powers these use-cases. Finally, we’ll discuss part of our road map to explore what’s next for Forge.AI and its knowledge graph.

Use Cases

It’s worth grounding our conversation of the knowledge graph in a few use-cases before we jump too deeply into a detailed discussion of the infrastructure, how it works, and where we’re going. In general, a knowledge graph can be used for a wide range of applications including entity resolution, dependency analysis, filtering, and machine reasoning. In the ensuing discussion, we will focus on entity disambiguation and dependency analysis, two of the many tasks that we use the knowledge graph for at Forge.AI.

Entity Disambiguation

While simple to state, the problem of entity disambiguation is one of the most frequent problems that we need to solve when reasoning about a document. While this problem is fairly straightforward to handle in cases where the relevant ontology is known and fully enumerated, it can quickly become difficult when that is not the case. To explore how we can handle this problem with the knowledge base, let’s consider the problem of determining which “Apple” is being referenced in the quote below:

“Though the company doesn’t break out individual unit sales by model, Apple says it sold 77.3 million iPhones — a decrease from the 78.2 million iPhones it sold in the same period in 2017.”

Obviously, this is “Apple” the corporation, not “apple” the type of fruit. How did our brains determine this? We used contextual clues! We know that the Apple Corporation sells the iPhone because the type of fruit is incapable of selling anything. Based on these contextual clues alone, we are able to perform this task nearly instantaneously using our reasoning.The ForgeAI knowledge graph works in the same way: when we seek to disambiguate an entity, we provide the knowledge graph with a set of co-located entities that provide the graph with the appropriate context. However, machine learning systems do not work like our brains do and for a machine learning system to reason with context, we need a knowledge graph. Our knowledge graph then searches for all versions of “Apple” on the full graph and constructs small graphs that include contextual information as can be seen in Figures 2 and 3. Note, this is a noisy string search that is capable of finding versions of the initial search term that may differ from the original string or contain the search string as a substring. We also keep a look up table of known aliases for each of our entities, where aliases can be things like CIK codes or ticker symbols.

Figure 2: Visualized excerpt from the Knowledge Graph that pertains to the entity Apple the fruit.

Figure 3: Visualized excerpt from the Knowledge Graph that pertains to the entity Apple, the consumer electronics corporation.

With these small graphs in hand, the knowledge graph then uses machine reasoning to determine which of the entities is truly being referenced. There are many strategies to doing this but we have found that a greedy algorithm which seeks to maximize the overlap between the contextual entities passed in and the small graphs under consideration is effective.

Dependency Analysis

Another major task that we’ve found the knowledge graph to be useful for is dependency analysis. That is, to determine the relationship between two or more entities. This is most useful when attempting to determine whether an extracted event is something that a customer would care about, given their stated interests. To make this concrete, let’s consider the following news story in regards to a customer that is interested in news events relating to Samsung:

“Russia’s Norilsk Nickel has teamed up with Russian Platinum to invest $4.4bn to develop mining projects in Siberia, which contains some of the world’s richest deposits of platinum and palladium. The two companies will form a joint venture to develop projects in the Taimyr Peninsula in Russia’s far north with an aim to become the world’s largest producer of the precious metals, they said Wednesday.”

It’s certainly not obvious to me how this story is connected to Samsung. The question at hand is to determine whether this news event is related to Samsung and, if so, the nature of that relation so we can determine whether or not to pass this event to our customer. We begin by constructing small graphs around each of the entities. With these graphs in hand, we then compute a path given Dijkstra’s algorithm between each of the marked endpoints. An example of such a path is given in Figure 4.

Figure 4: Visualized excerpt from the Knowledge Graph that pertains to the relationship between the Norilsk platinum group metals mine in Siberia, Russia and Samsung.

What we see in Figure 4 is that the knowledge graph believes that Iridium is a Platinum Group Metal, and that Platinum Group Metals are mined in Norilsk. We also see that the Knowledge Graph believes that Iridium is used in Organic Light Emitting Diodes (or OLEDs), which just happen to be used in Samsung phones. Therefore, this news event is likely relevant to our customer. In fact, this event is highly relevant to our customer’s interest in Samsung because Iridium is incredibly important to the production of OLED screens due to its ability to make a blue LED. Indeed, Samsung has even funded researchers at MIT and Harvard to explore alternatives to Iridium for OLED screens.

This type of dependency analysis is illustrative of the power of a well formed knowledge graph and it is critical for machine enabled semantic reasoning. It’s easy to imagine this type of dependency analysis having uses not only in the financial services industry, but also in work as wide ranging as supply chain risk assessment and nuclear nonproliferation applications – just to name a few.

Graph Infrastructure

In addition to standard graph features, we choose to endow each fact that is stored in the knowledge graph with the time at which the edge was added and a confidence for that edge. The time dependence intuitively follows from the observation that the totality of human knowledge grows and changes over time. Ultimately, this makes the graph dynamic, which is a natural feature of human knowledge itself.

There are a small number of facts that I’d be willing to bet my life on – something like Auston Matthews is a Toronto Maple Leaf – and a great many facts that I’d be willing to bet $20 dollars on – for example, the Boston Massacre happened in 1770. Both are true but, due to the amount of information that I’ve recently read, I know considerably more about the former than the latter and, therefore, am more confident about it. Motivated by this, we have designed our knowledge graph such that each edge has weights which we choose to interpret as confidences. This data enables us to capture the inherent uncertainty necessary to model a fast changing world and to reason about the validity of queries. By virtue of the graph being probabilistic, we are able to embrace true Bayesian reasoning as we attempt to evaluate a query, as well as provide query specific priors to up or down weight an assertion based on the origin (e.g. a company’s own statements about a new product release should be up-weighted over twitter rumors).

One of the most exciting engineering challenges of knowledge graphs is their size. It is not uncommon to have a knowledge graph with more than 1 billion facts and 50 million vertices; this can easily require hundreds of gigabytes of RAM. Even more concerning than the memory requirements is the computational cost of computing even basic graph properties such as the path length between vertices. We have taken two complementary approaches to ensure that our graph algorithms are as quick as possible. First, because our edges are interpreted as probabilities, it is possible to set a probability cutoff beyond which we are not interested in graph connections. This allows us to only consider graph algorithms over highly restricted subsets of the graph, which provides us with major algorithmic improvements. Second, we have engineered the data structure to remain as cache coherent as possible by representing our knowledge graph as a sparse three rank tensor in an attempt to optimize the per-fact throughput through the CPU.

We also have a clear route towards efficient parallelization by exploiting what we are terming the “galactic structure” of the graph. While this is not a general feature of all graphs, we have observed that there are highly connected clusters of vertices that are only weakly connected to one another. Intuitively, this makes sense. For example, consider domains such as the Toronto Maple Leafs and modern particle physics – there is little overlap between these fields and therefore no need to reason over a graph that contains both clusters of highly interconnected vertices when reasoning about Dave Keon, the Toronto Maple Leafs legend. This galactic structure provides us with a promising route towards efficient parallelization using commodity hardware.

Where are We Going?

We’ve just started to teach the knowledge graph and show it how to perform basic reasoning. The following are some of the many additional features that we are adding that will ensure the accuracy, robustness, and efficiency of the graph long into the future.

Probabilistic Reasoning

Giving the knowledge graph the ability to reason probabilistically about the validity of facts allows it to hold conflicting facts or hypotheses and evaluate them later in the presence of more evidence. This can additionally be used to evaluate nuance of queries. This can be achieved by using techniques such as softening the axiomatic constraints that power the machine reasoning engine and building ontology-specific Bayesian models. We anticipate that using these techniques should make our knowledge graph more resilient to internal errors.

Automatic Fact Checking

Of course, if we have a collection of facts that we intend to use as our internal source of truth to augment business data, we should ensure that this set of facts is correct. With our current knowledge graph size, we can perform this fact checking using a mix of manual spot checking and axiomatic constraint testing (e.g. a person can only be born in one country). This is the standard technique for evaluating the correctness of knowledge graphs. As with most machine learning tasks, this is incredibly person intensive and, therefore, expensive. Additionally, it’s difficult to scale this technique to large graphs. To address these issues, we’re excited to explore techniques related to hinge-loss Markov random fields that are directionally aware. In addition to being efficient, this allows us to look at a fact such as “Florida namedAfter Flo Rida” and swap the directionality, instead of having to first infer that we need to delete this edge and then infer that the reverse edge should be present.

Automatic Graph Enrichment

Because it’s simply not possible to have humans continually teach the knowledge graph, our system is being constructed to be capable of learning facts on its own. There are many ways to do this including: tracking unexplained queries, generalizing local and global graph features to infer new facts from patterns, and using semantic information. Intuitively, this might look like finding patterns such as “Companies tend to have a CEO” and one of the companies in our graph does not currently have a CEO. Therefore, we should enrich this region of the graph specifically relating to the specific company and the existence of the CEO. To achieve this, we are actively exploring modifications of techniques such as the path rank algorithm and graph embedding methods as well as information retrieval techniques from the internet and other sources. This is proving to be an exciting path of inquiry.

Graph Dynamics

Modeling the influence of specific edges on the connectivity of two marked vertices in a graph is fundamental to understanding network resilience. In the context of a knowledge graph, this provides us with information about the influence of this fact. Intuitively, if we imagine that the vertices in our graph are cities and the edges roads, with the edge weights corresponding to the width of those roads (e.g. 0.1 is a one lane road and 1.0 is a 6 lane super highway), then the time to travel between two different cities indicates the strength of their connection. With many alternative routes and many wide highways, we can say that those cities are tightly connected. Mathematically, the problem can be thought about in terms of a two point correlation function for a collection of random walks over the graph. These are discrete random walks whose dynamics can be modeled with a discrete Green’s function. By taking advantage of the connection between discrete Green’s functions on a graph of random topology and discrete Laplace equations, we’ve preliminarily found that it is possible to evaluate the influence of changing an edge. We’re excited to formalize and harden this connection and expose these measures to aid in producing more advanced models.

The knowledge graph at Forge.AI is a crucial element of our technology stack and it has exciting potential for further development. We look forward to sharing with you further insights in the coming months.

Note: This post was originally published on the Forge.AI blog: https://www.forge.ai/blog/knowledge-graphs-for-enhanced-machine-reasoning-at-forge.ai

Linking Python to C with CFFI

I hope that it is uncontroversial to state that Python is a great language that can suffer from occasional performance issues. This is especially true if Python is being used in heavy numerical computing environments like those at Gamalon. Gamalon is not the first to require using Python for numerical tasks. To meet this need, libraries like NumPy, SciPy, Pandas and others provide users with well tested implementations of most common numerical tasks. Most of these numerical tasks, such as matrix multiplication or special function evaluation among others, have reference implementations in C or Fortran that are linked to Python through many layers of indirection.

For rapid prototyping, this turns out to be a significant time saver, but what are the costs of these indirection layers? This can be answered by exploring the callgraph for the following code that simply evaluates the log of the probability density function of the Beta distribution with distributional parameters 1 and 2:

from random import randint
from scipy.stats import beta

N = 1000000
for i in range(N):
    a = beta.logpdf(i, 1, 2)

This simple script has the following callgraph (click to zoom in):

callgraph_beta_logpdf.jpg

We clearly see from the callgraph that a significant portion of our compute time is spent manipulating array shape, constructing arrays, and performing other associated operations. In fact, the underlying math for the distribution doesn’t even make an appearance in the callgraph!

The story is certainly different if, instead of making repeated calls to the logpdf function with a scalar, we make a single call with the vector. In this situation, the call overhead of the array manipulation is swamped by the vectorized cost of the math itself. Overall runtime is reflected by:

Scalar Vector
45.7 s 0.147 s

Vectorization, then, is the key to writing performant NumPy and SciPy based code.

Unfortunately, our specific use-cases typically involve lots of scalar calls, and thus, we encountered significant overhead. Can this cost be optimized away?

Beyond just the overhead, what happens when the user wants to use PyPy or some other tracing JIT to optimize the rest of their Python code (e.g. dictionary operations and the like)? Many of these libraries are simply not compatible with PyPy given PyPy’s partial support of the C API. In fact, NumPy only recently passed all upstream tests when run under the PyPy interpreter, but PyPy was generally unable to provide any performance optimizations. Is it possible to still take advantage of the C and Fortran reference implementations in PyPy without rewriting them in Python?

Generally yes. This linking can be done with a tool called CFFI, or the C Foreign Function Interface, which provides light weight, PyPy compatible, Python bindings for C code. In the remainder, we will explore how to write the bindings, how to link those bindings to Python, and what their associated performance impacts are.

Writing and Building the C

For the sake of comparison, we will implement the Beta logpdf function in C. This function is given by:

which can be implemented in C like so:

double beta_logpdf(double a, double b, double x){
  double prefactor;
  prefactor = lgamma(a + b) - (lgamma(a) + lgamma(b));
  return prefactor + (a - 1.) * log(x) + (b - 1.) * log(x - 1.);
}

Next, we can then start to explore writing a CFFI compatible build script. The first place to start is by constructing the associated header file, beta.h, for the above function. The entire contents of the header file are given by:

double beta_logpdf(double a, double b, double x);

This looks just like a normal header file, because it is. This header file is used to tell CFFI which functions you want to link to Python. Therefore, you should only add function prototypes for the functions you actually want to link.

With the header file and implementation in hand, we next turn our attention to implementing the build script that links with the API-level out-of-line linking paradigm. Generally, the CFFI works by taking the desired C code, automatically generating a C extension for the code, and then building that.

The script begins by import CFFI and creating a new ffibuilder. The ffibuilder is the object that will take care of the code generation and compilation.

import cffi
ffibuilder = cffi.FFI()

With the ffibuilder defined, we need to tell the ffibuilder where to find the file containing the method prototype for beta_logpdf and its implementation.

sourcefile = os.path.join('.', 'beta.h')
source = os.path.join('.', 'beta.c')

Next, we need to read in our files and build them. We first read in the header file and use the header file to inform the set_source method which functions to import. We also need to provide any necessary extra information about how to perform the code generation and provide any extra compilation arguments. Because our example is plain, we only need to tell the ffibuilder what to name the resulting module, _beta in our case, where to find the module and sources, and how to compile it. While not expressly necessary, I try and compile with ‘-O3’, ‘-march=native’, ‘-ffast-math’ whenever it provides performance benefits. In this case, the move from -O2 to -O3 halves the run time.

with open(sourcefile) as f:
    ffibuilder.cdef(f.read())

ffibuilder.set_source(
    '_beta',
    '#include "{0}"'.format(sourcefile),
    sources=[source],
    library_dirs=['.'],
    extra_compile_args=['-O3', '-march=native', '-ffast-math'])

Finally, we build it. I prefer to build with verbose=True to ensure that I have all the information necessary to fix any problems that could occur.

ffibuilder.compile(verbose=True)

Putting this all together, we have:

import os
import cffi

ffibuilder = cffi.FFI()
sourcefile = os.path.join('.', 'beta.h')
source = os.path.join('.', 'beta.c')
with open(sourcefile) as f:
    ffibuilder.cdef(f.read())

ffibuilder.set_source(
    '_beta',
    '#include "{0}"'.format(sourcefile),
    sources=[source],
    library_dirs=['.'],
    extra_compile_args=['-O3', '-march=native', '-ffast-math'])

ffibuilder.compile(verbose=True)

which, when run, reports a successful build. The successful build gives us multiple files; _beta.c which is the emitted C extension module, the compiled object library, _beta.o, and ` _beta.cpython-36m-darwin.so` which is the dynamic library that Python will actually load.

As a closing remark, it is possible to define the C inline as described in the CFFI documentation, but I have found that it gets tedious to do this for large C projects, like those that we use internally.

Linking with Python

Given the built module above, we can now import it into Python. The module will have the name prescribed by the first argument to the set_source method, which in our case was _beta. Because we’re only interested in the library itself and not any of the other FFI periphery, such as memory allocation, we only need to import lib. Importing only the function that we actually used, our import statement is simply from _beta.lib import beta_logpdf. Note, the name of the function in C is the same as that in the module.

Putting this all together, we can call the linked function with:

from _beta.lib import beta_logpdf

beta_logpdf(1, 2, 1)

Performance Tests

Running the scripts detailed in the Scripts section below, we see the following results.

Method Time [s] Slowdown (C Reference)
Reference C 0.0302 s -
SciPy Scalar 45.7 s 1510
SciPy Vector 0.147 s 4.87
CFFI in cPython 3.6 0.346 s 11.5
CFFI in PyPy3 0.0449 s 1.49

Where all timings were generated on my Mid-2017 Non-touchbar MacBook Pro with 2.3 GHz Intel Core i5. The SciPy scalar and vector results are exactly the same as those reported in the introduction. The reference C results were computed by using the beta_logpdf function reported above and timed using the system clock available in the C standard library. The C reference was compiled with GCC 7 with compiler flags -O3, -march=native, and -ffast-math. The reported CFFI numbers were computed using the exact build scripts linked above. We elected to test in both cPython 3.6 and PyPy3 to test the possible benefits that a tracing JIT can afford us in super simplistic scenarios like this. cPython 3.6 and PyPy3 are both from the bottle poured with Homebrew.

Interestingly enough, we observe that CFFI provides two order of magnitude improvement in the overhead for calling the logpdf function within cPython. This is already a drastic speedup, but we can do even better by running under PyPy, which provides only a factor of 2 slow-down versus the reference C.

What is the source of the remaining factor of two? We can explore the computational costs of an exceedingly simple function to further understand the overhead cost of the interpreter. To understand this, we will time a function that takes in three values and returns one:

double nop(double a, double b, double x){
    return x;
}

Linking in the method described above and then timing under both PyPy3 and cPython 3.6, observe the following timings:

Interpreter Time [s] Slowdown (PyPy Reference)
cPython 3.6 0.248 s 15.1
PyPy3 0.0164 s -

Interestingly enough, that means that, calling an empty function linked with CFFI from cPython costs on average 248 ns, while the equivalent tasks costs 16 ns in PyPy. We can then see that the majority of the computational benefit that we have observed from moving from cPython to PyPy is the more efficient C interface.

Closing Remarks

Motivated by Knuth’s oft-cited adage that “premature optimization is the root of all evil”, the general advice that is given in these situations is to write working code, profile it, and then optimize hotspots. For Python, one of the common approaches to such an optimization is to translate the hotspots to C and then link them to Python.

In the above, we have explored specific techniques for such an optimization pathway that also permits the user to use interpreters such as PyPy. This allows for using a JIT to optimize other aspects of the codebase, such as heavy dictionary access, or string computations.

Combining PyPy with a CFFI linked module, it is possible to obtain performance within an order of magnitude to C while still operating within Python. Amazingly enough, PyPy + CFFI out performs even vectorized SciPy in this situation! Granted, this is a sample size of 1 but it is at least encouraging.

Of course, CFFI is not a magic bullet. Performing optimizations of this kind can significantly limit the flexibility of certain aspects of your code-base, and require developers to be more deliberate about their use of interfaces. It also adds maintenance costs because developers will now need to be able to support both C and Python.

Edits

Note: The timings were updated to reflect an updated version of the underlying C function that matches the scipy behaviour better. Thank you @mgeier for pointing out this error!

Scripts

The callgraph was generated with

python -m cProfile -o prof.out beta.py
gprof2dot.py -f pstats prof.out | dot -Tsvg -o callgraph.svg

Run on cPython 3.6

from _beta.lib import beta_logpdf
from scipy.stats import beta
import time

N = 1000000
start = time.time()
for i in range(N):
    a = beta.logpdf(i, 1, 2)
end = time.time()
print(end - start)

start = time.time()
a = beta.logpdf(list(range(N)), 1, 2)
end = time.time()
print(end - start)

start = time.time()
for i in range(N):
    a = beta_logpdf(1, 2, i)
end = time.time()
print(end - start)

Run only under PyPy3

from _beta.lib import beta_logpdf
import time

start = time.time()
for i in range(N):
    a = beta_logpdf(1, 2, i)
end = time.time()
print(end - start)

Compiled with GCC 7 with command gcc -O3 -march=native -ffast-math -finline-functions beta.c -lm ```c #include #include #include

double nop(double a, double b, double x){ return x; }

double xlogy(double x, double y){ if(x == 0.0 || y == 0.0){ return 0.0; } return x * log(y); }

double beta_logpdf(double a, double b, double x){ double prefactor; prefactor = lgamma(a + b) - (lgamma(a) + lgamma(b)); return prefactor + xlogy(a - 1.0, x) + xlogy(b - 1.0, 1.0 - x); }

int main(void){ int N; clock_t start, end; double diff; N = 1000000; start = clock(); for(int i = 0; i < N; i++){ beta_logpdf(1.0, 2.0, i); } end = clock(); diff = (double)(end - start) / (CLOCKS_PER_SEC); printf(“Time in seconds %f”, diff); return 0; }```