Thomas Markovich Machine Learning Research Scientist I work on hard problems and write about some of them here.

Modeling Life Outcomes with Stochastic Methods


My wife and I got married a little less than a year ago, and we’re starting to consider the life that we can build together. Like most couples, we’ve been trying to think through activities such as starting a family, buying a home, and possible career changes; and how to make sure that we can achieve these things without working too hard, so that we can have time to enjoy the fruits of our labour. Naturally, forward projection is an incredibly difficult task with many different variables, many of which are difficult to anticipate. My standard way of trying to make sense of this all involves trying to reason through mean values and making a wide range of assumptions, many of which have turned out to be pretty poor.

Ultimately, it clicked that I might want to consider many trajectories, each the result of a stochastic process, such that I could form a statistical distribution. With this modeling technique in hand, I could start to answer questions such as “What size house can we afford to reliably buy, if we’re targeting a due date of fall 2019”, with an answer like “99.7% of the time, our cash on hand is projected to be remain positive with a purchase price of $500,000” [1]. The rest of this post is organized as follows: we start by discussing a simple stochastic process in python, and then we discuss building individual models for the major features of life that affect cash flow; before finally combining all of the models together and briefly discussing their interpretive power.

Simulating Some Simple Stochastic Processes

The Wiener Process

The Wiener process, known also as Brownian motion, is a simple continuous time stochastic process that in applications as different as path integrals in quantum field theory, control theory, and finance. The theory behind the process was originally developed to model the motion of a particle in a fluid medium, where the particle undergoes random motion due to the random fluctuations of the medium. This motion was first observed by Robert Brown, when he observed the motion of a particle in water through a microscope and observed strange random motion. While originally developed to study the motion of a particle in a randomly fluctuating medium, models of Brownian motion have proven to be successful in finance, where the “medium” at hand is the market or other minor fluctuations that are otherwise impossible to capture. Given that household finances will have many of these same random fluctuations, this is a good starting point. In this discussion we will denote the Wiener process as $W$, which has individual elements $W_t$ where the subscript $t$ indicates the time index. This process has four simple properties:

  1. $W_0 = 0$
  2. $W_{t + u} - W_t \sim \mathcal{N}(0, u) \sim \sqrt{u} \mathcal{N}(0, 1)$, where $\sim$ indicates “drawn from”.
  3. For $0 \le s<t<u<v \le T$, $W_t - W_s$ and $W_v - W_u$ are independent
  4. $W_t$ is continuous in $t$

Using property two, we can immediately discretize this process by choosing to simulate:

which we will represent compactly as

Then, to recover $W$, we simply use the cumulative sum:

Computationally, this can be done easily using python and numpy in the following way:

import numpy as np
N = 100000
dt = 1.
dW = np.random.normal(0., 1., int(N)) * np.sqrt(dt)
W = np.cumsum(dW)

I’ve plotted five different Wiener process trajectories in Figure 1

Figure 1: Plot of five different instantiations of the discretized Wiener process generated using the code above.

Geometric Brownian Motion

With the Wiener process in hand, we can next explore a slightly more complicated stochastic processes named Geometric Brownian motion (GBM). Like a Wiener process, GBM is a continuous time stochastic process. GBM, however, assumes a logarithmic dependence on an underlying Wiener process with drift. The equation for such a process is

where $S_t$ provides the time index for the value of the stochastic process, $\mu$ is the constant related to the drift, $\sigma$ represents the volatility of the process, and $W_t$ is the Weiner process. This model is simple enough to be solved analytically, with the solution being:

but this does not limit the applicability of GBM. In fact, GBM has seen applications in a wide range of fields including finance through the celebrated Black-Scholes model and other financial models, molecular dynamics, and even genetics. GBM is popular in all of these areas because it is intuitively easy to grasp, and easy to modify to include new physics such as stochastic jumps or non-constant volatility.

While an analytic solution is useful, we will compute trajectories of the GBM by discritizing it on a grid, and using the cumulative summation that we used above for the Wiener process. In code, this looks like:

import numpy as np

def GBM(N=10000, mu=1.0, sigma=0.8, s0 = 1.0):
    t = np.linspace(0.,1.,N+1)
    dt = 1./N
    dW = np.random.normal(0., 1., int(N + 1)) * np.sqrt(dt)
    W = np.cumsum(dW)
    term1 = (mu - sigma**2/2.0) * t
    term2 = sigma * W
    S = s0 * np.exp(term1 + term2)
    return t, W, S

Figure 2: Plot of three different instantiations of the discretized Geometric Brownian motion process and the underlying Wiener process generated using the code above. Matching colours have the same underlying stochastic processes.

To model real world data with a GBM we can compute the drift and volatility from the data at hand, and then project forward many trajectories to understand the likelihood of many future possibilities.

Building a model of Cash on Hand

With an understanding of some stochastic processes in hand, we next seek to actually model personal financial data. One of the most important values to track when evaluating financial health is cash on hand. Obviously you generally want to spend less than your income, but how much less? And what are the effects of activities like buying a house, increasing rent, raises, having kids, vacations, and other life events on cash on hand?

The most straight forward starting point is to model personal spending habits. Having analyzed my last 6 years of spending, they been constant with minor fluctuations once major life events such as paying for a wedding were subtracted. There also rarely appears to be any obvious memory from month to month. Because I mostly observed fluctuation of a calculable standard deviation around a calculable mean, with no historical dependence, we seem to meet most of the conditions for Wiener process model to hold.

To build out this model, let’s assume an after-tax salary of $1000/month and a year of spending data that looks like the following:

[ $1106.95   $819.1    $782.45   $629.3    $951.48   $795.21   $955.4   $1004.88 $1109.47  $1277.65  $1380.51  $1058.21]

We can then compute the mean and standard deviations to arrive at data driven values for our financial model. This can be simply done by:

mean = np.mean(spending) # $989.22
std = np.std(spending) # $206.37

We can then generate a single forward trajectory of cash on hand with these values using the following code, which is closely related to the previous code:

N = 60 # 5 Years
dW = salary - np.random.normal(mean, std, int(N))
W = np.cumsum(dW)

where $W$ now represents our time dependent cash on hand. We can generate a single trajectory, but this is hardly going to be a valuable exercise because the odds of accurately predicting future cash on hand is extraordinarily low. Instead, we can understand the result of the trajectories in distribution by running many trajectories and exploring their distribution, which we have plotted below in Figure 4:

Figure 3: On the left, a plot of projected cash on hand for our hypothetical financial scenario. The mean trajectory is plotted in white, and the standard deviation is plotted in blue. On the right, a plot of the fraction of trajectories with a negative cash on hand.

From simple subtraction, we knew that the hypothetical person in the above scenario would be struggling, because, on average, there was only this person could only expect to save $10.78/month. It’s safe to say that the above scenario isn’t financially healthy, but we now have a firm sense for how unhealthy this exact scenario is. With already an extremely simple model, we can already start to appreciate the depth of questions that we can ask. Beyond that, however, we can also start to appreciate the true randomness that underpins life; which I’ve found to be useful in distressing budgeting for the future in uncertain situations.

Adding Time Dependence

Let’s take this model and run with it! My spending habits tended to have natural cyclic variation, with higher spending around Christmas time (presents and plane tickets to visit family) and the summer due to vacations. The plan is to make the mean for our Wiener process time dependent. For the sake of this simulation, we will explore the financial situation around hypothetical person 2 (HP2), who we assume has a take-home salary of $3000/month, and we will assume the following means for our stochastic spending process:

Month Mean [$] Notes
Jan 2000  
Feb 2000  
Mar 2000  
Apr 2000  
May 2000  
Jun 4000 Vacation
Jul 2000  
Aug 2500 Back to School
Sep 2000  
Oct 2000  
Nov 3500 Christmas Plans Booked
Dec 3000 Christmas Gifts

With a simple modification, we can just plug these values right in:

dW = []
salary = 3000
for i in range(N):
    monthly_spending = np.random.normal(SpendingMeans[i % 12], 300., 1)[0]
    dW.append(salary - monthly_spending)
W = np.cumsum(dW)

We can simulate just like before, and we can study distributions in the way to understand our HP2’s potential envelope of futures. These distributions are plotted in Figure 5:

Figure 4: On the left, a plot of projected cash on hand for our hypothetical financial scenario with time dependent means. The mean trajectory is plotted in white, and the standard deviation is plotted in blue. On the right, a plot of the fraction of trajectories with a negative cash on hand.

Modeling Salaries With Stochastic Jump Process

Up until now we’ve provided a somewhat coarse model for household spending but we’ve ignored the general stochasticity of compensation that comes from raises and bonuses. Over a multiyear projection, ignoring these compensation increases can lead under-forecasting of cash on hand. For the sake of modeling, I’m going to assume that our hypothetical person 2 (HP2) is a salaried individual, who is making 3000/month after taxes and averages a 4% raise year over year with a standard deviation of 2%. This volatility can come from market conditions, HP2’s individual performance, or performance of his employer. Whatever the case, capturing this variability will have a drastic impact on the accuracy of our cash on hand simulations.

To do this we will start by assuming a yearly raise cycle, where the raise amount is drawn from a normal distribution that is constrained to be positive.

mean_raise = 4.0 # Can be computed
std_raise = 1.0 # Can be computed
base_salary = 3000 # per year
p = 1.0 # Base percentage

salary = np.array([base_salary] * N)
for i in range(len(salary)):
    if (i != 0) and (i % 12) == 0:
        p = p * (1. + np.abs(np.random.normal(mean_raise, std_raise))/100)
    salary[i] = salary[i] * p
Year Mean Yearly Salary
0 29400.0 $\pm$ 0.0
1 30577.0 $\pm$ 293.61
2 31802.77 $\pm$ 431.88
3 33075.85 $\pm$ 551.76
4 34397.68 $\pm$ 662.16

Figure 5: In Figure 5 we present the mean and standard deviations of 10000 simulations of the presented yearly raise model. We see the expected stair step pattern, with more uncertainty as we project further into the future.

With this simple model in hand, we will next assume stochasticity with respect the interval between raises. This distribution matches roughly the raise and promotion cycle that we generally experience in our jobs. That is, generally our raises correspond to yearly performance reviews, but promotions and raise cycles can come a touch earlier or a touch later. We will build this stochasticity in by drawing the periods between raises from a Poisson distribution with rate $\lambda$ and a linear shift. To provide an intuition for what the Poisson distribution looks like, I’ve plotted a histogram of 100000000 samples below in Figure 6.

def poisson_draw(lam = 2, shift = 0, truncate = False):
        val = shift + np.random.poisson(lam)
        if not truncate or (truncate and val >= 0):
    return val

Figure 6: In Figure 6 we present the poisson histogram that is generated by drawing 100000000 samples from the above function with lam = 3 and shift = 10. As expected, we see our most likely period as 12 months but we have some statistical support around a slightly early raise or a delayed raise, with the delay being more likely than the early raise.

With a duration model in hand, we can then modify our simple stochastic raise model to include a stochastic time dependence between raises. This can be done simply by drawing the raise period from a poisson distribution, as indicated below:

last_raise = 0
raise_period = poisson_draw(lam = 4, shift = 8)
for i in range(len(salary)):
    if (i != 0) and (i  - last_raise) == raise_period:
        pp = np.abs(np.random.normal(mean_raise, std_raise))/100.
        p = p * (1. + pp)
        last_raise = i
        raise_period = poisson_draw(lam = 4, shift = 8)

for i in range(len(salary)):
    salary[i] = salary[i] * percentage[i]

Figure 7: In Figure 7 we present the mean and standard deviations of 10000 simulations of the presented raise model with stochastic raise periods. We see the stair step pattern gets "smeared" out due to the added uncertainty, with more uncertainty as we project further into the future.

Year Mean Yearly Salary
0 29475.97 $\pm$ 108.38
1 30614.07 $\pm$ 369.22
2 31825.85 $\pm$ 533.17
3 33093.98 $\pm$ 671.24
4 34416.07 $\pm$ 803.32

The second major source of randomness in HP2’s compensation is the role of bonuses in total compensation. Bonuses typically fall within a range of percentage of salary, and are typically offered at the end of the year, which is how we will choose to model them in this case. Specifically, we will assume that HP2 has a 5% yearly bonus optionality, and that HP2 is bonused every December. This model is probably the simplest that we have to implement, and the code looks something like:

def bonus_model(bonus_mean, bonus_std, yearly_salary):
    N = len(yearly_salary)
    bonus_percentages = np.random.normal(bonus_mean, bonus_std, N)/100.
    bonus = []
    time = []
    i = 0
    for j in range(12*N):
        if (((j % 12) == 11) and j != 0):
            bonus.append(bonus_percentages[i] * yearly_salary[i])
            i = i + 1
    time = np.array(time)
    bonus = np.array(bonus)
    return bonus, time

Putting It All Together

Because our observable is cash, we can combine all of these models with through simple addition. This allows us to represent our total compensation model as:

salary, yearly_salary = stochastic_raise_salary_model(base_salary, mean_raise, std_raise, N)
bonus, months = bonus_model(mean_bonus, std_bonus, yearly_salary)
total_comp = (salary + bonus)

Putting this all together, we can combine the cyclic spending model with the above total compensation model to arrive at a richer set of simulations, which we present in figure 6.

Figure 8: In Figure 8 we present the cash on hand model that combines together the stochastic salary model with stochastic raise periods, stochastic bonus model, and cyclic stochastic spending model. Taken together we see that over the 5 year period our HP2 slowly becomes more financially stable, but still caries significant debt year after the holidays every year.


These models were extremely simple but can be combined and extended with more parameters to account for deterministic things such as event based tax benefits (e.g. the interest deduction from buying a house), stock market performance, credit card debt, and other important events. I’ve done precisely this and they’ve already helped me to understand my finances more completely. Beyond this, they’ve also helped to reduce my overall anxiety around finances because uncertainty is a first class citizen of this approach.

Of course, beyond just extending these models, we also want to evaluate them and then use real-world data to inform them so that we can plan using a true posterior rather than just trying to draw inferences from our prior. I’ve chosen to do this personally by likelihood weighting my trajectories. This is commonly known as Approximate Bayesian Computation (ABC), and a followup post will detail how I used ABC by implementing a very special purpose probabilistic programming language.

[1] Please also note that any dollar figures quoted here are purely hypothetical due to the highly private nature of personal finances. This does not affect our modeling techniques, however.

Croissants and Macarons

Not all problems in life can be solved by machine learning. Sometimes you want to bake something extremely unhealthy and extremely delicious; and I’m fortunate to have the free time to pursue these desires. Recently I learned how to make, but not pronounce, croissants and macarons through lots of trial and error and some faithful taste testing by my wife and my coworkers.

The croissants were surprisingly time consuming when compared to what I saw on the great british baking show. Even still, they turned out great and even my boss who was born and raised in France approved:

Macarons were quite a good deal more difficult and technique driven. There were so many places where you could go wrong, from splitting the ganache, to over working the batter that they took a bit of practice to get right. Before Christmas I brought a bunch of mint-white chocolate macarons into the Forge.AI, and Ivan was kind enough to take some pictures in exchange for a couple:

Photo credit to Ivan Aguilar and his mustache.

NeurIPS 2018 Recap by Forge.AI

With quick reflexes and a fortunate server error, I was lucky enough to get a ticket to the 2018 Neural Information Processing Systems Conference (NeurIPS). It was with great excitement that I attended to represent Forge.AI this year. NeurIPS provides its attendees with a week of talks, demonstrations, and incredible networking opportunities. I was able to catch up with old friends, and meet new friends and potential collaborators. For those of you who weren’t lucky enough to score a ticket, I thought it would be useful to provide a collection of highlights from the conference.

Scalable Bayesian Inference

On my first day at NeurIPS, I was fortunate to attend a tutorial by Professor David Dunson from Duke University.

Professor Dunson gave a beautiful overview of techniques for scaling markov chain monte carlo (MCMC) to both large datasets and high model complexity. Optimal transport methods based on Barycenter calculation in Wasserstein space were discussed at length, with scaling results that look extremely promising and relevant to some of the inference tasks Forge.AI is tackling.

Professor Dunson opened a discussion about the high model complexity limit, coarsened Bayes (c-bayes), modular Bayes, and other techniques. In particular, the idea of c-Bayes is both philosophically disconcerting and aesthetically beautiful. I’ve personally always considered Bayes’ theorem to be on the same footing as Kepler’s law, so making minor modifications out of modeling convenience feels strange particularly because Bayes’ theorem provides a mechanism to have statistical strength from the observed data dominate the model structure when the signal is strong enough; and this modification down-weights that mechanism.

Of course, that’s not to say that I feel like this is a bad idea. Ultimately, the use of this theory makes parameter estimation possible without having to necessarily worry about small amounts of data-noise. It appears particularly convenient, especially when the data noise model isn’t available or easy to infer. I will have to explore the technique more to understand the settings where it’s preferable to use c-Bayes rather than explicitly model data-set noise, but I have a hunch that c-Bayes will be useful in knowledge graph construction tasks where I have a small amount of string noise (typos and abbreviations) without having to provide explicit string noise models.

Causal and Counterfactual Inference Techniques

Later the same day, Professor Susan Athey from Stanford University gave a wonderful overview of causal and counterfactual inference techniques. In her presentation, she discussed many of the algorithms and applications with very specific example use-cases in mind. This really helped to ground a difficult-to-pin down talk concretely and succinctly.

The professor’s talk made it painfully obvious how Forge.AI can combine knowledge graphs with counterfactual inference to perform AI guided speculative analyses. For instance, automatically answering the question “what would happen to Tesla’s stock price if there was an uprising in the Democratic Republic of the Congo?”.

Other Highlights

The rest of the week was filled with interesting talks, posters, and conversations. For instance, I ran into the Alexandria team at the Microsoft booth. They’re focusing on applying probabilistic programming to high precision knowledge graph construction. Both are projects close to my heart, and I loved hearing about how they combined them together. It was particularly exciting to learn how their token-based string model combined character-based likelihoods with token and dictionary based likelihoods to automatically learn format models. Using these models to achieve a precision greater than 95% would represent a true step forward in automated knowledge graph construction, and I can’t wait to read the paper.

I also attended the workshop on challenges and applications of AI in the Financial Services space. It was an absolute treat to learn about how the researchers in the the finance sector envision bringing in ML techniques. It was incredibly useful to see how important fairness, privacy, and explainability are in making day-to-day algorithmic decisions. As a data-provider with a prominent vertical in the financial services industry, it was useful to understand precisely what was meant by the term explainability. On multiple occasions, both the panel speakers made and the invited speakers the point that explainability was mostly desirable due to regulatory constraints and audit protections.

Even though everyone was in the same industry, explainability meant different things to different parts of the industry. There are many situations where individual decision makers are personally liable, and being able to provide the analyst with the ability to explain a potential poor decision by diagnosing a tool is highly desirable. Explainability in the credit card applications space tends to focus on generating adverse action codes to explain a decision, to provide the end user with a view of how they can remedy any defects with their applications.

Additionally, it was useful to hear a repeated emphasis on uncertainty predictions and the usefulness of understanding how to leverage uncertainty in making business decisions whether those decisions are underwriting a mortgage, offering a credit card, or making a trade. I found this personally validating because Forge.AI has constantly pushed to keep track of confidences and transparently report them, to inform our customers and their models of any downstream uncertainties that we may have.

NeurIPS was an amazing experience this year, and I look forward to returning next year with a larger Forge.AI cohort. Hopefully we’ll even be presenting some of our interesting work. We’ll probably have to write a bot to sign us all up so that we all can actually get tickets, but that sounds like a perfect task for a tech company like us. Maybe we’ll even get mugs next year.

Note: This post was originally published on the Forge.AI blog:

Knowledge Graphs for Enhanced Machine Reasoning at Forge.AI


Natural Language Understanding at an industrial scale requires an efficient, high quality knowledge graph for tasks such as entity resolution and reasoning. Without the ability to reason about information semantically, natural language understanding systems are only capable of shallow understanding. As the requirements of machine reasoning and machine learning tasks become more complex, more advanced knowledge graphs are required. Indeed, it has been previously observed that knowledge graphs are capable of producing impressive results when used to augment and accelerate machine reasoning tasks at small scales, but struggle at large scale due to a mix of data integrity and performance issues. Solving this problem and enabling machine driven semantic reasoning at scale is one of the foundational technological challenges that we are addressing at Forge.AI.

To understand the complexity of this task, it’s necessary to define what a knowledge graph is. There are many academic definitions floating around, but most are replete with jargon and impenetrable. Simply said, a knowledge graph is a graph where each vertex represents an entity and each edge is directed and represents a relationship between entities. Entities are typically proper nouns and concepts (e.g. Apple and Company, respectively), with the edges representing verbs (e.g. Is A). Together, these form large networks that encode semantic information. For example, encoding the fact that “Apple is a Company” in the knowledge graph is done by storing two vertices, one for “Apple” and one for “Company”, with a directed edge originating with Apple and pointing to Company of type “isA”. This is visualized in Figure 1:

Figure 1: Visualized simple knowledge graph representing the fact that "Apple is a Company"

A knowledge graph encodes many facts, each through the use of a directed edge. Each vertex can have many facts connected to it, making this ultimately a directed multigraph. This type of representation provides an intuitive way to reason about queries. For example, from the knowledge graph represented in Figure 1 we can reason about the question “Is apple a company?” by simply walking through the graph, starting at “Apple” and walking to “Company”, testing edges and concepts along the way. In production, knowledge graphs tend to be quite large and complex with millions or billions of edges. Such a large amount of knowledge allows us to use these graphs to easily reason about semantic connections for tasks such as enriching business relevant data and resolving entities. At Forge.AI, we perform these tasks as part of our NLP / NLU pipeline for extracting individual events from unstructured text into a machine-readable format.

With a working definition of a knowledge graph in hand, we will next explore some of the use cases that we’ve found for the knowledge graph here at Forge.AI. Then, we’ll explore the graph infrastructure to understand what powers these use-cases. Finally, we’ll discuss part of our road map to explore what’s next for Forge.AI and its knowledge graph.

Use Cases

It’s worth grounding our conversation of the knowledge graph in a few use-cases before we jump too deeply into a detailed discussion of the infrastructure, how it works, and where we’re going. In general, a knowledge graph can be used for a wide range of applications including entity resolution, dependency analysis, filtering, and machine reasoning. In the ensuing discussion, we will focus on entity disambiguation and dependency analysis, two of the many tasks that we use the knowledge graph for at Forge.AI.

Entity Disambiguation

While simple to state, the problem of entity disambiguation is one of the most frequent problems that we need to solve when reasoning about a document. While this problem is fairly straightforward to handle in cases where the relevant ontology is known and fully enumerated, it can quickly become difficult when that is not the case. To explore how we can handle this problem with the knowledge base, let’s consider the problem of determining which “Apple” is being referenced in the quote below:

“Though the company doesn’t break out individual unit sales by model, Apple says it sold 77.3 million iPhones — a decrease from the 78.2 million iPhones it sold in the same period in 2017.”

Obviously, this is “Apple” the corporation, not “apple” the type of fruit. How did our brains determine this? We used contextual clues! We know that the Apple Corporation sells the iPhone because the type of fruit is incapable of selling anything. Based on these contextual clues alone, we are able to perform this task nearly instantaneously using our reasoning.The ForgeAI knowledge graph works in the same way: when we seek to disambiguate an entity, we provide the knowledge graph with a set of co-located entities that provide the graph with the appropriate context. However, machine learning systems do not work like our brains do and for a machine learning system to reason with context, we need a knowledge graph. Our knowledge graph then searches for all versions of “Apple” on the full graph and constructs small graphs that include contextual information as can be seen in Figures 2 and 3. Note, this is a noisy string search that is capable of finding versions of the initial search term that may differ from the original string or contain the search string as a substring. We also keep a look up table of known aliases for each of our entities, where aliases can be things like CIK codes or ticker symbols.

Figure 2: Visualized excerpt from the Knowledge Graph that pertains to the entity Apple the fruit.

Figure 3: Visualized excerpt from the Knowledge Graph that pertains to the entity Apple, the consumer electronics corporation.

With these small graphs in hand, the knowledge graph then uses machine reasoning to determine which of the entities is truly being referenced. There are many strategies to doing this but we have found that a greedy algorithm which seeks to maximize the overlap between the contextual entities passed in and the small graphs under consideration is effective.

Dependency Analysis

Another major task that we’ve found the knowledge graph to be useful for is dependency analysis. That is, to determine the relationship between two or more entities. This is most useful when attempting to determine whether an extracted event is something that a customer would care about, given their stated interests. To make this concrete, let’s consider the following news story in regards to a customer that is interested in news events relating to Samsung:

“Russia’s Norilsk Nickel has teamed up with Russian Platinum to invest $4.4bn to develop mining projects in Siberia, which contains some of the world’s richest deposits of platinum and palladium. The two companies will form a joint venture to develop projects in the Taimyr Peninsula in Russia’s far north with an aim to become the world’s largest producer of the precious metals, they said Wednesday.”

It’s certainly not obvious to me how this story is connected to Samsung. The question at hand is to determine whether this news event is related to Samsung and, if so, the nature of that relation so we can determine whether or not to pass this event to our customer. We begin by constructing small graphs around each of the entities. With these graphs in hand, we then compute a path given Dijkstra’s algorithm between each of the marked endpoints. An example of such a path is given in Figure 4.

Figure 4: Visualized excerpt from the Knowledge Graph that pertains to the relationship between the Norilsk platinum group metals mine in Siberia, Russia and Samsung.

What we see in Figure 4 is that the knowledge graph believes that Iridium is a Platinum Group Metal, and that Platinum Group Metals are mined in Norilsk. We also see that the Knowledge Graph believes that Iridium is used in Organic Light Emitting Diodes (or OLEDs), which just happen to be used in Samsung phones. Therefore, this news event is likely relevant to our customer. In fact, this event is highly relevant to our customer’s interest in Samsung because Iridium is incredibly important to the production of OLED screens due to its ability to make a blue LED. Indeed, Samsung has even funded researchers at MIT and Harvard to explore alternatives to Iridium for OLED screens.

This type of dependency analysis is illustrative of the power of a well formed knowledge graph and it is critical for machine enabled semantic reasoning. It’s easy to imagine this type of dependency analysis having uses not only in the financial services industry, but also in work as wide ranging as supply chain risk assessment and nuclear nonproliferation applications – just to name a few.

Graph Infrastructure

In addition to standard graph features, we choose to endow each fact that is stored in the knowledge graph with the time at which the edge was added and a confidence for that edge. The time dependence intuitively follows from the observation that the totality of human knowledge grows and changes over time. Ultimately, this makes the graph dynamic, which is a natural feature of human knowledge itself.

There are a small number of facts that I’d be willing to bet my life on – something like Auston Matthews is a Toronto Maple Leaf – and a great many facts that I’d be willing to bet $20 dollars on – for example, the Boston Massacre happened in 1770. Both are true but, due to the amount of information that I’ve recently read, I know considerably more about the former than the latter and, therefore, am more confident about it. Motivated by this, we have designed our knowledge graph such that each edge has weights which we choose to interpret as confidences. This data enables us to capture the inherent uncertainty necessary to model a fast changing world and to reason about the validity of queries. By virtue of the graph being probabilistic, we are able to embrace true Bayesian reasoning as we attempt to evaluate a query, as well as provide query specific priors to up or down weight an assertion based on the origin (e.g. a company’s own statements about a new product release should be up-weighted over twitter rumors).

One of the most exciting engineering challenges of knowledge graphs is their size. It is not uncommon to have a knowledge graph with more than 1 billion facts and 50 million vertices; this can easily require hundreds of gigabytes of RAM. Even more concerning than the memory requirements is the computational cost of computing even basic graph properties such as the path length between vertices. We have taken two complementary approaches to ensure that our graph algorithms are as quick as possible. First, because our edges are interpreted as probabilities, it is possible to set a probability cutoff beyond which we are not interested in graph connections. This allows us to only consider graph algorithms over highly restricted subsets of the graph, which provides us with major algorithmic improvements. Second, we have engineered the data structure to remain as cache coherent as possible by representing our knowledge graph as a sparse three rank tensor in an attempt to optimize the per-fact throughput through the CPU.

We also have a clear route towards efficient parallelization by exploiting what we are terming the “galactic structure” of the graph. While this is not a general feature of all graphs, we have observed that there are highly connected clusters of vertices that are only weakly connected to one another. Intuitively, this makes sense. For example, consider domains such as the Toronto Maple Leafs and modern particle physics – there is little overlap between these fields and therefore no need to reason over a graph that contains both clusters of highly interconnected vertices when reasoning about Dave Keon, the Toronto Maple Leafs legend. This galactic structure provides us with a promising route towards efficient parallelization using commodity hardware.

Where are We Going?

We’ve just started to teach the knowledge graph and show it how to perform basic reasoning. The following are some of the many additional features that we are adding that will ensure the accuracy, robustness, and efficiency of the graph long into the future.

Probabilistic Reasoning

Giving the knowledge graph the ability to reason probabilistically about the validity of facts allows it to hold conflicting facts or hypotheses and evaluate them later in the presence of more evidence. This can additionally be used to evaluate nuance of queries. This can be achieved by using techniques such as softening the axiomatic constraints that power the machine reasoning engine and building ontology-specific Bayesian models. We anticipate that using these techniques should make our knowledge graph more resilient to internal errors.

Automatic Fact Checking

Of course, if we have a collection of facts that we intend to use as our internal source of truth to augment business data, we should ensure that this set of facts is correct. With our current knowledge graph size, we can perform this fact checking using a mix of manual spot checking and axiomatic constraint testing (e.g. a person can only be born in one country). This is the standard technique for evaluating the correctness of knowledge graphs. As with most machine learning tasks, this is incredibly person intensive and, therefore, expensive. Additionally, it’s difficult to scale this technique to large graphs. To address these issues, we’re excited to explore techniques related to hinge-loss Markov random fields that are directionally aware. In addition to being efficient, this allows us to look at a fact such as “Florida namedAfter Flo Rida” and swap the directionality, instead of having to first infer that we need to delete this edge and then infer that the reverse edge should be present.

Automatic Graph Enrichment

Because it’s simply not possible to have humans continually teach the knowledge graph, our system is being constructed to be capable of learning facts on its own. There are many ways to do this including: tracking unexplained queries, generalizing local and global graph features to infer new facts from patterns, and using semantic information. Intuitively, this might look like finding patterns such as “Companies tend to have a CEO” and one of the companies in our graph does not currently have a CEO. Therefore, we should enrich this region of the graph specifically relating to the specific company and the existence of the CEO. To achieve this, we are actively exploring modifications of techniques such as the path rank algorithm and graph embedding methods as well as information retrieval techniques from the internet and other sources. This is proving to be an exciting path of inquiry.

Graph Dynamics

Modeling the influence of specific edges on the connectivity of two marked vertices in a graph is fundamental to understanding network resilience. In the context of a knowledge graph, this provides us with information about the influence of this fact. Intuitively, if we imagine that the vertices in our graph are cities and the edges roads, with the edge weights corresponding to the width of those roads (e.g. 0.1 is a one lane road and 1.0 is a 6 lane super highway), then the time to travel between two different cities indicates the strength of their connection. With many alternative routes and many wide highways, we can say that those cities are tightly connected. Mathematically, the problem can be thought about in terms of a two point correlation function for a collection of random walks over the graph. These are discrete random walks whose dynamics can be modeled with a discrete Green’s function. By taking advantage of the connection between discrete Green’s functions on a graph of random topology and discrete Laplace equations, we’ve preliminarily found that it is possible to evaluate the influence of changing an edge. We’re excited to formalize and harden this connection and expose these measures to aid in producing more advanced models.

The knowledge graph at Forge.AI is a crucial element of our technology stack and it has exciting potential for further development. We look forward to sharing with you further insights in the coming months.

Note: This post was originally published on the Forge.AI blog:

Linking Python to C with CFFI

I hope that it is uncontroversial to state that Python is a great language that can suffer from occasional performance issues. This is especially true if Python is being used in heavy numerical computing environments like those at Gamalon. Gamalon is not the first to require using Python for numerical tasks. To meet this need, libraries like NumPy, SciPy, Pandas and others provide users with well tested implementations of most common numerical tasks. Most of these numerical tasks, such as matrix multiplication or special function evaluation among others, have reference implementations in C or Fortran that are linked to Python through many layers of indirection.

For rapid prototyping, this turns out to be a significant time saver, but what are the costs of these indirection layers? This can be answered by exploring the callgraph for the following code that simply evaluates the log of the probability density function of the Beta distribution with distributional parameters 1 and 2:

from random import randint
from scipy.stats import beta

N = 1000000
for i in range(N):
    a = beta.logpdf(i, 1, 2)

This simple script has the following callgraph (click to zoom in):


We clearly see from the callgraph that a significant portion of our compute time is spent manipulating array shape, constructing arrays, and performing other associated operations. In fact, the underlying math for the distribution doesn’t even make an appearance in the callgraph!

The story is certainly different if, instead of making repeated calls to the logpdf function with a scalar, we make a single call with the vector. In this situation, the call overhead of the array manipulation is swamped by the vectorized cost of the math itself. Overall runtime is reflected by:

Scalar Vector
45.7 s 0.147 s

Vectorization, then, is the key to writing performant NumPy and SciPy based code.

Unfortunately, our specific use-cases typically involve lots of scalar calls, and thus, we encountered significant overhead. Can this cost be optimized away?

Beyond just the overhead, what happens when the user wants to use PyPy or some other tracing JIT to optimize the rest of their Python code (e.g. dictionary operations and the like)? Many of these libraries are simply not compatible with PyPy given PyPy’s partial support of the C API. In fact, NumPy only recently passed all upstream tests when run under the PyPy interpreter, but PyPy was generally unable to provide any performance optimizations. Is it possible to still take advantage of the C and Fortran reference implementations in PyPy without rewriting them in Python?

Generally yes. This linking can be done with a tool called CFFI, or the C Foreign Function Interface, which provides light weight, PyPy compatible, Python bindings for C code. In the remainder, we will explore how to write the bindings, how to link those bindings to Python, and what their associated performance impacts are.

Writing and Building the C

For the sake of comparison, we will implement the Beta logpdf function in C. This function is given by:

which can be implemented in C like so:

double beta_logpdf(double a, double b, double x){
  double prefactor;
  prefactor = lgamma(a + b) - (lgamma(a) + lgamma(b));
  return prefactor + (a - 1.) * log(x) + (b - 1.) * log(x - 1.);

Next, we can then start to explore writing a CFFI compatible build script. The first place to start is by constructing the associated header file, beta.h, for the above function. The entire contents of the header file are given by:

double beta_logpdf(double a, double b, double x);

This looks just like a normal header file, because it is. This header file is used to tell CFFI which functions you want to link to Python. Therefore, you should only add function prototypes for the functions you actually want to link.

With the header file and implementation in hand, we next turn our attention to implementing the build script that links with the API-level out-of-line linking paradigm. Generally, the CFFI works by taking the desired C code, automatically generating a C extension for the code, and then building that.

The script begins by import CFFI and creating a new ffibuilder. The ffibuilder is the object that will take care of the code generation and compilation.

import cffi
ffibuilder = cffi.FFI()

With the ffibuilder defined, we need to tell the ffibuilder where to find the file containing the method prototype for beta_logpdf and its implementation.

sourcefile = os.path.join('.', 'beta.h')
source = os.path.join('.', 'beta.c')

Next, we need to read in our files and build them. We first read in the header file and use the header file to inform the set_source method which functions to import. We also need to provide any necessary extra information about how to perform the code generation and provide any extra compilation arguments. Because our example is plain, we only need to tell the ffibuilder what to name the resulting module, _beta in our case, where to find the module and sources, and how to compile it. While not expressly necessary, I try and compile with ‘-O3’, ‘-march=native’, ‘-ffast-math’ whenever it provides performance benefits. In this case, the move from -O2 to -O3 halves the run time.

with open(sourcefile) as f:

    '#include "{0}"'.format(sourcefile),
    extra_compile_args=['-O3', '-march=native', '-ffast-math'])

Finally, we build it. I prefer to build with verbose=True to ensure that I have all the information necessary to fix any problems that could occur.


Putting this all together, we have:

import os
import cffi

ffibuilder = cffi.FFI()
sourcefile = os.path.join('.', 'beta.h')
source = os.path.join('.', 'beta.c')
with open(sourcefile) as f:

    '#include "{0}"'.format(sourcefile),
    extra_compile_args=['-O3', '-march=native', '-ffast-math'])


which, when run, reports a successful build. The successful build gives us multiple files; _beta.c which is the emitted C extension module, the compiled object library, _beta.o, and `` which is the dynamic library that Python will actually load.

As a closing remark, it is possible to define the C inline as described in the CFFI documentation, but I have found that it gets tedious to do this for large C projects, like those that we use internally.

Linking with Python

Given the built module above, we can now import it into Python. The module will have the name prescribed by the first argument to the set_source method, which in our case was _beta. Because we’re only interested in the library itself and not any of the other FFI periphery, such as memory allocation, we only need to import lib. Importing only the function that we actually used, our import statement is simply from _beta.lib import beta_logpdf. Note, the name of the function in C is the same as that in the module.

Putting this all together, we can call the linked function with:

from _beta.lib import beta_logpdf

beta_logpdf(1, 2, 1)

Performance Tests

Running the scripts detailed in the Scripts section below, we see the following results.

Method Time [s] Slowdown (C Reference)
Reference C 0.0302 s -
SciPy Scalar 45.7 s 1510
SciPy Vector 0.147 s 4.87
CFFI in cPython 3.6 0.346 s 11.5
CFFI in PyPy3 0.0449 s 1.49

Where all timings were generated on my Mid-2017 Non-touchbar MacBook Pro with 2.3 GHz Intel Core i5. The SciPy scalar and vector results are exactly the same as those reported in the introduction. The reference C results were computed by using the beta_logpdf function reported above and timed using the system clock available in the C standard library. The C reference was compiled with GCC 7 with compiler flags -O3, -march=native, and -ffast-math. The reported CFFI numbers were computed using the exact build scripts linked above. We elected to test in both cPython 3.6 and PyPy3 to test the possible benefits that a tracing JIT can afford us in super simplistic scenarios like this. cPython 3.6 and PyPy3 are both from the bottle poured with Homebrew.

Interestingly enough, we observe that CFFI provides two order of magnitude improvement in the overhead for calling the logpdf function within cPython. This is already a drastic speedup, but we can do even better by running under PyPy, which provides only a factor of 2 slow-down versus the reference C.

What is the source of the remaining factor of two? We can explore the computational costs of an exceedingly simple function to further understand the overhead cost of the interpreter. To understand this, we will time a function that takes in three values and returns one:

double nop(double a, double b, double x){
    return x;

Linking in the method described above and then timing under both PyPy3 and cPython 3.6, observe the following timings:

Interpreter Time [s] Slowdown (PyPy Reference)
cPython 3.6 0.248 s 15.1
PyPy3 0.0164 s -

Interestingly enough, that means that, calling an empty function linked with CFFI from cPython costs on average 248 ns, while the equivalent tasks costs 16 ns in PyPy. We can then see that the majority of the computational benefit that we have observed from moving from cPython to PyPy is the more efficient C interface.

Closing Remarks

Motivated by Knuth’s oft-cited adage that “premature optimization is the root of all evil”, the general advice that is given in these situations is to write working code, profile it, and then optimize hotspots. For Python, one of the common approaches to such an optimization is to translate the hotspots to C and then link them to Python.

In the above, we have explored specific techniques for such an optimization pathway that also permits the user to use interpreters such as PyPy. This allows for using a JIT to optimize other aspects of the codebase, such as heavy dictionary access, or string computations.

Combining PyPy with a CFFI linked module, it is possible to obtain performance within an order of magnitude to C while still operating within Python. Amazingly enough, PyPy + CFFI out performs even vectorized SciPy in this situation! Granted, this is a sample size of 1 but it is at least encouraging.

Of course, CFFI is not a magic bullet. Performing optimizations of this kind can significantly limit the flexibility of certain aspects of your code-base, and require developers to be more deliberate about their use of interfaces. It also adds maintenance costs because developers will now need to be able to support both C and Python.


Note: The timings were updated to reflect an updated version of the underlying C function that matches the scipy behaviour better. Thank you @mgeier for pointing out this error!


The callgraph was generated with

python -m cProfile -o prof.out -f pstats prof.out | dot -Tsvg -o callgraph.svg

Run on cPython 3.6

from _beta.lib import beta_logpdf
from scipy.stats import beta
import time

N = 1000000
start = time.time()
for i in range(N):
    a = beta.logpdf(i, 1, 2)
end = time.time()
print(end - start)

start = time.time()
a = beta.logpdf(list(range(N)), 1, 2)
end = time.time()
print(end - start)

start = time.time()
for i in range(N):
    a = beta_logpdf(1, 2, i)
end = time.time()
print(end - start)

Run only under PyPy3

from _beta.lib import beta_logpdf
import time

start = time.time()
for i in range(N):
    a = beta_logpdf(1, 2, i)
end = time.time()
print(end - start)

Compiled with GCC 7 with command gcc -O3 -march=native -ffast-math -finline-functions beta.c -lm ```c #include #include #include

double nop(double a, double b, double x){ return x; }

double xlogy(double x, double y){ if(x == 0.0 || y == 0.0){ return 0.0; } return x * log(y); }

double beta_logpdf(double a, double b, double x){ double prefactor; prefactor = lgamma(a + b) - (lgamma(a) + lgamma(b)); return prefactor + xlogy(a - 1.0, x) + xlogy(b - 1.0, 1.0 - x); }

int main(void){ int N; clock_t start, end; double diff; N = 1000000; start = clock(); for(int i = 0; i < N; i++){ beta_logpdf(1.0, 2.0, i); } end = clock(); diff = (double)(end - start) / (CLOCKS_PER_SEC); printf(“Time in seconds %f”, diff); return 0; }```