Thomas Markovich Machine Learning Research Scientist I work on hard problems and write about some of them here.

Rethinking My Relationship With Technology

I’ve been recently struggling with many little questions that have ultimately culminated in one, big, question. It’s been deeply frustrating, because the only answer that I keep coming to is “I don’t know.”

I’ve been reflecting a lot in the last few years on my relationships with technology and the influence that it’s had on my life. I’ve tried to answer questions like “Does having a subreddit for every esoteric interest make life better?” or “How valuable is twitter, actually?” The spirit of these questions is certainly not new, but it seems that I’ve yet to find a satisfying answer. While I originally thought that these questions had some value; the imminent arrival of my son has given these questions a new significance.

These are certainly not new questions. I remember my father musing about “kids these days” when he saw AOL Instant Messenger, email, and text messages dominate my social life; but he’s also an engineer who has seen technology repeatedly improve our lives and society. He raised me to be technologically optimistic. Some of my fondest memories include building lego mindstorms robots together, or building computers together.

Every day questions, like how in the world a car could even be picked up by a magnet were met with an experiment where we constructed an electromagnet just to see how many nails we could pick up. Christmas always brought a new amazing technology that fundamentally changed the way that I viewed the world. I remember the iPod that I received in 2003. That I could fit so much music on it was remarkable! And the user interface was just… stunning. Novel gadgets were coveted because of the hope that they would be good, and they were good. They were always good.

As a naturally curious person, I was taught by example that technology was a force multiplier that allowed use to answer any question imaginable. Indeed, that’s still the case. The internet still provides us with the ability read about any topic that I want. I’ve yet to encounter a question that wasn’t answerable through diligent research or connecting with experts.

So then what’s the problem? We all know. While technology has increased our access to experts, it has also increased our ability to seek out echo chambers and led to further misinformation and political polarization. We have unmatched tools for staying in contact with friends near and far, and yet we’ve never felt more isolated and lonely. We have tools that have been promised to drastically increase our productivity and yet, all they do is interrupt dinner with notifications about messages that don’t matter. Addressing all of these issues in isolation feels difficult given that they’re each a facet of the broader question: How can we take the best of technology while ignoring the worst?

Trying to answer this question has left me to reconsider my relationship with technology and the influence that it has on my life. As I’ve been on this journey, I’ve sought out books, articles, and conversations with close friends but many of these have inspired cynicism towards technology. I feel that this cynicism is myopic because technology has always inspired some form of hope in my life. My life as I know it wouldn’t be possible with it! Working in the technology sector has provided me with my livelihood. I even believe in the technologies that we continue to build. I truly believe that AI, if worked on thoughtfully, can empower untold societal change. I see the possible benefits of the systems that we’re building and I see no reason not to build them, but this is the response of an optimist.

Unfortunately, there is a growing body of literature that shows that many of our modern day technologies are ruining our ability to read, reason, and focus. Technologies such as social networks, streaming platforms, and smartphones provide us with the specifically designed intellectual junk food that allows us to disengage from the world around us. It’s just so easy for me to pick up my phone and scroll through reddit any time I get bored, possible even during conversations with my wife – and for what? So that I can skim through yet another poorly informed comments section filled with people I don’t know intent trying to score karma with dank-memes and hive-mind claims. When stated in such a manner, it’s obvious that this is a poor trade off and yet, I do it. It’d be easy to spend the rest of this ranting about all the ways in which modern day technologies intellectually hijack me, but that’s both boring and unoptimistic. Instead, it’s worth focusing on the distinct benefits that technology provide and how to minimize the downsides.

One of the things that I noticed most frequently was this issue of notification anxiety, where I always felt the need to keep my notification tray empty, which resulted in more than 100 unlocks per day, on average. I’ve found that by keeping my phone on do not disturb for weeks at a time I’ve been able to quell the notification anxiety while also keeping that same notification tray empty. Importantly, I sit down and choose to process emails and that’s the only time I view them. I’ve not yet removed the gmail app from my phone because the searching feature is far too useful, but I’ve found that without the constant notification anxiety, I’ve stopped checking the emails as well. While this felt revelatory at the moment, it’s obvious now. The question then becomes, what is the point of email notifications? Or the amount of email that we send and receive? Do we actually feel any more connected, or do we just end up feeling more distracted? I’m not actually sure.

The attention capturing might be the most severe in the professional world where I am expected to respond to emails, slack messages, instant messages, in person chats, and phone calls. There are days where I do nothing but respond briefly to a range of emails and slacks for 8 hours and then head home, exhausted, and entirely unproductive. Given the types of technologies that I’m building, I certainly don’t have the time to just waste entire days keeping up with all of these communications media. I have math to do, code to write, and models to train. All of those tasks require deep thought, and getting distracted by communications media, my phone, or the news actively prevents me from building the things that I want to build. I have found that by cutting these distractions out, I have been able to more easily keep up with the literature in my field. Because this is my profession, understanding the gory details and nuances of how an algorithm or model develops over time is literally a professional responsibility. I’m not constantly worried about the racist things that Steve King says so there is the mental energy to engage with a new and interesting knowledge graph paper without issue. Is it actually good that I’ve decided to become a less informed citizen? Or have I actually become less informed? Is it even a problem? And what do I want to model for my child?

In my personal life, it’s been easy to experiment with possible actions, but how can I make sure that I’m actually picking the most healthy solutions? Do we even know what the most healthy solutions are? What lessons do I want to teach my son about technology? Is guarded optimism even an option? Given that many of the above observations highlighted on the anxiety introduced by notifications and other frequent technological refocusing actions, how do I help my son avoid falling victim to these same traps? Is it even a problem that he could be or am I making a mountain out of a mole-hill?

Furthermore, I’m broadly part of the problem given the industry that I work in. If after deep examination of my life and my relationship with technology, I’ve resolved to drastically cut back on my technological interactions, my question is: how can I take these personal, introspective, learnings and generalize them to be professionally responsible? It’s certainly plausible that sectors such as ad-tech, algorithmic targeting, and gameification should be avoided. I’ve been able to avoid such projects thus far, but what about Forge.AI and the fintech/news processing technologies that we’re building? What are their downstream ramifications?

What do we as a profession owe the people who have used our work product? Is an apology sufficient? Is it even warranted? Are we as a profession just responding to consumer desires? We can attempt this argument, but we’ve all dog-fooded our own systems. I know many people in tech that refuse to own smartphones for the reasons identified above, and yet they still make these technologies. Are they just assuming that the end user is educated enough to make the right decision? Is that really ever the case?

Do we as a community even care, or is this only a luxury afforded to those who are already able to support their families? And who do I mean by we anyway? It feels like I’m maybe using we to avoid blaming myself; but maybe it’s acknowledgment that these problems are bigger than myself? Maybe.

So what are we to do? What am I to do? I’ll probably continue to think about this, but continue to disarm how uncomfortable this conversation can be my making jokes. I’ll tell myself that it’s likely that by attempting to think through this from time to time, and talking about it over drinks with friends, I’ll be doing my part. Maybe I will be, or maybe it’s all just a day late and a dollar sort. Maybe.

Modeling Life Outcomes with Stochastic Methods


My wife and I got married a little less than a year ago, and we’re starting to consider the life that we can build together. Like most couples, we’ve been trying to think through activities such as starting a family, buying a home, and possible career changes; and how to make sure that we can achieve these things without working too hard, so that we can have time to enjoy the fruits of our labour. Naturally, forward projection is an incredibly difficult task with many different variables, many of which are difficult to anticipate. My standard way of trying to make sense of this all involves trying to reason through mean values and making a wide range of assumptions, many of which have turned out to be pretty poor.

Ultimately, it clicked that I might want to consider many trajectories, each the result of a stochastic process, such that I could form a statistical distribution. With this modeling technique in hand, I could start to answer questions such as “What size house can we afford to reliably buy, if we’re targeting a due date of fall 2019”, with an answer like “99.7% of the time, our cash on hand is projected to be remain positive with a purchase price of $500,000” [1]. The rest of this post is organized as follows: we start by discussing a simple stochastic process in python, and then we discuss building individual models for the major features of life that affect cash flow; before finally combining all of the models together and briefly discussing their interpretive power.

Simulating Some Simple Stochastic Processes

The Wiener Process

The Wiener process, known also as Brownian motion, is a simple continuous time stochastic process that in applications as different as path integrals in quantum field theory, control theory, and finance. The theory behind the process was originally developed to model the motion of a particle in a fluid medium, where the particle undergoes random motion due to the random fluctuations of the medium. This motion was first observed by Robert Brown, when he observed the motion of a particle in water through a microscope and observed strange random motion. While originally developed to study the motion of a particle in a randomly fluctuating medium, models of Brownian motion have proven to be successful in finance, where the “medium” at hand is the market or other minor fluctuations that are otherwise impossible to capture. Given that household finances will have many of these same random fluctuations, this is a good starting point. In this discussion we will denote the Wiener process as $W$, which has individual elements $W_t$ where the subscript $t$ indicates the time index. This process has four simple properties:

  1. $W_0 = 0$
  2. $W_{t + u} - W_t \sim \mathcal{N}(0, u) \sim \sqrt{u} \mathcal{N}(0, 1)$, where $\sim$ indicates “drawn from”.
  3. For $0 \le s<t<u<v \le T$, $W_t - W_s$ and $W_v - W_u$ are independent
  4. $W_t$ is continuous in $t$

Using property two, we can immediately discretize this process by choosing to simulate:

\[W_{t + dt} - W_{t} \sim \sqrt{dt} \mathcal{N}(0, 1)\]

which we will represent compactly as

\[dW_t \sim \sqrt{t} \mathcal{N}(0, 1).\]

Then, to recover $W$, we simply use the cumulative sum:

\[W_t = \sum_{s \le t} dW_s.\]

Computationally, this can be done easily using python and numpy in the following way:

import numpy as np
N = 100000
dt = 1.
dW = np.random.normal(0., 1., int(N)) * np.sqrt(dt)
W = np.cumsum(dW)

I’ve plotted five different Wiener process trajectories in Figure 1

Figure 1: Plot of five different instantiations of the discretized Wiener process generated using the code above.

Geometric Brownian Motion

With the Wiener process in hand, we can next explore a slightly more complicated stochastic processes named Geometric Brownian motion (GBM). Like a Wiener process, GBM is a continuous time stochastic process. GBM, however, assumes a logarithmic dependence on an underlying Wiener process with drift. The equation for such a process is

\[dS_t = \mu S_t dt + \sigma S_t dW_t,\]

where $S_t$ provides the time index for the value of the stochastic process, $\mu$ is the constant related to the drift, $\sigma$ represents the volatility of the process, and $W_t$ is the Weiner process. This model is simple enough to be solved analytically, with the solution being:

\[S_t = S_0 e^{\left(\mu - \frac{\sigma^2}{2}\right) t + \sigma W_t}\]

but this does not limit the applicability of GBM. In fact, GBM has seen applications in a wide range of fields including finance through the celebrated Black-Scholes model and other financial models, molecular dynamics, and even genetics. GBM is popular in all of these areas because it is intuitively easy to grasp, and easy to modify to include new physics such as stochastic jumps or non-constant volatility.

While an analytic solution is useful, we will compute trajectories of the GBM by discritizing it on a grid, and using the cumulative summation that we used above for the Wiener process. In code, this looks like:

import numpy as np

def GBM(N=10000, mu=1.0, sigma=0.8, s0 = 1.0):
    t = np.linspace(0.,1.,N+1)
    dt = 1./N
    dW = np.random.normal(0., 1., int(N + 1)) * np.sqrt(dt)
    W = np.cumsum(dW)
    term1 = (mu - sigma**2/2.0) * t
    term2 = sigma * W
    S = s0 * np.exp(term1 + term2)
    return t, W, S

Figure 2: Plot of three different instantiations of the discretized Geometric Brownian motion process and the underlying Wiener process generated using the code above. Matching colours have the same underlying stochastic processes.

To model real world data with a GBM we can compute the drift and volatility from the data at hand, and then project forward many trajectories to understand the likelihood of many future possibilities.

Building a model of Cash on Hand

With an understanding of some stochastic processes in hand, we next seek to actually model personal financial data. One of the most important values to track when evaluating financial health is cash on hand. Obviously you generally want to spend less than your income, but how much less? And what are the effects of activities like buying a house, increasing rent, raises, having kids, vacations, and other life events on cash on hand?

The most straight forward starting point is to model personal spending habits. Having analyzed my last 6 years of spending, they been constant with minor fluctuations once major life events such as paying for a wedding were subtracted. There also rarely appears to be any obvious memory from month to month. Because I mostly observed fluctuation of a calculable standard deviation around a calculable mean, with no historical dependence, we seem to meet most of the conditions for Wiener process model to hold.

To build out this model, let’s assume an after-tax salary of $1000/month and a year of spending data that looks like the following:

[ $1106.95   $819.1    $782.45   $629.3    $951.48   $795.21   $955.4   $1004.88 $1109.47  $1277.65  $1380.51  $1058.21]

We can then compute the mean and standard deviations to arrive at data driven values for our financial model. This can be simply done by:

mean = np.mean(spending) # $989.22
std = np.std(spending) # $206.37

We can then generate a single forward trajectory of cash on hand with these values using the following code, which is closely related to the previous code:

N = 60 # 5 Years
dW = salary - np.random.normal(mean, std, int(N))
W = np.cumsum(dW)

where $W$ now represents our time dependent cash on hand. We can generate a single trajectory, but this is hardly going to be a valuable exercise because the odds of accurately predicting future cash on hand is extraordinarily low. Instead, we can understand the result of the trajectories in distribution by running many trajectories and exploring their distribution, which we have plotted below in Figure 4:

Figure 3: On the left, a plot of projected cash on hand for our hypothetical financial scenario. The mean trajectory is plotted in white, and the standard deviation is plotted in blue. On the right, a plot of the fraction of trajectories with a negative cash on hand.

From simple subtraction, we knew that the hypothetical person in the above scenario would be struggling, because, on average, there was only this person could only expect to save $10.78/month. It’s safe to say that the above scenario isn’t financially healthy, but we now have a firm sense for how unhealthy this exact scenario is. With already an extremely simple model, we can already start to appreciate the depth of questions that we can ask. Beyond that, however, we can also start to appreciate the true randomness that underpins life; which I’ve found to be useful in distressing budgeting for the future in uncertain situations.

Adding Time Dependence

Let’s take this model and run with it! My spending habits tended to have natural cyclic variation, with higher spending around Christmas time (presents and plane tickets to visit family) and the summer due to vacations. The plan is to make the mean for our Wiener process time dependent. For the sake of this simulation, we will explore the financial situation around hypothetical person 2 (HP2), who we assume has a take-home salary of $3000/month, and we will assume the following means for our stochastic spending process:

Month Mean [$] Notes
Jan 2000  
Feb 2000  
Mar 2000  
Apr 2000  
May 2000  
Jun 4000 Vacation
Jul 2000  
Aug 2500 Back to School
Sep 2000  
Oct 2000  
Nov 3500 Christmas Plans Booked
Dec 3000 Christmas Gifts

With a simple modification, we can just plug these values right in:

dW = []
salary = 3000
for i in range(N):
    monthly_spending = np.random.normal(SpendingMeans[i % 12], 300., 1)[0]
    dW.append(salary - monthly_spending)
W = np.cumsum(dW)

We can simulate just like before, and we can study distributions in the way to understand our HP2’s potential envelope of futures. These distributions are plotted in Figure 5:

Figure 4: On the left, a plot of projected cash on hand for our hypothetical financial scenario with time dependent means. The mean trajectory is plotted in white, and the standard deviation is plotted in blue. On the right, a plot of the fraction of trajectories with a negative cash on hand.

Modeling Salaries With Stochastic Jump Process

Up until now we’ve provided a somewhat coarse model for household spending but we’ve ignored the general stochasticity of compensation that comes from raises and bonuses. Over a multiyear projection, ignoring these compensation increases can lead under-forecasting of cash on hand. For the sake of modeling, I’m going to assume that our hypothetical person 2 (HP2) is a salaried individual, who is making 3000/month after taxes and averages a 4% raise year over year with a standard deviation of 2%. This volatility can come from market conditions, HP2’s individual performance, or performance of his employer. Whatever the case, capturing this variability will have a drastic impact on the accuracy of our cash on hand simulations.

To do this we will start by assuming a yearly raise cycle, where the raise amount is drawn from a normal distribution that is constrained to be positive.

mean_raise = 4.0 # Can be computed
std_raise = 1.0 # Can be computed
base_salary = 3000 # per year
p = 1.0 # Base percentage

salary = np.array([base_salary] * N)
for i in range(len(salary)):
    if (i != 0) and (i % 12) == 0:
        p = p * (1. + np.abs(np.random.normal(mean_raise, std_raise))/100)
    salary[i] = salary[i] * p
Year Mean Yearly Salary
0 29400.0 $\pm$ 0.0
1 30577.0 $\pm$ 293.61
2 31802.77 $\pm$ 431.88
3 33075.85 $\pm$ 551.76
4 34397.68 $\pm$ 662.16

Figure 5: In Figure 5 we present the mean and standard deviations of 10000 simulations of the presented yearly raise model. We see the expected stair step pattern, with more uncertainty as we project further into the future.

With this simple model in hand, we will next assume stochasticity with respect the interval between raises. This distribution matches roughly the raise and promotion cycle that we generally experience in our jobs. That is, generally our raises correspond to yearly performance reviews, but promotions and raise cycles can come a touch earlier or a touch later. We will build this stochasticity in by drawing the periods between raises from a Poisson distribution with rate $\lambda$ and a linear shift. To provide an intuition for what the Poisson distribution looks like, I’ve plotted a histogram of 100000000 samples below in Figure 6.

def poisson_draw(lam = 2, shift = 0, truncate = False):
        val = shift + np.random.poisson(lam)
        if not truncate or (truncate and val >= 0):
    return val

Figure 6: In Figure 6 we present the poisson histogram that is generated by drawing 100000000 samples from the above function with lam = 3 and shift = 10. As expected, we see our most likely period as 12 months but we have some statistical support around a slightly early raise or a delayed raise, with the delay being more likely than the early raise.

With a duration model in hand, we can then modify our simple stochastic raise model to include a stochastic time dependence between raises. This can be done simply by drawing the raise period from a poisson distribution, as indicated below:

last_raise = 0
raise_period = poisson_draw(lam = 4, shift = 8)
for i in range(len(salary)):
    if (i != 0) and (i  - last_raise) == raise_period:
        pp = np.abs(np.random.normal(mean_raise, std_raise))/100.
        p = p * (1. + pp)
        last_raise = i
        raise_period = poisson_draw(lam = 4, shift = 8)

for i in range(len(salary)):
    salary[i] = salary[i] * percentage[i]

Figure 7: In Figure 7 we present the mean and standard deviations of 10000 simulations of the presented raise model with stochastic raise periods. We see the stair step pattern gets "smeared" out due to the added uncertainty, with more uncertainty as we project further into the future.

Year Mean Yearly Salary
0 29475.97 $\pm$ 108.38
1 30614.07 $\pm$ 369.22
2 31825.85 $\pm$ 533.17
3 33093.98 $\pm$ 671.24
4 34416.07 $\pm$ 803.32

The second major source of randomness in HP2’s compensation is the role of bonuses in total compensation. Bonuses typically fall within a range of percentage of salary, and are typically offered at the end of the year, which is how we will choose to model them in this case. Specifically, we will assume that HP2 has a 5% yearly bonus optionality, and that HP2 is bonused every December. This model is probably the simplest that we have to implement, and the code looks something like:

def bonus_model(bonus_mean, bonus_std, yearly_salary):
    N = len(yearly_salary)
    bonus_percentages = np.random.normal(bonus_mean, bonus_std, N)/100.
    bonus = []
    time = []
    i = 0
    for j in range(12*N):
        if (((j % 12) == 11) and j != 0):
            bonus.append(bonus_percentages[i] * yearly_salary[i])
            i = i + 1
    time = np.array(time)
    bonus = np.array(bonus)
    return bonus, time

Putting It All Together

Because our observable is cash, we can combine all of these models with through simple addition. This allows us to represent our total compensation model as:

salary, yearly_salary = stochastic_raise_salary_model(base_salary, mean_raise, std_raise, N)
bonus, months = bonus_model(mean_bonus, std_bonus, yearly_salary)
total_comp = (salary + bonus)

Putting this all together, we can combine the cyclic spending model with the above total compensation model to arrive at a richer set of simulations, which we present in figure 6.

Figure 8: In Figure 8 we present the cash on hand model that combines together the stochastic salary model with stochastic raise periods, stochastic bonus model, and cyclic stochastic spending model. Taken together we see that over the 5 year period our HP2 slowly becomes more financially stable, but still caries significant debt year after the holidays every year.


These models were extremely simple but can be combined and extended with more parameters to account for deterministic things such as event based tax benefits (e.g. the interest deduction from buying a house), stock market performance, credit card debt, and other important events. I’ve done precisely this and they’ve already helped me to understand my finances more completely. Beyond this, they’ve also helped to reduce my overall anxiety around finances because uncertainty is a first class citizen of this approach.

Of course, beyond just extending these models, we also want to evaluate them and then use real-world data to inform them so that we can plan using a true posterior rather than just trying to draw inferences from our prior. I’ve chosen to do this personally by likelihood weighting my trajectories. This is commonly known as Approximate Bayesian Computation (ABC), and a followup post will detail how I used ABC by implementing a very special purpose probabilistic programming language.

[1] Please also note that any dollar figures quoted here are purely hypothetical due to the highly private nature of personal finances. This does not affect our modeling techniques, however.

Croissants and Macarons

Not all problems in life can be solved by machine learning. Sometimes you want to bake something extremely unhealthy and extremely delicious; and I’m fortunate to have the free time to pursue these desires. Recently I learned how to make, but not pronounce, croissants and macarons through lots of trial and error and some faithful taste testing by my wife and my coworkers.

The croissants were surprisingly time consuming when compared to what I saw on the great british baking show. Even still, they turned out great and even my boss who was born and raised in France approved:

Macarons were quite a good deal more difficult and technique driven. There were so many places where you could go wrong, from splitting the ganache, to over working the batter that they took a bit of practice to get right. Before Christmas I brought a bunch of mint-white chocolate macarons into the Forge.AI, and Ivan was kind enough to take some pictures in exchange for a couple:

Photo credit to Ivan Aguilar and his mustache.

NeurIPS 2018 Recap by Forge.AI

With quick reflexes and a fortunate server error, I was lucky enough to get a ticket to the 2018 Neural Information Processing Systems Conference (NeurIPS). It was with great excitement that I attended to represent Forge.AI this year. NeurIPS provides its attendees with a week of talks, demonstrations, and incredible networking opportunities. I was able to catch up with old friends, and meet new friends and potential collaborators. For those of you who weren’t lucky enough to score a ticket, I thought it would be useful to provide a collection of highlights from the conference.

Scalable Bayesian Inference

On my first day at NeurIPS, I was fortunate to attend a tutorial by Professor David Dunson from Duke University.

Professor Dunson gave a beautiful overview of techniques for scaling markov chain monte carlo (MCMC) to both large datasets and high model complexity. Optimal transport methods based on Barycenter calculation in Wasserstein space were discussed at length, with scaling results that look extremely promising and relevant to some of the inference tasks Forge.AI is tackling.

Professor Dunson opened a discussion about the high model complexity limit, coarsened Bayes (c-bayes), modular Bayes, and other techniques. In particular, the idea of c-Bayes is both philosophically disconcerting and aesthetically beautiful. I’ve personally always considered Bayes’ theorem to be on the same footing as Kepler’s law, so making minor modifications out of modeling convenience feels strange particularly because Bayes’ theorem provides a mechanism to have statistical strength from the observed data dominate the model structure when the signal is strong enough; and this modification down-weights that mechanism.

Of course, that’s not to say that I feel like this is a bad idea. Ultimately, the use of this theory makes parameter estimation possible without having to necessarily worry about small amounts of data-noise. It appears particularly convenient, especially when the data noise model isn’t available or easy to infer. I will have to explore the technique more to understand the settings where it’s preferable to use c-Bayes rather than explicitly model data-set noise, but I have a hunch that c-Bayes will be useful in knowledge graph construction tasks where I have a small amount of string noise (typos and abbreviations) without having to provide explicit string noise models.

Causal and Counterfactual Inference Techniques

Later the same day, Professor Susan Athey from Stanford University gave a wonderful overview of causal and counterfactual inference techniques. In her presentation, she discussed many of the algorithms and applications with very specific example use-cases in mind. This really helped to ground a difficult-to-pin down talk concretely and succinctly.

The professor’s talk made it painfully obvious how Forge.AI can combine knowledge graphs with counterfactual inference to perform AI guided speculative analyses. For instance, automatically answering the question “what would happen to Tesla’s stock price if there was an uprising in the Democratic Republic of the Congo?”.

Other Highlights

The rest of the week was filled with interesting talks, posters, and conversations. For instance, I ran into the Alexandria team at the Microsoft booth. They’re focusing on applying probabilistic programming to high precision knowledge graph construction. Both are projects close to my heart, and I loved hearing about how they combined them together. It was particularly exciting to learn how their token-based string model combined character-based likelihoods with token and dictionary based likelihoods to automatically learn format models. Using these models to achieve a precision greater than 95% would represent a true step forward in automated knowledge graph construction, and I can’t wait to read the paper.

I also attended the workshop on challenges and applications of AI in the Financial Services space. It was an absolute treat to learn about how the researchers in the the finance sector envision bringing in ML techniques. It was incredibly useful to see how important fairness, privacy, and explainability are in making day-to-day algorithmic decisions. As a data-provider with a prominent vertical in the financial services industry, it was useful to understand precisely what was meant by the term explainability. On multiple occasions, both the panel speakers made and the invited speakers the point that explainability was mostly desirable due to regulatory constraints and audit protections.

Even though everyone was in the same industry, explainability meant different things to different parts of the industry. There are many situations where individual decision makers are personally liable, and being able to provide the analyst with the ability to explain a potential poor decision by diagnosing a tool is highly desirable. Explainability in the credit card applications space tends to focus on generating adverse action codes to explain a decision, to provide the end user with a view of how they can remedy any defects with their applications.

Additionally, it was useful to hear a repeated emphasis on uncertainty predictions and the usefulness of understanding how to leverage uncertainty in making business decisions whether those decisions are underwriting a mortgage, offering a credit card, or making a trade. I found this personally validating because Forge.AI has constantly pushed to keep track of confidences and transparently report them, to inform our customers and their models of any downstream uncertainties that we may have.

NeurIPS was an amazing experience this year, and I look forward to returning next year with a larger Forge.AI cohort. Hopefully we’ll even be presenting some of our interesting work. We’ll probably have to write a bot to sign us all up so that we all can actually get tickets, but that sounds like a perfect task for a tech company like us. Maybe we’ll even get mugs next year.

Note: This post was originally published on the Forge.AI blog:

Knowledge Graphs for Enhanced Machine Reasoning at Forge.AI


Natural Language Understanding at an industrial scale requires an efficient, high quality knowledge graph for tasks such as entity resolution and reasoning. Without the ability to reason about information semantically, natural language understanding systems are only capable of shallow understanding. As the requirements of machine reasoning and machine learning tasks become more complex, more advanced knowledge graphs are required. Indeed, it has been previously observed that knowledge graphs are capable of producing impressive results when used to augment and accelerate machine reasoning tasks at small scales, but struggle at large scale due to a mix of data integrity and performance issues. Solving this problem and enabling machine driven semantic reasoning at scale is one of the foundational technological challenges that we are addressing at Forge.AI.

To understand the complexity of this task, it’s necessary to define what a knowledge graph is. There are many academic definitions floating around, but most are replete with jargon and impenetrable. Simply said, a knowledge graph is a graph where each vertex represents an entity and each edge is directed and represents a relationship between entities. Entities are typically proper nouns and concepts (e.g. Apple and Company, respectively), with the edges representing verbs (e.g. Is A). Together, these form large networks that encode semantic information. For example, encoding the fact that “Apple is a Company” in the knowledge graph is done by storing two vertices, one for “Apple” and one for “Company”, with a directed edge originating with Apple and pointing to Company of type “isA”. This is visualized in Figure 1:

Figure 1: Visualized simple knowledge graph representing the fact that "Apple is a Company"

A knowledge graph encodes many facts, each through the use of a directed edge. Each vertex can have many facts connected to it, making this ultimately a directed multigraph. This type of representation provides an intuitive way to reason about queries. For example, from the knowledge graph represented in Figure 1 we can reason about the question “Is apple a company?” by simply walking through the graph, starting at “Apple” and walking to “Company”, testing edges and concepts along the way. In production, knowledge graphs tend to be quite large and complex with millions or billions of edges. Such a large amount of knowledge allows us to use these graphs to easily reason about semantic connections for tasks such as enriching business relevant data and resolving entities. At Forge.AI, we perform these tasks as part of our NLP / NLU pipeline for extracting individual events from unstructured text into a machine-readable format.

With a working definition of a knowledge graph in hand, we will next explore some of the use cases that we’ve found for the knowledge graph here at Forge.AI. Then, we’ll explore the graph infrastructure to understand what powers these use-cases. Finally, we’ll discuss part of our road map to explore what’s next for Forge.AI and its knowledge graph.

Use Cases

It’s worth grounding our conversation of the knowledge graph in a few use-cases before we jump too deeply into a detailed discussion of the infrastructure, how it works, and where we’re going. In general, a knowledge graph can be used for a wide range of applications including entity resolution, dependency analysis, filtering, and machine reasoning. In the ensuing discussion, we will focus on entity disambiguation and dependency analysis, two of the many tasks that we use the knowledge graph for at Forge.AI.

Entity Disambiguation

While simple to state, the problem of entity disambiguation is one of the most frequent problems that we need to solve when reasoning about a document. While this problem is fairly straightforward to handle in cases where the relevant ontology is known and fully enumerated, it can quickly become difficult when that is not the case. To explore how we can handle this problem with the knowledge base, let’s consider the problem of determining which “Apple” is being referenced in the quote below:

“Though the company doesn’t break out individual unit sales by model, Apple says it sold 77.3 million iPhones — a decrease from the 78.2 million iPhones it sold in the same period in 2017.”

Obviously, this is “Apple” the corporation, not “apple” the type of fruit. How did our brains determine this? We used contextual clues! We know that the Apple Corporation sells the iPhone because the type of fruit is incapable of selling anything. Based on these contextual clues alone, we are able to perform this task nearly instantaneously using our reasoning.The ForgeAI knowledge graph works in the same way: when we seek to disambiguate an entity, we provide the knowledge graph with a set of co-located entities that provide the graph with the appropriate context. However, machine learning systems do not work like our brains do and for a machine learning system to reason with context, we need a knowledge graph. Our knowledge graph then searches for all versions of “Apple” on the full graph and constructs small graphs that include contextual information as can be seen in Figures 2 and 3. Note, this is a noisy string search that is capable of finding versions of the initial search term that may differ from the original string or contain the search string as a substring. We also keep a look up table of known aliases for each of our entities, where aliases can be things like CIK codes or ticker symbols.

Figure 2: Visualized excerpt from the Knowledge Graph that pertains to the entity Apple the fruit.

Figure 3: Visualized excerpt from the Knowledge Graph that pertains to the entity Apple, the consumer electronics corporation.

With these small graphs in hand, the knowledge graph then uses machine reasoning to determine which of the entities is truly being referenced. There are many strategies to doing this but we have found that a greedy algorithm which seeks to maximize the overlap between the contextual entities passed in and the small graphs under consideration is effective.

Dependency Analysis

Another major task that we’ve found the knowledge graph to be useful for is dependency analysis. That is, to determine the relationship between two or more entities. This is most useful when attempting to determine whether an extracted event is something that a customer would care about, given their stated interests. To make this concrete, let’s consider the following news story in regards to a customer that is interested in news events relating to Samsung:

“Russia’s Norilsk Nickel has teamed up with Russian Platinum to invest $4.4bn to develop mining projects in Siberia, which contains some of the world’s richest deposits of platinum and palladium. The two companies will form a joint venture to develop projects in the Taimyr Peninsula in Russia’s far north with an aim to become the world’s largest producer of the precious metals, they said Wednesday.”

It’s certainly not obvious to me how this story is connected to Samsung. The question at hand is to determine whether this news event is related to Samsung and, if so, the nature of that relation so we can determine whether or not to pass this event to our customer. We begin by constructing small graphs around each of the entities. With these graphs in hand, we then compute a path given Dijkstra’s algorithm between each of the marked endpoints. An example of such a path is given in Figure 4.

Figure 4: Visualized excerpt from the Knowledge Graph that pertains to the relationship between the Norilsk platinum group metals mine in Siberia, Russia and Samsung.

What we see in Figure 4 is that the knowledge graph believes that Iridium is a Platinum Group Metal, and that Platinum Group Metals are mined in Norilsk. We also see that the Knowledge Graph believes that Iridium is used in Organic Light Emitting Diodes (or OLEDs), which just happen to be used in Samsung phones. Therefore, this news event is likely relevant to our customer. In fact, this event is highly relevant to our customer’s interest in Samsung because Iridium is incredibly important to the production of OLED screens due to its ability to make a blue LED. Indeed, Samsung has even funded researchers at MIT and Harvard to explore alternatives to Iridium for OLED screens.

This type of dependency analysis is illustrative of the power of a well formed knowledge graph and it is critical for machine enabled semantic reasoning. It’s easy to imagine this type of dependency analysis having uses not only in the financial services industry, but also in work as wide ranging as supply chain risk assessment and nuclear nonproliferation applications – just to name a few.

Graph Infrastructure

In addition to standard graph features, we choose to endow each fact that is stored in the knowledge graph with the time at which the edge was added and a confidence for that edge. The time dependence intuitively follows from the observation that the totality of human knowledge grows and changes over time. Ultimately, this makes the graph dynamic, which is a natural feature of human knowledge itself.

There are a small number of facts that I’d be willing to bet my life on – something like Auston Matthews is a Toronto Maple Leaf – and a great many facts that I’d be willing to bet $20 dollars on – for example, the Boston Massacre happened in 1770. Both are true but, due to the amount of information that I’ve recently read, I know considerably more about the former than the latter and, therefore, am more confident about it. Motivated by this, we have designed our knowledge graph such that each edge has weights which we choose to interpret as confidences. This data enables us to capture the inherent uncertainty necessary to model a fast changing world and to reason about the validity of queries. By virtue of the graph being probabilistic, we are able to embrace true Bayesian reasoning as we attempt to evaluate a query, as well as provide query specific priors to up or down weight an assertion based on the origin (e.g. a company’s own statements about a new product release should be up-weighted over twitter rumors).

One of the most exciting engineering challenges of knowledge graphs is their size. It is not uncommon to have a knowledge graph with more than 1 billion facts and 50 million vertices; this can easily require hundreds of gigabytes of RAM. Even more concerning than the memory requirements is the computational cost of computing even basic graph properties such as the path length between vertices. We have taken two complementary approaches to ensure that our graph algorithms are as quick as possible. First, because our edges are interpreted as probabilities, it is possible to set a probability cutoff beyond which we are not interested in graph connections. This allows us to only consider graph algorithms over highly restricted subsets of the graph, which provides us with major algorithmic improvements. Second, we have engineered the data structure to remain as cache coherent as possible by representing our knowledge graph as a sparse three rank tensor in an attempt to optimize the per-fact throughput through the CPU.

We also have a clear route towards efficient parallelization by exploiting what we are terming the “galactic structure” of the graph. While this is not a general feature of all graphs, we have observed that there are highly connected clusters of vertices that are only weakly connected to one another. Intuitively, this makes sense. For example, consider domains such as the Toronto Maple Leafs and modern particle physics – there is little overlap between these fields and therefore no need to reason over a graph that contains both clusters of highly interconnected vertices when reasoning about Dave Keon, the Toronto Maple Leafs legend. This galactic structure provides us with a promising route towards efficient parallelization using commodity hardware.

Where are We Going?

We’ve just started to teach the knowledge graph and show it how to perform basic reasoning. The following are some of the many additional features that we are adding that will ensure the accuracy, robustness, and efficiency of the graph long into the future.

Probabilistic Reasoning

Giving the knowledge graph the ability to reason probabilistically about the validity of facts allows it to hold conflicting facts or hypotheses and evaluate them later in the presence of more evidence. This can additionally be used to evaluate nuance of queries. This can be achieved by using techniques such as softening the axiomatic constraints that power the machine reasoning engine and building ontology-specific Bayesian models. We anticipate that using these techniques should make our knowledge graph more resilient to internal errors.

Automatic Fact Checking

Of course, if we have a collection of facts that we intend to use as our internal source of truth to augment business data, we should ensure that this set of facts is correct. With our current knowledge graph size, we can perform this fact checking using a mix of manual spot checking and axiomatic constraint testing (e.g. a person can only be born in one country). This is the standard technique for evaluating the correctness of knowledge graphs. As with most machine learning tasks, this is incredibly person intensive and, therefore, expensive. Additionally, it’s difficult to scale this technique to large graphs. To address these issues, we’re excited to explore techniques related to hinge-loss Markov random fields that are directionally aware. In addition to being efficient, this allows us to look at a fact such as “Florida namedAfter Flo Rida” and swap the directionality, instead of having to first infer that we need to delete this edge and then infer that the reverse edge should be present.

Automatic Graph Enrichment

Because it’s simply not possible to have humans continually teach the knowledge graph, our system is being constructed to be capable of learning facts on its own. There are many ways to do this including: tracking unexplained queries, generalizing local and global graph features to infer new facts from patterns, and using semantic information. Intuitively, this might look like finding patterns such as “Companies tend to have a CEO” and one of the companies in our graph does not currently have a CEO. Therefore, we should enrich this region of the graph specifically relating to the specific company and the existence of the CEO. To achieve this, we are actively exploring modifications of techniques such as the path rank algorithm and graph embedding methods as well as information retrieval techniques from the internet and other sources. This is proving to be an exciting path of inquiry.

Graph Dynamics

Modeling the influence of specific edges on the connectivity of two marked vertices in a graph is fundamental to understanding network resilience. In the context of a knowledge graph, this provides us with information about the influence of this fact. Intuitively, if we imagine that the vertices in our graph are cities and the edges roads, with the edge weights corresponding to the width of those roads (e.g. 0.1 is a one lane road and 1.0 is a 6 lane super highway), then the time to travel between two different cities indicates the strength of their connection. With many alternative routes and many wide highways, we can say that those cities are tightly connected. Mathematically, the problem can be thought about in terms of a two point correlation function for a collection of random walks over the graph. These are discrete random walks whose dynamics can be modeled with a discrete Green’s function. By taking advantage of the connection between discrete Green’s functions on a graph of random topology and discrete Laplace equations, we’ve preliminarily found that it is possible to evaluate the influence of changing an edge. We’re excited to formalize and harden this connection and expose these measures to aid in producing more advanced models.

The knowledge graph at Forge.AI is a crucial element of our technology stack and it has exciting potential for further development. We look forward to sharing with you further insights in the coming months.

Note: This post was originally published on the Forge.AI blog:

Linking Python to C with CFFI

I hope that it is uncontroversial to state that Python is a great language that can suffer from occasional performance issues. This is especially true if Python is being used in heavy numerical computing environments like those at Gamalon. Gamalon is not the first to require using Python for numerical tasks. To meet this need, libraries like NumPy, SciPy, Pandas and others provide users with well tested implementations of most common numerical tasks. Most of these numerical tasks, such as matrix multiplication or special function evaluation among others, have reference implementations in C or Fortran that are linked to Python through many layers of indirection.

For rapid prototyping, this turns out to be a significant time saver, but what are the costs of these indirection layers? This can be answered by exploring the callgraph for the following code that simply evaluates the log of the probability density function of the Beta distribution with distributional parameters 1 and 2:

from random import randint
from scipy.stats import beta

N = 1000000
for i in range(N):
    a = beta.logpdf(i, 1, 2)

This simple script has the following callgraph (click to zoom in):


We clearly see from the callgraph that a significant portion of our compute time is spent manipulating array shape, constructing arrays, and performing other associated operations. In fact, the underlying math for the distribution doesn’t even make an appearance in the callgraph!

The story is certainly different if, instead of making repeated calls to the logpdf function with a scalar, we make a single call with the vector. In this situation, the call overhead of the array manipulation is swamped by the vectorized cost of the math itself. Overall runtime is reflected by:

Scalar Vector
45.7 s 0.147 s

Vectorization, then, is the key to writing performant NumPy and SciPy based code.

Unfortunately, our specific use-cases typically involve lots of scalar calls, and thus, we encountered significant overhead. Can this cost be optimized away?

Beyond just the overhead, what happens when the user wants to use PyPy or some other tracing JIT to optimize the rest of their Python code (e.g. dictionary operations and the like)? Many of these libraries are simply not compatible with PyPy given PyPy’s partial support of the C API. In fact, NumPy only recently passed all upstream tests when run under the PyPy interpreter, but PyPy was generally unable to provide any performance optimizations. Is it possible to still take advantage of the C and Fortran reference implementations in PyPy without rewriting them in Python?

Generally yes. This linking can be done with a tool called CFFI, or the C Foreign Function Interface, which provides light weight, PyPy compatible, Python bindings for C code. In the remainder, we will explore how to write the bindings, how to link those bindings to Python, and what their associated performance impacts are.

Writing and Building the C

For the sake of comparison, we will implement the Beta logpdf function in C. This function is given by:

\[logpdf(x; \alpha, \beta) = \log\left(\Gamma(\alpha + \beta)\right) - \log\left(\Gamma(\alpha)\right) - \log\left(\Gamma{\beta}\right) + (\alpha - 1) \log(x) + (\beta - 1) \log(1 - x)\]

which can be implemented in C like so:

double beta_logpdf(double a, double b, double x){
  double prefactor;
  prefactor = lgamma(a + b) - (lgamma(a) + lgamma(b));
  return prefactor + (a - 1.) * log(x) + (b - 1.) * log(x - 1.);

Next, we can then start to explore writing a CFFI compatible build script. The first place to start is by constructing the associated header file, beta.h, for the above function. The entire contents of the header file are given by:

double beta_logpdf(double a, double b, double x);

This looks just like a normal header file, because it is. This header file is used to tell CFFI which functions you want to link to Python. Therefore, you should only add function prototypes for the functions you actually want to link.

With the header file and implementation in hand, we next turn our attention to implementing the build script that links with the API-level out-of-line linking paradigm. Generally, the CFFI works by taking the desired C code, automatically generating a C extension for the code, and then building that.

The script begins by import CFFI and creating a new ffibuilder. The ffibuilder is the object that will take care of the code generation and compilation.

import cffi
ffibuilder = cffi.FFI()

With the ffibuilder defined, we need to tell the ffibuilder where to find the file containing the method prototype for beta_logpdf and its implementation.

sourcefile = os.path.join('.', 'beta.h')
source = os.path.join('.', 'beta.c')

Next, we need to read in our files and build them. We first read in the header file and use the header file to inform the set_source method which functions to import. We also need to provide any necessary extra information about how to perform the code generation and provide any extra compilation arguments. Because our example is plain, we only need to tell the ffibuilder what to name the resulting module, _beta in our case, where to find the module and sources, and how to compile it. While not expressly necessary, I try and compile with ‘-O3’, ‘-march=native’, ‘-ffast-math’ whenever it provides performance benefits. In this case, the move from -O2 to -O3 halves the run time.

with open(sourcefile) as f:

    '#include "{0}"'.format(sourcefile),
    extra_compile_args=['-O3', '-march=native', '-ffast-math'])

Finally, we build it. I prefer to build with verbose=True to ensure that I have all the information necessary to fix any problems that could occur.


Putting this all together, we have:

import os
import cffi

ffibuilder = cffi.FFI()
sourcefile = os.path.join('.', 'beta.h')
source = os.path.join('.', 'beta.c')
with open(sourcefile) as f:

    '#include "{0}"'.format(sourcefile),
    extra_compile_args=['-O3', '-march=native', '-ffast-math'])


which, when run, reports a successful build. The successful build gives us multiple files; _beta.c which is the emitted C extension module, the compiled object library, _beta.o, and `` which is the dynamic library that Python will actually load.

As a closing remark, it is possible to define the C inline as described in the CFFI documentation, but I have found that it gets tedious to do this for large C projects, like those that we use internally.

Linking with Python

Given the built module above, we can now import it into Python. The module will have the name prescribed by the first argument to the set_source method, which in our case was _beta. Because we’re only interested in the library itself and not any of the other FFI periphery, such as memory allocation, we only need to import lib. Importing only the function that we actually used, our import statement is simply from _beta.lib import beta_logpdf. Note, the name of the function in C is the same as that in the module.

Putting this all together, we can call the linked function with:

from _beta.lib import beta_logpdf

beta_logpdf(1, 2, 1)

Performance Tests

Running the scripts detailed in the Scripts section below, we see the following results.

Method Time [s] Slowdown (C Reference)
Reference C 0.0302 s -
SciPy Scalar 45.7 s 1510
SciPy Vector 0.147 s 4.87
CFFI in cPython 3.6 0.346 s 11.5
CFFI in PyPy3 0.0449 s 1.49

Where all timings were generated on my Mid-2017 Non-touchbar MacBook Pro with 2.3 GHz Intel Core i5. The SciPy scalar and vector results are exactly the same as those reported in the introduction. The reference C results were computed by using the beta_logpdf function reported above and timed using the system clock available in the C standard library. The C reference was compiled with GCC 7 with compiler flags -O3, -march=native, and -ffast-math. The reported CFFI numbers were computed using the exact build scripts linked above. We elected to test in both cPython 3.6 and PyPy3 to test the possible benefits that a tracing JIT can afford us in super simplistic scenarios like this. cPython 3.6 and PyPy3 are both from the bottle poured with Homebrew.

Interestingly enough, we observe that CFFI provides two order of magnitude improvement in the overhead for calling the logpdf function within cPython. This is already a drastic speedup, but we can do even better by running under PyPy, which provides only a factor of 2 slow-down versus the reference C.

What is the source of the remaining factor of two? We can explore the computational costs of an exceedingly simple function to further understand the overhead cost of the interpreter. To understand this, we will time a function that takes in three values and returns one:

double nop(double a, double b, double x){
    return x;

Linking in the method described above and then timing under both PyPy3 and cPython 3.6, observe the following timings:

Interpreter Time [s] Slowdown (PyPy Reference)
cPython 3.6 0.248 s 15.1
PyPy3 0.0164 s -

Interestingly enough, that means that, calling an empty function linked with CFFI from cPython costs on average 248 ns, while the equivalent tasks costs 16 ns in PyPy. We can then see that the majority of the computational benefit that we have observed from moving from cPython to PyPy is the more efficient C interface.

Closing Remarks

Motivated by Knuth’s oft-cited adage that “premature optimization is the root of all evil”, the general advice that is given in these situations is to write working code, profile it, and then optimize hotspots. For Python, one of the common approaches to such an optimization is to translate the hotspots to C and then link them to Python.

In the above, we have explored specific techniques for such an optimization pathway that also permits the user to use interpreters such as PyPy. This allows for using a JIT to optimize other aspects of the codebase, such as heavy dictionary access, or string computations.

Combining PyPy with a CFFI linked module, it is possible to obtain performance within an order of magnitude to C while still operating within Python. Amazingly enough, PyPy + CFFI out performs even vectorized SciPy in this situation! Granted, this is a sample size of 1 but it is at least encouraging.

Of course, CFFI is not a magic bullet. Performing optimizations of this kind can significantly limit the flexibility of certain aspects of your code-base, and require developers to be more deliberate about their use of interfaces. It also adds maintenance costs because developers will now need to be able to support both C and Python.


Note: The timings were updated to reflect an updated version of the underlying C function that matches the scipy behaviour better. Thank you @mgeier for pointing out this error!


The callgraph was generated with

python -m cProfile -o prof.out -f pstats prof.out | dot -Tsvg -o callgraph.svg

Run on cPython 3.6

from _beta.lib import beta_logpdf
from scipy.stats import beta
import time

N = 1000000
start = time.time()
for i in range(N):
    a = beta.logpdf(i, 1, 2)
end = time.time()
print(end - start)

start = time.time()
a = beta.logpdf(list(range(N)), 1, 2)
end = time.time()
print(end - start)

start = time.time()
for i in range(N):
    a = beta_logpdf(1, 2, i)
end = time.time()
print(end - start)

Run only under PyPy3

from _beta.lib import beta_logpdf
import time

start = time.time()
for i in range(N):
    a = beta_logpdf(1, 2, i)
end = time.time()
print(end - start)

Compiled with GCC 7 with command gcc -O3 -march=native -ffast-math -finline-functions beta.c -lm ```c #include #include #include

double nop(double a, double b, double x){ return x; }

double xlogy(double x, double y){ if(x == 0.0 || y == 0.0){ return 0.0; } return x * log(y); }

double beta_logpdf(double a, double b, double x){ double prefactor; prefactor = lgamma(a + b) - (lgamma(a) + lgamma(b)); return prefactor + xlogy(a - 1.0, x) + xlogy(b - 1.0, 1.0 - x); }

int main(void){ int N; clock_t start, end; double diff; N = 1000000; start = clock(); for(int i = 0; i < N; i++){ beta_logpdf(1.0, 2.0, i); } end = clock(); diff = (double)(end - start) / (CLOCKS_PER_SEC); printf(“Time in seconds %f”, diff); return 0; }```