500 Lines or LessDagoba: An In-Memory Graph Database

Dann enjoys building things, like programming languages, databases, distributed systems, communities of smart friendly humans, and pony castles with his two year old.

Prologue

A long time ago, when the world was still young, all data walked happily in single file. If you wanted your data to jump over a fence, you just set the fence down in its path and each datum jumped it in turn. Punch cards in, punch cards out. Life was easy and programming was a breeze.

Then came the random access revolution, and data grazed freely across the hillside. Herding data became a serious concern: if you can access any piece of data at any time, how do you know which one to pick next? Techniques were developed for corralling the data by forming links between items¹, marshaling groups of units into formation through their linking assemblage. Questioning data meant picking a sheep and pulling along everything connected to it.

Later programmers departed from this tradition, imposing a set of rules on how data would be aggregated². Rather than tying disparate data directly together they would cluster by content, decomposing data into bite-sized pieces, collected in pens and collared with name tags. Questions were posed declaratively, resulting in accumulating pieces of partially decomposed data (a state the relationalists refer to as "normal") into a frankencollection returned to the programmer.

For much of recorded history this relational model reigned supreme. Its dominance went unchallenged through two major language wars and countless skirmishes. It offered everything you could ask for in a model, for the small price of inefficiency, clumsiness and lack of scalability. For eons that was a price programmers were willing to pay. Then the internet happened.

The distributed revolution changed everything, again. Data broke free of spacial constraints and roamed from machine to machine. CAP-wielding theorists busted the relational monopoly, opening the door to new herding techniques—some of which hark back to the earliest attempts to domesticate random-access data. We're going to look at one of these, a style known as the graph database.

Take One

Within this chapter we're going to build a graph database³. As we build it we're going to explore the problem space, generate multiple solutions for our design decisions, compare those solutions to understand the tradeoffs between them, and finally choose the right solution for our system. A higher-than-usual precedence is put on code compactness, but the process will otherwise mirror that used by software professionals since time immemorial. The purpose of this chapter is to teach this process. And to build a graph database⁴.

Using a graph database will allow us to solve some interesting problems in an elegant fashion. Graphs are a very natural data structure for exploring connections between things. A graph in this sense is a set of vertices and a set of edges; in other words, it's a bunch of dots connected by lines. And a database? A "data base" is like a fort for data. You put data in it and get data back out of it.

So what kinds of problems can we solve with a graph database? Well, suppose that you enjoy tracking ancestral trees: parents, grandparents, cousins twice removed, that kind of thing. You'd like to develop a system that allows you to make natural and elegant queries like "Who are Thor's second cousins once removed?" or "What is Freyja's connection to the Valkyries?"

A reasonable schema for this data structure would be to have a table of entities and a table of relationships. A query for Thor's parents might look like

But how do we extend that to grandparents? We need to do a subquery, or use some other type of vendor-specific extension to SQL. And by the time we get to second cousins once removed we're going to have a lot of SQL.

What would we like to write? Something both concise and flexible; something that models our query in a natural way and extends to other queries like it. second_cousins('Thor') is concise, but it doesn't give us any flexibility. The SQL above is flexible, but lacks concision.

Something like Thor.parents.parents.parents.children.children.children strikes a reasonably good balance. The primitives give us flexibility to ask many similar questions, but the query is concise and natural. This particular phrasing gives us too many results, as it includes first cousins and siblings, but we're going for gestalt here.

What's the simplest thing we can build that gives us this kind of interface? We could make a list of vertices and a list of edges, just like the relational schema, and then build some helper functions. It might look something like this:

The essence of the above function is to iterate over a list, evaluating some code for each item and building up an accumulator of results. That's not quite as clear as it could be, though, because the looping construct introduces some unnecessary complexity.

It'd be nice if there was a more specific looping construct designed for this purpose. As it happens, the reduce function does exactly that: given a list and a function, it evaluates the function for each element of the list, while threading the accumulator through each evaluation pass.

Given a list of vertices we reduce over the edges, adding an edge's parent to the accumulator if the edge's child is in our input list. The children function is identical, but examines the edge's parent to determine whether to add the edge's child.

Those functions are valid JavaScript, but use a few features which browsers haven't implemented as of this writing. This translated version will work today:

It reads backwards and gets us lost in silly parens, but is otherwise pretty close to what we wanted. Take a minute to look at the code. Can you see any ways to improve it?

We're treating the edges as a global variable, which means we can only ever have one database at a time using these helper functions. That's pretty limiting.

We're also not using the vertices at all. What does that tell us? It implies that everything we need is in the edges array, which in this case is true: the vertex values are scalars, so they exist independently in the edges array. If we want to answer questions like "What is Freyja's connection to the Valkyries?" we'll need to add more data to the vertices, which means making them compound values, which means the edges array should reference vertices instead of copying their value.

The same holds true for our edges: they contain an "in" vertex and an "out" vertex⁵, but no elegant way to incorporate additional information. We'll need that to answer questions like "How many stepparents did Loki have?" or "How many children did Odin have before Thor was born?"

You don't have to squint very hard to tell that the code for our two selectors looks very similar, which suggests there may be a deeper abstraction from which they spring.

Build a Better Graph

Let's solve a few of the problems we've discovered. Having our vertices and edges be global constructs limits us to one graph at a time, but we'd like to have more. To solve this we'll need some structure. Let's start with a namespace.

We'll use an object as our namespace. An object in JavaScript is mostly just an unordered set of key/value pairs. We only have four basic data structures to choose from in JavaScript, so we'll be using this one a lot. (A fun question to ask people at parties is "What are the four basic data structures in JavaScript?")

Now we need some graphs. We can build these using a classic OOP pattern, but JavaScript offers us prototypal inheritance, which means we can build up a prototype object—we'll call it Dagoba.G—and then instantiate copies of that using a factory function. An advantage of this approach is that we can return different types of objects from the factory, instead of binding the creation process to a single class constructor. So we get some extra flexibility for free.

We'll accept two optional arguments: a list of vertices and a list of edges. JavaScript is rather lax about parameters, so all named parameters are optional and default to undefined if not supplied⁶. We will often have the vertices and edges before building the graph and use the V and E parameters, but it's also common to not have those at creation time and to build the graph up programmatically⁷.

Then we create a new object that has all of our prototype's strengths and none of its weaknesses. We build a brand new array (one of the other basic JS data structures) for our edges, another for the vertices, a new object called vertexIndex and an ID counter—more on those latter two later. (Think: Why can't we just put these in the prototype?)

Then we call addVertices and addEdges from inside our factory, so let's define those now.

Okay, that was too easy—we're just passing off the work to addVertex and addEdge. We should define those now too.

If the vertex doesn't already have an _id property we assign it one using our autoid.⁸ If the _id already exists on a vertex in our graph then we reject the new vertex. Wait, when would that happen? And what exactly is a vertex?

In a traditional object-oriented system we would expect to find a vertex class, which all vertices would be an instance of. We're going to take a different approach and consider as a vertex any object containing the three properties _id, _in and _out. Why is that? Ultimately, it comes down to giving Dagoba control over which data is shared with the host application.

If we create some Dagoba.Vertex instance inside the addVertex function, our internal data will never be shared with the host application. If we accept a Dagoba.Vertex instance as the argument to our addVertex function, the host application could retain a pointer to that vertex object and manipulate it at runtime, breaking our invariants.

So if we create a vertex instance object, we're forced to decide up front whether we will always copy the provided data into a new object—potentially doubling our space usage—or allow the host application unfettered access to the database objects. There's a tension here between performance and protection, and the right balance depends on your specific use case.

Duck typing on the vertex's properties allows us to make that decision at run time, by either deep copying⁹ the incoming data or using it directly as a vertex¹⁰. We don't always want to put the responsibility for balancing safety and performance in the hands of the user, but because these two sets of use cases diverge so widely the extra flexibility is important.

Now that we've got our new vertex we'll add it to our graph's list of vertices, add it to the vertexIndex for efficient lookup by _id, and add two additional properties to it: _out and _in, which will both become lists of edges¹¹.

First we find both vertices which the edge connects, then reject the edge if it's missing either vertex. We'll use a helper function to log an error on rejection. All errors flow through this helper function, so we can override its behavior on a per-application basis. We could later extend this to allow onError handlers to be registered, so the host application could link in its own callbacks without overwriting the helper. We might allow such handlers to be registered per-graph, per-application, or both, depending on the level of flexibility required.

Then we'll add our new edge to both vertices' edge lists: the edge's out vertex's list of out-side edges, and the in vertex's list of in-side edges.

Enter the Query

There are really only two parts to this system: the part that holds the graph and the part that answers questions about the graph. The part that holds the graph is pretty simple, as we've seen. The query part is a little trickier.

A program is a series of steps. Each step is like a pipe in a pipeline—a piece of data comes in one end, is transformed in some fashion, and goes out the other end. Our pipeline doesn't quite work like that, but it's a good first approximation.

Each step in our program can have state, and query.state is a list of per-step states that index correlates with the list of steps in query.program.

A gremlin is a creature that travels through the graph doing our bidding. A gremlin might be a surprising thing to find in a database, but they trace their heritage back to Tinkerpop's Blueprints, and the Gremlin and Pacer query languages. They remember where they've been and allow us to find answers to interesting questions.

Remember that question we wanted to answer about Thor's second cousins once removed? We decided Thor.parents.parents.parents.children.children.children was a pretty good way of expressing that. Each parents or children instance is a step in our program. Each of those steps contains a reference to its pipetype, which is the function that performs that step's operation.

Each of the steps is a function call, and so they can take arguments. The interpreter passes the step's arguments to the step's pipetype function, so in the query g.v('Thor').out(2, 3) the out pipetype function would receive [2, 3] as its first parameter.

Each step is a composite entity, combining the pipetype function with the arguments to apply to that function. We could combine the two into a partially applied function at this stage, instead of using a tuple ¹², but then we'd lose some introspective power that will prove helpful later.

We'll use a small set of query initializers that generate a new query from a graph. Here's one that starts most of our examples: the v method. It builds a new query, then uses our add helper to populate the initial query program. This makes use of the vertex pipetype, which we'll look at soon.

Note that [].slice.call(arguments) is JS parlance for "please pass me an array of this function's arguments". You would be forgiven for supposing that arguments is already an array, since it behaves like one in many situations, but it is lacking much of the functionality we utilize in modern JavaScript arrays.

The Problem with Being Eager

Before we look at the pipetypes themselves we're going to take a diversion into the exciting world of execution strategy. There are two main schools of thought: the Call By Value clan, also known as eager beavers, are strict in their insistence that all arguments be evaluated before the function is applied. Their opposing faction, the Call By Needians, are content to procrastinate until the last possible moment before doing anything—they are, in a word, lazy.

JavaScript, being a strict language, will process each of our steps as they are called. We would then expect the evaluation of g.v('Thor').out().in() to first find the Thor vertex, then find all vertices connected to it by outgoing edges, and from each of those vertices finally return all vertices they are connected to by inbound edges.

In a non-strict language we would get the same result—the execution strategy doesn't make much difference here. But what if we added a few additional calls? Given how well-connected Thor is, our g.v('Thor').out().out().out().in().in().in() query may produce many results—in fact, because we're not limiting our vertex list to unique results, it may produce many more results than we have vertices in our total graph.

We're probably only interested in getting a few unique results out, so we'll change the query a bit: g.v('Thor').out().out().out().in().in().in().unique().take(10). Now our query produces at most 10 results. What happens if we evaluate this eagerly, though? We're still going to have to build up septillions of results before returning only the first 10.

All graph databases have to support a mechanism for doing as little work as possible, and most choose some form of non-strict evaluation to do so. Since we're building our own interpreter, the lazy evaluation of our program is possible, but we may have to contend with some consequences.

Ramifications of Evaluation Strategy on our Mental Model

We would like to retain that model for our users, because it's easier to reason about, but as we've seen we can no longer use that model for the implementation. Having users think in a model that differs from the actual implementation is a source of much pain. A leaky abstraction is a small-scale version of this; in the large it can lead to frustration, cognitive dissonance and ragequits.

Our case is nearly optimal for this deception, though: the answer to any query will be the same, regardless of execution model. The only difference is the performance. The tradeoff is between having all users learn a more complicated model prior to using the system, or forcing a subset of users to transfer from the simple model to the complicated model in order to better reason about query performance.

In our case this tradeoff makes sense. For most uses queries will return results fast enough that users needn't be concerned with optimizing their query structure or learning the deeper model. Those who will are the users writing advanced queries over large datasets, and they are also likely the users most well-equipped to transition to a new model. Additionally, our hope is that there is only a small increase in difficulty imposed by using the simple model before learning the more complex one.

We'll go into more detail on this new model soon, but in the meantime here are some highlights to keep in mind during the next section:

Pipetypes

Pipetypes make up the core of our system. Once we understand how each one works, we'll have a better basis for understanding how they're invoked and sequenced together in the interpreter.

The pipetype's function is added to the list of pipetypes, and then a new method is added to the query object. Every pipetype must have a corresponding query method. That method adds a new step to the query program, along with its arguments.

When we evaluate g.v('Thor').out('parent').in('parent') the v call returns a query object, the out call adds a new step and returns the query object, and the in call does the same. This is what enables our method-chaining API.

Note that adding a new pipetype with the same name replaces the existing one, which allows runtime modification of existing pipetypes. What's the cost of this decision? What are the alternatives?

If we can't find a pipetype, we generate an error and return the default pipetype, which acts like an empty conduit: if a message comes in one side, it gets passed out the other.

See those underscores? We use those to label params that won't be used in our function. Most other pipetypes will use all three parameters, and have all three parameter names. This allows us to distinguish at a glance which parameters a particular pipetype relies on.

Vertex

Most pipetypes we meet will take a gremlin and produce more gremlins, but this particular pipetype generates gremlins from just a string. Given an vertex ID it returns a single new gremlin. Given a query it will find all matching vertices, and yield one new gremlin at a time until it has worked through them.

We first check to see if we've already gathered matching vertices, otherwise we try to find some. If there are any vertices, we'll pop one off and return a new gremlin sitting on that vertex. Each gremlin can carry around its own state, like a journal of where it's been and what interesting things it has seen on its journey through the graph. If we receive a gremlin as input to this step we'll copy its journal for the exiting gremlin.

Note that we're directly mutating the state argument here, and not passing it back. An alternative would be to return an object instead of a gremlin or signal, and pass state back that way. That complicates our return value, and creates some additional garbage ¹³. If JS allowed multiple return values it would make this option more elegant.

We would still need to find a way to deal with the mutations, though, as the call site maintains a reference to the original variable. What if we had some way to determine whether a particular reference is "unique"—that it is the only reference to that object?

If we know a reference is unique then we can get the benefits of immutability while avoiding expensive copy-on-write schemes or complicated persistent data structures. With only one reference we can't tell whether the object has been mutated or a new object has been returned with the changes we requested: "observed immutability" is maintained ¹⁴.

There are a couple of common ways of determining this: in a statically typed system we might make use of uniqueness types ¹⁵ to guarantee at compile time that each object has only one reference. If we had a reference counter ¹⁶—even just a cheap two-bit sticky counter—we could know at runtime that an object only has one reference and use that knowledge to our advantage.

JavaScript doesn't have either of these facilities, but we can get almost the same effect if we're really, really disciplined. Which we will be. For now.

In-N-Out

Walking the graph is as easy as ordering a burger. These two lines set up the in and out pipetypes for us.

The simpleTraversal function returns a pipetype handler that accepts a gremlin as its input, and spawns a new gremlin each time it's queried. Once those gremlins are gone, it sends back a 'pull' request to get a new gremlin from its predecessor.

The first couple of lines handle the differences between the in version and the out version. Then we're ready to return our pipetype function, which looks quite a bit like the vertex pipetype we just saw. That's a little surprising, since this one takes in a gremlin whereas the vertex pipetype creates gremlins ex nihilo.

Yet we can see the same beats being hit here, with the addition of a query initialization step. If there's no gremlin and we're out of available edges then we pull. If we have a gremlin but haven't yet set state then we find any edges going the appropriate direction and add them to our state. If there's a gremlin but its current vertex has no appropriate edges then we pull. And finally we pop off an edge and return a freshly cloned gremlin on the vertex to which it points.

Glancing at this code we see !state.edges.length repeated in each of the three clauses. It's tempting to refactor this to reduce the complexity of those conditionals. There are two issues keeping us from doing so.

One is relatively minor: the third !state.edges.length means something different from the first two, since state.edges has been changed between the second and third conditional. This actually encourages us to refactor, because having the same label mean two different things inside a single function usually isn't ideal.

The second is more serious. This isn't the only pipetype function we're writing, and we'll see these ideas of query initialization and/or state initialization repeated over and over. When writing code, there's always a balancing act between structured qualities and unstructured qualities. Too much structure and you pay a high cost in boilerplate and abstraction complexity. Too little structure and you'll have to keep all the plumbing minutia in your head.

In this case, with a dozen or so pipetypes, the right choice seems to be to style each of the pipetype functions as similarly as possible, and label the constituent pieces with comments. So we resist our impulse to refactor this particular pipetype, because doing so would reduce uniformity, but we also resist the urge to engineer a formal structural abstraction for query initialization, state initialization, and the like. If there were hundreds of pipetypes that latter choice would probably be the right one: the complexity cost of the abstraction is constant, while the benefit accrues linearly with the number of units. When handling that many moving pieces, anything you can do to enforce regularity among them is helpful.

Property

Let's pause for a moment to consider an example query based on the three pipetypes we've seen. We can ask for Thor's grandparents like this¹⁷:

But this is a common enough operation that we'd prefer to write something more like:

Plus this way the property pipe is an integral part of the query, instead of something appended after. This has some interesting benefits, as we'll soon see.

Our query initialization here is trivial: if there's no gremlin, we pull. If there is a gremlin, we'll set its result to the property's value. Then the gremlin can continue onward. If it makes it through the last pipe its result will be collected and returned from the query. Not all gremlins have a result property. Those that don't return their most recently visited vertex.

Note that if the property doesn't exist we return false instead of the gremlin, so property pipes also act as a type of filter. Can you think of a use for this? What are the tradeoffs in this design decision?

Unique

If we want to collect all Thor's grandparents' grandchildren—his cousins, his siblings, and himself—we could do a query like this: g.v('Thor').in().in().out().out().run(). That would give us many duplicates, however. In fact there would be at least four copies of Thor himself. (Can you think of a time when there might be more?)

To resolve this we introduce a new pipetype called 'unique'. Our new query produces output in one-to-one correspondence with the grandchildren:

A unique pipe is purely a filter: it either passes the gremlin through unchanged or it tries to pull a new gremlin from the previous pipe.

We initialize by trying to collect a gremlin. If the gremlin's current vertex is in our cache, then we've seen it before so we try to collect a new one. Otherwise, we add the gremlin's current vertex to our cache and pass it along. Easy peasy.

Filter

We've seen two simplistic ways of filtering, but sometimes we need more elaborate constraints. What if we want to find all of Thor's siblings whose weight is greater than their height ¹⁸? This query would give us our answer:

If we want to know which of Thor's siblings survive Ragnarök we can pass filter an object:

If the filter's first argument is not an object or function then we trigger an error, and pass the gremlin along. Pause for a minute, and consider the alternatives. Why would we decide to continue the query once an error is encountered?

There are two reasons this error might arise. The first involves a programmer typing in a query, either in a REPL or directly in code. When run, that query will produce results, and also generate a programmer-observable error. The programmer then corrects the error to further filter the set of results produced. Alternatively, the system could display only the error and produce no results, and fixing all errors would allow results to be displayed.

The second possibility is that the filter is being applied dynamically at run time. This is a much more important case, because the person invoking the query is not necessarily the author of the query code. Because this is on the web, our default rule is to always show results, and to never break things. It is usually preferable to soldier on in the face of trouble rather than succumb to our wounds and present the user with a grisly error message.

For those occasions when showing too few results is better than showing too many, Dagoba.error can be overridden to throw an error, thereby circumventing the natural control flow.

Take

We don't always want all the results at once. Sometimes we only need a handful of results; say we want a dozen of Thor's contemporaries, so we walk all the way back to the primeval cow Auðumbla:

Without the take pipe that query could take quite a while to run, but thanks to our lazy evaluation strategy the query with the take pipe is very efficient.

Sometimes we just want one at a time: we'll process the result, work with it, and then come back for another one. This pipetype allows us to do that as well.

Our query can function in an asynchronous environment, allowing us to collect more results as needed. When we run out, an empty array is returned.

We initialize state.taken to zero if it doesn't already exist. JavaScript has implicit coercion, but coerces undefined into NaN, so we have to be explicit here ¹⁹.

Then when state.taken reaches args[0] we return 'done', sealing off the pipes before us. We also reset the state.taken counter, allowing us to repeat the query later.

We do those two steps before query initialization to handle the cases of take(0) and take() ²⁰. Then we increment our counter and return the gremlin.

These next four pipetypes work as a group to allow more advanced queries. This one just allows you to label the current vertex. We'll use that label with the next two pipetypes.

After initializing the query, we ensure the gremlin's local state has an as parameter. Then we set a property of that parameter to the gremlin's current vertex.

Merge

Once we've labeled vertices we can extract them using merge. If we want Thor's parents, grandparents and great-grandparents we can do something like this:

We map over each argument, looking for it in the gremlin's list of labeled vertices. If we find it, we clone the gremlin to that vertex. Note that only gremlins that make it to this pipe are included in the merge—if Thor's mother's parents aren't in the graph, she won't be in the result set.

Except

We've already seen cases where we would like to say "Give me all Thor's siblings who are not Thor". We can do that with a filter:

But there are also queries that would be difficult to try to filter. What if we wanted Thor's uncles and aunts? How would we filter out his parents? It's easy with as and except ²¹:

Here we're checking whether the current vertex is equal to the one we stored previously. If it is, we skip it.

Back

Some of the questions we might ask involve checking further into the graph, only to return later to our point of origin if the answer is in the affirmative. Suppose we wanted to know which of Fjörgynn's daughters had children with one of Bestla's sons?

We're using the Dagoba.gotoVertex helper function to do all real work here. Let's take a look at that and some other helpers now.

Helpers

The pipetypes above rely on a few helpers to do their jobs. Let's take a quick look at those before diving in to the interpreter.

Gremlins

Gremlins are simple creatures: they have a current vertex, and some local state. So to make a new one we just need to make an object with those two things.

Any object that has a vertex property and a state property is a gremlin by this definition, so we could just inline the constructor, but wrapping it in a function allows us to add new properties to all gremlins in a single place.

We can also take an existing gremlin and send it to a new vertex, as we saw in the back pipetype and the simpleTraversal function.

Note that this function actually returns a brand new gremlin: a clone of the old one, sent to our desired destination. That means a gremlin can sit on a vertex while its clones are sent out to explore many other vertices. This is exactly what happens in simpleTraversal.

As an example of possible enhancements, we could add a bit of state to keep track of every vertex the gremlin visits, and add new pipetypes to take advantage of those paths.

Finding

The vertex pipetype uses the findVertices function to collect a set of initial vertices from which to begin our query.

This function receives its arguments as a list. If the first one is an object it passes it to searchVertices, allowing queries like:

Otherwise, if there are arguments it gets passed to findVerticesByIds, which handles queries like g.v('Thor', 'Odin').run().

If there are no arguments at all, then our query looks like g.v().run(). This isn't something you'll want to do frequently with large graphs, especially since we're slicing the vertex list before returning it. We slice because some call sites manipulate the returned list directly by popping items off as they work through them. We could optimize this use case by cloning at the call site, or by avoiding those manipulations. (We could keep a counter in state instead of popping.)

Note the use of vertexIndex here. Without that index we'd have to go through each vertex in our list one at a time to decide if it matched the ID—turning a constant time operation into a linear time one, and any \(O(n)\) operations that directly rely on it into \(O(n^2)\) operations.

The searchVertices function uses the objectFilter helper on every vertex in the graph. We'll look at objectFilter in the next section, but in the meantime, can you think of a way to search through the vertices lazily?

Filtering

We saw that simpleTraversal uses a filtering function on the edges it encounters. It's a simple function, but powerful enough for our purposes.

The first case is no filter at all: g.v('Odin').in().run() traverses all edges to Odin.

The second case filters on the edge's label: g.v('Odin').in('parent').run() traverses those edges with a label of 'parent'.

The third case accepts an array of labels: g.v('Odin').in(['parent', 'spouse']).run() traverses both parent and spouse edges.

The Interpreter's Nature

We've arrived at the top of the narrative mountain, ready to receive our prize: the interpreter. The code is actually fairly compact, but the model has a bit of subtlety.

We compared programs to pipelines earlier, and that's a good mental model for writing queries. As we saw, though, we need a different model for the actual implementation. That model is more like a Turing machine than a pipeline: there's a read/write head that sits over a particular step. It "reads" the step, changes its "state", and then moves either right or left.

Reading the step means evaluating the pipetype function. As we saw above, each of those functions accepts as input the entire graph, its own arguments, maybe a gremlin, and its own local state. As output it provides a gremlin, false, or a signal of 'pull' or 'done'. This output is what our quasi-Turing machine reads in order to change the machine's state.

That state comprises just two variables: one to record steps that are 'done', and another to record the results of the query. Those are updated, and then either the machine head moves or the query finishes and the result is returned.

We've now described all the state in our machine. We'll have a list of results that starts empty:

We need a place to store the most recent step's output, which might be a gremlin—or it might be nothing—so we'll call it maybe_gremlin:

And finally we'll need a program counter to indicate the position of the read/write head.

Except... wait a second. How are we going to get lazy ²²? The traditional way of building a lazy system out of an eager one is to store parameters to function calls as "thunks" instead of evaluating them. You can think of a thunk as an unevaluated expression. In JS, which has first-class functions and closures, we can create a thunk by wrapping a function and its arguments in a new anonymous function which takes no arguments:

None of the thunks are invoked until one is actually needed, which usually implies some type of output is required: in our case the result of a query. Each time the interpreter encounters a new function call, we wrap it in a thunk. Recall our original formulation of a query: children(children(children(parents(parents(parents([8])))))). Each of those layers would be a thunk, wrapped up like an onion.

There are a couple of tradeoffs with this approach: one is that spatial performance becomes more difficult to reason about, because of the potentially vast thunk graphs that can be created. Another is that our program is now expressed as a single thunk, and we can't do much with it at that point.

This second point isn't usually an issue, because of the phase separation between when our compiler runs its optimizations and when all the thunking occurs at runtime. In our case we don't have that advantage: because we're using method chaining to implement a fluent interface ²³ if we also use thunks to achieve laziness we would thunk each new method as it is called, which means by the time we get to run() we have only a thunk as our input, and no way to optimize our query.

Interestingly, our fluent interface hides another difference between our query language and regular programming languages. The query g.v('Thor').in().out().run() could be rewritten as run(out(in(v(g, 'Thor')))) if we weren't using method chaining. In JS we would first process g and 'Thor', then v, then in, out and run, working from the inside out. In a language with non-strict semantics we would work from the outside in, processing each consecutive nested layer of arguments only as needed.

So if we start evaluating our query at the end of the statement, with run, and work our way back to v('Thor'), calculating results only as needed, then we've effectively achieved non-strictness. The secret is in the linearity of our queries. Branches complicate the process graph and also introduce opportunities for duplicate calls, which require memoization to avoid wasted work. The simplicity of our query language means we can implement an equally simple interpreter based on our linear read/write head model.

In addition to allowing runtime optimizations, this style has many other benefits related to the ease of instrumentation: history, reversibility, stepwise debugging, query statistics. All these are easy to add dynamically because we control the interpreter and have left it as a virtual machine evaluator instead of reducing the program to a single thunk.

Interpreter, Unveiled

Here max is just a constant, and step, state, and pipetype cache information about the current step. We've entered the driver loop, and we won't stop until the last step is done.

To handle the 'pull' case we first set maybe_gremlin ²⁴ to false. We're overloading our 'maybe' here by using it as a channel to pass the 'pull' and 'done' signals, but once one of those signals is sucked out we go back to thinking of this as a proper 'maybe'.

If the step before us isn't 'done' ²⁵ we'll move the head backward and try again. Otherwise, we mark ourselves as 'done' and let the head naturally fall forward.

Handling the 'done' case is even easier: set maybe_gremlin to false and mark this step as 'done'.

We're done with the current step, and we've moved the head to the next one. If we're at the end of the program and maybe_gremlin contains a gremlin, we'll add it to the results, set maybe_gremlin to false and move the head back to the last step in the program.

This is also the initialization state, since pc starts as max. So we start here and work our way back, and end up here again at least once for each final result the query returns.

We're out of the driver loop now: the query has ended, the results are in, and we just need to process and return them. If any gremlin has its result set we'll return that, otherwise we'll return the gremlin's final vertex. Are there other things we might want to return? What are the tradeoffs here?

Query Transformers

Now we have a nice compact interpreter for our query programs, but we're still missing something. Every modern DBMS comes with a query optimizer as an essential part of the system. For non-relational databases, optimizing our query plan rarely yields the exponential speedups seen in their relational cousins ²⁶, but it's still an important aspect of database design.

What's the simplest thing we could do that could reasonably be called a query optimizer? Well, we could write little functions for transforming our query programs before we run them. We'll pass a program in as input and get a different program back out as output.

Now we can add query transformers to our system. A query transformer is a function that accepts a program and returns a program, plus a priority level. Higher priority transformers are placed closer to the front of the list. We're ensuring fun is a function, because we're going to evaluate it later ²⁷.

We'll assume there won't be an enormous number of transformer additions, and walk the list linearly to add a new one. We'll leave a note in case this assumption turns out to be false—a binary search is much more time-optimal for long lists, but adds a little complexity and doesn't really speed up short lists.

To run these transformers we're going to inject a single line of code in to the top of our interpreter:

We'll use that to call this function, which just passes our program through each transformer in turn:

Up until this point, our engine has traded simplicity for performance, but one of the nice things about this strategy is that it leaves doors open for global optimizations that may have been unavailable if we had opted to optimize locally as we designed the system.

Optimizing a program can often increase complexity and reduce the elegance of the system, making it harder to reason about and maintain. Breaking abstraction barriers for performance gains is one of the more egregious forms of optimization, but even something seemingly innocuous like embedding performance-oriented code into business logic makes maintenance more difficult.

In light of that, this type of "orthogonal optimization" is particularly appealing. We can add optimizers in modules or even user code, instead of having them tightly coupled to the engine. We can test them in isolation, or in groups, and with the addition of generative testing we could even automate that process, ensuring that our available optimizers play nicely together.

We can also use this transformer system to add new functionality unrelated to optimization. Let's look at a case of that now.

Aliases

Making a query like g.v('Thor').out().in() is quite compact, but is this Thor's siblings or his mates? Neither interpretation is fully satisfying. It'd be nicer to say what mean: either g.v('Thor').parents().children() or g.v('Thor').children().parents().

We can use query transformers to make aliases with just a couple of extra helper functions:

We're adding a new name for an existing step, so we'll need to create a query transformer that converts the new name to the old name whenever it's encountered. We'll also need to add the new name as a method on the main query object, so it can be pulled into the query program.

If we could capture missing method calls and route them to a handler function then we might be able to run this transformer with a lower priority, but there's currently no way to do that. Instead we will run it with a high priority of 100 so the aliased methods are added before they are invoked.

We call another helper to merge the incoming step's arguments with the alias's default arguments. If the incoming step is missing an argument then we'll use the alias's argument for that slot.

We can also start to specialize our data model a little more, by labeling each edge between a parent and child as a 'parent' edge. Then our aliases would look like this:

Now we can add edges for spouses, step-parents, or even jilted ex-lovers. If we enhance our addAlias function we can introduce new aliases for grandparents, siblings, or even cousins:

That cousins alias is kind of cumbersome. Maybe we could expand our addAlias function to allow ourselves to use other aliases in our aliases, and call it like this:

We've introduced a bit of a pickle, though: while our addAlias function is resolving an alias it also has to resolve other aliases. What if parents called some other alias, and while we were resolving cousins we had to stop to resolve parents and then resolve its aliases and so on? What if one of parents aliases ultimately called cousins?

This brings us in to the realm of dependency resolution²⁸, a core component of modern package managers. There are a lot of fancy tricks for choosing ideal versions, tree shaking, general optimizations and the like, but the basic idea is fairly simple. We're going to make a graph of all the dependencies and their relationships, and then try to find a way to line up the vertices while making all the arrows go from left to right. If we can, then this particular sorting of the vertices is called a 'topological ordering', and we've proven that our dependency graph has no cycles: it is a Directed Acyclic Graph (DAG). If we fail to do so then our graph has at least one cycle.

On the other hand, we expect that our queries will generally be rather short (100 steps would be a very long query) and that we'll have a reasonably low number of transformers. Instead of fiddling around with DAGs and dependency management we could return 'true' from the transform function if anything changed, and then run it until it stops being productive. This requires each transformer to be idempotent, but that's a useful property for transformers to have. What are the pros and cons of these two pathways?

Performance

All production graph databases share a particular performance characteristic: graph traversal queries are constant time with respect to total graph size ²⁹. In a non-graph database, asking for the list of someone's friends can require time proportional to the number of entries, because in the naive worst-case you have to look at every entry. This means if a query over ten entries takes a millisecond, then a query over ten million entries will take almost two weeks. Your friend list would arrive faster if sent by Pony Express ³⁰!

To alleviate this dismal performance most databases index over oft-queried fields, which turns an \(O(n)\) search into an \(O(log n)\) search. This gives considerably better search performance, but at the cost of some write performance and a lot of space—indices can easily double the size of a database. Careful balancing of the space/time tradeoffs of indices is part of the perpetual tuning process for most databases.

Graph databases sidestep this issue by making direct connections between vertices and edges, so graph traversals are just pointer jumps; no need to scan through every item, no need for indices, no extra work at all. Now finding your friends has the same price regardless of the total number of people in the graph, with no additional space cost or write time cost. One downside to this approach is that the pointers work best when the whole graph is in memory on the same machine. Effectively sharding a graph database across multiple machines is still an active area of research ³¹.

We can see this at work in the microcosm of Dagoba if we replace the functions for finding edges. Here's a naive version that searches through all the edges in linear time. It's similar to our very first implementation, but uses all the structures we've since built.

We can add an index for edges, which gets us most of the way there with small graphs but has all the classic indexing issues for large ones.

Serialization

Having a graph in memory is great, but how do we get it there in the first place? We saw that our graph constructor can take a list of vertices and edges and create a graph for us, but once the graph has been built how do we get the vertices and edges back out?

Our natural inclination is to do something like JSON.stringify(graph), which produces the terribly helpful error "TypeError: Converting circular structure to JSON". During the graph construction process the vertices were linked to their edges, and the edges are all linked to their vertices, so now everything refers to everything else. So how can we extract our nice neat lists again? JSON replacer functions to the rescue.

The JSON.stringify function takes a value to stringify, but it also takes two additional parameters: a replacer function and a whitespace number ³³. The replacer allows you to customize how the stringification proceeds.

We need to treat the vertices and edges a bit differently, so we're going to manually merge the two sides into a single JSON string.

The only difference between them is what they do when a cycle is about to be formed: for vertices, we skip the edge list entirely. For edges, we replace each vertex with its ID. That gets rid of all the cycles we created while building the graph.

We're manually manipulating JSON in Dagoba.jsonify, which generally isn't recommended as the JSON format is rather persnickety. Even in a dose this small it's easy to miss something and hard to visually confirm correctness.

We could merge the two replacer functions into a single function, and use that new replacer function over the whole graph by doing JSON.stringify(graph, my_cool_replacer). This frees us from having to manually massage the JSON output, but the resulting code may be quite a bit messier. Try it yourself and see if you can come up with a well-factored solution that avoids hand-coded JSON. (Bonus points if it fits in a tweet.)

Persistence

Persistence is usually one of the trickier parts of a database: disks are relatively safe but slow. Batching writes, making them atomic, journaling—these are difficult to make both fast and correct.

Fortunately, we're building an in-memory database, so we don't have to worry about any of that! We may, though, occasionally want to save a copy of the database locally for fast restart on page load. We can use the serializer we just built to do exactly that. First let's wrap it in a helper function:

In JavaScript an object's toString function is called whenever that object is coerced into a string. So if g is a graph, then g+'' will be the graph's serialized JSON string.

The fromString function isn't part of the language specification, but it's handy to have around.

Now we'll use those in our persistence functions. The toString function is hiding—can you spot it?

We preface the name with a faux namespace to avoid polluting the localStorage properties of the domain, as it can get quite crowded in there. There's also usually a low storage limit, so for larger graphs we'd probably want to use a Blob of some sort.

There are also potential issues if multiple browser windows from the same domain are persisting and depersisting simultaneously. The localStorage space is shared between those windows, and they're potentially on different event loops, so there's the possibility of one carelessly overwriting the work of another. The spec says there should be a mutex required for read/write access to localStorage, but it's inconsistently implemented between different browsers, and even with it a simple implementation like ours could still encounter issues.

If we wanted our persistence implementation to be multi-window–concurrency aware, then we could make use of the storage events that are fired when localStorage is changed to update our local graph accordingly.

Updates

Our out pipetype copies the vertex's out-going edges and pops one off each time it needs one. Building that new data structure takes time and space, and pushes more work on to the memory manager. We could have instead used the vertex's out-going edge list directly, keeping track of our place with a counter variable. Can you think of a problem with that approach?

If someone deletes an edge we've visited while we're in the middle of a query, that would change the size of our edge list, and we'd then skip an edge because our counter would be off. To solve this we could lock the vertices involved in our query, but then we'd either lose our capacity to regularly update the graph, or the ability to have long-lived query objects responding to requests for more results on-demand. Even though we're in a single-threaded event loop, our queries can span multiple asynchronous re-entries, which means concurrency concerns like this are a very real problem.

So we'll pay the performance price to copy the edge list. There's still a problem, though, in that long-lived queries may not see a completely consistent chronology. We will traverse every edge belonging to a vertex at the moment we visit it, but we visit vertices at different clock times during our query. Suppose we save a query like var q = g.v('Odin').children().children().take(2) and then call q.run() to gather two of Odin's grandchildren. Some time later we need to pull another two grandchildren, so we call q.run() again. If Odin has had a new grandchild in the intervening time, we may or may not see it, depending on whether the parent vertex was visited the first time we ran the query.

One way to fix this non-determinism is to change the update handlers to add versioning to the data. We'll then change the driver loop to pass the graph's current version in to the query, so we're always seeing a consistent view of the world as it existed when the query was first initialized. Adding versioning to our database also opens the door to true transactions, and automated rollback/retries in an STM-like fashion.

Future Directions

This is pretty clumsy, and doesn't scale well—what if we wanted six layers of ancestors? Or to look through an arbitrary number of ancestors until we found what we wanted?

after the query transformers have all run. We could run the times transformer first, to produce:

Then run the all transformer and have it transform each all into a uniquely labeled as, and put a merge after the last as.

There are a few problems with this, though. For one, this as/merge technique only works if every pathway is present in the graph: if we're missing an entry for one of Thor's great-grandparents then we will skip valid entries. For another, what happens if we want to do this to just part of a query and not the whole thing? What if there are multiple alls?

To solve that first problem we're going to have to treat alls as something more than just as/merge. We need each parent gremlin to actually skip the intervening steps. We can think of this as a kind of teleportation—jumping from one part of the pipeline directly to another—or we can think of it as a certain kind of branching pipeline, but either way it complicates our model somewhat. Another approach would be to think of the gremlin as passing through the intervening pipes in a sort of suspended animation, until awoken by a special pipe. Scoping the suspending/unsuspending pipes may be tricky, however.

The next two problems are easier. To modify just part of a query we'll wrap that portion in special start/end steps, like g.v('Thor').out().start().in().out().end().times(4).run(). Actually, if the interpreter knows about these special pipetypes we don't need the end step, because the end of a sequence is always a special pipetype. We'll call these special pipetypes "adverbs", because they modify regular pipetypes like adverbs modify verbs.

To handle multiple alls we need to run all all transformers twice: once before times, to mark all alls uniquely, and again after times to re-mark all marked alls uniquely.

There's still the issue of searching through an unbounded number of ancestors—for example, how do we find out which of Ymir's descendants are scheduled to survive Ragnarök? We could make individual queries like g.v('Ymir').in().filter({survives: true}) and g.v('Ymir').in().in().in().in().filter({survives: true}), and manually collect the results ourselves, but that's pretty awful.

which would work like all+times but without enforcing a limit. We may want to impose a particular strategy on the traversal, though, like a stolid BFS or YOLO DFS, so g.v('Ymir').in().filter({survives: true}).bfs() would be more flexible. Phrasing it this way allows us to state complicated queries like "check for Ragnarök survivors, skipping every other generation" in a straightforward fashion: g.v('Ymir').in().filter({survives: true}).in().bfs().

Wrapping Up

So what have we learned? Graph databases are great for storing interconnected ³⁴ data that you plan to query via graph traversals. Adding non-strict semantics allows for a fluent interface over queries you could never express in an eager system for performance reasons, and allows you to cross async boundaries. Time makes things complicated, and time from multiple perspectives (i.e., concurrency) makes things very complicated, so whenever we can avoid introducing a temporal dependency (e.g., state, observable effects, etc.) we make reasoning about our system easier. Building in a simple, decoupled and painfully unoptimized style leaves the door open for global optimizations later on, and using a driver loop allows for orthogonal optimizations—each without introducing the brittleness and complexity that is the hallmark of most optimization techniques.

That last point can't be overstated: keep it simple. Eschew optimization in favor of simplicity. Work hard to achieve simplicity by finding the right model. Explore many possibilities. The chapters in this book provide ample evidence that highly non-trivial applications can have a small, tight kernel. Once you find that kernel for the application you are building, fight to keep complexity from polluting it. Build hooks for attaching additional functionality, and maintain your abstraction barriers at all costs. Using these techniques well is not easy, but they can give you leverage over otherwise intractable problems.

Acknowledgements

Many thanks are due to Amy Brown, Michael DiBernardo, Colin Lupton, Scott Rostrup, Michael Russo, Erin Toliver, and Leo Zovic for their invaluable contributions to this chapter.

One of the very first database designs was the hierarchical model, which grouped items into tree-shaped hierarchies and is still used as the basis of IBM's IMS product, a high-speed transaction processing system. It's influence can also been seen in XML, file systems and geographic information storage. The network model, invented by Charles Bachmann and standardized by CODASYL, generalized the hierarchical model by allowing multiple parents, forming a DAG instead of a tree. These navigational database models came in to vogue in the 1960s and continued their dominance until performance gains made relational databases usable in the 1980s.↩
Edgar F. Codd developed relational database theory while working at IBM, but Big Blue feared that a relational database would cannibalize the sales of IMS. While IBM eventually built a research prototype called System R, it was based around a new non-relational language called SEQUEL, instead of Codd's original Alpha language. The SEQUEL language was copied by Larry Ellison in his Oracle Database based on pre-launch conference papers, and the name changed to SQL to avoid trademark disputes.↩
This database started life as a library for managing Directed Acyclic Graphs, or DAGs. Its name "Dagoba" was originally intended to come with a silent 'h' at the end, an homage to the swampy fictional planet, but reading the back of a chocolate bar one day we discovered the sans-h version refers to a place for silently contemplating the connections between things, which seems even more fitting.↩
The two purposes of this chapter are to teach this process, to build a graph database, and to have fun.↩
Notice that we're modeling edges as a pair of vertices. Also notice that those pairs are ordered, because we're using arrays. That means we're modeling a directed graph, where every edge has a starting vertex and an ending vertex. Our "dots and lines" visual model becomes a "dots and arrows" model. This adds complexity to our model, because we have to keep track of the direction of edges, but it also allows us to ask more interesting questions, like "which vertices point to vertex 3?" or "which vertex has the most outgoing edges?" If we need to model an undirected graph we could add a reversed edge for each existing edge in our directed graph. It can be cumbersome to go the other direction: simulating a directed graph from an undirected one. Can you think of a way to do it?↩
It's also lax in the other direction: all functions are variadic, and all arguments are available by position via the arguments object, which is almost like an array but not quite. ("Variadic" is a fancy way of saying a function has indefinite arity. "A function has indefinite arity" is a fancy way of saying it takes a variable number of variables.)↩
The Array.isArray checks here are to distinguish our two different use cases, but in general we won't be doing many of the validations one would expect of production code, in order to focus on the architecture instead of the trash bins.↩
Why can't we just use this.vertices.length here?↩
Often when faced with space leaks due to deep copying the solution is to use a path-copying persistent data structure, which allows mutation-free changes for only \(\log{}N\) extra space. But the problem remains: if the host application retains a pointer to the vertex data then it can mutate that data any time, regardless of what strictures we impose in our database. The only practical solution is deep copying vertices, which doubles our space usage. Dagoba's original use case involves vertices that are treated as immutable by the host application, which allows us to avoid this issue, but requires a certain amount of discipline on the part of the user.↩
We could make this decision based on a Dagoba-level configuration parameter, a graph-specific configuration, or possibly some type of heuristic.↩
We use the term list to refer to the abstract data structure requiring push and iterate operations. We use JavaScript's "array" concrete data structure to fulfill the API required by the list abstraction. Technically both "list of edges" and "array of edges" are correct, so which we use at a given moment depends on context: if we are relying on the specific details of JavaScript arrays, like the .length property, we will say "array of edges". Otherwise we say "list of edges", as an indication that any list implementation would suffice.↩
A tuple is another abstract data structure—one that is more constrained than a list. In particular a tuple has a fixed size: in this case we're using a 2-tuple (also known as a "pair" in the technical jargon of data structure researchers). Using the term for the most constrained abstract data structure required is a nicety for future implementors.↩
Very short lived garbage though, which is the second best kind.↩
Two references to the same mutable data structure act like a pair of walkie-talkies, allowing whoever holds them to communicate directly. Those walkie-talkies can be passed around from function to function, and cloned to create a whole lot of walkie-talkies. This completely subverts the natural communication channels your code already possesses. In a system with no concurrency you can sometimes get away with it, but introduce multithreading or asynchronous behavior and all that walkie-talkie squawking can become a real drag.↩
Uniqueness types were dusted off in the Clean language, and have a non-linear relationship with linear types, which are themselves a subtype of substructural types.↩
Most modern JS runtimes employ generational garbage collectors, and the language is intentionally kept at arm's length from the engine's memory management to curtail a source of programmatic non-determinism.↩
The run() at the end of the query invokes the interpreter and returns results.↩
With weight in skippund and height in fathoms, naturally. Depending on the density of Asgardian flesh this may return many results, or none at all. (Or just Volstagg, if we're allowing Shakespeare by way of Jack Kirby into our pantheon.)↩
Some would argue it's best to be explicit all the time. Others would argue that a good system for implicits makes for more concise, readable code, with less boilerplate and a smaller surface area for bugs. One thing we can all agree on is that making effective use of JavaScript's implicit coercion requires memorizing a lot of non-intuitive special cases, making it a minefield for the uninitiated.↩
What would you expect each of those to return? What do they actually return?↩
There are certain conditions under which this particular query might yield unexpected results. Can you think of any? How could you modify it to handle those cases?↩
Technically we need to implement an interpreter with non-strict semantics, which means it will only evaluate when forced to do so. Lazy evaluation is a technique used for implementing non-strictness. It's a bit lazy of us to conflate the two, so we will only disambiguate when forced to do so.↩
Method chaining lets us write g.v('Thor').in().out().run() instead of the six lines of non-fluent JS required to accomplish the same thing.↩
We call it maybe_gremlin to remind ourselves that it could be a gremlin, or it could be something else. Also because originally it was either a gremlin or Nothing.↩
Recall that done starts at -1, so the first step's predecessor is always done.↩
Or, more pointedly, a poorly phrased query is less likely to yield exponential slowdowns. As an end-user of an RDBMS the aesthetics of query quality can often be quite opaque.↩
Note that we're keeping the domain of the priority parameter open, so it can be an integer, a rational, a negative number, or even things like Infinity or NaN.↩
You can learn more about dependency resolution in the Contingent chapter of this book.↩
The fancy term for this is "index-free adjacency".↩
Though only in operation for 18 months due to the arrival of the transcontinental telegraph and the outbreak of the American Civil War, the Pony Express is still remembered today for delivering mail coast to coast in just ten days.↩
Sharding a graph database requires partitioning the graph. Optimal graph partitioning is NP-hard, even for simple graphs like trees and grids, and good approximations also have exponential asymptotic complexity.↩
In modern JavaScript engines filtering a list is quite fast—for small graphs the naive version can actually be faster than the index-free version due to the underlying data structures and the way the code is JIT compiled. Try it with different sizes of graphs to see how the two approaches scale.↩
Pro tip: Given a deep tree deep_tree, running JSON.stringify(deep_tree, 0, 2) in the JS console is a quick way to make it human readable.↩
Not too interconnected, though—you'd like the number of edges to grow in direct proportion to the number of vertices. In other words, the average number of edges connected to a vertex shouldn't vary with the size of the graph. Most systems we'd consider putting in a graph database already have this property: if Loki had 100,000 additional grandchildren the degree of the Thor vertex wouldn't increase.↩