Datomic Queries and Rules

Datomic's query and rules system is an extended form of Datalog. Datalog is a deductive query system, typically consisting of:

  • A database of facts
  • A set of rules for deriving new facts from existing facts
  • a query processor that, given some partial specification of a fact or rule:
    • finds all instances of that specification implied by the database and rules
    • i.e. all the matching facts

Typically a Datalog system would have a global fact database and set of rules. Datomic's query engine instead takes databases (and in fact, many other data sources) and rule sets as inputs.

There is a video introducing Datalog queries.

Examples

Example Data

The examples in this document use the mbrainz 1968-1973 sample set. Download and untar this file:

wget http://s3.amazonaws.com/mbrainz/datomic-mbrainz-1968-1973-backup-2014-10-15.tar -O mbrainz.tar
tar -xvf mbrainz.tar
bin/datomic restore-db file:///path/to/mbrainz-backup datomic:dev://localhost:4334/mbrainz-1968-1973

edn

Queries, sample data, and results in this document are written in the Extensible Data Notation (edn), which is programming language neutral. In your own programs, you can create data programmatically out of basic language data types, e.g. Java Strings, Lists, and Maps. Alternatively, you can pass the pattern argument as a serialized edn string.

The ellipsis is used in query results to shows that a large result set has been elided for brevity.

Why Datalog?

Simplicity

Datalog is simple. The basic component of Datalog is a clause, which is simply a list that either begins with the name of a rule, or is a data pattern. These clauses can contain variables (symbols beginning with a '?'). The query engine simply finds all combinations of values of the variables that satisfy all of the clauses. There is no complex syntax to learn.

Declarative

Like SQL and other good query languages, Datalog is declarative. That is, you specify what you want to know and not how to find it. This kind of declarative programming is very powerful, and it is a shame that it has been relegated to the database servers and not available to application programmers. Declarative programs are:

  • More evident - it is easier to tell what their purpose is, both for programmers and stakeholders.
  • More readily optimized - the query engine is free to reorder and parallelize operations to a degree not normally taken on by application programs.
  • Simpler - and thus, more robust.

Logic-based

Even SQL, while fundamentally declarative, still includes many operations that go beyond the query itself, like specifying joins explicitly. Because Datalog is based upon logical implication, joins are implicit, and the query engine figures out when they are needed.

Embedded

Datomic's queries are further simplified by the fact that its query engine, and the data, are made available locally. Query languages like SQL are oriented around a client-server model where, in a single conversation, you are going to have to both:

  • Answer your fundamental question, e.g. who bought socks this month.
  • Recover any additional information required for reporting and processing, e.g. what are their names and email addresses.

The latter is not really a query, it is just a mechanical navigation to related information. With Datomic, you don't have to combine decisions about how to render the answers with finding them, leading to simpler queries. Given an entity id found in a query, you can at any time later quickly navigate to any related information, freeing yourself from the complex queries forced by a client-server model.

The Database of Facts

The first obvious input for a query is a Datomic database. It ends up that the data sources processed and returned by Datalog are in fact relations, i.e. sets of tuples. A Datomic database is as a universal relation of datoms, i.e. 5-tuples of the form:

[entity attribute value transaction added?]

Datomic's Datalog is of course able to process these tuples, but is not limited to processing 5-tuples. Queries and rules output relations with tuples of varying arity, and Datomic's query engine can accept as inputs relation-like data in arbitrary collections.

Query Grammar

Syntax Used in Grammar

'' literal
"" string
[] = list or vector
{} = map {k1 v1 ...}
() grouping
| choice
? zero or one
+ one or more

Query

query                      = [find-spec with-clause? inputs? where-clauses?]
find-spec                  = ':find' (find-rel | find-coll | find-tuple | find-scalar)
find-rel                   = find-elem+
find-coll                  = [find-elem '...']
find-scalar                = find-elem '.'
find-tuple                 = [find-elem+]
find-elem                  = (variable | pull-expr | aggregate)
pull-expr                  = ["pull" variable pattern]
pattern                    = (input-name | pattern-data-literal)
aggregate                  = [aggregate-fn-name fn-arg+]
fn-arg                     = (variable | constant | db-var)
with-clause                = ':with' variable+
where-clauses              = ':where' clause+
inputs                     = ':in' (db-var | variable | pattern-var | rule-var)+
db-var                     = symbol starting with "$"
variable                   = symbol starting with "?"
rule-var                   = the symbol "%"
plain-symbol               = symbol that does not begin with "$" or "?"
pattern-var                = plain-symbol
clause                     = [ (data-pattern | pred-expr | fn-expr | rule-expr)+ ]
data-pattern               = [ db-var? (variable | constant)+ ]
constant                   = any non-variable data literal
pred-expr                  = [ [pred fn-arg+] ]
fn-expr                    = [ [fn fn-arg]+ binding]
binding                    = (bind-scalar | bind-tuple | bind-coll | bind-rel)
bind-scalar                = variable
bind-tuple                 = [variable+]
bind-coll                  = [variable '...']
bind-rel                   = [ [variable+] ]

See pattern grammar for the description of the pattern-data-literal rule.

Rules

Note that the rule grammar reuses some terms from the query grammar above.

rule                       = [ [rule-head clause+]+ ]
rule-head                  = [rule-name variable+]
rule-name                  = plain-symbol
rule-expr                  = [rule-name variable+]

Queries

Basics

The basic job of query is, given a set of variables and a set of clauses, find (the set of) all of the (tuples of) variables that satisfy the clauses. The shape of the most basic query looks like this:

[:find variables 
 :where clauses]

Given the following data tuples:

[[sally :age 21] 
 [fred :age 42] 
 [ethel :age 42]
 [fred :likes pizza] 
 [sally :likes opera] 
 [ethel :likes sushi]]

We could perform the query:

[:find ?e 
 :where [?e :age 42]]

And get this result:

[[fred], [ethel]]

Invoking a query takes this basic form:

Peer.q(query, inputs...);

The query above has one variable ?e, and will take one input, a collection of tuples with at least three components. This clause [?e :age 42] is called a data clause. A data clause consists of constants and/or variables, and a tuple satisfies a clause if its constants match. Variabls in the data pattern are then bound to the corresponding part of the matching tuple. All of this matching happens by position.

Unification

Unification occurs when a variable appears in more than one data pattern. In the following query, ?e appears twice:

;;which 42-year-olds like what?
[:find ?e ?x 
 :where [?e :age 42] [?e :likes ?x]]

Matches for the variable ?e must unify, i.e. represent the same value in every clause in order to satisfy the set of clauses. So a matching ?e must have both :age 42 and :likes for some ?x:

[[fred pizza], [ethel sushi]]

Blanks

Sometimes we don't care about certain components of the tuples in a query, but must put something in the clause in order to get to the positions we care about. The underscore symbol _ is blank placeholder, matching anything without binding or unifying.

The query below finds anything that is liked, without caring who does the liking:

;; query
[:find ?x 
 :where [_ :likes ?x]]

;; result
[[fred] [ethel [sally]]

Do not use a dummy variable instead of the blank. This will make the query engine do extra work, tracking binding an unification for a variable that you never intent to use. It will also make human readers do extra work, puzzling out that variable is intentionally not used.

Querying a database

To query a database, we must first get a value of the database from the connection:

Database db = conn.db();

A Database is an immutable value, and will never change. If we use db for several queries we will know the answers are based upon exactly the same data from a single point in time. As we said, the database itself acts as a relation of 5-tuples of

[entity attribute value transaction added?]

The following query finds all release names in the database:

;; query
[:find ?release-name
 :where [_ :release/name ?release-name]

;; inputs
db

;; result 
#{["Osmium"]
  ["Hela roept de akela"] 
  ["Ali Baba"] 
  ["The Power of the True Love Knot"]
  ...}

Implicit Blanks

Notice that while the release names query targets a database, its data pattern contains only three elements, not five. In data patterns, you can always elide any trailing components you don't care about, rather than explicitly padding with blanks:

;; unnecessary trailing blanks
[_ :release/name ?release-name _ _]

Inputs

By default, queries expect a single input, a database whose name is the dollar sign $. Also by default, data patterns refer to a database named $.

Queries can choose to explictly name their inputs via an :in clause. A fully explicit form of the release names query would look like:

;; query
[:find ?release-name
 :in $
 :where [$ _ :release/name ?release-name]]

;; inputs & result same as previous

Here you can see that the one and only :in argument names the one and only input db, and that the data pattern uses a leading $ to choose the database it matches against.

Multiple inputs

Most real-world queries are parameterized at runtime with variable bindings. For example, the following query takes two inputs: a database and a scalar variable binding to limit releases to those perfomed by John Lennon.

;; query
[:find ?release-name
 :in $ ?artist-name
 :where [?artist :artist/name ?artist-name]
        [?release :release/artists ?artist]
        [?release :release/name ?release-name]]

;; inputs
db, "John Lennon"

;; result
#{["Power to the People"] 
  ["Unfinished Music No. 2: Life With the Lions"] 
  ["Live Peace in Toronto 1969"] 
  ["Live Jam"]
  ...}

Because /?artist-name/ appears second in the /:in/ clause, it is bound
to the second input "John Lennon".

Bindings

A variable name like ?artist-name is the simplest kind of binding, to a single scalar. Other input shapes can be bound as follows:

Binding FormBinds
?ascalar
[?a ?b]tuple
[?a …]collection
[ [?a ?b ] ]relation

Tuple Binding

A tuple binding binds a set of variables to a single value each, passed in as a collection. The query below binds both artist name and release name to find the entity ids for releases of John Lennon's Mind Games:

;; query
[:find ?release
 :in $ [?artist-name ?release-name]
 :where [?artist :artist/name ?artist-name]
        [?release :release/artists ?artist]
        [?release :release/name ?release-name]]

;; inputs
db, ["John Lennon" "Mind Games"]

;; result
#{[17592186157686] 
  [17592186157672] 
  [17592186157690] 
  [17592186157658]}

Collection Binding

A collection binding binds a single variable to multiple values passed in as a collection. This can be used to ask "or" questions, i.e. what releases are associated with either Paul McCartney or George Harrison?

;; query
[:find ?release-name
 :in $ [?artist-name ...]
 :where [?artist :artist/name ?artist-name]
        [?release :release/artists ?artist]
        [?release :release/name ?release-name]]

;; inputs
db, ["Paul McCartney" "George Harrison"]

;; result
#{["My Sweet Lord"] 
  ["Electronic Sound"]
  ["Give Me Love (Give Me Peace on Earth)"] 
  ["All Things Must Pass"]
  ...}

Relation Binding

A relation binding is fully general, binding multiple variables positionally to a relation (collection of tuples) passed in. This can be used to ask "or" questions involving multiple variables. For example, what releases are associated with either John Lennon's Mind Games or Paul McCartney's Ram?

;; query
[:find ?release
 :in $ [[?artist-name ?release-name]]
 :where [?artist :artist/name ?artist-name]
        [?release :release/artists ?artist]
        [?release :release/name ?release-name]]

;; inputs
db,  [["John Lennon" "Mind Games"] 
      ["Paul McCartney" "Ram"]]

;; result
#{[17592186157686] 
  [17592186157672] 
  [17592186157690] 
  [17592186157658] 
  [17592186063566]}

Find Specifications

Where bindings control inputs, find specifications control results.

Find SpecReturnsJava Type Returned
:find ?a ?brelationCollection of Lists
:find [?a …]collectionCollection
:find [?a ?b]single tupleList
:find ?a .single scalarScalar Value

The relation find spec is the most common, and the most general. It will return a tuple for each result, with values in each tuple matching the named variables. All of the examples so far have used the relation find spec. The example below finds a relation spec with two variables and returns a relation of 2-tuples:

;; query
[:find ?artist-name ?release-name
 :where [?release :release/name ?release-name]
        [?release :release/artists ?artist]
        [?artist :artist/name ?artist-name]]

;; inputs
db

;; result
#{["George Jones" "With Love"] 
  ["Shocking Blue" "Hello Darkness / Pickin' Tomatoes"] 
  ["Junipher Greene" "Friendship"]
  ...}

The collection find spec is useful when you are only interested in a single variable. The form [?release-name …] below returns values for ?release-name, not wrapped in a one-tuple:

;; query
[:find [?release-name ...]
 :in $ ?artist-name
 :where [?artist :artist/name ?artist-name]
        [?release :release/artists ?artist]
        [?release :release/name ?release-name]]

;; inputs
db

;; result
["Power to the People" 
 "Unfinished Music No. 2: Life With the Lions" 
 "Live Peace in Toronto 1969" 
 "Live Jam"
 ...]

The single tuple find spec is useful when you are interested in multiple variables, but expect only a single result. The form [?year ?month ?day] below returns a single triple, not wrapped in a relation.

;; query
[:find [?year ?month ?day]
 :in $ ?name
 :where [?artist :artist/name ?name]
        [?artist :artist/startDay ?day]
        [?artist :artist/startMonth ?month]
        [?artist :artist/startYear ?year]] 

;; inputs
db

;; result
[1940 10 9]

The scalar find spec is useful when you want to return a single value of a single variable. The form ?year below returns a single scalar value:

;; query
[:find ?year .
 :in $ ?name
 :where [?artist :artist/name ?name]
        [?artist :artist/startYear ?year]]

;; inputs
db

;; result
1940

Note that the single tuple find spec and the scalar find spec will return only a single value from the query result, even if the result itself has more than one value. These find specs are typically used only when you know in advance that a query will have exactly one result.

Expression Clauses

Expression clauses allow arbitrary Java or Clojure functions to be used inside of Datalog queries. Any functions or methods you use in expression clauses must be pure, i.e. they must be free of side effects and always return the same thing given the same arguments. Expression clauses have one of two basic shapes:

[(predicate ...)]
[(function ...) bindings]

The first item in an expression clause is a list designating a function or method call.

Predicate Expressions

If no bindings are provided, the function is presumed to be a predicate returning a truth value: null and false are treated as false, anything else is treated as true.

In the example below, the built-in expression predicate < limits the results to artists who started before 1600:

;; query
[:find ?name ?year
 :where [?artist :artist/name ?name]
        [?artist :artist/startYear ?year]
        [(< ?year 1600)]]

;; inputs
db

;; result
#{["Choir of King's College, Cambridge" 1441] 
  ["Heinrich Schütz" 1585]}

Function Expressions

Functions behave similarly, except that their return values are used not as predicates, but to bind other variables. In the example below, the built-in expression function quot converts track lengths from milliseconds to minutes:

[:find ?track-name ?minutes
 :in $ ?artist-name
 :where [?artist :artist/name ?artist-name]
        [?track :track/artists ?artist]
        [?track :track/duration ?millis]
        [(quot ?millis 60000) ?minutes]
        [?track :track/name ?track-name]]

;; inputs 
db, "John Lennon"

;; result
#{["Crippled Inside" 3] 
  ["Working Class Hero" 3] 
  ["Sisters, O Sisters" 3] 
  ["Only People" 3] 
  ...}

Expression clauses do not nest:

;; this query will not work!!!
[:find ?celsius .
 :in ?fahrenheit
 :where [(/ (- ?fahrenheit 32) 1.8) ?celsius]]

Instead, multi-step calculations must be performed with separate expressions:

;; query
[:find ?celsius .
 :in ?fahrenheit
 :where [(- ?fahrenheit 32) ?f-32]
        [(/ ?f-32 1.8) ?celsius]]

;; inputs
212

;; result
100.0

Built-in Expression Functions and Predicates

Datomic provides the following built-in expression functions and predicates:

  • Two argument comparison predicates !=, <, <=, >, and >=.
  • Two-argument mathematical operators +, -, *, and /.
  • All of the functions from the clojure.core namespace of Clojure, except eval.
  • A set of functions and predicates that are aware of Datomic data structures, documented below:

get-else

[(get-else db-var ent attr default) ?val-or-default]

The get-else function takes a database, an entity identifier, a cardinality-one attribute, and a default value. It returns that entity's value for the attribute, or the default value if entity does not have a value.

The query below reports "N/A" whenever an artist's startYear is not in the database:

;; query
[:find ?artist-name ?year
 :in $ [?artist-name ...]
 :where [?artist :artist/name ?artist-name]
        [(get-else $ ?artist :artist/startYear "N/A") ?year]]

;; inputs
db, ["Crosby, Stills & Nash" "Crosby & Nash"]

;; result
#{["Crosby, Stills & Nash" 1968] 
  ["Crosby & Nash" "N/A"]}

get-some

[(get-some db-var ent attr+) [?attr ?val]]

The get-some function takes a database, an entity identifier, and one or more cardinality-one attributes, returning a tuple of the entity id and value for the first attribute possessed by the entity.

The query below tries to find a :country/name for an entity, and then falls back to :artist/name:

;; query
[:find [?e ?attr ?name]
 :in $ ?e
 :where [(get-some $ ?e :country/name :artist/name) [?attr ?name]]]

;; inputs
db, :country/US

;; result
[:country/US 84 "United States"]

ground

[(ground const) binding]

The ground function takes a single argument, which must be a constant, and returns that same argument. Programs that know information at query time should prefer ground over e.g. identity, as the former can be used inside the query engine to enable optimizations.

[(ground [:a :e :i :o :u]) [?vowel ...]]

fulltext

[(fulltext db-var attr search) [[?ent ?val ?tx ?score]]]

The fulltext function takes a database, an attribute, and a search expression, and returns a relation of four-tuples: entity, value, transaction, and score.

The following query finds all the artists whose name includes "Jane":

;; query
[:find ?entity ?name ?tx ?score
 :in $ ?search
 :where [(fulltext $ :artist/name ?search) [[?entity ?name ?tx ?score]]]]

;; inputs
db, "Jane"

;; result
#{[17592186047274 "Jane Birkin" 2839 0.625] 
  [17592186046687 "Jane" 2267 1.0] 
  [17592186047500 "Mary Jane Hooper" 3073 0.5]}

missing?

[(missing? db-var ent attr)]

The missing? predicate takes a database, entity, and attribute, and returns true if the entity has no value for attribute in the database.

The following query finds all artists whose start year is not recorded in the database.

;; query
[:find ?name
 :where [?artist :artist/name ?name]
        [(missing? $ ?artist :artist/startYear)]]

;; inputs
db

;; result
#{["Sigmund Snopek III"] ["De Labanda's"] ["Baby Whale"] ...}

tx-ids

[(tx-ids ?log ?start ?end) [?tx ...]]

Given a database log, start, and end, tx-ids returns a collection of transaction ids. Start and end can be specified as database t, transaction id, or instant in time, and can be nil.

The following query finds transactions from time t 1000 through 1050:

;; query
[:find [?tx ...]
 :in ?log
 :where [(tx-ids ?log 1000 1050) [?tx ...]]]

;; inputs
log

;; result
[13194139534340 13194139534312 13194139534313 13194139534314]

tx-ids is often used in conjunction with tx-data, to first locate transactions and then the data within those transactions.

tx-data

[(tx-data ?log ?tx) [[?e ?a ?v _ ?op]]]

Given a log and a database, tx-data returns a collection of the datoms added by a transaction. You should not bind the transaction position of the result, as the transaction is already bound on input.

The following query finds the entities referenced by transaction id

;; query
[:find [?e ...]
 :in ?log ?tx
 :where [(tx-data ?log ?tx) [[?e]]]]

;; inputs
log, 13194139534312

;; result
[13194139534312 63 0 64 65 66 67 68 69 70 71 ...]

Calling Java Methods

Java methods can be used as query expression functions and predicates, and can be type hinted for performance. Java code used in this way must be on the Java process classpath.

Calling Static Methods

Java static methods can be called with the (ClassName/methodName …) form. For example, the following code calls System.getProperties, binding property names to ?k and property values to ?v.

;; query
[:find ?k ?v
 :where [(System/getProperties) [[?k ?v]]]]

;; no inputs

;; result
#{["java.vendor.url.bug" "http://bugreport.sun.com/bugreport/"] 
  ["sun.cpu.isalist" ""] 
  ["sun.jnu.encoding" "UTF-8"]
  ...}

Calling Instance Methods

Java instance methods can be called with the (.methodName obj …) form. For example, the following code calls String.endsWith:

[(.endsWith ?k "path")]

and could be used to extend the previous example like this:

;; query
[:find ?k ?v
 :where [(System/getProperties) [[?k ?v]]]
        [(.endsWith ?k "version")]]

;; no inputs

;; result
#{["java.class.version" "52.0"] 
  ["java.runtime.version" "1.8.0_20-b26"] 
  ["java.version" "1.8.0_20"]
  ...}

Type Hinting for Performance

The current version of Datomic performs reflective lookup for Java interop. You can significantly improve performance by type hinting objects, allowing the query engine to make direct method invocations. Type hints take the form of ^ClassName preceding an argument, so the previous example becomes

[(.endsWith ^String ?k "path")]

Note that type hints outside java.lang will need to be fully qualified, and that complex method signatures may require more than one hint to be unambiguous.

Calling Clojure Functions

Clojure functions can be used as query expression functions and predicates. Clojure code used in this way must be on the Clojure process classpath. The example below uses subs as an expression function to extract prefixes of words

;; query
'[:find [?prefix ...]
  :in [?word ...]
  :where [(subs ?word 0 5) ?prefix]]

;; inputs
["hello" "antidisestablishmentarianism"]

;; result
["hello" "antid"]

Function names outside clojure.core need to be fully qualified, and their namespaces must be loaded before use in query.

The implicit data source - $

Often you will have only a single, or primary, data source (usually a database). In this case you can call that data source $, and elide it in the data clauses:

[:find ?e :in $ ?age :where [?e :age ?age]]
;;same as
[:find ?e :in $data ?age :where [$data ?e :age ?age]]

Rules

Datomic datalog allows you to package up sets of :where clauses into named rules. These rules make query logic reusable, and also composable, meaning that you can bind portions of a query's logic at query time.

A rule is a named group of clauses that can be plugged into the :where section of your query. For example, here is a rule from the Seattle example dataset that tests whether a community is a twitter feed:

[(twitter? ?c)
 [?c :community/type :community.type/twitter]]

As with transactions and queries, rules are described using data structures. A rule is a list of lists. The first list in the rule is the head. It names the rule and specifies its parameters. The rest of the lists are clauses that make up the body of the rule. In this rule, the name is "twitter", the variable ?c is an input argument, and the body is single data clause testing whether the :community/type attribute of the entity ?c has the value :community.type/twitter.

This rule has no output argument - it is a predicate rule that will evaluate to true or false, indicating whether ?c matches the specified criteria. However, rules with more than one argument can be used to bind output variables that can be subsequently used elsewhere in the query.

[(community-type ?c ?t)
 [?c :community/type ?t]]

In the rule above, we could bind either ?c or ?t at invocation time, and the other variable would be bound to the output of the rule.

Individual rule definitions are combined into a set of rules. A set of rules is simply another list containing some number of rule definitions:

[[(twitter ?c)
  [?c :community/type :community.type/twitter]]]

You have to do two things to use a rule set in a query. First, you have to pass the rule set as an input source and reference it in the :in section of your query using the '%' symbol. Second, you have to invoke one or more rules from the :where section of your query. You do this by adding a rule invocation clause. Rule invocations have this structure:

(rule-name rule-arg*)

A rule invocation is a list containing a rule-name and one or more arguments, either variables or constants, as defined in the rule head. It's idiomatic to use parenthesis instead of square brackets to represent a rule invocation in literal form, because it makes it easier to differentiate from a data clause. However, this is not a requirement.

As with other where clauses, you may specify a database before the rule-name to scope the rule to that database. Databases cannot be used as arguments in a rule.

($db rule-name rule-arg*)

Rules also make it possible to define different logical paths to the same conclusion (i.e. logical OR). Here's a rule, again from the Seattle example, which identifies communities that are "social-media".

[[(social-media ?c)
  [?c :community/type :community.type/twitter]]
 [(social-media ?c)
  [?c :community/type :community.type/facebook-page]]]

The social-media rule has two definitions, one testing whether a community's type is :community.type/twitter and the other testing whether a community's type is :community.type/facebook-page. When a given community value is tested, the social-media rule will be true if either of the definitions is true. In other words, using rules, we can implement logical OR in queries.

In all the examples above, the body of each rule is made up solely of data clauses. However, rules can contain any type of clause: data, expression, or even other rule invocations.

Aggregates

Datomic's aggregate syntax is incorporated in the :find clause:

[:find ?a (min ?b) (max ?b) ?c (sample 12 ?d)
 :where ...]

The list expressions are aggregate expressions. Query variables not in aggregate expressions will group the results and appear intact in the result. Thus, the above query binds ?a ?b ?c ?d, then groups by ?a and ?c, and produces a result for each aggregate expression for each group, yielding 5-tuples.

Control Grouping via :with

Unless otherwise specified, Datomic's datalog returns sets, and you will not see duplicate values. This is often undesirable when producing aggregates. Consider the following query, which attempts to return the total number of heads possessed by a set of mythological monsters:

;; incorrect query
[:find (sum ?heads) .
 :in [[_ ?heads]]]

;; inputs
[["Cerberus" 3]
 ["Medusa" 1]
 ["Cyclops" 1]
 ["Chimera" 1]]

;; result
4

The monsters clearly have six total heads, but set logic coalesces Medusa, the Cyclops, and the Chimera together, since each has one head.

The solution to this problem is the :with clause, which considers additional variables when forming the basis set for the query result. The :with variables are then removed, leaving a bag (not a set!) of values available for aggregation.

;; query
[:find (sum ?heads) .
 :with ?monster
 :in [[?monster ?heads]]]

;; inputs
[["Cerberus" 3]
 ["Medusa" 1]
 ["Cyclops" 1]
 ["Chimera" 1]]

;; result
6

Aggregates Returning a Single Value

(min ?xs)
(max ?xs)
(count ?xs)
(count-distinct ?xs)
(sum ?xs)
(avg ?xs)
(median ?xs)
(variance ?xs)
(stddev ?xs)

The aggregation functions that return a single value are listed above, and all behave as their names suggest.

  • Min and Max

    The following query finds the smallest and largest track lengths:

    ;; query 
    [:find [(min ?dur) (max ?dur)]
     :where [_ :track/duration ?dur]]
    
    ;; inputs
    db
    
    ;;result 
    [3000 3894000]
    

    min and max support all database types (via comparators), not just numbers.

  • Sum

    The following query uses sum to find the total number of tracks on all media in the database.

    ;; query
    (d/q '[:find (sum ?count) .
           :with ?medium
           :where [?medium :medium/trackCount ?count]]
         db)
    ;; inputs
    db
    
    ;; result
    100759
    
  • Counts

    More than one artist can have the same name. The following query uses count to report the total number of artist names, and count-distinct to report the total number of unique artist names.

    ;; query
    [:find (count ?name) (count-distinct ?name)
     :with ?artist
     :where [?artist :artist/name ?name]]
    
    ;; inputs
    db
    
    ;; result
    [4601 4588]
    

    Note the use of :with so that equal names do not coalesce.

  • Statistics

    Are musicians becoming more verbose when naming songs? The following query reports the median, avg, and stddev of song title lengths (in characters), and includes year in the find set to break out the results by year.

    ;; query
    [:find ?year (median ?namelen) (avg ?namelen) (stddev ?namelen)
     :with ?track
     :where [?track :track/name ?name]
            [(count ?name) ?namelen]
            [?medium :medium/tracks ?track]
            [?release :release/media ?medium]
            [?release :release/year ?year]]
    
    ;; inputs 
    db
    
    ;; result
    [[1968 16 18.92181098534824 12.898760656290333] 
     [1969 16 18.147895557287608 11.263945894977244] 
     [1970 15 18.007481296758105 12.076103750401026] 
     [1971 15 18.203682039283294 13.715552693168124] 
     [1972 15 17.907170949841063 11.712941060399375] 
     [1973 16 18.19300100438759 12.656827911058622]]
    

Aggregates Returning Collections

(distinct ?xs)
(min n ?xs)
(max n ?xs)
(rand n ?xs)
(sample n ?xs)

Where n is specified, fewer than n items may be returned if not enough items are available.

  • Distinct

    The distinct aggregate returns the set of distinct values in the collection.

    ;; query
    [:find (distinct ?v) .
     :in [?v ...]] 
    
    ;; inputs
    [1 1 2 2 2 3]
    
    ;; result
    #{1 3 2}
    
  • Min N /Max N

    The min n and max n aggregates return up to n least/greatest items. The following query returns the five shortest and five longest track lengths in the database.

    ;; query
    [:find [(min 5 ?millis) (max 5 ?millis)]
     :where [?track :track/duration ?millis]]
    
    ;; inputs 
    db
    
    ;; result
    [[3000 4000 5000 6000 7000] 
     [3894000 3407000 2928000 2802000 2775000]]
    
  • Rand N / Sample N

    The rand n aggregate selects exactly n items with potential for duplicates. and the sample n aggregate returns up to n distinct items.

    The following query returns two random and two sampled artist names.

    ;; query
    [:find [(rand 2 ?name) (sample 2 ?name)]
     :where [_ :artist/name ?name]]
    
    ;; inputs
    db
    
    ;; result
    [("Four Tops" "Ethel McCoy") 
     ["Gábor Szabó" "Zapata"]]
    

Custom Aggregates

You may call an arbitrary Clojure function as an aggregation function as follows:

  • Use the fully qualified name of the function.
  • Load the namespace before using the function.
  • The one and only aggregated variable must be the last argument to the function.
  • Other arguments to the function must be constants in the query.

The aggregated variable will be passed as a partial implementation of java.util.List - only size(), iterator(), and get(i) are implemented.

For example, you might implement your own mode function to calculate the mode as follows:

(defn mode
  [vals]
  (->> (frequencies vals)
       (sort-by (comp - second))
       ffirst))

With mode in hand, you can answer the question "What is the most common release medium length, in tracks?"

;; query
[:find (user/mode ?track-count) .
 :with ?media
 :where [?media :medium/trackCount ?track-count]]

;; inputs
db

;; result
2

I was initially surprised by this result until I recalled the time period of the sample data included a huge number of vinyl singles, which by definition have two tracks

Pull Expressions

Pull expressions can used in a :find clause. A pull expression takes the form

(pull ?entity-var pattern)

and adds a value to the result set by applying the Pull API to the entities named by ?entity-var.

For example, the following query returns the :release/name for all of Led Zeppelin's releases:

;; query
[:find (pull ?e [:release/name])
 :in $ ?artist
 :where [?e :release/artists ?artist]]

;; args
db, led-zeppelin

;; results
[[{:release/name "Immigrant Song / Hey Hey What Can I Do"}]
 [{:release/name "Heartbreaker / Bring It On Home"}]
 [{:release/name "Led Zeppelin III"}]
 ...]

The pull expression pattern can also be bound dynamically as an :in parameter to query:

;; query
[:find (pull ?e pattern)
 :in $ ?artist pattern
 :where [?e :release/artists ?artist]]

;; args
db, led-zeppelin, [:release/name]

;; results elided, same as previous example

Building Queries Programmatically

List form vs. Map form

Queries written by humans typically are a list, and the various keyword arguments are inferred by position. For example, in the query

[:find ?e
 :in $ ?fname ?lname
 :where [?e :user/firstName ?fname]
        [?e :user/lastName ?lname]]

there is one :find argument, three :in arguments, and two :where arguments.

While most people find the positional syntax easy to read, it makes extra work for programmatic readers and writers, which have to keep track of what keyword is currently "active" and interpret tokens accordingly. For such cases, queries can be specified more simply as maps. The query above becomes:

{:find [?e]
 :in [$ ?fname ?lname]
 :where [[?e :user/firstName ?fname]
         [?e :user/lastName ?lname]]}

Work with Data Structures, Not Strings

Two features of Datalog queries make them immune to many of the SQL-injection style attacks to which many other DBMSs are vulnerable:

  • Datalog queries are composed of data structures, rather than strings, which obviates the need to do string interpolation, sanitation, escaping, etc.
  • The query API is parameterized with data sources. In many cases, this feature obviates the need to include user-provided data in the query itself. Instead, you can pass user data to a parameterized query as its own data source.

You should avoid building queries by reading in a string that has been built up by concatenation or interpolation. Doing so gives up the security and simplicity of working with native data structures.

Performance

Clause order

To minimize the amount work the query engine must do, query authors should put the most restrictive or narrowing :where clauses first, and then proceed on to less restrictive clauses.

As an example, consider the following two queries looking for Paul McCartney's releases. The first :where clause begins with a data pattern ([?release :release/name ?name]) that is not at all selective, forcing the query engine to consider all the releases in the system:

;; query
[:find [?name ...]
 :in $ ?artist
 :where [?release :release/name ?name]
        [?release :release/artists ?artist]]

;; inputs
db, mccartney

;; result
["McCartney" "Another Day / Oh Woman Oh Why" "Ram"]

The following equivalent query reorders the :where clause, leading with a much more selective pattern ([?release :release/artists ?artist]) that is limited in this context to the single ?artist passed in.

;; query
[:find [?name ...]
 :in $ ?artist
 :where [?release :release/artists ?artist]
        [?release :release/name ?name]]

;; inputs and result same as above

The second query runs 50 times faster.

Queries and Peer Memory

Since queries run within the Peer with application-local memory, application designers need to consider the memory requirements for their queries.

Datomic's Datalog is set-oriented and eager, and does not spill large results to disk. Queries are designed to be able to run over datasets much larger than memory. However, each intermediate representation step of a query must fit into local memory. Datomic doesn't spool intermediate representations to disk like some server-based RDBMS's.

If you need to work with result sets larger than can fit in memory, you will need to break up the query into smaller pieces whose results can fit in memory. The datoms and index-range APIs are often useful for this.

Query Caching

Datomic processes maintain an in-memory cache of parsed query representations. Caching is based on equality (in the Java .equals sense) of the first argument to q. To take advantage of caching, programs should

  • Use parameterized queries (i.e. queries with multiple inputs) instead of building dynamic queries.
  • When building dynamic queries, use a canonical approach to naming and ordering such that equivalent queries will be Java .equals.

Limitations

Resolving Entity Identifiers in V Position

Datomic performs automatic resolution of entity identifiers, so that you can generally use entity ids, idents, and lookup refs interchangeably.

For example, the following family of queries all locate Belgium and return the same results:

;; query
[:find ?artist-name
 :in $ ?country
 :where [?artist :artist/name ?artist-name]
        [?artist :artist/country ?country]]

;; input option 1: lookup ref
db, [:country/name "Belgium"]

;; input option 2: ident
db, :country/BE

;; input option 3: entity id
db, 17592186045516

;; result
#{["Wallace Collection"] 
  ["André Brasseur"] 
  ["Arthur Grumiaux"]
  ...}

Highly dynamic queries inhibit Datomic's resolution of entity identifiers in value position. The following query makes the reference attribute a dynamic ?reference input to query.

;; query
'[:find [?artist-name ...]
  :in $ ?country [?reference ...]
  :where [?artist :artist/name ?artist-name]
         [?artist ?reference ?country]]

;; inputs 
db, :country/BE, [:artist/country]

;; result
[]

Since the attribute itself is dynamic, Datomic does not know that the the variable ?reference is guaranteed to refer to a reference attribute, and will not perform entity identifier resolution for ?country. Unable to resolve :country/BE, the query returns no results.

There are two options for dealing with dynamic queries such as these.

  • Where possible, make the attribute specification static, as in the first example in this section.
  • Where attributes truly need to be dynamically specified, resolve the entity id yourself. The query below introduces a call to entid to resolve the entity:
;; query
[:find [?artist-name ...]
 :in $ ?country [?reference ...]
 :where [(datomic.api/entid $ ?country) ?country-id]
        [?artist :artist/name ?artist-name]
        [?artist ?reference ?country-id]]

;; inputs
db, :country/BE, [:artist/country]

;; result
#{["Wallace Collection"] 
  ["André Brasseur"] 
  ["Arthur Grumiaux"]
  ...}

Note that this ambiguity occurs only with the value component of a datom, which might be a reference or a scalar. Entity identifier resolution is always available for entity, attribute, and transaction, since those components are known to always be entities.