Datomic Best Practices

Plan for Accretion

Programs should not assume that an entity will be limited to the set of attributes that it has at a given point in time. The Schema Growth principle provides a means for entites to develop over time, and tying programmatic usage to a fixed set of attributes can lead to breakage in the face of growth.

For example, usage should be structured to handle a given set of attributes for an entity, not applying arbitrary usage to all possible attributes:

 (select-keys some-entity keys-i-understand))
;; not (do-side-effecty-thing-with-each some-entity)

Model Relationships in One Direction Only

When you create a reference type, Datomic automatically indexes the reference in both directions: from entity to value and from value to entity. The indices are equally efficient, so it is entirely redundant to manage relationships in both directions.

For example, the mbrainz sample schema connects artists and releases via the :release/artists attribute. There is no need for a separate :artist/releases attribute. The data clause below can find ?artist from ?release and/or ?release from ?artist.

[?release :release/artists ?artist]

Use Idents for Enumerated Types

Enumerated types should be represented by a reference pointing to an entity that has an ident. This is very efficient in memory and storage, as Datomic stores the ident name only once and allows any number of datoms to reference it.

The mbrainz sample schema demonstrates this with :artist/country. Note that transactions can use an ident directly as a reference value, eliding the indirection. The transaction data below creates a reference between artist-id and the entity whose ident is :country/GB

[artist-id :artist/country :country/GB]

Use Unique Identities for External Keys

If an attribute is used as a external key, you should set that attribute to be :db/unique :db.unique/identity. Unique identities can be domain specific identifiers, such as an account number or an email address.

For example, you could represent an ISO 3166-1 compliant country code external key with the following attribute definition:

{:db/ident :country/code
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/identity
 :db/doc "ISO 3166-1 alpha-3 country code"}

If you need to generate an ID that will be both unique and friendly to Datomic's indexing, use Datomic's SQUUID generator to create ids for an attribute of type :db.type/uuid.

Note that all entities have an internal key - the entity ID.

Use NoHistory for High-churn Attributes

For high churn attributes, such as a counter or version incrementer, the cost of storing history is frequently not worth the impact on database size or indexing performance. If you have a high-churn attribute that you don't expect to use in historical queries, you should set :db/noHistory to true.

Grow Schema and Never Break It

In a production system, one or more codebases depend on your data. In schema terms, growth is providing more schema while breakage is removing schema or changing the meaning of existing schema.

Growth migrations are suitable for production, and breakage migrations are, at best, a dev-only convenience.

Further details on growth vs. breakage can be found in Rich Hickey's 2016 Clojure/conj Keynote, Spec-ulation.

Datomic provides lightweight, flexible schema that can be grown as necessary to support new information about your system or domain. Growth is always additive, i.e. adding new attributes, adding new 'types', adding relationships between 'types':

;; Given some initial schema
{:db/ident :user/id
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}

;; Grow system by adding schema:
{:db/ident :user/name
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}

Although this recommendation may seem difficult, the following three sections provide approaches to facilitate growth without breakage.

Never Remove or Reuse Names

Removing a named schema component at any level is a breaking change for programs that depend on that name. Never remove a name. The meaning of a name is established when the name is first introduced. Reusing that name to mean something substantially different breaks programs that depend on that meaning. This can be even worse than removing the name, as the breakage may not be as immediately obvious.

Use Aliases

Instead of removing or reusing names, you should use aliases to allow multiple names to refer to a single schema element. Datomic allows multiple :db/idents to refer to a single entity ID.

For example, to create an alias, :user/primary-email that refers to the same schema element as an existing ident (:user/id) transact the following assertion:

[:db/add :user/id :db/ident :user/primary-email]

This new alias allows new programs to use the new :user/primary-email name, while adhering to the the prior section ensures that old programs that require :user/id will continue to function.

Annotate Schema

Because Datomic Schema is stored as data, you can and should annotate your schema elements with useful information that can both help users/readers understand the system and provide an understanding of how the schema has grown over time.

For example, in the case of a preferred newer schema option, you can add a :schema/see-instead flag and a :db/doc on the older schema element to point users at the new convention:

{:db/ident :user
 :schema/see-instead :user2
 :db/doc "prefer the user2 namespace for new development"}

Add Facts About the Transaction Entity

Most entities in a system model the "what" of your domain. Transactions provide a place to model "when", "who", "where", and "why".

As a part of every transaction, Datomic creates an entity to represent the transaction itself. This reified transaction automatically includes a "when" datom, :db/txInstant, which records the wall clock time that the transaction was recorded.

Transactions are ordinary entities: you can create attributes that are about transactions and query them just like any other datoms in Datomic.

For example, the following datom uses a :data/src attribute to report the source of data being imported from another system:


Use Lookup Refs to Specify Existing Entities

Database updates often have a two-step structure:

  • Query for database ids using an externally unique identifier
  • Use those database ids as part of a transaction

Lookup refs flatten this into a single step. With a lookup ref, you can specify an entity directly via an external identifier.

So instead of:

(def e (d/q [:find ?e .
             :where [?e :user/email "jdoe@example.com"]]))

;;then use
(d/transact conn [[:db/add 
                   "doc about John"]])

you can simply:

(d/transact conn [[:db/add
                   [:user/email "jdoe@example.com"]
                   "doc about John"]])

Use DbAfter to see Result of a Transaction

The transact function returns a future. On a successfully committed transaction, the value of this future is a map which can be used to retrieve information about the outcome of the transaction. This map contains the DbAfter value of the database immediately after the transaction is applied. This value can be retrieved as DB_AFTER in Java and :db-after in Clojure.

If you want to determine the results of applying a transaction to a database, using :db-after ensures that you will see only the impact of the transaction and no other changes to the database that may have been made in the time since.

Because :db-after can also be retrieved from with, you can examine the results of a prospective or real transaction identically by using :db-after.

The following example uses :db.fn/cas and :db-after in conjunction to verify that a deposit only increases the account by the correct amount:

(let [db (d/db conn)
      deposit 850.00
      acc-lookup [:account/number "123-45-6789"]
      balance-before (-> (d/pull db '[:account/balance] acc-lookup)
      tx-result (deref
                  (d/transact conn [[:db.fn/cas acc-lookup
                                     :account/balance balance-before
                                                      (+ balance-before deposit)]]))
      db-after (:db-after tx-result)
      balance-after (-> (d/pull db-after '[:account/balance] acc-lookup)
  [balance-before deposit balance-after])

In this case, the constraint of cas with db-after ensures that we're only seeing the impact of our transaction on this particular account balance. If we had instead used (d/db conn) to retrieve the latest version of the database in place of :db-after, it's possible the balance would have been changed again by the time we looked at it.

Set txInstant on Imports

When importing values (e.g. from another datastore), note that you can override Datomic's default timestamping behavior by setting :db/txInstant explicitly. In the typical case you will override this by using the original transaction time logged by the database from which you're importing. When you do this, note that you must choose a :db/txInstant value that is not older than any existing transaction's :db/txInstant value and not newer than the transactor's clock time as Datomic's :db/txInstant values must increase monotonically.

To write your own :db/txInstant value, add a map to the transaction with the special :db/id value "datomic.tx" and a supplied value for the :db/txInstant attribute.

{:db/id "datomic.tx"
 :db/txInstant #inst "2014-02-28"}]

Pipeline Transactions for Higher Throughput

Data imports will run significatly faster if you pipeline transactions using transact-async to maintain several transactions in-flight at the same time.

This can be accomplished using pipeline in core.async:

(ns your.namespace
  (:require [clojure.core.async :as a :refer (>!! <! >! go-loop)]
            [datomic.api :as d]))

(defn tx-pipeline
  "Transacts data from from-ch. Returns a map with:
     :result, a return channel getting {:error t} or {:completed n}
     :stop, a fn you can use to terminate early."
  [conn conc from-ch]
  (let [to-ch (a/chan 100)
        done-ch (a/chan)
        transact-data (fn [data]
                          @(d/transact-async conn data)
                        ; if exception in a transaction
                        ; will close channels and put error
                        ; on done channel.
                        (catch Throwable t
                          (.printStackTrace t)
                          (a/close! from-ch)
                          (a/close! to-ch)
                          (>!! done-ch {:error t}))))]

   ; go block prints a '.' after every 1000 transactions, puts completed
   ; report on done channel when no value left to be taken.
   (go-loop [total 0]
     (when (zero? (mod total 1000))
       (print ".") (flush))
     (if-let [c (<! to-ch)]
       (recur (inc total))
       (>! done-ch {:completed total})))

   ; pipeline that uses transducer form of map to transact data taken from
   ; from-ch and puts results on to-ch
   (a/pipeline-blocking conc to-ch (map transact-data) from-ch)

   ; returns done channel and a function that you can use
   ; for early termination.
   {:result done-ch
    :stop (fn [] (a/close! to-ch))}))

Additionally, an example of using a Java LinkedBlockingQueue for batch import is provided in the day of datomic repository.

Put the Most Selective Clause First in Query

The :where clauses of Datomic queries are executed in order. To minimize the work performed by the query engine, the most restrictive clauses should come before the less restrictive clauses, i.e.:

:where [?e :only/matches ?few]
       [?e :will/match ?many]

The query documentation provides an example illustrating this technique.

Prefer Query over Raw Index Access

Datomic's Datalog query is simple and declarative. Queries written in datalog are evident, readily optimized (in ways that may improve over time), and logic based. As such, Datomic query decouples logical specification from lookup implementation.

Leveraging Datalog's simple and declarative nature allows for the easy decomposition of queries. With little system knowledge you can troubleshoot query performance. For more details see the decomposing a query example in our day-of-datomic tutorial examples.

A common anti-pattern is to use the Datoms API for raw index access pre-emptively in order to chase a perceived performance gain. For the reasons stated above, Query should always be preferred and datoms should only be employed when raw index access is (1) iterating through part or all of a database in sequence, as in raw data export or paging through datoms sorted in AVET, and/or (2) a concrete performance gain can be made using datoms that can be directly linked to an application requirement.

Pass Collections As Inputs

You can use a collection binding in query to match several values specified in a collection. This behaves as a logical or, that is, it returns a union of the results for each item in the collection. When you want all results that match any of a given collection of entities or values, you should prefer a collection binding in a parameterized query over running multiple queries.

The following query uses a collection binding to return all names for artists based in Japan or Canada:

(d/q '[:find (pull ?a [:artist/name])
       :in $ [?c ...]
       :where [?a :artist/country ?country]
              [?country :country/name ?c]]
     db ["Canada" "Japan"])

;; Results
[[{:artist/name "The Guess Who"}]
 [{:artist/name "Whiskey Howl"}]
 [{:artist/name "Yoko Ono"}]
 [{:artist/name "Asakawa Maki"}]]

Use Pull to Retrieve Attribute Values

pull is usually the simplest way to retrieve a value for a specified attribute for a given entity. Using an entity identifier, calling pull is very straight forward - see the example below that retrieves the documentation for :db/ident:

(d/pull db [:db/doc] :db/ident)

It returns a map:

{:db/doc "Attribute used to uniquely name an entity."}

Inside of a query, you should use the :where clauses to identify entites of interest and prefer a pull expression to obtain attribute values for those entities. An example:

[:find [(pull ?e [:artist/name]) ...]
 :where [?e :artist/country :country/JP]]

Finds all artists from Japan then uses pull to list their names. Note that the results are now self documenting:

[{:artist/name "Flower Travellin' Band"}
 {:artist/name "The Tigers"}
 {:artist/name "Jacks"}

Note also that pull will not limit results that are missing attributes:

[:find [(pull ?e [:artist/name :artist/gender]) ...]
 :where [?e :artist/country :country/JP]]

Returns maps for artists that have no gender listed:

[{:artist/gender {:db/id 17592186045591} :artist/name "Yoko Ono"}
 {:artist/name "Far Out"}
 {:artist/gender {:db/id 17592186045591} :artist/name "Asakawa Maki"} ...]

Things to note in using pull instead of :where in query.

  • pull does not unify on attributes. While a :where clause will omit an entity that lacks a specified attribute, pull will return nil for a requested attribute if the entity does not have a value for it, or omit attributes with missing values in the case of a pull that specifies multiple attributes.

Things to note in using pull instead of entity.

  • pull eagerly returns raw data, i.e. maps and vectors.
  • entity returns a map-like, lazily realized associative data structure.
  • pull returns :db/ident values as maps, whereas entity returns them as keywords.

Put Blanks in Data Patterns

Blanks can be used as placeholders that match anything but do not bind or unify. This mbrainz example finds all countries which have an artist in the mbrainz database, using a blank to match any artist.

[:find [?c ...]
 :where [_ :artist/country ?c]]

Using a variable rather than the blank is considered an antipattern for these reasons:

  • blanks allow the query engine to avoid extra work tracking binding and unification for a dummy variable.
  • it is clear that a _ is meant to match everything - the same is not true for a dummy variable.

Use Query Inputs to Parameterize Queries and Leverage Caching

Datomic will cache queries, so long as the query (first) argument data structures evaluate as equal. As a result, reusing parameterized queries is much more efficient than building different query data structures. If you need to build data structures at run time for query, you should do so using a standard process so that equivalent queries will evaluate as equal.

The following query finds the names of all bands starting with "B", of type group, with a startyear of 1970:

[:find [?name ...]
 :where [?a :artist/type :artist.type/group]
        [?a :artist/name ?name]
        [?a :artist/startYear 1970]
        [(.startsWith ^String ?name "B")]]

A common antipattern is to make repeated adjustments to the query's data structure to change what facts it will match, e.g.:

[:find [?name ...]
 :where [?a :artist/type :artist.type/group]
        [?a :artist/name ?name]
        [?a :artist/startYear 1971]
        [(.startsWith ^String ?name "C")]]

Instead, we should parameterize the query so that we can call it multiple times with different inputs:

[:find [?name ...]
 :in $ ?year ?letter
 :where [?a :artist/type :artist.type/group]
        [?a :artist/name ?name]
        [?a :artist/startYear ?year]
        [(.startsWith ^String ?name ?letter)]]

;; inputs
mbrainz-db 1971 "C"

Use a Consistent Db Value for a Unit of Work

A database value is immutable and calls to query, pull, or entity on the same database value will return consistent results. The result of calling the database function on a connection will vary over time, and therefore calling the function repeatedly in sequence can retrieve facts that do not share a common time basis.

You should use a single database value for a unit of work in order to maintain consistency and to reap the resulting benefits for testing, predictability, and composability.

Specify t Instead of txInstant for Precise asOf Locations

Datomic's own t time value exactly orders transactions in monotonically ascending order, with each transaction receiving its own t value. By contrast, wall clock times specicified by txInstant are imprecise as more than one transaction can be recorded in the same millisecond.

For filtering databases by time, establishing the relative order of events in a historical database, or any other time based operation requiring exact precision you should always prefer the t value.

For example, if we're interested in the wall clock time in which The Rolling Stones were added to the mbrainz database, we might use a query like this:

'[:find ?txInstant .
  :where [?a :artist/name "The Rolling Stones" ?tx]
         [?tx :db/txInstant ?txInstant]]

But if we want to use the Log API to inspect the other data transacted along with this fact, we should get the transaction ID itself:

[:find ?tx .
 :where [?a :artist/name "The Rolling Stones" ?tx]]

This transaction id can be converted to a t value with exact precision, or itself used as an argument to e.g. txRange where it can be used to retrieve the datoms in the transaction.

(-> (d/tx-range log tx (inc tx))

Use the History Filter for Audit Trail Queries

Use the history filter when trying to answer audit trail questions such as "What did we know about this customer when we confirmed this purchase?" or "In which of these four time ranges was this customer active?" Without the history filter, your database will be limited to a particular time view and you will not see all historical datoms, thereby missing facts that have been since retracted.

The following example uses a history database to retrieve all purchases ever entered for a user as well as the time the purchase was logged by the system. Since it uses a history database, it includes facts that were once asserted but have been later retracted.

(d/q '[:find ?p ?inst ?added
       :in $ ?userid
       :where [?u :user/purchase ?p ?tx ?added]
              [?u :user/id ?userid]
              [?tx :db/txInstant ?inst]]
     (d/history db) userid)

This will bind ?added to true if the datom is an assertion and false if the datom is a retraction, returning results like:

[["Bread" #inst "2014-11-18T03:41:59.301-00:00" false]
 ["Bread" #inst "2014-11-18T03:41:37.031-00:00" true]
 ["Soda" #inst "2014-11-18T03:41:39.617-00:00" true]
 ["French Fries" #inst "2014-11-18T03:41:56.115-00:00" true]]

We can see a cancelled (retracted) order and explain why we sent a loaf of bread to shipping at 3:41:43 even though there was no longer a bread order in the shipping queue when the loaf arrived at 3:43:15.

Pass Multiple Points-in-time to a Single Query

When using filters, you will often need to use multiple points in time for a database. A common mistake encountered with using the since filter, for example, is to omit the unfiltered (or differently filtered) database that is the basis for the facts about the identity of the entity for which the query is being made.

An example of using two points in time is provided in the day-of-datomic sample repository. This query:

[:find ?count .
 :in $ $since ?id
 :where [$ ?e :item/id ?id]
        [$since ?e :item/count ?count]]

Requires two databases as input. The item's id was transacted before the time range was defined by the since filter, so we look up the entity id from its domain id using the unfiltered database. We then join that to the :item/count from the filtered database.

Use the Log API if Time is Your Most Selective Criterion

If you are most interested in retrieving values by or at specific times, the Log API is the appropriate (and most efficient) way to do so. The log is ordered by transaction time and provides fast lookup by T.

Transactions can be retrieved directly by tx-range in the Clojure API or txRange in the Java API. The log can also be used as an input to query.

The Log API provides access to the transaction id and data for each transaction. These can be retrieved from the map returned by tx-range as :data and :t, or retrieved in query using the helper functions tx-ids and tx-data.