Datomic Deployment

This guide includes advice for managing Datomic deployments:

Getting Connected

A Datomic deployment consists of a storage service, an active transactor, a standby transactor, and one or more peers. These processes discover each other as follows:

  • A transactor process discovers storage via configuration specified in the transactor properties file.
  • The active transactor then writes its location (an IP address) into storage as part of heartbeat.
  • A second transactor will stay in standby mode unless the active misses two heartbeats.
  • Peers connect to storage using the URI or map specified in their call to connect, read the location of the active transactor from storage, then connect to the transactor using HornetQ.

To provision a Datomic system and get it up and running, follow the steps in the storage services topic.

Getting Reconnected

Processes may become unavailable for various reasons:

  • A process or machine crashes.
  • A process loses network connectivity to other processes.
  • A process stops operating in a timely manner.
  • An operator kills a process deliberately, e.g. as part of an upgrade.

Datomic handles all these causes of unavailability in a unified way, described below.

Transactor Failover

Run a standby transactor to minimize loss of availability. Instructions for running a standby transactor can be found in the High Availability section.

Transactor processes are designed to fail fast and allow a (hopefully healthier) process to take over, see High Availability. A transactor failover is not a bug, but frequent failovers indicate an operational problem.

Monitoring Heartbeat

If you are using CloudWatch to monitor transactors, you will see a HeartbeatMsec metric for every heartbeat from an active transactor, and you will see a HeartMonitorMsec metric for every check made by a standby transactor. Note that you can use CloudWatch metrics a la carte, regardless of which storage you are using, even if no other part of your application is running in the cloud.

Peer Failover

The peer library is designed to survive and automatically reconnect in the event of either storage or transactor failure, see Peer Failover.

When a transactor or storage becomes unavailable, peers will automatically seek to reconnect, with no special action required by application code. During a period of unavailability, calls to Connection.db() will return the most recent database value available locally in the Peer, so that peers can continue to read and query during transactor outage. Such queries have a sound point-in-time basis, accessible through Database.basisT() and the :db/txInstant property of the most recent transaction. However during such a transactor outage, Peers might not have access to the most recent database value. The database future returned by Connection.sync() will block until the transactor becomes available, should a Peer absolutely require the latest database value. These techniques allow peers to choose to continue reading during transactor failover, or to wait until the transactor becomes available.

Peer application code should be prepared for the possibility of transient write failure in the event of transactor unavailability and implement appropriate retry strategies. Datomic will not automatically retry arbitrary transactions. Not all transactions are idempotent, and transactions may have succeeded, even if the communication back to the peer failed.

Given that a peer must discover on reconnect whether or not a transaction has succeeded, you can handle retry logic with a variety of strategies. In general you will want to query the database to determine what happened from a time basis after the transaction failure/success. Some example strategies:

Troubleshooting

When in doubt, please ask on the mailing list or open a ticket. We are here to help. Use logs and metrics to monitor your system and gather the information necessary to characterize the problem you're seeing. Below is a list of common configuration problems and their typical solutions.

Do You Have Enough Memory?

Datomic uses timeouts in several places:

  • An active transactor that is too slow to write its heartbeat will self destruct and be replaced by a standby transactor.
  • If the peer is too slow to ping the transactor (see datomic.peerConnectionTTLMsec), the peer will renegotiate the connection with the transactor.

If timeouts occur even though the network and storage layers seem healthy, the most likely cause is long garbage-collection (GC pauses) caused by a process not having enough memory. When looking for the culprit, keep the following points in mind.

  • Peers run arbitrary application code, and are far more likely than transactors to encounter memory issues.
  • Transactor memory issues are usually traceable to oversized transactions or to badly-behaved transaction functions.

If a process has a memory problem that you do not understand, do not conceal it by configuring the process with more memory. This will not fix an actual leak, it will just change the failure mode, leaving you a much larger memory dump to study when symptoms manifest.

Are Your Processes Isolated?

Datomic is a distributed system. Storage services, transactors, and peers are designed so that load on one process has minimal impact on other processes (while delivering on Datomic's semantic promises).

As a convenience for development, you can 'undistribute' Datomic by running more than one process on the same virtual or physical hardware, e.g. running storage, transactor, and peer on a developer laptop. This 'development mode' cannot deliver the reliability or performance of a production deployment, as processes are competing for a single resource (the machine), and a single machine failure brings all processes down.

System tests and production should be performed against a properly distributed installation, with each process running on its own (possibly virtual) hardware. The Datomic team does not support troubleshooting performance or reliability problems on non-distributed installations of Datomic.

Transactor Fails to Start

If a transactor process fails to start, it is likely misconfigured. Check the logs for a WARN, ERROR, or FAIL level message, which should point to the problem.

If the transactor that fails to start is a CloudFormation transactor on AWS that you generated with Datomic's ensure-transactor and create-cf-template process (i.e. it uses the Datomic AMI we provide) and it fails to rotate logs, make sure:

  • That your memory settings are valid (e.g. heap fits within instance size)
  • That your license key is valid (test transactor launch locally)
  • That you are using a supported AWS instance type (e.g. Datomic requires a writable file system).
    • Supported instance types can be found in a Datomic-generated cloud formation template under the key "AWSInstanceType2Arch".

If the transactor fails to write heartbeat at launch, verify that the storage connection information Datomic has can be used to reach the storage, e.g. by using to invoke connectins with psql, cqlsh, etc.

Transactor Starts then Fails Later

If a transactor fails after running successfully for some time, the most likely causes are storage unavailability or a pause/failure of the transactor process.

In the case of storage unavailability, root causes could be:

  • A network partition/reachability issue.
    • Between transactor and storage.
    • Between nodes in a distributed storage.
  • Storage latency.
    • Use StoragePutMsec and StorageGetMsec metrics to diagnose.
    • Back pressure from exceeding DynamoDB provisioning.
  • Latency/pauses on storage introduced by:
    • Virtualization or GC pauses on storage instance(s)
    • Storage ops, such as log replication, backup, or VACUUM events.
  • Failure of an eventually consistent storage to provide a transactional or consistent write with sufficiently low latency to Datomic when required.

In the case of a pause on the transactor, likely root causes are:

  • JVM Garbage Collection
    • Most common cause is a transactor configuration which has had its defaults overriden by a custom transactor launch script or arguments passed to the transactor other than Xmx or Xms. Make sure in these cases to include the default -XX:+UseG1GC -XX:MaxGCPauseMillis=50 JVM GC related args.
  • Virtualization/VM level pause
    • These can usually be discerned from pauses in logs that are also correlated with pauses in other system logging, but are otherwise difficult to diagnose directly.
    • These may turn out to be an unalterable reality of your data center and need to be accomodated by higher values for heartbeat-interval-msec.
  • Interactions with other processes on the same box (not a supported config)

Peer Fails to Connect

Peers must connect to storage then the transactor. Failures to connect to storage will appear as errors thrown by the storage driver for that particular storage, e.g. an org.postgresql.util.PSQLException on Postgres. Failures to connect to the transactor will be HornetQ errors.

If a peer can't connect to a transactor, the most likely culprit is transactor configuration, not peer configuration. The transactor writes the values of host and alt-host to storage as part of heartbeat. The peer will attempt to connect to host first, then alt-host. A peer must be able to reach the transactor at one of these to connect to it.

The first step is to verify that the transactor machine can be reached from the machine the peer runs on using one the value it sets for host and alt-host. If you run in a containerized or dynamic environment, you may have to generate the transactor's host and alt-host values at launch, e.g. with scripts or command line tools that can pipe in some discoverable values for the host or alt-host setting.

If you've ruled out the transactor and storage, other likely suspects are:

  • Peer network configuration or permissions to connect to the transactor.
  • Peer connection will exceed process limit of license.

Peer Connects then Fails Later

If a peer connects to the transactor and runs for some time and the connection to the transactor fails later, you should take the following steps to diagnose.

  • Correlate with transactor logs or metrics. Was there a transactor failover, a large number of transactions runs, an increase in transactor load, etc.
  • Look at peer and transactor memory use - HornetQ failures can be caused by memory pressure.
  • Be suspicious of GC pauses on both the transactor and the peers. As peers run application code, they're more likely to experience GC pauses. We recommend troubleshootinhg suspected GC pauses by collecting logs using these JVM args:
-Xloggc:{file} -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCCause
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCDateStamps
  • If you identify garbage collection as a culprit, you can look to either:
    • Tune your application and/or its JVM GC settings so as to lengthy garbage collection pauses.
    • Increase heartbeat-interval-msec and datomic.peerConnectionTTLMsec to make the peer to transactor connection more tolerance of these pauses.

Common Problems

Unsupported protocol

Attempting to connect to a db without the proper dependencies will throw an error indicating the protocol is unsupported. The datomic-pro package is not in maven central. Instructions for including datomic pro and downloading datomic can be found here.

Unsupported protocol error:

CompilerException java.lang.IllegalArgumentException: :db.error/unsupported-protocol Unsupported protocol :dev, compiling:(core.clj:36:11) 

Misconfigured data directory

Transactors use the data-dir specified in the transactor properties file as the storage location for the dev and free storage protocols as well as for temporary storage with all storage protocols. The data-dir path must be writable by the Datomic process and the appropriate filesystem permissions must be granted to allow Datomic to create additional temporary directories within the data-dir.

Insufficient write and/or create-directory permissions can result in the error:

WARN  default    datomic.index - {:message "merge-db failed", :pid 11316, :tid 22}
java.io.IOException: No such file or directory
    at java.io.UnixFileSystem.createFileExclusively(Native Method) ~[na:1.7.0_101]

Storage-Level Problems

If the storage service fails to read or write data, Datomic will throw an exception on any data access that needs to read a segment that the underlying storage cannot return. In particular, warnings in the transactor log of this type:

WARN default datomic.index - {:tid 645868, :pid 20809, :message "merge-one-index failed"} 
java.lang.Exception: Key not found: 

indicate that the key provided in the error cannot be returned by the underlying storage. For eventually consistent backend storages (i.e. Cassandra, DynamoDB), the loss may be temporary and self-resolving (this is frequently the case during node rebalancing).

Peers Fail to Connect to Transactor

Datomic uses storage to coordinate transactor location. The network location of the transactor is specified to the transactor via the host and alt-host values in the transactor properties file. Peers connect to storage, lookup the transactor endpoint, then connect to the transactor.

Peers must be able to reach the transactor at the address specified by either the host or the alt-host property. The lack of proper entries for host and alt-host in the transactor properties file will result in the following error upon attempting to connect a peer:

Caused by: clojure.lang.ExceptionInfo: Error communicating with HOST 0.0.0.0 on PORT 4334 {:alt-host nil, :peer-version 2, :password "...", :username "...", :port 4334, :host "0.0.0.0", :version "0.9.5130", :timestamp 1449708475587, :encrypt-channel true}
   at clojure.core$ex_info.invoke(core.clj:4593)

   Caused by: HornetQException[errorType=NOT_CONNECTED message=HQ119007: Cannot connect to server(s). Tried with all available servers.] 

Further description of the peer connection process is available here and here.

Queries Fail With Thread Pool Shutdown

Calling the Datomic API shutdown or Clojure shutdown-agents function will shut down the Peer library. If either of these functions is called prior to an attempt to run a Datomic query, the query will fail with the exception:

Task java.util.concurrent.FutureTask@7ff48e13 rejected from java.util.concurrent.ThreadPoolExecutor@3407eed0[Shutting down, pool size = 7, active threads = 7, queued tasks = 0, completed tasks = 7]
        datalog.clj:1454 datomic.datalog/eval-rule[fn]
        datalog.clj:1434 datomic.datalog/eval-rule

Upgrading a Live System

It is possible to update both peers and transactors live in a running system, subject to the constraints described below. If the update you are performing is a compatible one, you can update the transactor and the peers piecemeal, in any order you choose, without ever taking down the entire system.

You should always test a new release of Datomic in a staging environment before upgrading your production environment. This is discussed separately in the next section.

To upgrade the transactor, start a new transactor (or pair of transactors) based on the release of Datomic you are upgrading to. Once these processes are up and monitoring the storage heartbeat, kill the old transactors and the new ones will automatically take over.

To upgrade peer applications, simply start new peers and take down old ones. You can stagger the ups and downs to maintain availability during the upgrade.

Testing Upgrades in a Staging Environment

It it always a good idea to test new releases of Datomic in a staging environment before upgrading to production, as follows:

  • Make your staging system as similar to production as you feasibly can.
  • Make sure that your staging and production environments are fully separated.
    • You should not connect to both the staging and production system from the same peer (and you should enforce this with firewalls, AWS roles or security groups, etc.).
    • You should not use the same storage instance/cluster for staging and production.

With a suitable staging configuration, follow these steps to test a Datomic upgrade:

  • Backup the production system.
  • Restore it into the storage used by your staging system.
  • Start your staging system including transactor and peers.
    • Use the same upgrade procedure as you plan to on production, e.g. rolling upgrades of peers.
  • verify that your application(s) work correctly.

Monitoring Upgrades

When upgrading a Datomic system's transactors, you should monitor metrics to determine when you can safely failover to the new transactors. In particular, watch for HeartMonitorMsec sum and count (data samples) metrics to be reported as any new transactors will launch as standby.

For example, with heartbeat-interval-msec defaults of 5000, you will see a count of 12 and sum of around 60000 for both HeartbeatMsec (on the active transactor) and HeartMonitorMsec (on the standby transactor). If you launch two additional transactors with the ugpraded version of Datomic you will see the HeartMonitorMsec count change to ~36 and sum increase to ~/18000/ - assuming that metrics are reported to an aggregator in the same namespace. Otherwise, e.g. if you monitor transactors individually, you will see HeartMonitorMsec metrics reported by each transactor when it is up and healthy.

When you see HeartMonitorMsec reported by the newly launched transactor(s), you are safe to kill the old transactor processes/VMs and failover to the upgraded transactor(s). When the upgraded transactor becomes active it will begin reporting HeartbeatMsec metrics. In the failover window, you may see Transactor unavailable error messages on the peers which will resolve when the new transactor has taken over. These should go away when the new active transactor begins heartbeating.

Compatibility Across Datomic Releases

It is a design objective of Datomic to preserve compatibility across releases to the greatest extent possible. Compatible releases can be used together in a single system, i.e. two (or more) distinct releases of the software running in different processes.

The Datomic downloads page links to a changes file for every release. The vast majority of these changes are compatible changes, and incompatible changes are highlighted in the text. When migrating between releases, you should always review all changes between the two releases.

As of January 2016, there have been three compatibility-breaking releases that require that you shut down all Datomic processes on upgrade:

  • 0.9.4609 (2014-03-13)
  • 0.9.4470 (2014-02-17)
  • 0.8.3705 (2013-01-09)

See the release notices page for documentation of all critical release notices.