High Availability

High Availability (HA) is a Datomic Cloud feature ensuring the availability of Datomic for reads and writes in the event of node instance failures.

Datomic is designed to have no single point of failure anywhere in the database stack. In addition, Datomic's extensive caching means that even cold nodes will typically become warm without having to wait for high latency storage-of-record (S3).

Enabling HA

The Production Topology is always highly-available, with no configuration required. The Production Topology mandates a minimum of 2 nodes, and allows more nodes for scaling purposes.

The Solo Topology is never highly available. The remainder of this document is about the Production Topology.

How HA Works

At any point in time a database has a preferred node for transactions. In normal operation all txes for a db will flow to/through that node. If for any reason (e.g. a temporary network partition) that node can't be reached, any node can and will handle txes. Consistency is ensured by CAS at the DynamoDB level. This situation increases contention for DynamoDB and decreases throughput, so if the condition persists (or in the case where the preferred node disappears) a different node will become preferred. This is all immediate, there are no transfer/recovery intervals etc. Thus it is not like the mastership transfer and failover of Datomic On-Prem (and many other dbs). But neither should it be confused with parallel multi-writer (a la Cassandra).

In the event of node failures, Datomic will remain available for transactions (albeit with reduced capacity) as long as even a single primary compute node stays healthy!

All nodes implement the entire Datomic API, so HA covers all Datomic functionality: query, transactions, and database administration.

Nodes are accessed via an Application Load Balancer (ALB), and their health is managed via Auto Scaling Group (ASG) health checks. If either the ASG or a node itself decide that a node is unhealthy, then the ASG will

  • start a replacement node
  • remove the ailing node from the ALB
  • terminate the ailing node

Programs Should Be Ready for Transient Anomalies

When a node becomes unhealthy for any reason, client requests that are routed from the ALB to that node may experience slow responses or transient unavailable anomalies. (Client requests that reach one of the other, healthy nodes during this time will experience normal behavior and performance.)

Programs should be implemented to detect and handle transient anomalies in a manner appropriate to the program's needs.