High Availability
High Availability (HA) is a Datomic Cloud feature ensuring the availability of Datomic for reads and writes in the event of node instance failures.
Datomic is designed to have no single point of failure anywhere in the database stack. In addition, Datomic's extensive caching means that even cold nodes will typically become warm without having to wait for high latency storage-of-record (S3).
Enabling HA
To enable HA, use the Production Topology.
The Production Topology is always highly-available, with no configuration required. The Production Topology mandates a minimum of 2 nodes, and allows more nodes for scaling purposes.
The Solo Topology is never highly available. The remainder of this document is about the Production Topology.
How HA Works
At any point in time a database has a preferred node for transactions. In normal operation all txes for a db will flow to/through that node. If for any reason (e.g. a temporary network partition) that node can't be reached, any node can and will handle txes. Consistency is ensured by CAS at the DynamoDB level. This situation increases contention for DynamoDB and decreases throughput, so if the condition persists (or in the case where the preferred node disappears) a different node will become preferred. This is all immediate, there are no transfer/recovery intervals etc. Thus it is not like the mastership transfer and failover of Datomic On-Prem (and many other dbs). But neither should it be confused with parallel multi-writer (a la Cassandra).
In the event of node failures, Datomic will remain available for transactions (albeit with reduced capacity) as long as even a single primary compute node stays healthy!
All nodes implement the entire Datomic API, so HA covers all Datomic functionality: query, transactions, and database administration.
Nodes are accessed via a Network Load Balancer (NLB), and their health is managed via Auto Scaling Group (ASG) health checks. If either the ASG or a node itself decide that a node is unhealthy, then the ASG will
- start a replacement node
- remove the ailing node from the NLB
- terminate the ailing node
Programs Should Be Ready for Transient Anomalies
When a node becomes unhealthy for any reason, client requests that are
routed from the NLB to that node may experience slow responses or
transient unavailable
anomalies. (Client requests that reach one of
the other, healthy nodes during this time will experience normal
behavior and performance.)
Programs should be implemented to detect and handle transient anomalies in a manner appropriate to the program's needs.