Monitoring and Performance

This page covers

Metric names are identified in the text by surrounding square brackets, e.g. [TransactionMsec]. Each reported value of a metric can include up to five different statistics, which are capitalized where they appear in the text: Samples, Average, Minimum, Maximum, and Sum.

monitoring.png

Tuning Tips

Give your peers and transactor enough memory

You can set the maximum memory available to a JVM process with the -Xmx flag to java (or to bin/transactor). The more memory you give a process, the more database index it will be able to cache locally in memory, avoiding trips to memcached or storage.

Use memcached

Memcached is a great fit for Datomic. Read more about configuring memcached in the Caching documentation.

Give memcached enough memory

Memcached can deliver data to peers with very low latency (on the order of 1 millisecond for a segment). If you can fit your entire database (or even the commonly-used working set) in memcached, your read and query performance will be quite good.

Diagnostic: If your [Memcache] Average is substantially less than 1.0, then you can serve more reads from memcache by increasing the size of your memcached cluster.

Add more peers

Datomic read and query is horizontally scalable. If you can't serve as many simultaneous queries as you like, add more peers.

Add more cores

Datomic's entire public API is threadsafe, and is designed to automatically take advantage of multiple cores where available. Having more cores will help any Datomic process, transactor or peer.

Don't benchmark against the memory database

The memory database is a convenience for development, testing, and temp data. Memory is not good for storage-sized datasets, as memory isn't storage-sized.

Provision enough DynamoDB capacity

See the DynamoDB section of Capacity Planning.

Run large imports on cheap storage, then backup/restore

Restoring a database consumes significantly less write capacity than building the database via transactions. So if your target system has metered storage pricing (e.g. DynamoDB), it may make sense to do large import jobs on less expensive storage, backup, and then restore to the target storage.

Use metrics to analyze write activity

The speed with which a Datomic transactor can write data is measured with the [TransactionMsec] metric. The minimum, average, and maximum values of this metric, combined with the size of your transactions, characterize the write latency of the system from the transactor's perspective. (To measure write latency from a peer or client's perspective, instrument your code to measure the time from when transact is called until the transaction future is realized.)

The [TransactionBytes] Samples metric, again combined with the size of your transactions, characterizes the write throughput of the system. Note that to achieve maximum throughput you will need to pipeline transactions, i.e. arrange to have multiple transactions in flight at the same time.

Custom Monitoring

You can integrate custom monitoring in both transactors and peers by registering a callback. The callback must be on the JVM process's classpath, so you will need to modify your deployment to e.g. include any necessary JARs and resources in the transactor's lib directory of the standard transactor.

Callbacks are registered via the datomic.metricsCallback Java system property in the peer or transactor, or via the metrics-callback setting in the transactor properties file.

Register callback on the transactor

Register the callback for the transactor in the properties file:

metrics-callback=my.ns/handler

Register callback on the peer

Register the callback for the peer with the command line argument:

-Ddatomic.metricsCallback=my.ns/handler

Java callbacks

Java callbacks must be static public methods, taking a single Object argument, named as my.pkg.CallbackClass.handler, e.g.

metrics-callback=my.pkg.CallbackClass.handler

Clojure callbacks

Clojure callbacks must be functions that can take a single argument, named as my.ns/handler, e.g.

metrics-callback=my.ns/handler

Callback argument format

The callback argument format is alpha, and is subject to change. Currently, the argument is a map with keyword keys, and either number or map values. If the value is itself a map, it will contain :lo, :hi, :sum, and :count components.

{:HeartbeatMsec {:lo 5000, :hi 5001, :sum 55009, :count 11}, :AvailableMB 921.0}

The metrics are the same as those reported to CloudWatch, and are enumerated below.

Configuring CloudWatch Monitoring

The Datomic Pro transactor can be configured to write metrics to CloudWatch. You can find instructions for setting up CloudWatch here.

Use CloudWatch from Anywhere!

Note that you can use CloudWatch with any Datomic Pro transactor, regardless of which storage you are using, and regardless of where your transactor is located. Instructions for configuring CloudWatch for storages other than DynamoDB can be found here.

Available CloudWatch Metrics

The Datomic Pro transactor can create the following CloudWatch metrics:

MetricUnitsStatisticDescription
Alarmcountsuma problem has occurred (see alarms table below)
AvailableMBMBminimumunused RAM on transactor
ClusterCreateFSmsecmaximumtime to create a "file system" in the storage
CreateEntireIndexMsecmsecmaximumtime to create index, reported at end of indexing job
CreateFulltextIndexMsecmsecmaximumtime to create fulltext portion of index
Datomscountmaximumnumber of unique datoms in the index
HeartbeatMsecmsecavgtime spacing of heartbeats written by active transactor
HeartbeatMsecmsecsamplesnumber of heartbeats written by active transactor
HeartbeatMsecmsecmaximumlongest spacing between heartbeats written by active transactor
HeartMonitorMsecmsecmin, avg, maxtime spacing of heartbeats observed by standby transactor
IndexDatomscountmaximumnumber of datoms stored by the index, all sorts
IndexSegmentscountmaximumtotal number of segments in the index
IndexWritescountsumnumber of segments written by indexing job, reported at end
LogIngestBytesbytesmaximumin-memory size of log when a database size
LogIngestMsecmsecmaximumtime to play log back when a database start
Memcachecountaveragememcache hit ratio, from 0 (no hits) to 1 (all hits)
Memcachecountsumtotal number of cache hits for memcache
Memcachecountsamplestotal number of requests to memcache
MemoryIndexMBMBmaximumMB of RAM consumed by memory index
ObjectCachecountaverageobject cache hit ratio, from 0 (no hits) to 1 (all hits)
ObjectCachecountsumtotal number of requests to object cache
ReaderMemcachedPutMusecµsecmin, avg, maxdistribution of successful memcache put latencies when reading from storage
ReaderMemcachedPutFailedMusecµsecmin, avg, maxdistribution of failed memcache put latencies when reading from storage
RemotePeerscountmaximumnumber of remote peers connected
StorageBackoffmsecmaximumworst case time spent in backoff/retry around calls to storage
StorageBackoffmsecsamplesnumber of times a storage operation was retried
Storage{Get,Put}Bytesbytessumthroughput of storage operations
Storage{Get,Put}Bytesbytessamplescount of storage operations
Storage{Get,Put}Msecmsecmin, avg, maxdistribution of storage latencies
TransactionBytesbytessumvolume of transaction data to log, peers
TransactionBytesbytessamplesnumber of transactions
TransactionBytesbytesmin, avg, maxdistribution of transaction sizes
TransactionMsecmsecmin, avg, maxdistribution of transaction latencies
WriterMemcachedPutMusecµsecmin, avg, maxdistribution of successful memcache put latencies when writing to storage
WriterMemcachedPutFailedMusecµsecmin, avg, maxdistribution of failed memcache put latencies when writing to storage

The Datomic Pro peer can create the following CloudWatch metrics:

MetricUnitsStatisticDescription
AvailableMBMBminimumunused RAM on peer
LogIngestBytesbytessumSum of bytes ingested by peer process
LogIngestMsecmsecmin, avg, maxtime to ingest log by peer
ObjectCachecountaverageobject cache hit ratio, from 0 (no hits) to 1 (all hits)
ObjectCachecountsumtotal number of requests to object cache
PeerAcceptNewMsecmsecmin, avg, maxtime for the peer to accept a new index
StorageGetBytesbytessumtotal number of bytes retrieved from storage
StorageGetMsecmsecmin, avg, maxtime to retrieve bytes from storage

All Datomic metrics show up in the Datomic namespace, with a dimension whose name is Transactor and whose value is specified by the aws-cloudwatch-dimension-value property in your transactor properties file.

You should choose a different aws-cloudwatch-dimension-value for each Datomic system you are testing, so you can isolate the metrics coming from each system.

Metrics outside of AWS

If you're analyzing metrics through your own custom callback, or manually calculating statistics from the log, you may need to take additional steps to interpret metrics. The most common case is producing your own average by dividing a sum value for a metric by its count value.

Example using memcache

If you're looking at memcache metrics outside of AWS, you would interpret it as follows:

  • count is the number of requests to memcached.
  • sum is the total number of hits.
  • You would manually calculate misses as count minus sum.
  • You would manually calculate your hit rate as sum divided by count, producing, e.g., a value like 0.95.

Alarms

Cloudwatch allows you to create alarms when a Cloudwatch metric's value it outside of a predetermined range for a defined time period. Alarms can then send a notification to a pager or email, or trigger an Auto Scaling action to expand or contract some resource.

Datomic transactors have a metric named [Alarm]. This metric indicates that something has gone wrong on a transactor, and that human intervention is likely to be necessary. When deploying Datomic in production, you should create a Cloudwatch alarm that notifies you if the Datomic [Alarm] count is ever nonzero.

Each Datomic [Alarm] is accompanied by a more specific metric whose name is prefixed with [Alarm]. You can use these more specific alarms to plan your response to the problem, as shown in the table below.

AlarmRecommended Action
AlarmIndexingJobFailedcheck for storage problems
AlarmBackPressurelet import finish, increase capacity
AlarmUnhandledExceptioncontact Datomic support
Alarm{AnythingElse}contact Datomic support