Monitoring and Performance

This page covers

general tuning tips for Datomic
integrating custom monitoring
configuring CloudWatch monitoring
a list of available metrics
responding to alarms
internal monitoring

Metric names are identified in the text by surrounding square brackets, e.g. [TransactionMsec]. Each reported value of a metric can include up to five different statistics, which are capitalized where they appear in the text: Samples, Average, Minimum, Maximum, and Sum.

Tuning Tips

Give your peers and transactor enough memory

You can set the maximum memory available to a JVM process with the -Xmx flag to java (or to bin/transactor). The more memory you give a process, the more database index it will be able to cache locally in memory, avoiding trips to memcached or storage.

Use memcached

Memcached is a great fit for Datomic. Read more about configuring memcached in the Caching documentation.

Give memcached enough memory

Memcached can deliver data to peers with very low latency (on the order of 1 millisecond for a segment). If you can fit your entire database (or even the commonly-used working set) in memcached, your read and query performance will be quite good.

Diagnostic: If your [Memcache] Average is substantially less than 1.0, then you can serve more reads from memcache by increasing the size of your memcached cluster.

Add more peers

Datomic read and query is horizontally scalable. If you can't serve as many simultaneous queries as you like, add more peers.

Add more cores

Datomic's entire public API is threadsafe, and is designed to automatically take advantage of multiple cores where available. Having more cores will help any Datomic process, transactor or peer.

Don't benchmark against the memory database

The memory database is a convenience for development, testing, and temp data. Memory is not good for storage-sized datasets, as memory isn't storage-sized.

Provision enough DynamoDB capacity

See the DynamoDB section of Capacity Planning.

Run large imports on cheap storage, then backup/restore

Restoring a database consumes significantly less write capacity than building the database via transactions. So if your target system has metered storage pricing (e.g. DynamoDB), it may make sense to do large import jobs on less expensive storage, backup, and then restore to the target storage.

Use metrics to analyze write activity

The speed with which a Datomic transactor can write data is measured with the [TransactionMsec] metric. The minimum, average, and maximum values of this metric, combined with the size of your transactions, characterize the write latency of the system from the transactor's perspective. (To measure write latency from a peer or client's perspective, instrument your code to measure the time from when transact is called until the transaction future is realized.)

The [TransactionBytes] Samples metric, again combined with the size of your transactions, characterizes the write throughput of the system. Note that to achieve maximum throughput you will need to pipeline transactions, i.e. arrange to have multiple transactions in flight at the same time.

Custom Monitoring

You can integrate custom monitoring in both transactors and peers by registering a callback. The callback must be on the JVM process's classpath, so you will need to modify your deployment to e.g. include any necessary JARs and resources in the transactor's lib directory of the standard transactor.

Callbacks are registered via the datomic.metricsCallback Java system property in the peer or transactor, or via the metrics-callback setting in the transactor properties file.

Register callback on the transactor

metrics-callback=my.ns/handler

Register callback on the peer

-Ddatomic.metricsCallback=my.ns/handler

Java callbacks

Java callbacks must be static public methods, taking a single Object argument, named as my.pkg.CallbackClass.handler, e.g.

metrics-callback=my.pkg.CallbackClass.handler

Clojure callbacks

Clojure callbacks must be functions that can take a single argument, named as my.ns/handler, e.g.

metrics-callback=my.ns/handler

Callback argument format

The callback argument format is alpha, and is subject to change. Currently, the argument is a map with keyword keys, and either number or map values. If the value is itself a map, it will contain :lo, :hi, :sum, and :count components, all with numeric values.

The example below shows :HeatbeatMsec with a map value, and :AvailableMB with a numeric value.

{:HeartbeatMsec {:lo 5000, :hi 5001, :sum 55009, :count 11}, :AvailableMB 921.0}

The metrics are the same as those reported to CloudWatch. Note that the set of metrics is dynamic, open, and subject to change over time. A correct callback implementation should be tolerant of

metric names it has never seen before
metric values in either of the shapes documented above, regardless of what shapes it has seen previously

Configuring CloudWatch Monitoring

The Datomic Pro transactor can be configured to write metrics to CloudWatch. You can find instructions for setting up CloudWatch here.

Use CloudWatch from Anywhere!

Note that you can use CloudWatch with any Datomic Pro transactor, regardless of which storage you are using, and regardless of where your transactor is located. Instructions for configuring CloudWatch for storages other than DynamoDB can be found here.

Available CloudWatch Metrics

Transactor metrics

The Datomic Pro transactor can create at least the following CloudWatch metrics, and may create others as well.

Metric	Units	Statistic	Description
Alarm	count	sum	a problem has occurred (see alarms table below)
AvailableMB	MB	minimum	unused RAM on transactor
ClusterCreateFS	msec	maximum	time to create a "file system" in the storage
CreateEntireIndexMsec	msec	maximum	time to create index, reported at end of indexing job
CreateFulltextIndexMsec	msec	maximum	time to create fulltext portion of index
Datoms	count	maximum	number of unique datoms in the index
DBAddFulltextMsec	msec	sum	time per transaction, to add fulltext
FulltextSegments	count	sum	total number of fulltext segments in the index, per index job
GarbageSegments	count	sum	total number of garbage segments created
GcPauseMsec	msec	min, avg, max	duration of garbage collector induced pauses
HeartbeatMsec	msec	avg	time spacing of heartbeats written by active transactor
HeartbeatMsec	msec	samples	number of heartbeats written by active transactor
HeartbeatMsec	msec	maximum	longest spacing between heartbeats written by active transactor
HeartMonitorMsec	msec	min, avg, max	time spacing of heartbeats observed by standby transactor
IndexDatoms	count	maximum	number of datoms stored by the index, all sorts
IndexSegments	count	maximum	total number of segments in the index
IndexWrites	count	sum	number of segments written by indexing job, reported at end
IndexWriteMsec	msec	sum	time per index segment write
LogIngestBytes	bytes	maximum	in-memory size of log when a database size
LogIngestMsec	msec	maximum	time to play log back when a database start
LogWriteMsec	msec	sum	the time to write to log per transaction batch
Memcache	count	avg	memcache hit ratio, from 0 (no hits) to 1 (all hits)
Memcache	count	sum	total number of cache hits for memcache
Memcache	count	samples	total number of requests to memcache
MemoryIndexMB	MB	maximum	MB of RAM consumed by memory index
MetricReport	count	sum	Count of successful metrics report writes over a 1 min period
ObjectCache	count	avg	object cache hit ratio, from 0 (no hits) to 1 (all hits)
ObjectCache	count	sum	total number of requests to object cache
MemcachedPutMsec	msec	min, avg, max	distribution of successful memcache put latencies
MemcachedPutFailedMsec	msec	min, avg, max	distribution of failed memcache put latencies
RemotePeers	count	maximum	number of remote peers connected
StorageBackoff	msec	maximum	worst case time spent in backoff/retry around calls to storage
StorageBackoff	msec	samples	number of times a storage operation was retried
Storage{Get,Put}Bytes	bytes	sum	throughput of storage operations
Storage{Get,Put}Bytes	bytes	samples	count of storage operations
Storage{Get,Put}Msec	msec	min, avg, max	distribution of storage latencies
TransactionBatch	count	sum	transactions per batch written to storage
TransactionBytes	bytes	sum	volume of transaction data to log, peers
TransactionBytes	bytes	samples	number of transactions
TransactionBytes	bytes	min, avg, max	distribution of transaction sizes
TransactionDatoms	count	sum	datoms per transaction
TransactionApplyMsec	msec	min, avg, max	time for the transactor to apply the transaction data to its in-memory representation of the db
TransactionMsec	msec	min, avg, max	transaction time, includes the time to write to durable storage
Valcache	count	avg	Valcache hit ratio, from 0 (no hits) to 1 (all hits)
Valcache	count	sum	total number of cache hits for Valcache
Valcache	count	samples	total number of requests to Valcache
ValcachePutMsec	msec	min, avg, max	distribution of successful valcache put latencies
ValcachePutFailedMsec	msec	min, avg, max	distribution of failed valcache put latencies
WriterMemcached

Metrics outside of AWS

If you're analyzing metrics through your own custom callback, or manually calculating statistics from the log, you may need to take additional steps to interpret metrics. The most common case is producing your own average by dividing a sum value for a metric by its count value.

Example using memcache

If you're looking at memcache metrics outside of AWS, you would interpret it as follows:

count is the number of requests to memcached.
sum is the total number of hits.
You would manually calculate misses as count minus sum.
You would manually calculate your hit rate as sum divided by count, producing, e.g., a value like 0.95.

Alarms

Cloudwatch allows you to create alarms when a Cloudwatch metric's value it outside of a predetermined range for a defined time period. Alarms can then send a notification to a pager or email, or trigger an Auto Scaling action to expand or contract some resource.

Datomic transactors have a metric named [Alarm]. This metric indicates that something has gone wrong on a transactor, and that human intervention is likely to be necessary. When deploying Datomic in production, you should create a Cloudwatch alarm that notifies you if the Datomic [Alarm] count is ever nonzero.

Each Datomic [Alarm] is accompanied by a more specific metric whose name is prefixed with [Alarm]. You can use these more specific alarms to plan your response to the problem, as shown in the table below.

Alarm	Recommended Action
AlarmIndexingJobFailed	check for storage problems
AlarmBackPressure	let import finish, increase capacity
AlarmUnhandledException	contact Datomic support
Alarm{AnythingElse}	contact Datomic support

Internal Monitoring

Metrics and events not documented above are internal to Datomic. Internal monitoring is implementation-specific, and subject to change at any time. You should not build systems that rely on internal monitoring. Please ask at the knowledgbase and we will help you find an alternative approach.

The descriptions of internal metrics and events below are provided for informational purposes only.

AlarmStorageSemaphoreTimeout is an internal metric that occurs after a bounding timeout reading or writing storage. AlarmStorageSemaphoreTimeout can occur if Datomic is waiting tens of seconds to hear back from a storage call, and will likely correlate with spikes in documented metrics such as StorageGetMsec or StoragePutMsec. After an AlarmStorageSemaphoreTimeout, check for problems in storage or the network between a Datomic process and storage.