AEP Engine Statistics
Overview
An operational AEP engine and its underlying components such as its HA Store and Bus Bindings capture many metrics and statistics during the course of its operation. These metrics are periodically collected by the XVM in which the engine is running and reported in XVM heartbeats which can then be traced, logged in a binary format or emitted to monitoring applications. This document describes the statistics that are collected and the format in which statistics are traced.
Configuring AEP Engine Statistics
Most engine metrics are low overhead and their collection cannot be disabled. Other types of statistics such as Latency Statistics, Per Message Type statistics, and Per Transaction Statistics can impact application performance and are disabled by default.
For complete configuration details including global settings, per-engine settings, and output thread configuration, see:
Configuring Monitoring - Complete statistics configuration guide
This page focuses on understanding and interpreting the statistics once they are collected.
Metrics Collected
An AEP engine collects the following raw metrics during the course of its operation.
NumFlows
Total number of message flows functioning in the engine.
NumMsgsRcvdBestEffort
Total number of messages received by the engine on best-effort channels.
NumMsgsRcvdGuaranteed
Total number of messages received by the engine on guaranteed channels.
NumMsgsSourced
[EventSourcing Only] Total number of messages sourced from from the recovery log or primary agent in a cluster during agent inialization.
NumMsgsFiltered
The number of messages that were filtered.
NumDupMsgsRcvd
Total number of duplicate messages received and discarded by an engine. This metric will always be 0 if duplicate detection has been disabled via the nv.aep.duplicate.checking configuration property.
NumMsgsSentBestEffort
Total number of messages sent by the engine on best-effort channels.
NumMsgsSentGuaranteed
Total number of messages sent by the engine on guaranteed channels.
NumMsgsResent
Total number of messages retransmitted by an engine. When a backup agent assumes the role of a primary on failover, it retransmits all in-doubt messages as part of its first transaction. In-doubt messages are those messages for which positive acknowledgements have not been received from the downstream messaging components. This metric records the number of such restransmitted messages.
NumEventsRcvd
Total number of events received by an engine. Events received by an engine include message events and the non-message events, e.g. 'channel up event', 'channel down event', etc.
NumFlowEventsRcvd
Total number of flow events received by the engine. Flow events are synonymous with message events. These include message events received on message channels and message events injected for processing into the engine by the microservice.
NumFlowEventsProcSuccess
Total number of successfully processed flow events. A flow event is considered successfully processed if the microservice did not throw an exception while processing the message associated with the event.
NumFlowEventsProcFail
Total number of failed flow events. A flow event is considered to have failed processing if the microservice threw an unchecked exception while procesing the message associated with the event.
NumFlowEventsProcComplete
Total number of flow events whose transactions have completed. Each successfully processed flow event participates in an AEP transaction. This metric counts the number of successful flow events whose transactions have completed.
NumTransactions
Total number of transactions that have been committed or rolled back. Note: This metric is equal to NumCommitsStarted + NumRollbacks
NumCommitsStarted
Total number of transactions whose commits have been started. Note: Transactions are committed in a pipelined manner. Therefore, there can be multiple transactions in the commit pipeline. This metric counts the number of transactions that have been entered into the commit pipeline. See NumCommitsCompeted
NumCommitsCompleted
Total number of transactions whose commits have completed. This metric counts the number of transactions that have completed and exited the transaction commit pipeline. The difference between this and the NumCommitsStarted metric is the number of transactions in flight at any point in time. See NumCommitsStarted
NumSendCommitsStarted
Total number of transactions whose send commits have been started. A transaction commit pipeline is comprised of a send commit pipeline (the commit of outbound messages in the transaction) and a store commit pipeline (commit to store changes). This metric counts the number of transactions in the send commit pipeline.
NumSendCommitsCompleted
Total number of transactions whose send commits have completed. This metric counts the number of transactions whose send commits have completed and exited the send commit pipeline.
SendCommitCompletionQueueSize
Number of transaction in the send commit completion queue. Transactions that complete the send portion of the commit are passed through a send commit completion queue for sequencing before entering into the next phase of a transaction commit. This metric counts the number of transactions held in the post send commit sequencing queue at any point in time.
NumStoreCommitsStarted
Total number of transactions whose store commits have been started. A transaction commit pipeline is comprised of a send commit pipeline (the commit of outbound messages in the transaction) and a store commit pipeline (commit to store changes). This metric counts the number of transactions in the store commit pipeline.
NumStoreCommitsCompleted
Total number of transactions whose store commits have completed. This metric counts the number of transactions whose store commits have completed and exited the store commit pipeline.
StoreCommitCompletionQueueSize
Number of transaction in the store commit completion queue. Transactions that complete the store portion of the commit are passed through a store commit completion queue for sequencing before entering into the next phase of a transaction commit. This metric counts the number of transactions held in the post store commit sequencing queue at any point in time.
NumRollbacks
Number of transactions that have been rolled back.
BackupOutboundQueueSize
[EventSourcing Only] Number of messages in a backup's outbound queue. A backup outbound queue holds outbound messages sent during concurrent processing by a backup agent in an Event Sourced cluster. The messages are held in the queue as in-doubt messages until notifications are received from the primary 'acknowledging' the receipt of the messages from downstream messaging components/agents. The messages are flushed from the queue upon receipt of such notifications. This metric counts the number of messages in a backup's outbound queue. Note: The notifications are piggy-backed on other replication traffic to avoid extra network traffic. Therefore, in the absence of replication traffic, messages can remain in the backup's outbound queue even though acknowledged by the downstream messaging system until some replication traffic occurs to replicate the downstream ack notification.
BackupOutboundLogQueueSize
[EventSourcing Only] Number of messages in a backup's outbound log queue. A backup's outbound log queue holds outbound messages for the course of a single transaction. The queue is flushed to the outbound message log upon completion of the transaction. This metric holds the number of messages in the outbound log queue.
OutboundSno
The current outbound sequence number in use by the engine. Each engine instance maintains a sequence number space for messages outbound by the engine. This metric holds the current outbound sequence number.
OutboundStableSno
The current 'stable' outbound sequence number. An outbound is considered to be stable when a positive acknowledgement is received for the message from the downstream messaging system. This metric holds the sequence number of the last stable outbound message.
Message Type Specific Statistics
Message type specific statistics. The engine also (optionally driven by configuration) maintains statistics for each of the different message types flowing through the engine. The following are the different metrics collected for each message type: - NumMsgsRcvdBestEffort - NumMsgsRcvdGuaranteed - NumMsgsSourced - NumDupMsgsRcvd - NumMsgsSentBestEffort - NumMsgsSentGuaranteed - NumMsgsResent The semantic meaning of these metrics for each message type is identical to metrics with the same name described above, except that it is local to the message type.
Transaction Latencies
Transaction latencies traced by an engine stats thread includes summary statistics for various phases within the transaction processing pipeline. The meaning of these summary statistics is as follows:
mpproc
Records the time (in microseconds) spent by the engine dispatching the message to a microservice.
mproc
Records latencies for application message process times (in an EventHandler).
mfilt
Records latencies for application message filtering times (by a message filter).
msend
Time spent in AepEngine.sendMessage(). The time in the AepEngine's send call. This latency will be a subset of mproc for solicited sends and it includes msendc.
msendc
Time spent in the AepEngine's core send logic. This leg includes enqueuing the message for delivery in the corresponding bus manager.
cstart
Time spent from the point the first message of a transaction is received to the time the transaction is committed.
cprolo
Time spent from the point where transaction commit is started to send or store commit, whichever occurs first. This latency measures the time taken in any bookkeeping done by the engine prior to commit the transaction to store (or for an engine without a store until outbound messages are released for delivery).
csend
The send commit latency: i.e. time from when send commit is initiated, to receipt of send completion event. This latency represents the time from when outbound messages for a transaction are released to the time that all acknowledgements for the messages are received. Because this latency includes acknowledgement time a high value for csend does not necessarily indicate that downstream latency will be affected. The Message Latencies listed below allow this value to be decomposied further.
ctrans
Time spent from the point the store commit completes to the beginning of the send commit which releases a transaction's outbound messages for delivery. If the engine doesn't have a store, then this statistic is not captured as messages are released immediately.
cstore
The store commit latency i.e. time from when store commit is initiated to receipt of store completion event. This latency includes the time spent serializing transaction contents, persisting to the store's transaction log, inter cluster replication, and replication to backup members including the replication ack. 💡 High values in cstore will impact downstream message latencies because store commit must complete before outbound messages are released for delivery. The cstore latency is further broken down in the Store Latencies listed below.
cepilo
Time spent from the point the store or the send commit completes, whichever is last, to commit completion.
cfull
Time spent from the time the first message of a transaction is received to commit completion.
tleg1
Records latencies for the first transaction processing leg. Transaction Leg One includes time spent from the point where the first message of a transaction is received to submission of send/store commit. It includes message processing and and any overhead associated with transactional book keeping done by the engine. 💡 Each transaction leg is a portion of the overall commit time that is processed on the Aep Engine's thread. The sum of the transaction leg stats are important in that they determine the overall throughput that a microservice can sustain in terms of transactions per second.
tleg2
Records latencies for the second transaction processing leg. Transaction Leg Two includes time spent from the point where the send/store commit completion is received to the submission of store/send commit. 💡 Each transaction leg is a portion of the overall commit time that is processed on the Aep Engine's thread. The sum of the transaction leg stats are important in that they determine the overall throughput that a microservice can sustain in terms of transactions per second.
tleg3
Records latencies for the third transaction processing leg. Transaction Leg Three includes time spent from the point where the store/store commit completion is received to the completion of the transaction commit. 💡 Each transaction leg is a portion of the overall commit time that is processed on the Aep Engine's thread. The sum of the transaction leg stats are important in that they determine the overall throughput that a microservice can sustain in terms of transactions per second.
inout
Records latencies for receipt of a message to transmission of the last outbound message.
inack
Records latencies for receipt of a message to stabilization (and upstream acknowledgement for Guaranteed).
Message Type Specific Stats
By default, an AEP engine does not collect message type specific stats. For configuration details on enabling message type stats, see Configuring Monitoring.
Bus Connection Stats
The engine stats thread can also trace summary statistics for each of the message bus connections managed by the engine. Each bus connection is handled by bus manager. The manager handles bus connection establishment, reconnect handling and message IO through the underlying connection. From the perspective on an engine bus manager, a bus connection is synonymous with an SMA bus connection. Each Bus Manager maintains statistics across bus binding reconnects, allowing continuous stats across bus binding reconnects. The following sections break these statistics down in more detail.
Message and Transaction Statistics
Immediately following the Bus Manager "header" are statistics that relate to message volumes and rates across the managed bus:
The raw metrics from which these statistics are computed are as follows:
NumMsgsRcvd
The number of message received by the bus.
NumMsgsInBatches
The number of messages received by the bus that were part of a batch.
NumMsgBatchesRcvd
The number of batch message received by the bus.
NumPacketsRcvd
The number of raw packets received by the bus.
NumMsgsEnqueued
The total number of batch messages enqueued for delivery by this bus.
NumAcksSent
The total number of acknowledgment sent upstream for received messages by this bus.
NumStabilityRcvd
The number of stability events (acks) received by this bus.
NumStabilityBatchesRcvd
The number of batched stability events received by this bus.
NumMsgsEnqueued
The total number of batch messages enqueued for delivery by this bus.
NumMsgsSent
The total number of batch messages enqueued message that were actually sent by the bus.
NumFlushesSync
The number of times this bus has been synchronously flushed.
NumFlushesAsync
The number of times this bus has been asynchronously flushed.
NumMsgsFlushedSync
The number of messages flushed by synchronous flushes.
NumMsgsFlushedAsync
The number of messages flushed by asynchronous flushes.
NumAsyncFlushCompletions
The number of asynchronous flushes for this that have completed.
NumCommits
The number of transactions committed by the bus.
NumRollbacks
The number transactions rolled back for the bus.
Bus Disruptor Latencies
When a bus manager is configured for detached send (aka detached commit), a transaction's outbound messages are sent on the underlying SMA connection by the bus manager's I/O thread. The bus manager uses a disruptor to manage the handoff of messages to the manager's IO thread. The "Offer to Poll" latency measures the time the outbound messages are latent in the disruptor queue. High o2p latencies in a Bus Manager may indicate that messages are being released for send faster than we can actually send them through the underlying SMA connection.
Disruptor statistics follow the message and transaction statistics in the bus manager statistics trace:
Bus Clients, Channels and Fails
After the disruptor statistics are counters indicating the number of connected clients, active channels and binding failures:
The raw metrics from which these statistics are computed are as follows:
NumClients
The number of connected clients (if applicable).
NumChannels
The number of channels brought up by this bus.
NumFails
The number of binding failures that have occurred for this bus.
Messaging Latencies
Messaging latencies follow the clients, channels and fails output. The following latency statistics relate to the bus manager's message handling pipeline:
c2o
The create to send latencies in microseconds, the time in microseconds from message creation to when send was called for it. 💡 Note, this statistic is for outbound messages sent through a bus and is different from the c2o statistic captured for an AepEngine which tracks the create to offer times for received/injected messages offered to the microservice's input queue.
o2s
The send to serialize latencies in microseconds, the time from when the message was sent until it was serialized in preparation for transmission on the wire. For an engine with a store this will include the time from the microservice's send call, the replication hop (if there is a store) and time through the bus manager's disruptor if detached commit is enabled for the bus manager.
s
The serialize latencies in microseconds, the spent serializing the MessageView to its transport encoding.
s2w
The serialize to wire latencies in microseconds, the time post serialize to just before the message is written to the wire.
w
The wire latencies in microseconds, the time an inbound messages spent on the wire. The time spent on the wire from when the message was written to the wire by the sender to the time it was received off the wire by the receiver. Note: that this metric is subject to clock skew when the sending and receiving sides are on different hosts.
w2d
The time from when the serialized form was received from the wire to deserialization.
d
The time (in microseconds) spent deserializing the message and wrapping it in a MessageView.
d2i
The time (in microseconds) from when the message was deserialized to when it is received by the engine. This measure the time from when the bus has deserialized by the bus to when the app's engine picks it up from it's input queue (before it dispatches it to a microservice) handler (it includes the o2p time of the engine's disruptor). Additional time spent by the engine dispatching the message to the microservice handler is covered by mpproc (see the Transaction Latencies table).
o2i
The origin to receive latencies in microseconds. The time from when a message was originally created to when it was received by the binding.
w2w
The wire to wire latencies in microseconds, for outbound messages the time from when the corresponding inbound message was received off the wire to when the outbound message was written to the wire.
Event Multiplexer (Input Queue) Statistics
The Event Multiplexer reports statistics describing the latency between:
The point at which an event is offered to the multiplexer by a thread and the point at which it is actually enqueued for processing – i.e. the o2p statistics traced per Feeder Queue.
The time between when an event is offered to the multiplexer for processing and the time at which it is returned from a poll by the engine's event multiplexer thread for processing – i.e. the o2p statistics traced for the Event Multiplexer's Disruptor.
Store Statistics
If the stats thread supports summary statistics related to an engine's store. The following sections will break these statistics down in more detail.
Inbound and Outbound Commit Statistics
The raw metrics from which these statistics are computed are as follows:
NumCommitsRcvd
The number of committed transactions replicated to this store.
PacketsRcvd
The number of committed entries replicated to this store.
NumCommitsSent
The number of committed transactions replicated by the store.
PacketsSent
The number of committed entries replicated by the store.
CommitCompletionsRcvd
The number of commit acknowledgements received by this store member from followers.
CommitCompletionsSent
The number of commit acknowledgements sent by this store member to the leader.
Store Latencies
cqs
The number of entries committed per commit. ⚠️ While this statistic is reported with latencies, it is not actually a latency statistic but rather captures the number of entries included in a commit. For Event Sourcing this will reflect the number of input event batched into a transaction, and for State Replication, this will reflect the number of state entries and outbound messages in the transaction.
s
The amount of time spent serializing transaction entries in preparation of replication / and persistence.
s2w
The time between serializing transaction entries to the last entry being written to the wire (but not stabilized) for replication. 💡 A high value here can indicate that there is network pushback from backup members.
s2p
The time between serializing transaction entries to the time that entries have been passed to the persister for write to disk (but not yet synced). 💡 A high value here can indicate that there is disk pushback or that there is overhead in preparing entries for persistence or ICR.
w
The commit wire time. For a store in the primary role, wire latency captures the time from the last commit entry being replicated until the last commit ack is received for the transaction. In other words: the round trip time. For a store in the backup role wire latency capture the amount of time from when the primary wrote the last commit entry to the wire to the time it was received by the backup. When primary and backup are on different hosts, this statistic is subject to clock skew and could even be negative. 💡 A high value for 'w' on a Primary member may indicate a slow replication connection or can indicate that there is a bottleneck occurring in a backup member. If s2w is low, the latter case is more likely and backup stats should be consulted. A high value for 'w' is not necessarily a problem for applications that are more concerned with throughput provided there is enough bandwidth, but a high value can impact downstream message latency. For a backup member (discounting clock skew) a high value would indicate latency in the replication connection.
w2d
The time between receipt of a commit packet to the point at which deserialization of the entries has started (by a backup store member).
d
The time spent deserializing a transaction's entries (by a backup store member).
per
The amount of time spent persisting transaction entries to disk. 💡 A high value here indicates that disk is a bottleneck.
icr
The amount of time spent sending transaction entries to via an ICR Sender. 💡 A high value here indicates that intercluster replication is a bottleneck.
idx
The index latency records the time spent indexing records during commit.
c
The commit latency records the latency from commit start to the commit being stabilized to follower members and / or to disk.
Computed Statistics
A started engine statistics thread computes the following at the output frequency at which it was configured (programmatically or administratively).
Message Receipt
Overall Message Receive Rate (since engine start)
Delta Message Receive Rate (since last compute time)
Overall Best-Effort Message Receive Rate (since engine start)
Delta Best-Effort Message Receive Rate (since last compute time)
Overall Guaranteed Message Receive Rate (since engine start)
Delta Guaranteed Message Receive Rate (since last compute time)
Overall Message Source Rate (since engine start)
Delta Message Source Rate (since last compute time)
Message Send
Overall Message Send Rate (since engine start)
Delta Message Send Rate (since last compute time)
Overall Best-Effort essage Send Rate (since engine start)
Delta Best-Effort Message Send Rate (since last compute time)
Overall Guaranteed Message Send Rate (since engine start)
Delta Guaranteed Message Send Rate (since last compute time)
Event Receipt
Overall Event Receive Rate (since engine start)
Delta Event Receive Rate (since last compute time)
Overall Flow Event Receive Rate (since engine start)
Delta Flow Event Receive Rate (since last compute time)
Transactions
Overall Transaction Rate (since engine start)
Delta Transaction Rate (since last compute time)
Average Flow Events Per Transaction
Message Type Specific
Message receipt and send statistics described above per message type
Histograms for application processing and filtering times.
Appendix A - Statistics Output Threads
Statistics output threads can be enabled to trace individual types of statistics for development and testing. For details on configuring statistics output threads, see Configuring Monitoring - Statistics Output Threads.
Appendix B – Statistics Output Format
The following sections break down the output of engine statistics. The format below is used when the stats are traced by an XVM, the Stats Dump Tool or the Statistics threads described in Appendix A.
The above output is comprised of three sections:
The trace header
The overall stats
The message type specific stats
Trace Header
The trace header is the standard trace header.
The header above is the header output by the native X trace logger. When using SLF4J, the header will be appropriate to the configuration of the concrete logger bound to SLF4J.
Overall Stats
The next part of the output is the raw and computed statistics output by the statistics thread. The next sections explains the different sections of this output:
Flow Count
Overall Messages Received & Receipt Rate
Best Effort Messages Received & Receipt Rate
Guaranteed Messages Received & Receipt Rate
Messages Sourced & Sourcing Rate
Messages Filtered
Duplicate Messages Received
Overall Messages Sent & Send Rate
Best Effort Messages Sent & Send Rate
Guaranteed Messages Sent & Send Rate
Outbound Sequence Numbers
Outbound Queue Sizes
Events Received & Receipt Rate
Flow Events Received, Success vs Fail & Receipt Rate
Transactions Processed, Outstanding Commits & Transaction Rate
The number of outstanding commits represents the transactions currently in flight i.e. (NumCommitsStarted - NumCommitsCompleted)
Engine Store Size
Transaction Latencies
Event Multiplexer
Store
Bus Manager
Related Topics
Configuring Monitoring - Configure statistics collection
XVM Stats and Heartbeats - Server-level statistics and heartbeat output
Per Transaction Stats - Detailed per-transaction statistics logging
Stats Dump Tool - Command-line tool for analyzing statistics
Exposing Application Stats - Define custom application metrics
Next Steps
Review the metrics collected by default and determine if additional latency statistics are needed
Configure global and per-engine statistics settings based on performance requirements
Enable XVM heartbeats to collect and emit statistics
Set up monitoring tools to consume and visualize statistics
Tune statistics collection based on performance impact vs operational visibility needs
Last updated

