Exposing Application Stats
Overview
The platform provides the ability for users to define their own application specific stats. These user defined app stats can be registered with the AepEngine, which allows them to be traced along with AEP Engine Stats, and be included in XVM heartbeats. Applications can programmatically register stats with the AepEngine, or when running in a Talon XVM to be discovered via annotations.
This article describes the usage of the following types of statistics that applications can expose:
Gauge
A Gauge samples and reports a value on each stats collection interval. Gauges can be exposed simply by annotating a field or method of interest.
Counter
A Counter stat captures a monotonically increasing value, and can be used to derive rates based on deltas between intervals.
Series
A Series stat allows recording of a series of data points upon which histographical computations can be reported.
Latencies
A special case of a Series stat used to collected timing related data points. Latencies are used extensively within Core X to provide visibility into transaction processing pipeline times.
Gauges
A Gauge captures an instantaneous value at the time of a statistic collection. Gauges can be of the following types:
booleanbyteshortintlongfloatdoublecharStringorXString
Important considerations regarding gauge collection
Gauge values are collected on a stats collection thread(s) separate from the business logic thread. Consequently:
Gauge field values must be declared as volatile to ensure changes to them are visible to collection threads.
Additionally, for method gauges:
Be sure that the computation cost is not so high that it skews statistics collection. Consider using a background thread for computing gauge values that are computationally expensive.
It is possible that more than one stats collection thread will be collecting and reporting on stats concurrently, so method gauges should be threadsafe.
Field Gauges
When running in a Talon XVM, it is possible to annotate a field as a gauge:
Method Gauges
When running in a Talon XVM, it is possible to annotate a method as a gauge accessor:
Method accessor gauges for primitive type are not Zero Garbage. This is because the platform invokes getHasOrderErrors via reflection which generates autoboxing garbage. A better approach, if your application is sensitive to garbage, is to use a Gauge subclass which can directly return the primitive type.
Gauge Subclass Field
You can subclass one of the XXXGauge implementations to avoid garbage associated with an annotated method. This is useful if your Gauge needs to be calculated or you are not running in a Talon XVM and you need to programmatically register a Gauge instance with the AepEngine.
Note that in the above case the 'name' attribute is omitted on the AppStat annotation because it is provided directly when creating the Gauge.
Gauges on Server Heartbeats
Gauges can be read programmatically on XVM heartbeats:
yields:
Gauges in Aep Engine Stats Trace
Threading Considerations for Gauges
Note that in the above examples, gauges fields are declared as volatile. This is because gauge values are collected by the statistics thread that is emitting XVM heartbeats, not the microservice's business logic thread.
Counters
A Counter is useful for recording a monotonically increasing value over time. Sampled periodically, it can be used to derive a rate. For example, a counter could be used to record a number of message received. By sampling it over time, it can be used to create a received message rate.
If Aep engine stats tracing is enabled, the above stat will be printed along with the rest of engine stats in the format:
<overallCount> <lastIntervalCount> (<overallRate> <lastIntervalRate>):
From the above, we can see that there were 9 invalid orders in the lifetime of the app, 1 invalid order in the last interval, and that the app is receiving a little over 1 invalid order / sec.
User stats are also included in Server Heartbeats, the following code iterates through all user Counter stats and prints them out.
Series
Series stats allow capture of a series of datapoints and allow reporting of histographical statistics based on that series.
A common usecase for a Series statistic is collecting Latency timing data. In that, one would like to be able to observe median, min, max 99.99% for message processing times to ensure that SLAs are being met. However, Non-lossy collection and reporting of histographical latency statistics is a challenging problem in low latency systems due to the number of data points that need to be retained, computed and serialized. For example, imagine an application that is recording latency statistics for messages coming in at a rate of 10k/sec. To accurately compute and report percentiles with a collection period of 10 seconds, the application needs to retain at least 100,000 data points per statistic to perform histographical analysis for just one interval! Assuming that the values are double or long values, then one would be looking at ~800Kb per statistic collected. Collecting and computing on such data is hard on processor memory caches and can have a disruptive impact on application processing times. Furthermore, to perform longer term histographical analysis (across multiple collection periods) without losing any data, each set of interval results needs to be stored so that computation can be performed. Persisting such data to disk or emitting it in XVM heartbeats to achieve this is also problematic because it leads to a large volume of data which puts a strain on disk space and bandwidth, or in the case of heartbeats, network bandwidth when emitted over the messaging fabric.
Loss-less series stats collection
Talon supports the ability to perform loss-less series capture by allowing all collecting latencies timing datapoints to be emitted in heartbeats. Providing that the collection period doesn't exceed the data point capture rate, every datapoint can be emitted in heartbeats (which can be logged to disk or emitted over an SMA channel). However, this approach should be use sparingly as it is quite expensive.
Histogram (HDR) collection
As an alternative to reporting all captured data points, 3.1 introduces computed histogram reporting based on HDRHistogram which significantly reduces the size of heartbeats by maintaining a running computation of latency statistics. At each collection interval the captured latencies are fed into both a running histogram and an interval histogram.
An HDRHistogram compromises on precision of the captured latencies in favor of cheaper computation and storage of results while still maintaining a predictable precision. The documentation on HDR histogram provides details on the level of precision that is achieved. Practically speaking, however, for latency data points in the 100s of microseconds the precision that is guaranteed for collected percentiles is in the order of +/- 1us, which is acceptable for most applications (for tail values, say in the range of 1 minute, the value is guaranteed to be correct within +/- 60ms).
Creating a Series Stat
When Aep Engine Statistics are enabled, the statistic would then be traced:
In the above we can see that in the last interval, one new customer registered and their age was 21. Over the last 8 intervals, the average new customer age is 23 with the oldest being 29 and the youngest being 21.
Series Data in Server Heartbeats
Series data for user stats are exposed in the Server Monitoring Heartbeat using the SrvMonUserSeriesStat object:
SrvMonUserSeriesStat
Reports an application defined series statistic.
name
String
When the XVM is configured to include the capture data points for the statistic, the returned array will include the values collected during this interval. This allows monitoring tools to perform non-lossy calculation of percentiles, providing new data points were skipped due to under sampling or a missed heartbeat. The number of valid values in the returned array is dictated by numDataPoints; if the length of the values array is longer than numDataPoints, subsequent values in the array should be ignored.
seriesType
SrvMonSeriesType
The type of the series data.
Currently only Integer Data series are supported. The types BYTE, SHORT, LONG, FLOAT and DOUBLE are reserved for future use. Processors of heartbeats should ensure that they check the data type here for future proofing.
| | intSeries | SrvMonIntSeries | The collected int series data for an INT series.
This field should only be set when the series type is set to SrvMonSeriesType.INT. |
SrvMonIntSeries
Latency statistics are reported in a SrvMonIntSeries object.
SrvMonIntSeries reports interval and running histogram data for a series of integer data points. It may also be used to report the captured datapoints, but because reporting the raw data is costly (both in terms of collection and size/bandwidth), the captured values are typically not reported.
SrvMonIntSeries is frequently used to capture measured latency timings, but can also be used to capture any integer data series.
dataPoints
int[]
When the XVM is configured to include the capture data points for the statistic, the returned array will include the values collected during this interval. This allows monitoring tools to perform non-lossy calculation of percentiles, providing new data points were skipped due to under sampling or a missed heartbeat. The number of valid values in the returned array is dictated by numDataPoints; if the length of the values array is longer than numDataPoints, subsequent values in the array should be ignored.
lastSequenceNumber
long
Sequence numbers for collected data points start at 1, a value of 0 indicates that no data points have been collected. The Sequence Number always indicates the number or data points that have been collected since the statistic has been created or was last reset. If the statistic is reset then this value will reset to 0.
numDataPoints
int
Indicates the number of data points collected in this interval. If no data points were collected, numDataPoints will be 0. The sequence number of the first value collected in this interval can be determined by subtracting numDataPoints from lastSequenceNumber. This can be used to determine if two consecutive datapoints have skipped data points due to under sampling or a missing heartbeat.
skippedDataPoints
long
The runtime only holds on to a fixed number of data points for any particular Latency statistic. If the sampling interval is too high, then some datapoints may be skipped. For example, let's say Latency stats are configured to hold on to a sample size of 1000 datapoints. If the number of data points being captured per second is 2000, and the stats collection interval is 1 second, then on each collection, 1000 datapoints will be missed, which will skew results. The skipped data points counter thus indicates how many data points have been missed in the reported runningStats. And if the count grows over two successive heartbeats, this indicates that the values the intervalStats don't reflect all the activity since the last interval. The skipped data points counter is a running counter: it tracks the total number of data points that have been skipped since the underlying statistic was last reset.
intervalStats
SrvMonIntHistogram
Holds computed results for the datapoints captured for this heartbeat (e.g. for the numDataPoints captured). This field may not be set if numDataPoints is 0 or if interval computations are not done on the XVM.
runningStats
SrvMonIntHistogram
Holds computed results for the datapoints over the lifetime of this statistic (e.g. since seqNo 1). If the underlying statistic is reset then the running stats are also corresponding reset.
SrvMonIntHistogram
Holds calculated statistics of a range of integer datapoints. The values are computed using an HDRHistogram.
sampleSize
long
The number of datapoints over which results were calculated (possibly 0 if no data points were collected).
minimum
int
The minimum value recorded in the sample set. The value is not set if the sample size is 0.
maximum
int
The maximum value recorded in the sample set. The value is not set if the sample size is 0.
mean
int
The mean for the values recorded in the sample set. The value is not set if the sample size is 0.
median
int
The median for the values recorded in the sample set. The value is not set if the sample size is 0.
pct75
int
The 75th percentile for the values recorded in the sample set. The value is not set if the sample size is 0.
pct90
int
The 90th percentile for the values recorded in the sample set. The value is not set if the sample size is 0.
pct99
int
The 99th percentile for the values recorded in the sample set. The value is not set if the sample size is 0.
pct999
int
The 99.9th percentile for the values recorded in the sample set. The value is not set if the sample size is 0.
pct9999
int
The 99.99th percentile for the values recorded in the sample set. The value is not set if the sample size is 0.
samplesOverMax
long
The number of samples that exceeded the maximum recordable value for the histogram. When computing latency percentiles using an HDRHistogram, it is possible that a recorded value will exceed the maximum value allowable. In this case, the datapoint is downsampled to the maximum recordable value, which skews the percentile calculations lower. SamplesOverMax allows detection of how frequently this is occurring.
samplesUnderMin
long
The number of samples captured that were below the recordable value for the histogram. When computing latency percentiles using an HDRHistogram, it is possible that a recorded value will be below 0 in cases where clock skew is possible. In such cases, the value will be upsampled to 0, which can skew the histogram results. SamplesUnderMin allows detection of how frequently this is happening.
Latencies
The Latencies stats is an extension of the Series stat. When capturing latency or timing data, it is good practice to use Latencies instead of Series.
User Defined Statistic Discovery
@AppStat Annotation
The AppStat annotation can be used to annotate user defined statistics in the microservice to allow those statistics to be discovered by a Talon XVM. The Talon XVM will register each statistic it finds with the microservice's AepEngine. AppStat annotations are only introspected once: just after the microservice's AepEngine is injected. If the microservice changes the instance after microservice initialization, the new stat instance won't be discovered by the microservice.
@AppStatContainersAccessor
Any @AppStat annotated field in the main microservice class will be discovered by the Talon XVM: if additional classes in your microservice contain user defined stats, they can be exposed to the XVM using the AppStatContainerAccessor annotation.
AppStat Discovery in Hornet
For Topic Oriented Applications, any @Managed object will be introspected for User Defined stats. See ManagedObjectLocator. The DefaultManagedObjectLocator for Hornet calls TopicOrientedApplication.addAppStatContainers(Set), so unless your application provides its own managed object locator, additional user defined stats containers can be added by overriding addAppStatsContainers:
Programmatically Registering Stats
When running in a Talon XVM, the XVM registers discovered App Stats with the AepEngine. When not running in a Talon XVM, user defined stats may be registered programmatically with the AepEngine by calling the appropriate register method:
Counter
registerCounterStat(IStats.Counter counter)
Gauge
registerGaugeStat(IStats.Gauge gauge)
Series Latencies
registerSeriesStat(IStats.Series series)
If not registered with the engine, app stats will not be collected with other engine stats when engine stats are enabled.
Registration of User Defined stats is only supported prior to engine startup.
Related Topics
AEP Engine Statistics - Engine-level statistics
XVM Stats and Heartbeats - XVM heartbeat configuration
Next Steps
Determine which stat types best fit your application metrics
Annotate fields or methods with @AppStat
Enable XVM heartbeats to collect statistics
Monitor stats via heartbeat handlers or Admin tools
Optimize for zero-garbage if needed using Gauge subclasses
Last updated

