Test Description
This page describes the complete methodology for the canonical performance benchmark used to measure X Platform performance across releases.
Test Program
The benchmark uses the ESProcessor (Event Sourcing Processor) from the X Platform Performance Benchmark Suite. The ESProcessor exercises the complete Receive-Process-Send flow of a clustered microservice using Event Sourcing HA policy.
Documentation: See the AEP Module documentation for complete details on the test program, parameters, and configuration options.
Test Flow
The benchmark exercises a complete message flow through a clustered microservice consisting of a primary and backup instance:
Primary Microservice
The primary microservice executes the following steps:
Decode Inbound Message - Deserialize incoming message from wire format
Dispatch to Handler - Route message to appropriate business logic handler
Read all fields from message - Business logic accesses message data
Create and send message - Business logic creates response message
Replicate - Replicate state change to backup
5.2. Persist - (Concurrent) Persist state change to disk on primary
Consensus ACK - Receive acknowledgment from backup
Encode Outbound Message - Serialize response message to wire format
Backup Microservice
The backup microservice maintains consistency through (steps concurrent with primary's 5.2):
5.1. Replicate - Receive replicated state from primary
5.3. Persist - Persist replicated state to disk
5.4. Dispatch to Handler - Process replicated message in business logic
5.5. Replay business logic - Execute business logic for consistency
5.6. Consensus ACK - Send acknowledgment back to primary
Test Message
Message Characteristics
Type: Full-featured message exercising the complete X Platform data model
Serialized Size: ~200 bytes
Encoding: Xbuf2 (X Platform's high-performance binary encoding)
Structure: Contains all standard data types (primitives, strings, nested entities, arrays)
Code Paths Exercised
The benchmark tests the following X Platform capabilities:
✅ Exercised Paths:
Message serialization/deserialization
Handler dispatch
Persistence
Cluster replication
Threading
Consensus protocol
❌ Not Exercised:
Message logging
ICR (Inter-Cluster Replication)
Primary Metric: Wire-to-Wire (w2w) Latency
The w2w metric measures the time from when an inbound message is received ("post-wire") to when the corresponding outbound message is sent ("pre-wire").
What is Included
The w2w latency encompasses:
Inbound message deserialization (wire format to POJO)
Message handoff to business logic thread
Handler dispatch
Message data access by business logic
State persistence
Cluster replication to backup
Replication acknowledgment from backup
Outbound message creation
Outbound message serialization (POJO to wire format)
Latency Percentiles
Results are reported as:
50th percentile (median) - Typical latency
99th percentile - Tail latency under normal conditions
99.9th percentile - Worst-case latency for high-percentile SLAs
Test Variables
The benchmark measures performance across multiple configuration dimensions:
Runtime Optimization Mode
Latency
XVM optimized for lowest latency
Throughput
XVM optimized for highest throughput
Message Population/Extraction Method
Indirect
Message data accessed via POJO setter/getter methods
Standard object-oriented access
Direct
Message data accessed via serializer/deserializer objects
Zero-copy access, higher performance
CPU Configuration
The # CPUs value represents the number of system CPUs actually utilized by the test configuration.
MinCPU
1
Business logic thread (affinitized, hot) + Cluster replication reader (affinitized, not hot)
Minimal CPU footprint. Only the business logic thread runs "hot" (spinning). The replication reader is affinitized but consumes minimal CPU time, so total utilization is closer to 1 CPU than 2.
Default
4
X decides thread allocation
Balanced configuration. X automatically determines optimal thread count and affinitization.
MaxCPU
6
Default threads + detached sender (affinitized, hot) + detached dispatcher (affinitized, hot)
Maximum parallelization with additional hot threads for sending and dispatching.
Note on "Hot" threads: A "hot" thread runs in a tight spin loop, continuously consuming a full CPU core for maximum responsiveness. Non-hot threads are affinitized (pinned to specific cores) but block when idle, consuming minimal CPU.
Test Hardware
Servers
Model: Supermicro SYS-110P-WTR
CPU: 1 x Intel Xeon Gold 6334 (8-Core, 3.6 GHz)
Memory: 128GB (4 x 32GB)
Network: NVIDIA/Mellanox ConnectX-6 InfiniBand dual-port
Storage: NVME M.2 2TB
Network
Switch: NVIDIA Quantum InfiniBand Switch
Configuration: Standard TCP/IP (VMA not enabled, unoptimized)
Round-trip wire latency: ~23µs (unoptimized network)
Hardware Tuning
The servers are configured for low-latency operation:
✅ Enabled:
Dynamic power management = OFF
Hyperthreading = OFF
Linux performance profile = latency-performance
❌ Not Enabled:
VMA (Mellanox kernel bypass) = OFF
RDMA (Remote Direct Memory Access) = OFF
Notes:
VMA is Mellanox's equivalent of Solarflare onloading
RDMA is not supported in X Platform 3.16
Software Configuration
CPU affinitization: ON (threads pinned to specific cores)
Test driver: Custom in-process messaging driver (zero network overhead)
Test Execution
Latency Tests
Message Rate: 10,000 messages/second (sustained)
Duration: Sufficient for statistical significance
Measurement: Percentile latencies (50th, 99th, 99.9th)
Throughput Tests
Message Rate: As fast as possible (saturated load)
Duration: Sufficient to reach steady state
Measurement: Messages processed per second
Interpreting Results
Latency Results
All latency numbers are in microseconds (µs)
Round-trip wire latency (~23µs on unoptimized network) is included in all results
Lower numbers indicate better performance
Tail latencies (99th, 99.9th percentile) indicate consistency
Throughput Results
Measured in messages per second
Higher numbers indicate better performance
Represents maximum sustained throughput under saturation
Configuration Trade-offs
MinCPU: Lowest resource usage, may limit throughput
Default: Balanced latency and throughput
MaxCPU: Highest parallelization, may increase coordination overhead
Direct access: Best performance, requires more careful coding
Indirect access: Easier to use, slightly lower performance
Next Steps
Test Results - View performance results by release
Canonical Benchmark Overview - Return to canonical benchmark overview
Benchmark Suite - Full benchmark suite documentation
Last updated

