Test Description

This page describes the complete methodology for the canonical performance benchmark used to measure X Platform performance across releases.

Test Program

The benchmark uses the ESProcessor (Event Sourcing Processor) from the X Platform Performance Benchmark Suite. The ESProcessor exercises the complete Receive-Process-Send flow of a clustered microservice using Event Sourcing HA policy.

Documentation: See the AEP Module documentation for complete details on the test program, parameters, and configuration options.

Test Flow

The benchmark exercises a complete message flow through a clustered microservice consisting of a primary and backup instance:

Primary Microservice

The primary microservice executes the following steps:

Decode Inbound Message - Deserialize incoming message from wire format
Dispatch to Handler - Route message to appropriate business logic handler
Read all fields from message - Business logic accesses message data
Create and send message - Business logic creates response message
Replicate - Replicate state change to backup
- 5.2. Persist - (Concurrent) Persist state change to disk on primary
Consensus ACK - Receive acknowledgment from backup
Encode Outbound Message - Serialize response message to wire format

Backup Microservice

The backup microservice maintains consistency through (steps concurrent with primary's 5.2):

5.1. Replicate - Receive replicated state from primary
5.3. Persist - Persist replicated state to disk
5.4. Dispatch to Handler - Process replicated message in business logic
5.5. Replay business logic - Execute business logic for consistency
5.6. Consensus ACK - Send acknowledgment back to primary

Test Message

Message Characteristics

Type: Full-featured message exercising the complete X Platform data model
Serialized Size: ~200 bytes
Encoding: Xbuf2 (X Platform's high-performance binary encoding)
Structure: Contains all standard data types (primitives, strings, nested entities, arrays)

Code Paths Exercised

The benchmark tests the following X Platform capabilities:

✅ Exercised Paths:

Message serialization/deserialization
Handler dispatch
Persistence
Cluster replication
Threading
Consensus protocol

❌ Not Exercised:

Message logging
ICR (Inter-Cluster Replication)

Primary Metric: Wire-to-Wire (w2w) Latency

The w2w metric measures the time from when an inbound message is received ("post-wire") to when the corresponding outbound message is sent ("pre-wire").

What is Included

The w2w latency encompasses:

Inbound message deserialization (wire format to POJO)
Message handoff to business logic thread
Handler dispatch
Message data access by business logic
State persistence
Cluster replication to backup
Replication acknowledgment from backup
Outbound message creation
Outbound message serialization (POJO to wire format)

Latency Percentiles

Results are reported as:

50th percentile (median) - Typical latency
99th percentile - Tail latency under normal conditions
99.9th percentile - Worst-case latency for high-percentile SLAs

Test Variables

The benchmark measures performance across multiple configuration dimensions:

Runtime Optimization Mode

Value

Description

Latency

XVM optimized for lowest latency

Throughput

XVM optimized for highest throughput

Message Population/Extraction Method

Value

Description

Performance Characteristic

Indirect

Message data accessed via POJO setter/getter methods

Standard object-oriented access

Direct

Message data accessed via serializer/deserializer objects

Zero-copy access, higher performance

CPU Configuration

The # CPUs value represents the number of system CPUs actually utilized by the test configuration.

Value

# CPUs

Threads

Description

MinCPU

Business logic thread (affinitized, hot) + Cluster replication reader (affinitized, not hot)

Minimal CPU footprint. Only the business logic thread runs "hot" (spinning). The replication reader is affinitized but consumes minimal CPU time, so total utilization is closer to 1 CPU than 2.

Default

X decides thread allocation

Balanced configuration. X automatically determines optimal thread count and affinitization.

MaxCPU

Default threads + detached sender (affinitized, hot) + detached dispatcher (affinitized, hot)

Maximum parallelization with additional hot threads for sending and dispatching.

Note on "Hot" threads: A "hot" thread runs in a tight spin loop, continuously consuming a full CPU core for maximum responsiveness. Non-hot threads are affinitized (pinned to specific cores) but block when idle, consuming minimal CPU.

Test Hardware

Servers

Model: Supermicro SYS-110P-WTR
CPU: 1 x Intel Xeon Gold 6334 (8-Core, 3.6 GHz)
Memory: 128GB (4 x 32GB)
Network: NVIDIA/Mellanox ConnectX-6 InfiniBand dual-port
Storage: NVME M.2 2TB

Network

Switch: NVIDIA Quantum InfiniBand Switch
Configuration: Standard TCP/IP (VMA not enabled, unoptimized)
Round-trip wire latency: ~23µs (unoptimized network)

Hardware Tuning

The servers are configured for low-latency operation:

✅ Enabled:

Dynamic power management = OFF
Hyperthreading = OFF
Linux performance profile = latency-performance

❌ Not Enabled:

VMA (Mellanox kernel bypass) = OFF
RDMA (Remote Direct Memory Access) = OFF

Notes:

VMA is Mellanox's equivalent of Solarflare onloading
RDMA is not supported in X Platform 3.16

Software Configuration

CPU affinitization: ON (threads pinned to specific cores)
Test driver: Custom in-process messaging driver (zero network overhead)

Test Execution

Latency Tests

Message Rate: 10,000 messages/second (sustained)
Duration: Sufficient for statistical significance
Measurement: Percentile latencies (50th, 99th, 99.9th)

Throughput Tests

Message Rate: As fast as possible (saturated load)
Duration: Sufficient to reach steady state
Measurement: Messages processed per second

Interpreting Results

Latency Results

All latency numbers are in microseconds (µs)
Round-trip wire latency (~23µs on unoptimized network) is included in all results
Lower numbers indicate better performance
Tail latencies (99th, 99.9th percentile) indicate consistency

Throughput Results

Measured in messages per second
Higher numbers indicate better performance
Represents maximum sustained throughput under saturation

Configuration Trade-offs

MinCPU: Lowest resource usage, may limit throughput
Default: Balanced latency and throughput
MaxCPU: Highest parallelization, may increase coordination overhead
Direct access: Best performance, requires more careful coding
Indirect access: Easier to use, slightly lower performance

Next Steps

Test Results - View performance results by release
Canonical Benchmark Overview - Return to canonical benchmark overview
Benchmark Suite - Full benchmark suite documentation

PreviousCanonical Benchmark NextTest Results

Last updated 3 months ago

hashtagTest Program

hashtagTest Flow

hashtagPrimary Microservice

hashtagBackup Microservice

hashtagTest Message

hashtagMessage Characteristics

hashtagCode Paths Exercised

hashtagPrimary Metric: Wire-to-Wire (w2w) Latency

hashtagWhat is Included

hashtagLatency Percentiles

hashtagTest Variables

hashtagRuntime Optimization Mode

hashtagMessage Population/Extraction Method

hashtagCPU Configuration

hashtagTest Hardware

hashtagServers

hashtagNetwork

hashtagHardware Tuning

hashtagSoftware Configuration

hashtagTest Execution

hashtagLatency Tests

hashtagThroughput Tests

hashtagInterpreting Results

hashtagLatency Results

hashtagThroughput Results

hashtagConfiguration Trade-offs

hashtagNext Steps

Test Program

Test Flow

Primary Microservice

Backup Microservice

Test Message

Message Characteristics

Code Paths Exercised

Primary Metric: Wire-to-Wire (w2w) Latency

What is Included

Latency Percentiles

Test Variables

Runtime Optimization Mode

Message Population/Extraction Method

CPU Configuration

Test Hardware

Servers

Network

Hardware Tuning

Software Configuration

Test Execution

Latency Tests

Throughput Tests

Interpreting Results

Latency Results

Throughput Results

Configuration Trade-offs

Next Steps