Test Description

This page describes the complete methodology for the canonical performance benchmark used to measure X Platform performance across releases.

Test Program

The benchmark uses the ESProcessor (Event Sourcing Processor) from the X Platform Performance Benchmark Suite. The ESProcessor exercises the complete Receive-Process-Send flow of a clustered microservice using Event Sourcing HA policy.

Documentation: See the AEP Module documentation for complete details on the test program, parameters, and configuration options.

Test Flow

The benchmark exercises a complete message flow through a clustered microservice consisting of a primary and backup instance:

Primary Microservice

The primary microservice executes the following steps:

  1. Decode Inbound Message - Deserialize incoming message from wire format

  2. Dispatch to Handler - Route message to appropriate business logic handler

  3. Read all fields from message - Business logic accesses message data

  4. Create and send message - Business logic creates response message

  5. Replicate - Replicate state change to backup

    • 5.2. Persist - (Concurrent) Persist state change to disk on primary

  6. Consensus ACK - Receive acknowledgment from backup

  7. Encode Outbound Message - Serialize response message to wire format

Backup Microservice

The backup microservice maintains consistency through (steps concurrent with primary's 5.2):

  • 5.1. Replicate - Receive replicated state from primary

  • 5.3. Persist - Persist replicated state to disk

  • 5.4. Dispatch to Handler - Process replicated message in business logic

  • 5.5. Replay business logic - Execute business logic for consistency

  • 5.6. Consensus ACK - Send acknowledgment back to primary

Test Message

Message Characteristics

  • Type: Full-featured message exercising the complete X Platform data model

  • Serialized Size: ~200 bytes

  • Encoding: Xbuf2 (X Platform's high-performance binary encoding)

  • Structure: Contains all standard data types (primitives, strings, nested entities, arrays)

Code Paths Exercised

The benchmark tests the following X Platform capabilities:

Exercised Paths:

  • Message serialization/deserialization

  • Handler dispatch

  • Persistence

  • Cluster replication

  • Threading

  • Consensus protocol

Not Exercised:

  • Message logging

  • ICR (Inter-Cluster Replication)

Primary Metric: Wire-to-Wire (w2w) Latency

The w2w metric measures the time from when an inbound message is received ("post-wire") to when the corresponding outbound message is sent ("pre-wire").

What is Included

The w2w latency encompasses:

  • Inbound message deserialization (wire format to POJO)

  • Message handoff to business logic thread

  • Handler dispatch

  • Message data access by business logic

  • State persistence

  • Cluster replication to backup

  • Replication acknowledgment from backup

  • Outbound message creation

  • Outbound message serialization (POJO to wire format)

Latency Percentiles

Results are reported as:

  • 50th percentile (median) - Typical latency

  • 99th percentile - Tail latency under normal conditions

  • 99.9th percentile - Worst-case latency for high-percentile SLAs

Test Variables

The benchmark measures performance across multiple configuration dimensions:

Runtime Optimization Mode

Value
Description

Latency

XVM optimized for lowest latency

Throughput

XVM optimized for highest throughput

Message Population/Extraction Method

Value
Description
Performance Characteristic

Indirect

Message data accessed via POJO setter/getter methods

Standard object-oriented access

Direct

Message data accessed via serializer/deserializer objects

Zero-copy access, higher performance

CPU Configuration

The # CPUs value represents the number of system CPUs actually utilized by the test configuration.

Value
# CPUs
Threads
Description

MinCPU

1

Business logic thread (affinitized, hot) + Cluster replication reader (affinitized, not hot)

Minimal CPU footprint. Only the business logic thread runs "hot" (spinning). The replication reader is affinitized but consumes minimal CPU time, so total utilization is closer to 1 CPU than 2.

Default

4

X decides thread allocation

Balanced configuration. X automatically determines optimal thread count and affinitization.

MaxCPU

6

Default threads + detached sender (affinitized, hot) + detached dispatcher (affinitized, hot)

Maximum parallelization with additional hot threads for sending and dispatching.

Note on "Hot" threads: A "hot" thread runs in a tight spin loop, continuously consuming a full CPU core for maximum responsiveness. Non-hot threads are affinitized (pinned to specific cores) but block when idle, consuming minimal CPU.

Test Hardware

Servers

  • Model: Supermicro SYS-110P-WTR

  • CPU: 1 x Intel Xeon Gold 6334 (8-Core, 3.6 GHz)

  • Memory: 128GB (4 x 32GB)

  • Network: NVIDIA/Mellanox ConnectX-6 InfiniBand dual-port

  • Storage: NVME M.2 2TB

Network

  • Switch: NVIDIA Quantum InfiniBand Switch

  • Configuration: Standard TCP/IP (VMA not enabled, unoptimized)

  • Round-trip wire latency: ~23µs (unoptimized network)

Hardware Tuning

The servers are configured for low-latency operation:

Enabled:

  • Dynamic power management = OFF

  • Hyperthreading = OFF

  • Linux performance profile = latency-performance

Not Enabled:

  • VMA (Mellanox kernel bypass) = OFF

  • RDMA (Remote Direct Memory Access) = OFF

Notes:

  • VMA is Mellanox's equivalent of Solarflare onloading

  • RDMA is not supported in X Platform 3.16

Software Configuration

  • CPU affinitization: ON (threads pinned to specific cores)

  • Test driver: Custom in-process messaging driver (zero network overhead)

Test Execution

Latency Tests

  • Message Rate: 10,000 messages/second (sustained)

  • Duration: Sufficient for statistical significance

  • Measurement: Percentile latencies (50th, 99th, 99.9th)

Throughput Tests

  • Message Rate: As fast as possible (saturated load)

  • Duration: Sufficient to reach steady state

  • Measurement: Messages processed per second

Interpreting Results

Latency Results

  • All latency numbers are in microseconds (µs)

  • Round-trip wire latency (~23µs on unoptimized network) is included in all results

  • Lower numbers indicate better performance

  • Tail latencies (99th, 99.9th percentile) indicate consistency

Throughput Results

  • Measured in messages per second

  • Higher numbers indicate better performance

  • Represents maximum sustained throughput under saturation

Configuration Trade-offs

  • MinCPU: Lowest resource usage, may limit throughput

  • Default: Balanced latency and throughput

  • MaxCPU: Highest parallelization, may increase coordination overhead

  • Direct access: Best performance, requires more careful coding

  • Indirect access: Easier to use, slightly lower performance

Next Steps

Last updated