Threading Model

Understanding Talon's threading architecture is essential to understanding how the platform achieves its extreme performance characteristics. This page explains the design principles, architectural decisions, and concepts that underpin Talon's threading model.

Overview

Talon's threading model is built on two fundamental architectural principles that work together to achieve extreme performance levels. First, write access to the microservice store is single-threaded, eliminating the costly overhead of managing concurrent access to shared data. Second, work flows through a pipeline where critical processing steps can be offloaded to dedicated threads that hand work forward without blocking, keeping the main business logic thread focused on executing handlers.

These principles address the fundamental challenge in high-performance computing: when multiple threads contend for the same piece of data or resource, scalability suffers. The costs aren't just in locks and synchronization primitives—even with all state in main memory, multiple threads operating on the same data face significant costs at the processor cache level. Talon's architecture eliminates these costs entirely for microservice state.

The Single Writer Principle

The single writer principle posits that when trying to build a highly scalable system, the single biggest limitation on scalability is having multiple writers contend for any item of data or resource. This principle motivates architectural patterns like the actor model and microservices, and it's fundamental to Talon's design.

Talon's microservice architecture makes all state private to each microservice, reducing write contention by partitioning data. By bringing all microservice data into memory, Talon further reduces the cost of updating data by keeping it as close to the business logic operating on it as possible. But even with all state in main memory, there are significant costs when multiple threads operate on the same piece of data—processor caches must be synchronized between cores, cache lines are invalidated, and memory has to be reloaded from higher levels of cache or main memory.

Every Talon microservice is backed by an AEP Engine with a single input multiplexer thread—the dispatcher thread—that consumes events and messages coming in from message buses and dispatches them to handler code. This thread serves as the single writer for the microservice's state. Handler code executes on this same thread, so all state modifications happen serially without any synchronization primitives. Application developers don't need to concern themselves with locks, mutexes, atomic operations, or thread-safe data structures.

As with most architectures, horizontal scalability can still be achieved by partitioning state across multiple microservice instances, whether in the same JVM, on the same machine, or across multiple machines. But the single writer architecture reduces the need for sharding by eliminating the hardware inefficiency of managing inter-thread contention. A single microservice instance can process far more transactions per second when it's not wasting processor and memory resources on synchronization.

Detached Threads and Pipelining

While the microservice programming model is single-threaded, it's desirable to keep the dispatcher thread busy performing application logic rather than spending cycles on infrastructural concerns. The platform provides the ability to do much of the non-functional heavy lifting—like replication, persistence, and message I/O—in background threads that are "detached" from the business logic thread.

Work that can be offloaded to detached threads includes:

  • Store replication - Sending state updates to backup instances (detached store sender)

  • Store replication dispatch - Deserializing received replication traffic (detached store dispatcher)

  • Persistence - Writing recovery logs to disk (detached persister)

  • Inter-cluster replication - Sending state across clusters (detached ICR sender)

  • Message logging - Audit logging of inbound/outbound messages (detached message loggers)

  • Bus I/O - Serializing and sending outbound messages (detached bus sender)

This creates a processing pipeline. The dispatcher thread executes a handler, which modifies state and prepares outbound messages. Instead of blocking to replicate that state or serialize and send those messages, the dispatcher immediately hands that work forward to a detached thread and begins processing the next transaction. The detached threads work in parallel, each focused on their specific task—one thread replicates to backups, another persists to disk, another sends messages to the bus. The dispatcher just keeps executing handlers.

For optimal latency and throughput, these detached threads can be affinitized to particular CPU cores, reducing the performance impact of thread context switching—a concept we'll explore more in the next section.

Disruptors: Inter-Thread Communication

Effective pipelining between threads requires optimal inter-thread communication. Talon uses LMAX Disruptors—a high-performance inter-thread messaging library—to pass data between critical threads in the processing pipeline.

Disruptors implement a ring buffer with sophisticated wait strategies. When the dispatcher thread has work to hand forward—say, state updates that need to be replicated—it writes those updates into a ring buffer. The detached replication thread reads from that same buffer. The ring buffer is sized as a power of 2, typically 1024 entries, large enough to absorb spikes in traffic without blocking the offering thread but small enough to keep active data within CPU caches.

The wait strategy determines how a thread waiting for work behaves. For ultra-low latency, threads can busy spin—continuously checking for new work without yielding to the operating system. This keeps the CPU's instruction pipeline hot and avoids context switch jitter, but it requires dedicating a full CPU core to that thread. For better CPU utilization with slightly higher latency, threads can yield to the OS or even block, allowing other threads to use that core. The platform provides controls to configure these trade-offs.

Thread Affinitization and NUMA

Modern server hardware adds another layer of complexity that Talon's threading model addresses. Contemporary CPUs typically have multiple cores—often 10 to 20 physical cores per socket. Servers often have multiple CPU sockets. And many CPUs support hyper-threading, where each physical core appears as two logical CPUs to the operating system.

This creates a NUMA (Non-Uniform Memory Access) architecture. Each CPU socket has its own bank of RAM—a NUMA node. A thread running on socket 0 can access memory on socket 0's NUMA node quickly, but accessing memory on socket 1's node requires going across the inter-socket link, which is significantly slower. For an in-memory computing platform like Talon, where the primary storage mechanism is memory, this matters enormously from a Von Neumann Bottleneck perspective.

Thread affinitization—pinning threads to specific CPU cores—addresses these challenges. When you pin the dispatcher thread to a specific core and pin all the detached threads to cores on the same socket, and you ensure the process's memory is allocated on that socket's NUMA node, you achieve several benefits:

  • Cache locality: The dispatcher and its detached threads share the same L3 cache, minimizing the time to pass work between them

  • Memory locality: All threads access memory on the local NUMA node, minimizing memory access latency

  • Eliminates context switching: The OS won't move pinned threads to different cores, so processor caches stay hot

Hyper-threading presents a trade-off. With hyper-threading enabled, two logical CPUs share the same physical core's CPU caches, meaning each logical process has less cache space available, resulting in higher memory access time. For Talon applications optimized for latency, it's generally best to disable hyper-threading when possible to maximize the cache available to each thread.

Goals of Affinitization

With the above concepts in mind, optimal performance for a Talon microservice is achieved when:

  • All critical threads are affinitized to the same processor socket, sharing the same L3 cache

  • Process memory is affinitized to that same socket's NUMA node, avoiding remote NUMA access

  • Critical threads are pinned to their own CPU cores and set to busy spin, avoiding context switches

  • Hyper-threading is disabled, preventing threads from being scheduled onto the same physical core as a busy-spinning thread

This level of tuning isn't necessary for all deployments—the benefits are most pronounced in ultra-low-latency applications where every microsecond matters. But understanding these concepts helps explain why Talon can achieve such extreme performance characteristics when properly configured.

Configuration

Threading behavior is configured through DDL and system properties. See Configuring Threading for detailed configuration guidance including:

  • Disruptor configuration (queue depth, wait strategies)

  • Thread affinitization (basic and advanced approaches)

  • NUMA topology optimization

  • Per-thread affinity masks

Programming Implications

The single-threaded model profoundly simplifies application code. Handler code runs on the dispatcher thread, and since that's the only thread that modifies state, developers don't need thread synchronization. There's no need for locks, mutexes, volatile fields, or concurrent collections. State can be accessed directly without any thread-safety concerns.

This does impose one requirement: handler code must be deterministic and non-blocking. Blocking the dispatcher thread blocks the entire microservice—no other handlers can execute until the current one completes. This means avoiding blocking I/O, long-running computations, or calls to external services that might delay. The programming model trades away the complexity of concurrent programming for the discipline of writing fast, focused handlers.

See Programming Fundamentals for detailed coding guidelines.

Next Steps

  1. Review Configuring Threading to understand configuration options

  2. Understand programming implications in Programming Fundamentals

  3. Learn about the Runtime Architecture to see how threads interact with other components

Last updated