Operating Model

This page describes Talon's operating model—the architecture and mechanisms for administering, monitoring, and troubleshooting running applications.

Overview

Operating Talon applications in production requires three complementary capabilities working in concert. Administration provides the control plane for managing running microservices, allowing operators to issue commands, query state, and manage lifecycle operations. Monitoring observes runtime behavior by collecting and delivering telemetry that reveals how the system is performing. Analysis and troubleshooting capabilities diagnose issues by examining logs and historical data when problems occur.

These three capabilities are deeply integrated through a common infrastructure. Discovery enables operational tools to locate running XVMs. Messaging channels carry both administrative commands and monitoring data. Transaction logs capture detailed history for post-mortem analysis. Together, they form a cohesive operating environment that scales from development through production.

Administration

Administrative Architecture

Talon recognizes that different operational contexts demand different administrative approaches, and provides two distinct administration modes to accommodate these varying needs.

In direct administration mode, XVMs advertise an admin acceptor through the discovery system. Admin clients discover these XVMs and establish direct TCP connections to them. This approach is particularly well-suited for development and diagnostic scenarios where low latency and direct access are valuable. The direct connection provides immediate, responsive interaction with individual XVMs without requiring additional infrastructure.

Admin over SMA (Shared Memory Appliance) takes a messaging-based approach instead. XVMs emit administrative messages over configured messaging channels, and admin clients subscribe to these channels through the message bus. This mode excels in production environments where you need to manage many distributed XVMs. It supports passive monitoring—observers can watch administrative traffic without requiring direct connections to XVMs. This architecture scales naturally as you add more XVMs, since all communication flows through the existing messaging infrastructure.

Command and Control

Administration in Talon follows a command-and-control pattern where operators issue commands to running XVMs and receive responses. The system provides built-in XVM commands for common lifecycle and diagnostic operations like thread dumps, statistics queries, and configuration changes. Microservice developers can extend this by defining application-specific commands using @AppCommandHandler annotations, allowing custom operational procedures to be exposed through the same administrative interface.

Commands flow through request channels to their target XVMs, which process them and return responses via response channels. Discovery plays a crucial role here—commands target specific XVMs by name, and the discovery provider resolves these names to actual XVM locations. This decoupling means administrative tools don't need to know where XVMs are running; they simply address commands to the appropriate name and let discovery handle the routing.

Design Rationale

The dual administration modes reflect different operational priorities. TCP mode provides low-overhead access ideal for interactive debugging and development, where latency matters and you're working with a small number of local XVMs. SMA mode scales to production deployments with hundreds of XVMs distributed across data centers, where establishing direct TCP connections would be impractical and where you want multiple monitoring tools observing the same administrative traffic.

Discovery decouples administrative clients from XVM locations, allowing XVMs to move, restart, or scale without reconfiguring administrative tools. The channel-based model in SMA mode means adding more monitoring tools or XVMs doesn't require connection management—everything flows through the established messaging infrastructure.

Monitoring

Monitoring Architecture

Monitoring in Talon is built around a simple but powerful concept: each XVM runs a background stats collection thread that periodically gathers metrics and emits them as heartbeats. This thread wakes up at regular intervals—typically every one to five seconds, though this is configurable—and walks through a hierarchy of statistics collectors. It starts at the XVM level, collecting JVM and system metrics, then descends into each engine to gather message processing statistics, and finally reaches into the application layer for custom metrics. As it collects, it also performs higher-level computations like calculating rates, averages, and percentiles from raw counters.

The result of this collection cycle is a heartbeat—a periodic message that serves two purposes. First, it's a "proof of life" signal showing that the XVM is running and healthy. Second, it carries a snapshot of all collected statistics, timestamped and tagged with the XVM's identity. These heartbeats form the foundation of Talon's monitoring capability.

Heartbeats can be delivered through three different mechanisms, each suited to different operational contexts. Trace output writes human-readable statistics to log files, which is convenient during development and diagnostics when you want to quickly scan metrics in a text editor. However, this approach creates garbage and isn't suitable for latency-sensitive production environments.

Binary transaction logs provide a zero-garbage alternative. Statistics are written in compact binary format to sequential log files, where they can be stored efficiently and analyzed offline using dedicated tools. This approach works well in production when you need to maintain ultra-low latency—writing to binary logs adds minimal overhead, and you can perform analysis later without impacting the running system.

The third mechanism, SMA channels, emits heartbeats over the messaging infrastructure where they can be consumed by remote monitoring applications. This enables real-time dashboards and alerts and scales naturally to distributed deployments. Monitoring tools simply subscribe to the heartbeat channels and receive statistics as they're emitted, without needing direct connections to XVMs.

Statistics Hierarchy

The statistics collected at each heartbeat cycle form a natural hierarchy that mirrors Talon's architecture. At the top level, XVM statistics capture JVM health metrics like heap usage, garbage collection activity, and thread counts, along with system-level metrics such as CPU usage and load average. These process health indicators tell you whether the XVM itself is healthy.

Descending into the engine level, you find metrics that reveal how messages are being processed. Message throughput and rates show how much work is flowing through the system. Transaction counts and latencies indicate how long it takes to process work. Queue depths and backlog metrics reveal whether the system is keeping up with load. Consensus performance statistics show how well the cluster is coordinating. Store replication and persistence metrics track data durability operations.

At the application level, microservice developers define custom statistics using @AppStat annotations. These application-specific metrics might track business operations—orders processed, trades executed, inventories updated—or operational characteristics like cache hit rates or queue sizes. The ability to define custom statistics lets developers expose exactly the metrics that matter for their specific application.

Latency Collection

Latency statistics deserve special attention because they're more expensive to collect than simple counters or gauges. Tracking latencies requires capturing timestamps at multiple points in message processing pipelines and computing differences, which adds overhead. Talon provides granular control over which latencies to collect, allowing operators to balance visibility against performance impact.

Message latencies track individual messages through their lifecycle—ingestion from the bus, queuing in the engine, processing by handlers, and transmission back out. Transaction latencies break down the transaction pipeline into stages, measuring time spent in handler code, persisting to the store, and committing outbound sends. Store latencies focus specifically on persistence and replication timing. For deeper investigation, you can enable per-message-type latencies, which provide detailed breakdowns for each message type, though this carries higher overhead. The most detailed option, per-transaction stats, captures a complete trace of every transaction, but this level of visibility comes with very high overhead suitable only for diagnostic scenarios.

Operators configure these latency collection options based on their needs. Development environments might enable everything to understand system behavior. Production environments typically enable basic latencies and selectively enable detailed collection when investigating specific issues.

Design Rationale

Periodic collection strikes a balance between overhead and visibility—you get regular snapshots of system behavior without the cost of continuous instrumentation. The multiple delivery mechanisms reflect real operational needs: trace output for quick debugging, binary logs for production monitoring without garbage, and SMA channels for real-time dashboards. The hierarchical organization of statistics provides appropriate granularity at each level—you don't need application details to diagnose a JVM problem, but you do need them to understand business logic issues. Making expensive statistics opt-in lets operators tune the overhead/visibility trade-off for their specific situation.

Analysis & Troubleshooting

Trace Logging

When monitoring shows that something is wrong, trace logging helps you understand what's happening inside your microservices. Talon's trace logging system provides configurable runtime diagnostics through a three-layer architecture. At the bottom, tracers are objects embedded throughout Talon's code that can emit diagnostic messages. These tracers don't write directly to output; instead, they send their messages to loggers—named entities organized in a hierarchical namespace like nv.aep, nv.ods, and nv.sma. Each logger can be configured with its own trace level, allowing fine-grained control over what gets logged. Finally, handlers receive trace output from loggers and route it to destinations like stdout, stderr, files, network sockets, or memory buffers.

This architecture provides tremendous flexibility. Handlers can be daisy-chained to route trace to multiple destinations simultaneously. You can dynamically adjust trace levels at runtime without restarting microservices. The system integrates with both Talon's native logging and standard frameworks like SLF4J, letting you work with familiar tools.

One particularly clever feature is the memory handler, which buffers trace output in memory but only writes it out when triggered by a severe error. This captures the context leading up to problems—you get to see what was happening in the moments before a failure occurred—while minimizing trace overhead during normal operation. It's like having a flight recorder that's always running but only preserves data when something goes wrong.

Transaction Logs

While trace logging shows you what code is executing, transaction logs provide a complete record of the data flowing through your system. These logs are the foundation for deep analysis and troubleshooting.

Talon uses several types of transaction logs, each serving a specific purpose. The recovery transaction log is fundamental to Talon's fault tolerance. In Event Sourcing mode, it records inbound messages so they can be replayed after a failure. In State Replication mode, it captures state tree updates—the PUTs, UPDATEs, and REMOVEs that modify your microservice's state. When a microservice recovers from a failure, it reads this log to restore itself to the correct state.

Beyond recovery, you can enable additional logs for audit and diagnostic purposes. An inbound message log records every message your microservice receives, regardless of whether it's needed for recovery. This creates a complete audit trail of incoming requests. Similarly, an outbound message log captures every message your microservice sends, enabling you to trace message paths through complex multi-hop flows. For the deepest level of diagnostics, per-transaction stats logs record detailed statistics for every transaction—complete latency breakdowns and all metrics—though this comes with high overhead and is typically reserved for diagnostic sessions.

All these logs use a compact binary format that enables zero-garbage writes, which is crucial for maintaining low latency in production. The format supports fast sequential access and indexing, making offline analysis efficient even with large log files.

Query and Analysis Tools

Transaction logs would have limited value if you could only read them sequentially, but Talon provides powerful tools that turn these logs into queryable databases. The Transaction Log Tool provides both interactive browsing and XPQL—a SQL-like query language specifically designed for transaction logs.

The XPQL query engine maps transaction log entries to a relational table model. Each log becomes a table, each entry becomes a row, and you get columns for entry metadata plus typed columns for every message and entity type in your schema. You can build indexes on any field, enabling efficient queries even in logs containing millions of entries. This turns transaction logs into a powerful analytical database—you can run complex queries to find patterns, track specific messages through your system, or analyze behavior over time.

For statistics analysis, the Stats Dump Tool reads binary heartbeat logs and outputs human-readable statistics. This enables offline analysis without any impact on running systems. You can apply date range filters to focus on specific time periods, making it easy to investigate historical incidents or analyze performance trends.

Design Rationale

The binary log format enables rich post-mortem analysis without impacting production performance—writes are zero-garbage and fast, while reads happen offline. Multiple log types serve different purposes, so you can enable exactly the logging you need without paying for capabilities you don't use. The query capability transforms logs from sequential records into a queryable database, dramatically expanding what you can learn from them. Having all analysis tools work offline means you can investigate issues as deeply as needed without affecting running systems.

Integration: How It All Works Together

Discovery as the Foundation

Discovery is the glue that binds all of Talon's operational capabilities together. XVMs advertise themselves through the configured discovery provider, making their presence known to the operational infrastructure. Both admin clients and monitoring tools use discovery to find XVMs—they don't need hardcoded addresses or complex configuration. The choice of discovery provider (multicast, UDP, SMA-based) determines your operational topology, and changing discovery providers changes how your operational tools connect without requiring changes to the tools themselves.

Admin Channels Carry Monitoring Data

When you enable Admin over SMA, something elegant happens: the same messaging infrastructure that carries administrative commands also carries monitoring data. The xvm-heartbeat channel delivers statistics for monitoring. The xvm-trace channel carries trace output. The xvm-event channel broadcasts lifecycle events and alerts. The xvm-request and xvm-response channels handle administrative commands. This unified approach means you're not building and maintaining separate networks for administration versus monitoring—one messaging infrastructure serves both purposes.

Transaction Logs Enable Analysis

Real-time monitoring through heartbeats and SMA channels shows you what's happening right now, but transaction logs capture history. While your monitoring dashboard displays current throughput and latency, transaction logs are quietly recording every message and transaction to disk. Later, when you need to understand what happened during an incident, offline query tools let you analyze past behavior without any impact on the running system. You can treat these logs like a database, running complex queries to understand what occurred.

The recovery transaction log serves double duty in this model. Its primary purpose is enabling failure recovery—when a microservice crashes and restarts, it reads this log to restore state. But this same log also enables replay for testing, message path analysis across microservice boundaries, and tracking how state evolved over time. What you write for fault tolerance also becomes a powerful analytical resource.

Stats Flow Example

Consider the lifecycle of a statistics heartbeat. Every five seconds, the XVM's stats collection thread wakes up and gathers metrics from all collectors. It assembles these into a heartbeat message and then simultaneously delivers it through three paths if all are enabled. The heartbeat gets traced to a log file in human-readable format, written to a binary heartbeat log in compact format, and emitted over the SMA heartbeat channel. A monitoring tool subscribed to that channel receives it immediately and updates its real-time dashboard. Days later, when someone needs to analyze that period, the Stats Dump Tool reads the binary log and produces detailed reports—all without touching the running system.

Operational Trade-offs

Performance vs Visibility

Every operational capability comes with a performance cost, and Talon's design gives you the controls to balance visibility against overhead. The conservative default configuration collects minimal statistics with low overhead—enough to know your microservices are healthy without impacting latency. In development, you'll typically enable detailed trace logging and per-transaction stats because understanding behavior matters more than performance. Production deployments usually balance binary logging for post-mortem analysis with selective SMA emission of critical metrics for real-time dashboards. When troubleshooting production issues, you can temporarily enable detailed statistics for specific microservices or message types, gathering the visibility you need without permanently increasing overhead across the entire system.

Real-time vs Historical

Real-time monitoring through SMA channels shows you what's happening right now. Your dashboards update continuously, alerts fire immediately when thresholds are breached, and operators have instant visibility into system behavior. This comes at the cost of requiring messaging infrastructure and continuous processing by monitoring applications.

Historical analysis through binary logs takes a different approach. Transaction logs and heartbeat logs capture everything to disk with minimal overhead, then offline analysis tools read those logs later without any impact on running systems. This requires log retention and storage but gives you the ability to perform arbitrarily expensive analysis after the fact.

Many deployments use a hybrid approach: log everything to binary logs for complete historical record, analyze offline when investigating issues, but emit critical metrics over SMA channels for real-time dashboards. This gives you both immediate visibility into key metrics and the ability to deeply analyze historical data when needed.

Direct vs Messaging-based Admin

The choice between direct TCP administration and Admin over SMA reflects different operational contexts. Direct TCP provides lower latency and simpler setup—admin tools connect directly to XVMs and get immediate responses. This works well during development or when troubleshooting specific services, but it requires network access to each XVM and doesn't scale well when managing hundreds of distributed services.

Admin over SMA scales naturally to large deployments. Adding more XVMs or monitoring tools doesn't require connection management since everything flows through the messaging infrastructure. It supports passive monitoring where observers can watch administrative traffic without active connections. However, it requires that messaging infrastructure to be running and adds the latency of routing through message buses.

Operational Sections

For detailed configuration and usage:

Next Steps

  1. Understand Discovery Model for operational infrastructure

  2. Review Administration options

  3. Configure Monitoring appropriate for your environment

  4. Set up Logging for troubleshooting

Last updated