OpenTelemetry at Scale: Controlling your Fleet
As you scale usage of OpenTelemetry, you'll find yourself wanting / needing a solution to manage all of involved entities (eg. containers emitting telemetry, OpenTelemetry collectors etc.) [for the remainder of this post I shall collectively call these "agents"] in a standardised way. Moreover, I'd be willing to bet that OpenTelemetry-based "agents" aren't the only things you'd like to control. What about security software, log collection agents or any other software that runs persistently and could benefit from some element of remote management. In the case of an agent being an OpenTelemetry collector, a server could push an update to the collector configuration and request a restart. A server could push an entirely different version of the collector to securely up / downgrade the collector. Agents also emit their own telemetry which servers can visualise in context. An agent in this context is anything involved in the gathering or processing of OpenTelemetry data. What do you mean by "Control"? Status reports from agents Securely upgrade / downgrade of agent configuration Downgrading Restarting agents Receiving agent heartbeats Connection credential management (eg. rotation or revocation of TLS certificates) Open Agent Management Protocol This is where OpAMP comes to the rescue. OpAMP is an open, vendor neutral protocol for managing fleets of agents. How OpAMP works OpAMP is a protocol describing how agents (eg. clients) communicate with server(s) (and vice versa). This means you can connect n agents to a server and manage those n agents from a central location. OpAMP works via gRPC websockets or HTTPS. Clients and servers communicate via binary serialized protobuf messages. Clients send AgentToServer messages and servers send back (no surprises here ServerToAgent messages. Capabilities Both agents and servers can declare their own capabilities. Here are the current capabilities that an agent can advertise itself as supporting: enum AgentCapabilities { // The capabilities field is unspecified. UnspecifiedAgentCapability = 0; // The Agent can report status. This bit MUST be set, since all Agents MUST // report status. ReportsStatus = 0x00000001; // The Agent can accept remote configuration from the Server. AcceptsRemoteConfig = 0x00000002; // The Agent will report EffectiveConfig in AgentToServer. ReportsEffectiveConfig = 0x00000004; // The Agent can accept package offers. // Status: [Beta] AcceptsPackages = 0x00000008; // The Agent can report package status. // Status: [Beta] ReportsPackageStatuses = 0x00000010; // The Agent can report own trace to the destination specified by // the Server via ConnectionSettingsOffers.own_traces field. // Status: [Beta] ReportsOwnTraces = 0x00000020; // The Agent can report own metrics to the destination specified by // the Server via ConnectionSettingsOffers.own_metrics field. // Status: [Beta] ReportsOwnMetrics = 0x00000040; // The Agent can report own logs to the destination specified by // the Server via ConnectionSettingsOffers.own_logs field. // Status: [Beta] ReportsOwnLogs = 0x00000080; // The can accept connections settings for OpAMP via // ConnectionSettingsOffers.opamp field. // Status: [Beta] AcceptsOpAMPConnectionSettings = 0x00000100; // The can accept connections settings for other destinations via // ConnectionSettingsOffers.other_connections field. // Status: [Beta] AcceptsOtherConnectionSettings = 0x00000200; // The Agent can accept restart requests. // Status: [Beta] AcceptsRestartCommand = 0x00000400; // The Agent will report Health via AgentToServer.health field. ReportsHealth = 0x00000800; // The Agent will report RemoteConfig status via AgentToServer.remote_config_status field. ReportsRemoteConfig = 0x00001000; // The Agent can report heartbeats. // This is specified by the ServerToAgent.OpAMPConnectionSettings.heartbeat_interval_seconds field. // If this capability is true, but the Server does not set a heartbeat_interval_seconds field, the // Agent should use its own configured interval, which by default will be 30s. The Server may not // know the configured interval and should not make assumptions about it. // Status: [Development] ReportsHeartbeat = 0x00002000; } Similarly, the server can advertise it's own capabilities: enum ServerCapabilities { // The capabilities field is unspecified. UnspecifiedServerCapability = 0; // The Server can accept status reports. This bit MUST be set, since all Server // MUST be able to accept status reports. AcceptsStatus = 0x00000001; // The Server can of

As you scale usage of OpenTelemetry, you'll find yourself wanting / needing a solution to manage all of involved entities (eg. containers emitting telemetry, OpenTelemetry collectors etc.) [for the remainder of this post I shall collectively call these "agents"] in a standardised way. Moreover, I'd be willing to bet that OpenTelemetry-based "agents" aren't the only things you'd like to control. What about security software, log collection agents or any other software that runs persistently and could benefit from some element of remote management.
In the case of an agent being an OpenTelemetry collector, a server could push an update to the collector configuration and request a restart. A server could push an entirely different version of the collector to securely up / downgrade the collector.
Agents also emit their own telemetry which servers can visualise in context.
An agent in this context is anything involved in the gathering or processing of OpenTelemetry data.
What do you mean by "Control"?
- Status reports from agents
- Securely upgrade / downgrade of agent configuration
- Downgrading
- Restarting agents
- Receiving agent heartbeats
- Connection credential management (eg. rotation or revocation of TLS certificates)
Open Agent Management Protocol
This is where OpAMP comes to the rescue. OpAMP is an open, vendor neutral protocol for managing fleets of agents.
How OpAMP works
OpAMP is a protocol describing how agents (eg. clients) communicate with server(s) (and vice versa).
This means you can connect n agents to a server and manage those n agents from a central location.
OpAMP works via gRPC websockets or HTTPS. Clients and servers communicate via binary serialized protobuf messages. Clients send AgentToServer
messages and servers send back (no surprises here ServerToAgent
messages.
Capabilities
Both agents and servers can declare their own capabilities.
Here are the current capabilities that an agent can advertise itself as supporting:
enum AgentCapabilities {
// The capabilities field is unspecified.
UnspecifiedAgentCapability = 0;
// The Agent can report status. This bit MUST be set, since all Agents MUST
// report status.
ReportsStatus = 0x00000001;
// The Agent can accept remote configuration from the Server.
AcceptsRemoteConfig = 0x00000002;
// The Agent will report EffectiveConfig in AgentToServer.
ReportsEffectiveConfig = 0x00000004;
// The Agent can accept package offers.
// Status: [Beta]
AcceptsPackages = 0x00000008;
// The Agent can report package status.
// Status: [Beta]
ReportsPackageStatuses = 0x00000010;
// The Agent can report own trace to the destination specified by
// the Server via ConnectionSettingsOffers.own_traces field.
// Status: [Beta]
ReportsOwnTraces = 0x00000020;
// The Agent can report own metrics to the destination specified by
// the Server via ConnectionSettingsOffers.own_metrics field.
// Status: [Beta]
ReportsOwnMetrics = 0x00000040;
// The Agent can report own logs to the destination specified by
// the Server via ConnectionSettingsOffers.own_logs field.
// Status: [Beta]
ReportsOwnLogs = 0x00000080;
// The can accept connections settings for OpAMP via
// ConnectionSettingsOffers.opamp field.
// Status: [Beta]
AcceptsOpAMPConnectionSettings = 0x00000100;
// The can accept connections settings for other destinations via
// ConnectionSettingsOffers.other_connections field.
// Status: [Beta]
AcceptsOtherConnectionSettings = 0x00000200;
// The Agent can accept restart requests.
// Status: [Beta]
AcceptsRestartCommand = 0x00000400;
// The Agent will report Health via AgentToServer.health field.
ReportsHealth = 0x00000800;
// The Agent will report RemoteConfig status via AgentToServer.remote_config_status field.
ReportsRemoteConfig = 0x00001000;
// The Agent can report heartbeats.
// This is specified by the ServerToAgent.OpAMPConnectionSettings.heartbeat_interval_seconds field.
// If this capability is true, but the Server does not set a heartbeat_interval_seconds field, the
// Agent should use its own configured interval, which by default will be 30s. The Server may not
// know the configured interval and should not make assumptions about it.
// Status: [Development]
ReportsHeartbeat = 0x00002000;
}
Similarly, the server can advertise it's own capabilities:
enum ServerCapabilities {
// The capabilities field is unspecified.
UnspecifiedServerCapability = 0;
// The Server can accept status reports. This bit MUST be set, since all Server
// MUST be able to accept status reports.
AcceptsStatus = 0x00000001;
// The Server can offer remote configuration to the Agent.
OffersRemoteConfig = 0x00000002;
// The Server can accept EffectiveConfig in AgentToServer.
AcceptsEffectiveConfig = 0x00000004;
// The Server can offer Packages.
OffersPackages = 0x00000008;
// The Server can accept Packages status.
// Status: [Beta]
AcceptsPackagesStatus = 0x00000010;
// The Server can offer connection settings.
// Status: [Beta]
OffersConnectionSettings = 0x00000020;
// The Server can accept ConnectionSettingsRequest and respond with an offer.
// Status: [Development]
AcceptsConnectionSettingsRequest = 0x00000040;
}
Let me Experiment - Open Source Server Implementations
There are a few open source server implementations available if you'd like to try this out:
- Python OpAMP Server I'm currently writing and I invite you to try it out, raise bugs and contribute (if you know Python) to make it better!
- Elixir OpAMP.
- Bindplane is a commercial entity offering a freemium OpAMP server.
Summary
I believe OpAMP will become the defacto standard for remote agent fleet management - just like OpenTelemetry is now the standard for telemetry. I see only upside for vendors - they no-longer have to maintain their own implementations, specifications and protocols.
The benefits for end-users is huge. Imagine all your antivirus, EDR tooling, Observability tooling and AI agents all speaking the same protocol and managed from a single central dashboard. It would make like so much easier for enterprise-level sysadmins.
The OpAMP protocol is defined in-depth here. Let me know in the comments section what you'd like me to cover on this subject - I'm happy to create any additional content.