Enhance OpenTelemetry gRPC With a Consistent Hash Load Balancer
This article demonstrates leveraging the envoy's consistent hash load balancing for OpenTelemetry OTLP gRPC payload.
The use case
OpenTelemetry collector (OTel collector) is deployed as an agent alongside the application on remote servers. It sends telemetry data (logs, traces, metrics) from the application and the host into central storage through a gateway deployed on the Kubernetes cluster.
The OTel collector is deployed using the OpenTelemetry operator Helm chart, with Kubernetes HPA, scaling replicas based on CPU load. The traffic is routed through a headless service because the standard Kubernetes service is not a good fit for gRPC, described in this article. But with this setup, there is no load balancing on the Kubernetes side, which is also mentioned in the article in the above link.
So, this lack of load balancing with the OTel agents configured to send data in batches causes the data from the same remote host to be forwarded randomly through the OTel collector gateway replicas. Data are written multiples by the actual number of replicas into the storage due to different label values holding the identity of the OTel replica. This drastically increases the storage usage, and the queries must be aggregated.
Let’s show it in an example.
Take one of the OTel agent metrics called otelcol_process_uptime, which has a label added by the OTel gateway called otelcol_replica, holding the name of the replica. The OTel gateway has four replicas; let’s query the metric using PromQL on the storage side:
avg by (otelcol_replica)(otelcol_process_uptime{hostname="xxxxxx"})
{otelcol_replica="opentelemetry-collector-5fc9f8g5sj5"} 2502046.749352578
{otelcol_replica="opentelemetry-collector-5fc9f8pfmvh"}
2502096.74889717
{otelcol_replica="opentelemetry-collector-5fc9f8rzkh4"}
2502156.749325255
{otelcol_replica="opentelemetry-collector-5fc9f8xj95v"}
2502136.749453457
As demonstrated, the data coming from the remote host are written four times into the storage.
So, the solution to this problem is a load balancing mechanism, which provides consistency in routing data from the same remote source through the same OTel collector replica. And that’s where the envoy-proxy is a perfect candidate, offering load balancers based on consistent hashing.
The solution
The envoy-proxy is deployed with two replicas and a headless service between the ingress and OTel collector gateway.
It is configured with a ring-hash load balancer based on the X-Forwarded-For HTTP header, enabling HTTP2 for upstream clusters.
...
route:
cluster: "opentelemetry-collector-cluster"
hash_policy:
- header:
header_name: x-forwarded-for
...
clusters:
- name: opentelemetry-collector-cluster
connect_timeout: 0.25s
type: STRICT_DNS
dns_lookup_family: V4_ONLY
lb_policy: RING_HASH
http2_protocol_options: {}
...
This configuration ensures that the data from the same source IP will flow through the same OTel gateway replica while it exists. With this consistent route, only one copy of the data is written into storage from the remote host.
In case the replica fails, the envoy-proxy will redirect the data flow to the next member of the hash ring, so for a short period in the storage, two copies of the data will exist due to the changed value of the label holding the identity of the OTel collector replica.
Conclusion
Consider a high-load environment where the number of the OTel gateway replicas could be scaled to quite a high number. How much storage capacity could be saved with a reliable data flow from remote sources?
Author
Gabriel Illés
Senior DevOps Engineer
Dedicated professional with experience in managing cloud infrastructure and system administration, integrating cloud-based infrastructure components, and developing automation and data engineering solutions. Good at troubleshooting problems and building successful solutions. Excellent verbal and written communicator with strong background cultivating positive relationships and exceeding goals.
The entire Grow2FIT consulting team: Our team