Featured

Tracking a Java Memory Leak: How Daemon Threads in Pyroscope Nearly Crashed Our Service

6 min readApr 24, 2025

Introduction

Every hotel search on Agoda starts with one thing: delivering the right content to our customers at the right time. Behind the scenes, we rely on a central service that powers the content users see: images, descriptions, reviews, location details, facilities, room information, policies, and more. Internally, we call this the content-api service. It sits at the heart of Agoda’s content ecosystem and plays a crucial role in helping users make confident booking decisions.

Given the importance of this service, we rely on profiling tools to monitor performance and identify patterns in application behavior. But what happens when the profiler itself becomes the problem?

One day, while monitoring our content-api service, we noticed an unusual trend: memory usage and thread count were steadily rising, even though CPU and heap memory remained stable. The increases weren’t drastic, but over time, they edged closer to the out-of-memory (OOM) threshold.

What began as a subtle anomaly led us down a path that revealed a surprising root cause: orphaned daemon threads in a third-party Java agent. In this post, we’ll walk through how we approached the issue, what we discovered, and how we helped resolve the problem upstream.

Tracing the Anomaly

Our first step was to determine when the unusual behavior began. Was it linked to a recent deployment, a change in configuration, or something more subtle that had escaped notice?

To understand the scope of the issue, we analyzed metrics across recent releases. We saw a gradual but consistent rise in memory and thread usage. The service wasn’t failing outright; regular deployments and pod restarts were temporarily resetting the state, preventing an out-of-memory (OOM) error. But the trend was clear and unsustainable.

To begin diagnosing the root cause, we started collecting thread and heap snapshots. These would help us understand what was being created during runtime, how resources were accumulating, and what wasn’t being cleaned up. With that, the investigation was underway.

Capturing Runtime Evidence

We began our investigation by collecting heap and thread snapshots from the application. Since JMX RMI was already enabled, we first set up port forwarding to allow local access to the pod. This gave us the ability to connect VisualVM directly to the application for deeper inspection.

Why VisualVM? VisualVM was selected for its open source nature, user-friendly interface, and real-time thread visibility. Its live thread and memory profiling capabilities provide immediate insights, making it ideal for efficient debugging.

Once connected, we started capturing periodic snapshots of both heap and thread states. This wasn’t a one-off exercise, because thread counts and memory usage were gradually increasing over time, taking snapshots at regular intervals was key. By comparing the deltas between snapshots, we could pinpoint which objects and threads were accumulating without being cleaned up.

Heap dumps were generated directly on the remote pod. To analyze them locally, we used the following command to copy the files:

Heapdump files are created on the remote server, which can be copied to the local system using the command below:

kubectl cp <remote_pod>:<file_path>:<local_file_path>

In addition to dumps, we reviewed application logs for supporting evidence. Logs can often reveal patterns or errors that correlate with runtime behavior; in this case, they did.

After analyzing the snapshots, we noticed a growing number of threads in the WAITING state. They all originated from the same class, and the count increased consistently over time. The logs from that same class also began surfacing at the exact point where the issue seemed to begin.

All signs pointed to a single component: Pyroscope, the profiling agent integrated into the service.

Confirming the Root Cause

With strong indicators pointing to Pyroscope, we moved to validate the hypothesis by isolating its impact. We disabled Pyroscope in one of our non-production environments, while keeping it active in the others, and monitored memory and thread metrics across all instances.

The results were unambiguous. The environment without Pyroscope remained stable, while the others continued to show a steady increase in thread count and memory usage.

This confirmed that Pyroscope’s behavior was directly contributing to the issue.

Fixing the Issue

After confirming that Pyroscope was the source of the problem, we raised the issue with our Observability team. They quickly helped verify the behavior and worked with us to trace it back to the agent’s internal thread management.

To understand the issue more clearly, it’s important to look at how Pyroscope works.

Pyroscope is integrated into our application as a Java agent and can be toggled on or off through our internal developer portal. Once enabled, the agent begins collecting profiling metrics, such as CPU and memory usage, and temporarily stores this data in a buffer. At regular intervals, it pushes the buffered data to a central Pyroscope server.

This export process is handled by a component called QueuedExporter, which spawns a background daemon thread to transmit the profiling data.

The problem occurs when profiling is restarted. Instead of reusing the existing export thread, Pyroscope spawns a new one. However, the previous thread is never explicitly stopped. Over time, these orphaned threads accumulate, remaining in memory even though they no longer serve any function.

This unbounded thread growth was the root cause of our observed memory increase.

The daemon threads were never explicitly terminated, leading to unbounded thread growth.

The root cause came down to how daemon threads were managed. These threads were never explicitly terminated, even after profiling was restarted. Over time, the accumulation of orphaned threads led to increased memory consumption and thread count.

Since pyroscope-java is an open-source library maintained by Grafana, our Observability team worked with the maintainers to address the issue. A fix was proposed and merged promptly: updating the agent’s stop() method to shut down properly and clean up the exporter thread. (GitHub Issue #169)

With the updated agent deployed, the resource leak was resolved. Memory and thread usage stabilized, and content-api returned to normal operating conditions.

One question remains: Why was this issue isolated to content-api, and not observed in other services using the same profiling setup? While we suspect it may relate to how frequently profiling is restarted in this particular service, the exact trigger is still under review.

Some debugging stories end with a fix. Others leave behind open questions and valuable lessons.

Conclusion / Final Thoughts

This incident reminded us that performance issues don’t always stem from our code, and that even trusted tools can misbehave in subtle ways. Observability isn’t just about monitoring services; it’s about closely monitoring the entire ecosystem, including third-party agents and dependencies.

By staying curious, using the right tools, and working closely with both internal and open-source communities, we were able to fix the issue and improve the ecosystem for everyone else using the same tools.

Agoda Engineering & Design