Fix a deadlock due to signal interactions with prometheus client

The prometheus client uses a threading.Lock() to prevent shared access to
certain metric state. This lock is taken as part of doing collection, as well
as during metric.labels().

We hit a deadlock where our stack sampler signal arrived during a collection,
when the lock was held. This meant that flamegraph.labels() blocked forever,
and the lock was never released, hanging all metrics collection.

Our solution is a hack, which is to reach into the internals of our metric object
and replace its lock with a dummy one. This is reasonably safe, but only as long as
the prometheus_client internal structure doesn't change signfigiantly.
pull/31/head
Mike Lang 6 years ago committed by Mike Lang
parent c9cc8a73a7
commit 10cca18922

@ -6,6 +6,8 @@ import os
import signal
import sys
import gevent.lock
import prometheus_client as prom
from monotonic import monotonic
@ -173,12 +175,6 @@ class PromLogCountsHandler(logging.Handler):
root_logger.addHandler(cls())
flamegraph = prom.Counter(
"flamegraph",
"Approx time consumed by each unique stack trace seen by sampling the stack",
["stack"]
)
def install_stacksampler(interval=0.005):
"""Samples the stack every INTERVAL seconds of user time.
We could use user+sys time but that leads to interrupting syscalls,
@ -191,6 +187,23 @@ def install_stacksampler(interval=0.005):
# 2. Avoid biasing the results by effectively not including the time taken to do the actual
# stack sampling.
flamegraph = prom.Counter(
"flamegraph",
"Approx time consumed by each unique stack trace seen by sampling the stack",
["stack"]
)
# HACK: It's possible to deadlock if we handle a signal during a prometheus collect
# operation that locks our flamegraph metric. We then try to take the lock when recording the
# metric, but can't.
# As a hacky work around, we replace the lock with a dummy lock that doesn't actually lock anything.
# This is reasonably safe. We know that only one copy of sample() will ever run at once,
# and nothing else but sample() and collect() will touch the metric, leaving two possibilities:
# 1. Multiple collects happen at once: Safe. They only do read operations.
# 2. A sample during a collect: Safe. The collect only does a copy inside the locked part,
# so it just means it'll either get a copy with the new label set, or without it.
# This presumes the implementation doesn't change to make that different, however.
flamegraph._lock = gevent.lock.DummySemaphore()
def sample(signum, frame):
stack = []
while frame is not None:

@ -5,6 +5,7 @@ setup(
version = "0.0.0",
packages = find_packages(),
install_requires = [
"gevent",
"monotonic",
"prometheus-client",
"python-dateutil",

Loading…
Cancel
Save