This solves the problem of rows which don't need a full cut video,
but we'd like to link to an image or a short gif or clip of it.
It is a sheet input that is only used in the output sheet, so it doesn't affect the wubloader itself.
* Checks for the SCTE35-OUT/SCTE35-IN marks in the HLS stream that indicate an ad start/end
* Ignores those segments completely
* Doesn't mark the StreamWorker as up until it sees the first non-ad segment
Some other operational notes:
* The main risk this adds is that re-connecting / refreshing master playlist takes longer.
If all downloaders are doing this at the same time (ie. because the stream only just came up,
or during a deployment rollout), all downloaders might be waiting for ads to finish and
you'll miss segments.
* We should run more downloaders to compensate. This also increases the chance at least one of
them won't get any ads, so we get everything right from stream-up.
* The other mitigation we can do is have geographically diverse downloaders. This decreases the risk
that they all get served an ad, and at least at time of writing it seems that no in-stream ads
are served outside of these regions:
> US, Canada, Germany, France, Sweden, Belgium, Poland, Norway, Finland, Denmark, Netherlands, Italy, Spain, Switzerland, Austria, Portugal, UK, Australia, New Zealand
Split UPLOADED into TRANSCODING and DONE, to represent the time after upload
that youtube is transcoding the video and it's not viewable.
Any cutter can poll for the state of a transcoding video and mark it as done.
Add some extra sheet input columns.
I fully expect the exact list of sheet inputs, edit inputs and outputs to change.
The important thing I wanted to codify here was the state machine and the behaviour of the cutters.
In resource contention scenarios, all calls can start failing due to
not being able to read the response in a timely manner.
This means SegmentGetters never stop retrying, leading to further contention
and a feedback loop.
We attempt to put at least some cap on this scenario by giving up
if an amount of time has elapsed to the point that we know our URL couldn't be valid anymore.
Since we don't actually know how long segment URLs are valid, we are very conservative about
this time, for now setting it to 20min.
When we're under CPU or disk contention, doing other work
can become very slow. We want to avoid spurious errors in this situation
as this causes further retries and further contention.
One easy way to do this is to increase the time we have to finish fetching headers.
get_best_segments can sometimes take a very long time,
we don't want to stop other work from happening while it's ongoing.
So we ask gevent to run other things until there's no other work to do,
then we do one hour, then check back with gevent again.
In combination with the performance improvements, this should mean we don't block
other things from running for more than a few hundred ms at most.
strptime is much faster but can't handle as varied formats.
But in this case we fully control the format, so there's no reason not to use it.
Profiling suggests we spend about 80% of our time in get_best_segments just parsing dates,
so this is a signifigant performance gain.
The prometheus client uses a threading.Lock() to prevent shared access to
certain metric state. This lock is taken as part of doing collection, as well
as during metric.labels().
We hit a deadlock where our stack sampler signal arrived during a collection,
when the lock was held. This meant that flamegraph.labels() blocked forever,
and the lock was never released, hanging all metrics collection.
Our solution is a hack, which is to reach into the internals of our metric object
and replace its lock with a dummy one. This is reasonably safe, but only as long as
the prometheus_client internal structure doesn't change signfigiantly.
To preserve independence between workers and ensure that a
retry (a worker re-create) actually starts from scratch, we only pool connections
on a per-worker basis.
Furthermore, for the same reason, we only let SegmentGetters use the worker's
pool on their first attempt. After that, they create a new pool to ensure they have a clean retry.
Despite this, the result should be that we're almost always re-using an existing connection
when getting segments or media playlists, unless something goes wrong.
SSL connection setup was measured as almost half the CPU time used by the process,
so this change should result in a signifigant CPU usage reduction.
Backdoor allows the operator to telnet into the given port, and get a python shell
running inside the process, from which you can debug, modify state (eg. set the log level),
or whatever. This is extremely useful for debugging weird states that you encounter randomly
but can't easily reproduce, without restarting the process and needing to wait until it happens again.
This meant that old workers would never shut down, causing us to fetch the same media playlist
and same segments multiple times for no reason, and to never give up in face of (non-403/404) errors
even once we have something else working.
Also fix some logging.
When we're out of touch with twitch for long enough, our segment URL will get
so old that twitch stops returning 403 because our token is expired,
and start returning 404s, presumebly becasue the underlying resource has gone away.
We want to treat these the same.
The function is quite customizable and therefore quite complex, but it allows us to
easily annotate a function to be timed with labels based on input and output,
as well as normalize results based on amount of work done to get a better
picture of the actual amount of time taken per unit of work.
This will help us monitor for performance issues.
Prom client doesn't like you creating two stats with the same name,
even though they have different labels and this makes perfect sense.
I feel like I just need to re-write the prom client at some point - it doesn't actually
do all that much except get in your way, apart from the actual text encoding which I
can steal.
Anyway, in the meantime, we get around this by breaking up metrics into two names,
a "foo_all" and a "foo_ENDPOINT". The foo_all lacks the detailed labels,
but is still labelled by endpoint and can be used more easily.
The foo_ENDPOINT labels have more information but require messier PromQL as you need to
match on a name regex if you want to look at more than one specific endpoint.