due to not including the map data in the hash calculation.
This is only relevant for streams with map data, which does not include twitch or youtube URLs.
since they can't do any other quality, but we still want to be able to set other qualities
for twitch streams.
Really qualities should be per-channel but I'm being lazy.
Adds very simple youtube stream support where we only ever use the "best" quality,
which we call "source" for consistency with twitch.
We use yt-dlp to do the heavy lifting of getting the playlist url out of youtube.
In some formats, most notably DASH, there is a "initialization data" that is required
in order to play the segment. The data is common to all segments so it is served as a seperate URL
under EXT-X-MAP. However, redundant copies of this data are benign and it's very small, so
we just put it in front of EVERY segment so that we can play every one independently (but
concatenating them still works).
We use a very simple cache to avoid downloading it again for every segment.
Rarely, we find ourselves needing to explicitly delete some data, eg. something that shouldn't
have been public and should be removed from all records.
It would also be nice if we could "clean up" bad versions of the same segment,
which occasionally come up when downloaders have issues.
With our distributed segment database, this is actually rather difficult as deleting the data
from any one server would cause it to be restored from the others. It was only possible
by stopping all backfill, deleting the data on all servers, then starting backfill again.
Here we introduce a more practical approach. An operator creates an empty flag file
with the same name as the segment to be deleted, but with a `.tombstone` extension.
eg. to delete a file `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.ts`,
you would create a tombstone `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.tombstone`.
These tombstone files do two important things:
* They hide the segment from being listed, which both means:
* It can't be restreamed or put into a video
* It can't be backfilled to other nodes
* The tombstone files themselves do get backfilled to other nodes, so you only need to mark them on one server.
Once the tombstone has propagated to all nodes, the segment file can be deleted independently on each one.
We chose not to have a tombstone automatically trigger a segment deletion for safety reasons.
When pushed, this tells github to associate the ghcr.io repo that was pushed to
with the github repo specified (the owner needs to match).
This does a few things.
Most importantly, this automatically gives github actions credentials to push to these
repositories when run in the context of the wubloader repo.
In python 3, file.write() may do a partial write and returns the number of characters written.
In order to not lose data, we need to wrap every instance of file.write() with our new
common.writeall() wrapper that loops until the data is actually written.
Check that open() calls for reading and writing use binary modes
Use alpine version with py3-pip package
Use python3 in Dockerfile CMD
Remove sys.setdefaultencoding() "hack"
Simplify ensure_directory() in common.common package
Twitch removed their old access token endpoint and now use a GraphQL endpoint.
The old endpoint would just always return 404, which we sadly interpreted as "stream not up".
Thankfully streamlink has already done the reverse engineering work so I was able to
update it to work again fairly easily, it's just a bit more convoluted.
The intended behaviour was to log a warning message and retry next time,
but still allow workers to be started for any streams found.
However, due to a missing continue, we fall through to attempting to start a worker
for a non-existent quality which causes a KeyError when looking up
`self.latest_urls[quality]`. This exception means we don't run through the other qualities,
so we never start any other quality.
We've noticed that when nodes have connection problems, they get full segments
with different hashes. Inspection of these segments shows that
they all have identical data up to a point.
Segments that fetched normally will then have the remainder of the data.
Segments that had issues will have a slightly corrupted end.
The data is still valid, and no errors are raised. It just doesn't have all the data.
We noticed that these corrupted segments all were cut off exactly 60sec after their requests
began. We believe this is a server-side timeout on the request that returns whatever data
it has, then closes the container file cleanly before returning successfully.
We detect segments that take > 59 seconds to recieve, and label them as "suspect".
Suspect segments are treated identically to partial segments, except they are always preferred
over partials.
Not only is this redundant, but it creates a race condition where
the worker fails before the latest_worker = workers[-1] check,
and we get an IndexError.
By carefully ensuring most of our dockerfiles are identical in their first few layers,
we only need to build those layers once instead of every time.
In particular, we move installing gevent to before installing common,
so that even when common changes gevent doesn't need to be reinstalled.
This is important because gevent takes ages to install.
Also fixes segment_coverage, which wasn't being installed.
In our usage, we have one channel where we really care / want to know if it's down,
but also a bunch of other channels where they're expected to not be streaming most/all of the time.
To prevent these extra channels making a ton of noise, we introduce the concept of an "important"
channel, indicated by appending a '!' to the channel name in the command line.
So for example, you might specify channels as "foo! foo_backup foo_behindthescenes".
Important channels have the same behaviour as previously.
Non-important channels:
* Have a 20-second retry on a master playlist fetch failure, instead of 5
* Log at debug when the stream is down, instead of info.
This gives us a "stream delay" metric.
Prom doesn't have any native way to check the current value of a metric,
in order to take max(). It only offers increment and set.
We reach into some internals to do this in a hacky way,
but the cleaner way would be to track the value ourselves and have a prom callback
that gets the value.
Sigh, I hate this prom library. I might write my own that's less dumb.
We wrap direct dateutil calls to handle two distinct cases:
* `common.dateutil.parse()`: We want to handle arbitrary timestamps including tz info,
then convert them to UTC.
This is used in HLS parsing, and for command line input for backfiller
* `common.dateutil.parse_utc_only()`: We want to only handle UTC timestamps,
but datetime.strptime isn't flexible enough (eg. can't handle missing fractional component).
This is used for restreamer request params.
* Checks for the SCTE35-OUT/SCTE35-IN marks in the HLS stream that indicate an ad start/end
* Ignores those segments completely
* Doesn't mark the StreamWorker as up until it sees the first non-ad segment
Some other operational notes:
* The main risk this adds is that re-connecting / refreshing master playlist takes longer.
If all downloaders are doing this at the same time (ie. because the stream only just came up,
or during a deployment rollout), all downloaders might be waiting for ads to finish and
you'll miss segments.
* We should run more downloaders to compensate. This also increases the chance at least one of
them won't get any ads, so we get everything right from stream-up.
* The other mitigation we can do is have geographically diverse downloaders. This decreases the risk
that they all get served an ad, and at least at time of writing it seems that no in-stream ads
are served outside of these regions:
> US, Canada, Germany, France, Sweden, Belgium, Poland, Norway, Finland, Denmark, Netherlands, Italy, Spain, Switzerland, Austria, Portugal, UK, Australia, New Zealand
In resource contention scenarios, all calls can start failing due to
not being able to read the response in a timely manner.
This means SegmentGetters never stop retrying, leading to further contention
and a feedback loop.
We attempt to put at least some cap on this scenario by giving up
if an amount of time has elapsed to the point that we know our URL couldn't be valid anymore.
Since we don't actually know how long segment URLs are valid, we are very conservative about
this time, for now setting it to 20min.
When we're under CPU or disk contention, doing other work
can become very slow. We want to avoid spurious errors in this situation
as this causes further retries and further contention.
One easy way to do this is to increase the time we have to finish fetching headers.
To preserve independence between workers and ensure that a
retry (a worker re-create) actually starts from scratch, we only pool connections
on a per-worker basis.
Furthermore, for the same reason, we only let SegmentGetters use the worker's
pool on their first attempt. After that, they create a new pool to ensure they have a clean retry.
Despite this, the result should be that we're almost always re-using an existing connection
when getting segments or media playlists, unless something goes wrong.
SSL connection setup was measured as almost half the CPU time used by the process,
so this change should result in a signifigant CPU usage reduction.