Commit Graph

52 Commits (b9cd76b1a2d5e005a4d9d04c7c35fc2d9de722ff)

Author SHA1 Message Date
Mike Lang 21d5548980 Add new segment type "suspect"
We've noticed that when nodes have connection problems, they get full segments
with different hashes. Inspection of these segments shows that
they all have identical data up to a point.

Segments that fetched normally will then have the remainder of the data.
Segments that had issues will have a slightly corrupted end.
The data is still valid, and no errors are raised. It just doesn't have all the data.

We noticed that these corrupted segments all were cut off exactly 60sec after their requests
began. We believe this is a server-side timeout on the request that returns whatever data
it has, then closes the container file cleanly before returning successfully.

We detect segments that take > 59 seconds to recieve, and label them as "suspect".

Suspect segments are treated identically to partial segments, except they are always preferred
over partials.
5 years ago
Mike Lang 72003f28d0 downloader: Don't check the age of a worker we just spawned
Not only is this redundant, but it creates a race condition where
the worker fails before the latest_worker = workers[-1] check,
and we get an IndexError.
5 years ago
Mike Lang 94d81d708f Downloader: Change access_token call to match website
It stopped working, these changes bring it back in line with the website
so it works.
5 years ago
Christopher Usher abb9193705 fixed outdated "stream", "variant" in metric 5 years ago
Mike Lang 6c6c1ae637 downloader: Make a few things quieter for non-important channels 5 years ago
Mike Lang b2a07ef114
Merge pull request #140 from ekimekim/mike/build-improvements
Refactor dockerfiles for more shared layers
5 years ago
Mike Lang 731ef9e2d0 Refactor dockerfiles for more shared layers
By carefully ensuring most of our dockerfiles are identical in their first few layers,
we only need to build those layers once instead of every time.

In particular, we move installing gevent to before installing common,
so that even when common changes gevent doesn't need to be reinstalled.

This is important because gevent takes ages to install.

Also fixes segment_coverage, which wasn't being installed.
5 years ago
Mike Lang 83750da37b downloader: Create concept of an "important" channel
In our usage, we have one channel where we really care / want to know if it's down,
but also a bunch of other channels where they're expected to not be streaming most/all of the time.

To prevent these extra channels making a ton of noise, we introduce the concept of an "important"
channel, indicated by appending a '!' to the channel name in the command line.

So for example, you might specify channels as "foo! foo_backup foo_behindthescenes".

Important channels have the same behaviour as previously.

Non-important channels:
* Have a 20-second retry on a master playlist fetch failure, instead of 5
* Log at debug when the stream is down, instead of info.
5 years ago
Mike Lang 8ad61e9870 downloader: Collect metrics on http calls 5 years ago
Mike Lang b4655f18c6 downloader: Track total duration of downloaded segments 5 years ago
Mike Lang 04ef0d3823 fix a few remaining usages of StreamWorker.stream instead of .quality 5 years ago
Christopher Usher 361e577474 fixes based on ekimekims suggestions 5 years ago
Christopher Usher 3564643613 refactoring downloader 5 years ago
Mike Lang 73d5941e05 downloader: Track timestamp of latest segment
This gives us a "stream delay" metric.

Prom doesn't have any native way to check the current value of a metric,
in order to take max(). It only offers increment and set.

We reach into some internals to do this in a hacky way,
but the cleaner way would be to track the value ourselves and have a prom callback
that gets the value.

Sigh, I hate this prom library. I might write my own that's less dumb.
5 years ago
Mike Lang f8d10dacdf Audit and fix all usage of dateutil
We wrap direct dateutil calls to handle two distinct cases:

* `common.dateutil.parse()`: We want to handle arbitrary timestamps including tz info,
then convert them to UTC.

This is used in HLS parsing, and for command line input for backfiller

* `common.dateutil.parse_utc_only()`: We want to only handle UTC timestamps,
but datetime.strptime isn't flexible enough (eg. can't handle missing fractional component).

This is used for restreamer request params.
5 years ago
Mike Lang df66553b38 downloader: Start backdoor later so workers is in locals 6 years ago
Mike Lang 86da9d9fe8 downloader: Support watching multiple channels
This is useful eg. for watching db_admin or other testing channels in addition to the main channel.
6 years ago
Mike Lang f0d9aa82c2 Ignore segments that are marked as ads
* Checks for the SCTE35-OUT/SCTE35-IN marks in the HLS stream that indicate an ad start/end
* Ignores those segments completely
* Doesn't mark the StreamWorker as up until it sees the first non-ad segment

Some other operational notes:
* The main risk this adds is that re-connecting / refreshing master playlist takes longer.
  If all downloaders are doing this at the same time (ie. because the stream only just came up,
  or during a deployment rollout), all downloaders might be waiting for ads to finish and
  you'll miss segments.
* We should run more downloaders to compensate. This also increases the chance at least one of
  them won't get any ads, so we get everything right from stream-up.
* The other mitigation we can do is have geographically diverse downloaders. This decreases the risk
  that they all get served an ad, and at least at time of writing it seems that no in-stream ads
  are served outside of these regions:

> US, Canada, Germany, France, Sweden, Belgium, Poland, Norway, Finland, Denmark, Netherlands, Italy, Spain, Switzerland, Austria, Portugal, UK, Australia, New Zealand
6 years ago
Mike Lang c0f94059aa downloader: Stop retrying in SegmentGetter after a long timeout
In resource contention scenarios, all calls can start failing due to
not being able to read the response in a timely manner.
This means SegmentGetters never stop retrying, leading to further contention
and a feedback loop.
We attempt to put at least some cap on this scenario by giving up
if an amount of time has elapsed to the point that we know our URL couldn't be valid anymore.
Since we don't actually know how long segment URLs are valid, we are very conservative about
this time, for now setting it to 20min.
6 years ago
Mike Lang 81aee0ee1e Increase hard timeout for getting segment headers
When we're under CPU or disk contention, doing other work
can become very slow. We want to avoid spurious errors in this situation
as this causes further retries and further contention.

One easy way to do this is to increase the time we have to finish fetching headers.
6 years ago
Mike Lang b75b9a9b00 Add stacksampler to all services 6 years ago
Mike Lang a5213ccb3b downloader: Pool connections when we can
To preserve independence between workers and ensure that a
retry (a worker re-create) actually starts from scratch, we only pool connections
on a per-worker basis.
Furthermore, for the same reason, we only let SegmentGetters use the worker's
pool on their first attempt. After that, they create a new pool to ensure they have a clean retry.

Despite this, the result should be that we're almost always re-using an existing connection
when getting segments or media playlists, unless something goes wrong.

SSL connection setup was measured as almost half the CPU time used by the process,
so this change should result in a signifigant CPU usage reduction.
6 years ago
Mike Lang 901cda4814 Enable backdoor in all services, and add telnet to containers 6 years ago
Mike Lang 9af7795f34 Add gevent.backdoor as an optional arg to all services
Backdoor allows the operator to telnet into the given port, and get a python shell
running inside the process, from which you can debug, modify state (eg. set the log level),
or whatever. This is extremely useful for debugging weird states that you encounter randomly
but can't easily reproduce, without restarting the process and needing to wait until it happens again.
6 years ago
Mike Lang 47ff92b155 downloader: Fix bug where mark_working wasn't called
This meant that old workers would never shut down, causing us to fetch the same media playlist
and same segments multiple times for no reason, and to never give up in face of (non-403/404) errors
even once we have something else working.
6 years ago
Mike Lang 3042d00516 downloader: Give up on 404 in addition to 403
Also fix some logging.

When we're out of touch with twitch for long enough, our segment URL will get
so old that twitch stops returning 403 because our token is expired,
and start returning 404s, presumebly becasue the underlying resource has gone away.

We want to treat these the same.
6 years ago
Mike Lang 7f9a1dbe45 downloader: Remove implicit source quality arg
This brings it in line with backfiller, is more flexible and less surprising
6 years ago
Mike Lang 0d627715f3 downloader: Track number of downloaded segments
This is the most important metric, we can add more later.
6 years ago
Mike Lang b4b315b6bc Expose prometheus metrics for backfiller and downloader 6 years ago
Mike Lang b0ded641c3 Add a logging handler which counts logs for prometheus stats
This isn't as good as having a full centralised logging system, but should
suffice to know if anything funny is happening.
6 years ago
Mike Lang 17972b87aa Allow setting of log level via WUBLOADER_LOG_LEVEL env var
By using an env var, it is universal and happens prior to arg parsing,
at the same point we do other logging setup.
6 years ago
Mike Lang c0357680cf downloader: Use caller's logger inside soft_hard_timeout 6 years ago
Mike Lang a628676e74 downloader: Log to subloggers instead of the root logger
This gives us some context when logging, and is best practice.
6 years ago
Mike Lang 6815924097 Fix some bugs and linter errors introduced by backfiller
I ran `pyflakes` on the repo and found these bugs:

```
./common/common.py:289: undefined name 'random'
./downloader/downloader/main.py:7: 'random' imported but unused
./backfiller/backfiller/main.py:150: undefined name 'variant'
./backfiller/backfiller/main.py:158: undefined name 'timedelta'
./backfiller/backfiller/main.py:171: undefined name 'sort'
./backfiller/backfiller/main.py:173: undefined name 'sort'
```
(ok, the "imported but unused" one isn't a bug, but the rest are)

This fixes those, as well as a further issue I saw with sorting of hours.

Iterables are not sortable. As an obvious example, what if your iterable was infinite?
As a result, any attempt to sort an iterable that is not already a friendly type like a list
or tuple will result in an error. We avoid this by coercing to list, fully realising the iterable
and putting it into a form that python will let us sort. It also avoids the nasty side-effect
of mutating the list that gets passed into us, which the caller may not expect. Consider this example:

```
>>> my_hours = ["one", "two", "three"]
>>> print my_hours
["one", "two", "three"]
>>> backfill_node(base_dir, node, stream, variants, hours=my_hours, order='forward')
>>> print my_hours
["one", "three", "two"]
```

Also, one of the linter errors was non-trivial to fix - we were trying to get a list of hours
(which is an api call for a particular variant), but at a time when we weren't dealing with a single
variant. My solution was to get a list of hours for ALL variants, and take the union.
6 years ago
Christopher Usher fec0975d18 fixed white space and the like 6 years ago
Christopher Usher 3cdfaad664 moved rename, ensure_directory and jitter to common
Move a few useful functions in downloader used in the backfiller to common
6 years ago
Mike Lang 1dce14bf77 downloader: Fix and improve the stop mechanism, stop on SIGTERM
Allows for graceful shutdown
6 years ago
Mike Lang 7b10429846 downloader: Dockerfile fixes to make it work 6 years ago
Mike Lang 6c3501db6f downloader: Fix dateutil lib, which is actually called python-dateutil 6 years ago
Mike Lang 7257fb9b73 downloader: Include channel name in path, instead of assuming it's already in base_dir
Previously, downloader would put files under BASE_DIR/VARIANT/HOUR/FILE.ts
now, it will put files under BASE_DIR/STREAM/VARIANT/HOUR/FILE.ts

This brings downloader in line with restreamer's concept of base_dir
6 years ago
Christopher Usher efe30c1942 Added a comment to highlight recursion 6 years ago
Mike Lang 75c9793eac Remove central config file as it's more trouble than it's worth
Simpler and easier for testing to stick to configuration via CLI args.
We'll worry about deployment later.
6 years ago
Mike Lang 031dd60897 downloader: Fix some typos around the max age calculation 6 years ago
Mike Lang 6377db2aa2 downloader: Bug fixes and improvements
* Fix bug where soft timeout is not cancelled if an exception occurs
* Various logging tweaks
* Prevent master playlist wait time from going negative
* Stop gracefully if stream worker detects end of stream
* Don't treat master playlist 404 as an error, it just means the stream isn't up
6 years ago
Mike Lang 6e0dcd5e22 downloader: Fix bugs and missing bits in initial implementation
* Set a reasonable log format
* Make soft timeouts not always fire
* Change soft_hard_timeout signature slightly for ease-of-use
* Make renames not fail if file already exists
* Misc typos
6 years ago
Mike Lang f193bd0f54 Re-write downloader to be resilient to failures as much as possible
This makes the code crazy complicated and messy, but means we can be persistent about
not giving up, while still retrying at the same time, and trying multiple urls at once
until we find one that works.

See docstrings for a full discussion on some of the failures we're trying to work around.
6 years ago
Mike Lang 10241b6190 Downloader: Implement a very basic proof of concept version
Missing a LOT of tankiness, ways to configure, conversion to bustime, etc.
6 years ago
Mike Lang 8993773a22 downloader.twitch: Deals with twitch specifics of playlist management 6 years ago
Mike Lang 961712b919 downloader: Import hls_playlist from streamlink
This is a useful library and we might as well use it.
Copying it over and slightly modifying it to work was easier than importing all of streamlink.

The original version may be found at 30043408c7/src/streamlink/stream/hls_playlist.py
6 years ago
Mike Lang 30612f00ad downloader: basic startup path 6 years ago