wubloader

Commit Graph

Author	SHA1	Message	Date
Mike Lang	1add3c5c22	Implement tombstoning to allow for segment deletion Rarely, we find ourselves needing to explicitly delete some data, eg. something that shouldn't have been public and should be removed from all records. It would also be nice if we could "clean up" bad versions of the same segment, which occasionally come up when downloaders have issues. With our distributed segment database, this is actually rather difficult as deleting the data from any one server would cause it to be restored from the others. It was only possible by stopping all backfill, deleting the data on all servers, then starting backfill again. Here we introduce a more practical approach. An operator creates an empty flag file with the same name as the segment to be deleted, but with a `.tombstone` extension. eg. to delete a file `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.ts`, you would create a tombstone `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.tombstone`. These tombstone files do two important things: * They hide the segment from being listed, which both means: * It can't be restreamed or put into a video * It can't be backfilled to other nodes * The tombstone files themselves do get backfilled to other nodes, so you only need to mark them on one server. Once the tombstone has propagated to all nodes, the segment file can be deleted independently on each one. We chose not to have a tombstone automatically trigger a segment deletion for safety reasons.	3 years ago
Mike Lang	a47c29fff4	Link images to github repo by adding a LABEL When pushed, this tells github to associate the ghcr.io repo that was pushed to with the github repo specified (the owner needs to match). This does a few things. Most importantly, this automatically gives github actions credentials to push to these repositories when run in the context of the wubloader repo.	3 years ago
Mike Lang	62bd6539ea	Unpin gevent as that was a workaround for a py2 issue	3 years ago
Mike Lang	21856c68aa	Fix all instances of file.write() for py3 In python 3, file.write() may do a partial write and returns the number of characters written. In order to not lose data, we need to wrap every instance of file.write() with our new common.writeall() wrapper that loops until the data is actually written.	3 years ago
Mike Lang	a56f6859bb	more py3 fixes	3 years ago
Mike Lang	f2a8007bf7	Fix build dependency issues	3 years ago
Mike Lang	6a98addac8	py3 fixes for downloader	3 years ago
HubbeKing	6d790a1b36	Do a first naive pass for py3 compatibility Check that open() calls for reading and writing use binary modes Use alpine version with py3-pip package Use python3 in Dockerfile CMD Remove sys.setdefaultencoding() "hack" Simplify ensure_directory() in common.common package	3 years ago
Mike Lang	f0546e2ee3	Pin gevent to 1.5a2 to avoid https://github.com/gevent/gevent/issues/1711	4 years ago
Mike Lang	32138bbd43	downloader: Update to work with twitch's new access token API Twitch removed their old access token endpoint and now use a GraphQL endpoint. The old endpoint would just always return 404, which we sadly interpreted as "stream not up". Thankfully streamlink has already done the reverse engineering work so I was able to update it to work again fairly easily, it's just a bit more convoluted.	4 years ago
Mike Lang	6d55f01de6	downloader: Fix a bug when we can't find a particular stream quality The intended behaviour was to log a warning message and retry next time, but still allow workers to be started for any streams found. However, due to a missing continue, we fall through to attempting to start a worker for a non-existent quality which causes a KeyError when looking up `self.latest_urls[quality]`. This exception means we don't run through the other qualities, so we never start any other quality.	4 years ago
HubbeKing	86f7823348	Replace calls to gevent.signal() with gevent.signal_handler() gevent.signal() was removed in gevent 1.5a4, see http://www.gevent.org/api/gevent.signal.html Removed on Feb 5th, see https://github.com/gevent/gevent/pull/1530	4 years ago
Mike Lang	a53786dc2d	Add file and make as build dependencies gevent now requires these to build. I'm not sure when this changed.	4 years ago
Mike Lang	5ccb5afff1	Track number of ignored ads	4 years ago
Mike Lang	50770651ce	Detect new style of twitch in-stream ads New ad segments always have a title like "Amazon\|...". This is the same fix as used by streamlink at time of writing.	4 years ago
Mike Lang	21d5548980	Add new segment type "suspect" We've noticed that when nodes have connection problems, they get full segments with different hashes. Inspection of these segments shows that they all have identical data up to a point. Segments that fetched normally will then have the remainder of the data. Segments that had issues will have a slightly corrupted end. The data is still valid, and no errors are raised. It just doesn't have all the data. We noticed that these corrupted segments all were cut off exactly 60sec after their requests began. We believe this is a server-side timeout on the request that returns whatever data it has, then closes the container file cleanly before returning successfully. We detect segments that take > 59 seconds to recieve, and label them as "suspect". Suspect segments are treated identically to partial segments, except they are always preferred over partials.	5 years ago
Mike Lang	72003f28d0	downloader: Don't check the age of a worker we just spawned Not only is this redundant, but it creates a race condition where the worker fails before the latest_worker = workers[-1] check, and we get an IndexError.	5 years ago
Mike Lang	94d81d708f	Downloader: Change access_token call to match website It stopped working, these changes bring it back in line with the website so it works.	5 years ago
Christopher Usher	abb9193705	fixed outdated "stream", "variant" in metric	5 years ago
Mike Lang	6c6c1ae637	downloader: Make a few things quieter for non-important channels	5 years ago
Mike Lang	b2a07ef114	Merge pull request #140 from ekimekim/mike/build-improvements Refactor dockerfiles for more shared layers	5 years ago
Mike Lang	731ef9e2d0	Refactor dockerfiles for more shared layers By carefully ensuring most of our dockerfiles are identical in their first few layers, we only need to build those layers once instead of every time. In particular, we move installing gevent to before installing common, so that even when common changes gevent doesn't need to be reinstalled. This is important because gevent takes ages to install. Also fixes segment_coverage, which wasn't being installed.	5 years ago
Mike Lang	83750da37b	downloader: Create concept of an "important" channel In our usage, we have one channel where we really care / want to know if it's down, but also a bunch of other channels where they're expected to not be streaming most/all of the time. To prevent these extra channels making a ton of noise, we introduce the concept of an "important" channel, indicated by appending a '!' to the channel name in the command line. So for example, you might specify channels as "foo! foo_backup foo_behindthescenes". Important channels have the same behaviour as previously. Non-important channels: * Have a 20-second retry on a master playlist fetch failure, instead of 5 * Log at debug when the stream is down, instead of info.	5 years ago
Mike Lang	8ad61e9870	downloader: Collect metrics on http calls	5 years ago
Mike Lang	b4655f18c6	downloader: Track total duration of downloaded segments	5 years ago
Mike Lang	04ef0d3823	fix a few remaining usages of StreamWorker.stream instead of .quality	6 years ago
Christopher Usher	361e577474	fixes based on ekimekims suggestions	6 years ago
Christopher Usher	3564643613	refactoring downloader	6 years ago
Mike Lang	73d5941e05	downloader: Track timestamp of latest segment This gives us a "stream delay" metric. Prom doesn't have any native way to check the current value of a metric, in order to take max(). It only offers increment and set. We reach into some internals to do this in a hacky way, but the cleaner way would be to track the value ourselves and have a prom callback that gets the value. Sigh, I hate this prom library. I might write my own that's less dumb.	6 years ago
Mike Lang	f8d10dacdf	Audit and fix all usage of dateutil We wrap direct dateutil calls to handle two distinct cases: * `common.dateutil.parse()`: We want to handle arbitrary timestamps including tz info, then convert them to UTC. This is used in HLS parsing, and for command line input for backfiller * `common.dateutil.parse_utc_only()`: We want to only handle UTC timestamps, but datetime.strptime isn't flexible enough (eg. can't handle missing fractional component). This is used for restreamer request params.	6 years ago
Mike Lang	df66553b38	downloader: Start backdoor later so workers is in locals	6 years ago
Mike Lang	86da9d9fe8	downloader: Support watching multiple channels This is useful eg. for watching db_admin or other testing channels in addition to the main channel.	6 years ago
Mike Lang	f0d9aa82c2	Ignore segments that are marked as ads * Checks for the SCTE35-OUT/SCTE35-IN marks in the HLS stream that indicate an ad start/end * Ignores those segments completely * Doesn't mark the StreamWorker as up until it sees the first non-ad segment Some other operational notes: * The main risk this adds is that re-connecting / refreshing master playlist takes longer. If all downloaders are doing this at the same time (ie. because the stream only just came up, or during a deployment rollout), all downloaders might be waiting for ads to finish and you'll miss segments. * We should run more downloaders to compensate. This also increases the chance at least one of them won't get any ads, so we get everything right from stream-up. * The other mitigation we can do is have geographically diverse downloaders. This decreases the risk that they all get served an ad, and at least at time of writing it seems that no in-stream ads are served outside of these regions: > US, Canada, Germany, France, Sweden, Belgium, Poland, Norway, Finland, Denmark, Netherlands, Italy, Spain, Switzerland, Austria, Portugal, UK, Australia, New Zealand	6 years ago
Mike Lang	c0f94059aa	downloader: Stop retrying in SegmentGetter after a long timeout In resource contention scenarios, all calls can start failing due to not being able to read the response in a timely manner. This means SegmentGetters never stop retrying, leading to further contention and a feedback loop. We attempt to put at least some cap on this scenario by giving up if an amount of time has elapsed to the point that we know our URL couldn't be valid anymore. Since we don't actually know how long segment URLs are valid, we are very conservative about this time, for now setting it to 20min.	6 years ago
Mike Lang	81aee0ee1e	Increase hard timeout for getting segment headers When we're under CPU or disk contention, doing other work can become very slow. We want to avoid spurious errors in this situation as this causes further retries and further contention. One easy way to do this is to increase the time we have to finish fetching headers.	6 years ago
Mike Lang	b75b9a9b00	Add stacksampler to all services	6 years ago
Mike Lang	a5213ccb3b	downloader: Pool connections when we can To preserve independence between workers and ensure that a retry (a worker re-create) actually starts from scratch, we only pool connections on a per-worker basis. Furthermore, for the same reason, we only let SegmentGetters use the worker's pool on their first attempt. After that, they create a new pool to ensure they have a clean retry. Despite this, the result should be that we're almost always re-using an existing connection when getting segments or media playlists, unless something goes wrong. SSL connection setup was measured as almost half the CPU time used by the process, so this change should result in a signifigant CPU usage reduction.	6 years ago
Mike Lang	901cda4814	Enable backdoor in all services, and add telnet to containers	6 years ago
Mike Lang	9af7795f34	Add gevent.backdoor as an optional arg to all services Backdoor allows the operator to telnet into the given port, and get a python shell running inside the process, from which you can debug, modify state (eg. set the log level), or whatever. This is extremely useful for debugging weird states that you encounter randomly but can't easily reproduce, without restarting the process and needing to wait until it happens again.	6 years ago
Mike Lang	47ff92b155	downloader: Fix bug where mark_working wasn't called This meant that old workers would never shut down, causing us to fetch the same media playlist and same segments multiple times for no reason, and to never give up in face of (non-403/404) errors even once we have something else working.	6 years ago
Mike Lang	3042d00516	downloader: Give up on 404 in addition to 403 Also fix some logging. When we're out of touch with twitch for long enough, our segment URL will get so old that twitch stops returning 403 because our token is expired, and start returning 404s, presumebly becasue the underlying resource has gone away. We want to treat these the same.	6 years ago
Mike Lang	7f9a1dbe45	downloader: Remove implicit source quality arg This brings it in line with backfiller, is more flexible and less surprising	6 years ago
Mike Lang	0d627715f3	downloader: Track number of downloaded segments This is the most important metric, we can add more later.	6 years ago
Mike Lang	b4b315b6bc	Expose prometheus metrics for backfiller and downloader	6 years ago
Mike Lang	b0ded641c3	Add a logging handler which counts logs for prometheus stats This isn't as good as having a full centralised logging system, but should suffice to know if anything funny is happening.	6 years ago
Mike Lang	17972b87aa	Allow setting of log level via WUBLOADER_LOG_LEVEL env var By using an env var, it is universal and happens prior to arg parsing, at the same point we do other logging setup.	6 years ago
Mike Lang	c0357680cf	downloader: Use caller's logger inside soft_hard_timeout	6 years ago
Mike Lang	a628676e74	downloader: Log to subloggers instead of the root logger This gives us some context when logging, and is best practice.	6 years ago
Mike Lang	6815924097	Fix some bugs and linter errors introduced by backfiller I ran `pyflakes` on the repo and found these bugs: ``` ./common/common.py:289: undefined name 'random' ./downloader/downloader/main.py:7: 'random' imported but unused ./backfiller/backfiller/main.py:150: undefined name 'variant' ./backfiller/backfiller/main.py:158: undefined name 'timedelta' ./backfiller/backfiller/main.py:171: undefined name 'sort' ./backfiller/backfiller/main.py:173: undefined name 'sort' ``` (ok, the "imported but unused" one isn't a bug, but the rest are) This fixes those, as well as a further issue I saw with sorting of hours. Iterables are not sortable. As an obvious example, what if your iterable was infinite? As a result, any attempt to sort an iterable that is not already a friendly type like a list or tuple will result in an error. We avoid this by coercing to list, fully realising the iterable and putting it into a form that python will let us sort. It also avoids the nasty side-effect of mutating the list that gets passed into us, which the caller may not expect. Consider this example: ``` >>> my_hours = ["one", "two", "three"] >>> print my_hours ["one", "two", "three"] >>> backfill_node(base_dir, node, stream, variants, hours=my_hours, order='forward') >>> print my_hours ["one", "three", "two"] ``` Also, one of the linter errors was non-trivial to fix - we were trying to get a list of hours (which is an api call for a particular variant), but at a time when we weren't dealing with a single variant. My solution was to get a list of hours for ALL variants, and take the union.	6 years ago
Christopher Usher	fec0975d18	fixed white space and the like	6 years ago

1 2

67 Commits (ab985cf1b037b6df05dc6edcbf430c42b163d52d)