Commit Graph

336 Commits (ccb7f3c684a4717de38a73baf7bbd838491d943c)
 

Author SHA1 Message Date
Christopher Usher 7d9a5b4626 added workers and a worker manager 6 years ago
Christopher Usher be8d40d1ba Move the code for calculating hours outside the code that backfills 6 years ago
Chris Usher ed58b6e44d reintroduced a start time for the backfiller; more logging 6 years ago
Mike Lang 292188ad7c database: Remove retry_on_conflict helper and default to autocommit
All our usage was of a single query anyway, so autocommit is easier to handle.
You can still opt into a longer transaction using the transaction() helper.
6 years ago
Mike Lang 73640ed4ab database: Add column video_id for storing upload-location-specific metadata for identifying video
ie. for youtube, the video id.
6 years ago
Mike Lang dc2eb6ed74 Add some common database code
This code manages the database connections, setting their isolation level correctly
and ensuring the idempotent schema is applied before they're used.

Applying the schema on startup means we don't need to deal with the database's state,
setting it up before running, running migrations etc. However, it does put constraints on
the changes we can safely make.

Our use of seralizable isolation means that all transactions can be treated as fully
independent - the server must behave as though they'd been run seperately in some valid order.
This will give us the least surprising results when multiple connections try to modify the same
data, though we'll need to deal with occasional transaction commit failures due to conficts.
6 years ago
Mike Lang cea66a4bbf database: Rename start/end to event_start/end, add channel and quality
* Had to rename `end` as `end` is a reserved word in postgres SQL.
`event_end` is more consistent with `video_end` anyway. Updated `start` to match.

* Added ability to specify channel and stream quality in the editor, which may prove useful
if we have issues with a particular stream quality, or if content needs to be captured from
other channels.
6 years ago
MasterGunner 7423f8c4ef Update DATABASE.md
Changed upload_location to be edit input.
6 years ago
MasterGunner d89458c27d Update DATABASE.md
Changed allow_holes and uploader_whitelist to be edit inputs - there's no need for them to come from the sheet; and we'll have an admin dashboard for modifying them if needed.
6 years ago
Mike Lang 437d38e646 DATABASE.md: Add image_links column
This solves the problem of rows which don't need a full cut video,
but we'd like to link to an image or a short gif or clip of it.
It is a sheet input that is only used in the output sheet, so it doesn't affect the wubloader itself.
6 years ago
Mike Lang df66553b38 downloader: Start backdoor later so workers is in locals 6 years ago
Mike Lang 86da9d9fe8 downloader: Support watching multiple channels
This is useful eg. for watching db_admin or other testing channels in addition to the main channel.
6 years ago
Mike Lang f0d9aa82c2 Ignore segments that are marked as ads
* Checks for the SCTE35-OUT/SCTE35-IN marks in the HLS stream that indicate an ad start/end
* Ignores those segments completely
* Doesn't mark the StreamWorker as up until it sees the first non-ad segment

Some other operational notes:
* The main risk this adds is that re-connecting / refreshing master playlist takes longer.
  If all downloaders are doing this at the same time (ie. because the stream only just came up,
  or during a deployment rollout), all downloaders might be waiting for ads to finish and
  you'll miss segments.
* We should run more downloaders to compensate. This also increases the chance at least one of
  them won't get any ads, so we get everything right from stream-up.
* The other mitigation we can do is have geographically diverse downloaders. This decreases the risk
  that they all get served an ad, and at least at time of writing it seems that no in-stream ads
  are served outside of these regions:

> US, Canada, Germany, France, Sweden, Belgium, Poland, Norway, Finland, Denmark, Netherlands, Italy, Spain, Switzerland, Austria, Portugal, UK, Australia, New Zealand
6 years ago
Mike Lang ef0a78fce3 Updates to database states and columns
Split UPLOADED into TRANSCODING and DONE, to represent the time after upload
that youtube is transcoding the video and it's not viewable.

Any cutter can poll for the state of a transcoding video and mark it as done.

Add some extra sheet input columns.
6 years ago
Mike Lang 6b3a0fea9f Add a doc covering how the database is used
I fully expect the exact list of sheet inputs, edit inputs and outputs to change.
The important thing I wanted to codify here was the state machine and the behaviour of the cutters.
6 years ago
Mike Lang c0f94059aa downloader: Stop retrying in SegmentGetter after a long timeout
In resource contention scenarios, all calls can start failing due to
not being able to read the response in a timely manner.
This means SegmentGetters never stop retrying, leading to further contention
and a feedback loop.
We attempt to put at least some cap on this scenario by giving up
if an amount of time has elapsed to the point that we know our URL couldn't be valid anymore.
Since we don't actually know how long segment URLs are valid, we are very conservative about
this time, for now setting it to 20min.
6 years ago
Mike Lang 81aee0ee1e Increase hard timeout for getting segment headers
When we're under CPU or disk contention, doing other work
can become very slow. We want to avoid spurious errors in this situation
as this causes further retries and further contention.

One easy way to do this is to increase the time we have to finish fetching headers.
6 years ago
Mike Lang 787b9002ab restreamer: Use correct name for dateutil 6 years ago
Mike Lang 3a1e4b0aef restreamer: Fix missing dependency
This was hidden because common included it
6 years ago
Mike Lang 997c1242b2 get_best_segments: Let other things run
get_best_segments can sometimes take a very long time,
we don't want to stop other work from happening while it's ongoing.
So we ask gevent to run other things until there's no other work to do,
then we do one hour, then check back with gevent again.

In combination with the performance improvements, this should mean we don't block
other things from running for more than a few hundred ms at most.
6 years ago
Mike Lang bf08aa29b8 parse_segment_path: Use datetime.strptime instead of dateutil.parser
strptime is much faster but can't handle as varied formats.
But in this case we fully control the format, so there's no reason not to use it.

Profiling suggests we spend about 80% of our time in get_best_segments just parsing dates,
so this is a signifigant performance gain.
6 years ago
Mike Lang bcdb268ce8 Also need to replace locks on the counter float values to prevent deadlocks
See comment for full details
6 years ago
Mike Lang 10cca18922 Fix a deadlock due to signal interactions with prometheus client
The prometheus client uses a threading.Lock() to prevent shared access to
certain metric state. This lock is taken as part of doing collection, as well
as during metric.labels().

We hit a deadlock where our stack sampler signal arrived during a collection,
when the lock was held. This meant that flamegraph.labels() blocked forever,
and the lock was never released, hanging all metrics collection.

Our solution is a hack, which is to reach into the internals of our metric object
and replace its lock with a dummy one. This is reasonably safe, but only as long as
the prometheus_client internal structure doesn't change signfigiantly.
6 years ago
Mike Lang c9cc8a73a7 generate-flamegraph: Script to create a flamegraph by querying prometheus 6 years ago
Mike Lang b75b9a9b00 Add stacksampler to all services 6 years ago
Mike Lang b9c2921242 common.stats: Add a stacksampler that records sampled stacks to prometheus
This can then be used to generate flamegraphs
6 years ago
Mike Lang a5213ccb3b downloader: Pool connections when we can
To preserve independence between workers and ensure that a
retry (a worker re-create) actually starts from scratch, we only pool connections
on a per-worker basis.
Furthermore, for the same reason, we only let SegmentGetters use the worker's
pool on their first attempt. After that, they create a new pool to ensure they have a clean retry.

Despite this, the result should be that we're almost always re-using an existing connection
when getting segments or media playlists, unless something goes wrong.

SSL connection setup was measured as almost half the CPU time used by the process,
so this change should result in a signifigant CPU usage reduction.
6 years ago
Mike Lang 5175b099af common: Split segment-related stuff into its own module
We still import them into __init__.py so they're accessible externally just the same
6 years ago
Mike Lang 6f84a23ba6 common: Split stats-related stuff into its own module
We still import them into __init__.py so they're accessible externally just the same
6 years ago
Mike Lang 8fe2fec958 common: convert from module to package 6 years ago
MasterGunner 96c1566d21
Merge pull request #34 from ekimekim/gunner/restreamer/additional-routes
Added additional routes for listing available streams and variants.
6 years ago
MasterGunner a9569d9e96 Removed unneeded '@has_path_args'. 6 years ago
MasterGunner 306ac53d08 Added additional routes for listing available streams and variants. 6 years ago
Mike Lang 901cda4814 Enable backdoor in all services, and add telnet to containers 6 years ago
Mike Lang 9af7795f34 Add gevent.backdoor as an optional arg to all services
Backdoor allows the operator to telnet into the given port, and get a python shell
running inside the process, from which you can debug, modify state (eg. set the log level),
or whatever. This is extremely useful for debugging weird states that you encounter randomly
but can't easily reproduce, without restarting the process and needing to wait until it happens again.
6 years ago
Mike Lang 47ff92b155 downloader: Fix bug where mark_working wasn't called
This meant that old workers would never shut down, causing us to fetch the same media playlist
and same segments multiple times for no reason, and to never give up in face of (non-403/404) errors
even once we have something else working.
6 years ago
Mike Lang 3042d00516 downloader: Give up on 404 in addition to 403
Also fix some logging.

When we're out of touch with twitch for long enough, our segment URL will get
so old that twitch stops returning 403 because our token is expired,
and start returning 404s, presumebly becasue the underlying resource has gone away.

We want to treat these the same.
6 years ago
Mike Lang 7f9a1dbe45 downloader: Remove implicit source quality arg
This brings it in line with backfiller, is more flexible and less surprising
6 years ago
Mike Lang 89d6b3a6be docker-compose: Add list of peers to backfill from 6 years ago
Mike Lang 0d627715f3 downloader: Track number of downloaded segments
This is the most important metric, we can add more later.
6 years ago
Mike Lang 90ccc6d827 backfiller: Track number of successful backfills
Other stats can come later, but this one is important as it tells us if
a downloader hasn't been doing its job.
6 years ago
Mike Lang c59892e148 backfiller: Add ability to set nodes as CLI arg 6 years ago
Mike Lang bdcb217d20 docker-compose: Expose metrics ports for other services 6 years ago
Mike Lang b4b315b6bc Expose prometheus metrics for backfiller and downloader 6 years ago
Mike Lang d90f01b8ce common: Create general function for timing things, and use it to time get_best_segments
The function is quite customizable and therefore quite complex, but it allows us to
easily annotate a function to be timed with labels based on input and output,
as well as normalize results based on amount of work done to get a better
picture of the actual amount of time taken per unit of work.
This will help us monitor for performance issues.
6 years ago
Mike Lang b0ded641c3 Add a logging handler which counts logs for prometheus stats
This isn't as good as having a full centralised logging system, but should
suffice to know if anything funny is happening.
6 years ago
Mike Lang c9d02b3318 restreamer: Prevent prom client blowing up after two different endpoints are hit
Prom client doesn't like you creating two stats with the same name,
even though they have different labels and this makes perfect sense.

I feel like I just need to re-write the prom client at some point - it doesn't actually
do all that much except get in your way, apart from the actual text encoding which I
can steal.

Anyway, in the meantime, we get around this by breaking up metrics into two names,
a "foo_all" and a "foo_ENDPOINT". The foo_all lacks the detailed labels,
but is still labelled by endpoint and can be used more easily.
The foo_ENDPOINT labels have more information but require messier PromQL as you need to
match on a name regex if you want to look at more than one specific endpoint.
6 years ago
Mike Lang 30c4bbec1d restreamer: return the actual response from after_request even if untracked
otherwise any untracked endpoints don't work
6 years ago
Christopher Usher 96e6904c85 Added monotonic to restreamer setup.py 6 years ago
Christopher Usher 225288980a Added the backfiller to docker-compose 6 years ago