Rarely, we find ourselves needing to explicitly delete some data, eg. something that shouldn't
have been public and should be removed from all records.
It would also be nice if we could "clean up" bad versions of the same segment,
which occasionally come up when downloaders have issues.
With our distributed segment database, this is actually rather difficult as deleting the data
from any one server would cause it to be restored from the others. It was only possible
by stopping all backfill, deleting the data on all servers, then starting backfill again.
Here we introduce a more practical approach. An operator creates an empty flag file
with the same name as the segment to be deleted, but with a `.tombstone` extension.
eg. to delete a file `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.ts`,
you would create a tombstone `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.tombstone`.
These tombstone files do two important things:
* They hide the segment from being listed, which both means:
* It can't be restreamed or put into a video
* It can't be backfilled to other nodes
* The tombstone files themselves do get backfilled to other nodes, so you only need to mark them on one server.
Once the tombstone has propagated to all nodes, the segment file can be deleted independently on each one.
We chose not to have a tombstone automatically trigger a segment deletion for safety reasons.
The restreamer spends most of its time iterating through segments (parsing them, determining the best one for each start time)
to serve large time ranges. Since this only depends on the list of filenames read from disk,
we can cache it for a given hour as long as that list is identical.
This is a little trickier than it sounds because best_segments_by_start is an iterator
and in most cases it won't be fully consumed. So we introduce a `CachedIterator` abstraction
that will both remember the previously yielded values, and keep track of the live iterator
so it can be resumed again if a previous invocation only partially consumed it.
This also has the nice side effect of merging simultaneous operations - if two requests come in
for the same hour at the same time, they'll share one iterator and both consume the results
as they come in.
Previously both restreamer and thrimshim had some complex logic for dealing with
graceful shutdown, in different ways, that was still prone to race conditions.
We replace this with a common method that does it properly.
Fixes#226
This is the simplest case as we can just cut each range like we already do,
then concat the results.
We still allow for the full design in the database and cutter, but error out if transitions
is ever anything but hard cuts or if it's a full cut.
We also update the restreamer to allow accepting ranges, however for usability we still allow
the old "just one start and end" args.
Note this changes the thrimshim API to give and take the new "video_ranges" and "video_transitions" columns.
In python 3, file.write() may do a partial write and returns the number of characters written.
In order to not lose data, we need to wrap every instance of file.write() with our new
common.writeall() wrapper that loops until the data is actually written.
Check that open() calls for reading and writing use binary modes
Use alpine version with py3-pip package
Use python3 in Dockerfile CMD
Remove sys.setdefaultencoding() "hack"
Simplify ensure_directory() in common.common package
float() is inaccurate and Decimal() is very slow (~3x the cpu usage)
so instead we right-pad with 0s (eg. so 1.2345 -> 1.234500) then convert to int microsec directly.
Floating point error leads to 1us differences in parsed times,
which causes false positives in the overlapping segments check.
By using a Decimal, we get the exact digits from the filepath.
strptime is very slow. In terms of pure get_best_segments() speed, this change
more than doubles the throughput.
In particular for segment_coverage, this halves the run time for each check.
It causes problems due to the sheer number of unique metrics emitted, which makes
the prometheus endpoint be very expensive / fail a lot.
The data is not useful enough to justify the cost.
We've noticed that when nodes have connection problems, they get full segments
with different hashes. Inspection of these segments shows that
they all have identical data up to a point.
Segments that fetched normally will then have the remainder of the data.
Segments that had issues will have a slightly corrupted end.
The data is still valid, and no errors are raised. It just doesn't have all the data.
We noticed that these corrupted segments all were cut off exactly 60sec after their requests
began. We believe this is a server-side timeout on the request that returns whatever data
it has, then closes the container file cleanly before returning successfully.
We detect segments that take > 59 seconds to recieve, and label them as "suspect".
Suspect segments are treated identically to partial segments, except they are always preferred
over partials.
We occasionally see corrupted segments that are slightly shorter in size
but report the same metadata as the full segments. Prefer the largest version
as it's likely the least corrupt.
Despite our best efforts, this was causing chunked responses to be fully
buffered into memory as a side effect.
This is really bad because responses can be VERY large.
Since these tend to happen around stream endings, etc,
we don't want them to be crazy noisy and cause us to disregard real problems.
We can use the segment coverage to see in metrics if there are overlaps.
This gives an easy way to do so across all services without adding new options.
Reasons to do so might be to avoid overheads or because your prometheus metrics grow too large.
To be clear, this is an awful hack.
It means that any implicit str/unicode coersion will use the utf-8 encoding,
which is basically always what you want.
However, it is possible that some badly-written libraries might be relying
on the default encoding being ascii, and will do weird things as a result.
Finally, it's especially hacky to be doing this as part of importing a library.
Normally you're meant to do this as part of a sitecustomize.py in your python system directory,
and the function is deleted before passing control to normal code (this is why we need
to reload() to get it back).
The caller can pick depending on the needs of the output format.
This reverses most of 80d829b83b,
re-introducing streaming full cuts but keeping non-streaming as an option.
Most formats like mp4 require ffmpeg to make changes at the start of the file
throughout writing.
Unfortunately, this prevents us from streaming the upload as we cut it.
Instead, we spool to a temporary file until ffmpeg exits,
then upload that all at once.
In a fast cut, we edit the first and last segments then concatenate them all.
However, this leads to some tiny but perciptible artifacting around the border
of the first and second (and second-last and last) segments.
A full cut is much slower, but re-encodes the video into the desired format
and is more reliable.
We want both options to be available.
With this commit, we only add the option, we don't use it in restreamer or cutter.
To make this work, we make type a proper segment field.
We also tell get_best_segments to ignore temp segments, since they might go away
before we can actually use them.
This allows manual uploads to work without needing to fill all the edit fields
with junk.
We also set a constraint on uploader asserting that any videos from claimed onwards have a known uploader.
Again, an exception is made for DONE to allow manual uploads.
These can happen if a downloader or backfiller dies suddenly.
We treat it similarly to partial but lacking any hash.
At some point in the future we should probably have something
to find any temp segments, hash them and rename them to partials.
We wrap direct dateutil calls to handle two distinct cases:
* `common.dateutil.parse()`: We want to handle arbitrary timestamps including tz info,
then convert them to UTC.
This is used in HLS parsing, and for command line input for backfiller
* `common.dateutil.parse_utc_only()`: We want to only handle UTC timestamps,
but datetime.strptime isn't flexible enough (eg. can't handle missing fractional component).
This is used for restreamer request params.
Note this moves over the 'experimental' cutter and deletes the original cutter
that concatenates entire videos before cutting.
We may eventually want to revive that method if the experimental cutter turns out
to introduce too many issues.
We move most of the code over verbatim, but adjust it such that it acts
as a generic iterator that can be used in a variety of contexts.
Some other changes made during the move include telling ffmpeg to be quieter
(don't output version info and junk, only log if something goes wrong),
and avoiding errors during cleanup.
This is a performance optimization, allowing us to fail out early (potentially avoiding a LOT
of work) if we know we're going to reject any result that contains holes.
We add a new exception ContainsHoles that is raised in this condition.
This should help prevent changing state to EDITED with any of these fields unset,
which would blow up the cutter.
We also fix up upload_location, which was set up as a sheet input (NOT NULL DEFAULT ''),
and add a similar constraint saying any DONE columns must have non-NULL video link.
All our usage was of a single query anyway, so autocommit is easier to handle.
You can still opt into a longer transaction using the transaction() helper.
This code manages the database connections, setting their isolation level correctly
and ensuring the idempotent schema is applied before they're used.
Applying the schema on startup means we don't need to deal with the database's state,
setting it up before running, running migrations etc. However, it does put constraints on
the changes we can safely make.
Our use of seralizable isolation means that all transactions can be treated as fully
independent - the server must behave as though they'd been run seperately in some valid order.
This will give us the least surprising results when multiple connections try to modify the same
data, though we'll need to deal with occasional transaction commit failures due to conficts.
get_best_segments can sometimes take a very long time,
we don't want to stop other work from happening while it's ongoing.
So we ask gevent to run other things until there's no other work to do,
then we do one hour, then check back with gevent again.
In combination with the performance improvements, this should mean we don't block
other things from running for more than a few hundred ms at most.
strptime is much faster but can't handle as varied formats.
But in this case we fully control the format, so there's no reason not to use it.
Profiling suggests we spend about 80% of our time in get_best_segments just parsing dates,
so this is a signifigant performance gain.
The prometheus client uses a threading.Lock() to prevent shared access to
certain metric state. This lock is taken as part of doing collection, as well
as during metric.labels().
We hit a deadlock where our stack sampler signal arrived during a collection,
when the lock was held. This meant that flamegraph.labels() blocked forever,
and the lock was never released, hanging all metrics collection.
Our solution is a hack, which is to reach into the internals of our metric object
and replace its lock with a dummy one. This is reasonably safe, but only as long as
the prometheus_client internal structure doesn't change signfigiantly.
The function is quite customizable and therefore quite complex, but it allows us to
easily annotate a function to be timed with labels based on input and output,
as well as normalize results based on amount of work done to get a better
picture of the actual amount of time taken per unit of work.
This will help us monitor for performance issues.
I ran `pyflakes` on the repo and found these bugs:
```
./common/common.py:289: undefined name 'random'
./downloader/downloader/main.py:7: 'random' imported but unused
./backfiller/backfiller/main.py:150: undefined name 'variant'
./backfiller/backfiller/main.py:158: undefined name 'timedelta'
./backfiller/backfiller/main.py:171: undefined name 'sort'
./backfiller/backfiller/main.py:173: undefined name 'sort'
```
(ok, the "imported but unused" one isn't a bug, but the rest are)
This fixes those, as well as a further issue I saw with sorting of hours.
Iterables are not sortable. As an obvious example, what if your iterable was infinite?
As a result, any attempt to sort an iterable that is not already a friendly type like a list
or tuple will result in an error. We avoid this by coercing to list, fully realising the iterable
and putting it into a form that python will let us sort. It also avoids the nasty side-effect
of mutating the list that gets passed into us, which the caller may not expect. Consider this example:
```
>>> my_hours = ["one", "two", "three"]
>>> print my_hours
["one", "two", "three"]
>>> backfill_node(base_dir, node, stream, variants, hours=my_hours, order='forward')
>>> print my_hours
["one", "three", "two"]
```
Also, one of the linter errors was non-trivial to fix - we were trying to get a list of hours
(which is an api call for a particular variant), but at a time when we weren't dealing with a single
variant. My solution was to get a list of hours for ALL variants, and take the union.