This is a more featureful wrapper around ffmpeg with notable differences:
- It's used as a context manager, and so can manage its own cleanup
- It takes care of input feeding
- It can handle multiple inputs (via pipes), instead of one (via stdin)
This drastically reduces the setup and cleanup code required for most basic usage,
and the multi-input support will be used in followup changes.
Of 4 users of this function, all but one set them to None.
We're about to replace that one usage with something else, so it makes more sense
to not have them as options at all and just have the user add to the encode args manually.
- Move sheets API into common dir, since multi use
- Live download from Google Sheets using Config
- Falls back on old schedule if new one can't be downloaded for some reason
Sometimes in the wild (particularly on youtube) segments may not be timed perfectly, so allow up to 10ms of gap or overlap
to be counted as "equal" for purposes of finding the best segment.
This should hopefully result in frames on the edge of timestamps being extracted
from a combination of the neighboring segment and the naive one,
so that we don't get errors extracting a frame.
Rarely, we find ourselves needing to explicitly delete some data, eg. something that shouldn't
have been public and should be removed from all records.
It would also be nice if we could "clean up" bad versions of the same segment,
which occasionally come up when downloaders have issues.
With our distributed segment database, this is actually rather difficult as deleting the data
from any one server would cause it to be restored from the others. It was only possible
by stopping all backfill, deleting the data on all servers, then starting backfill again.
Here we introduce a more practical approach. An operator creates an empty flag file
with the same name as the segment to be deleted, but with a `.tombstone` extension.
eg. to delete a file `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.ts`,
you would create a tombstone `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.tombstone`.
These tombstone files do two important things:
* They hide the segment from being listed, which both means:
* It can't be restreamed or put into a video
* It can't be backfilled to other nodes
* The tombstone files themselves do get backfilled to other nodes, so you only need to mark them on one server.
Once the tombstone has propagated to all nodes, the segment file can be deleted independently on each one.
We chose not to have a tombstone automatically trigger a segment deletion for safety reasons.
The restreamer spends most of its time iterating through segments (parsing them, determining the best one for each start time)
to serve large time ranges. Since this only depends on the list of filenames read from disk,
we can cache it for a given hour as long as that list is identical.
This is a little trickier than it sounds because best_segments_by_start is an iterator
and in most cases it won't be fully consumed. So we introduce a `CachedIterator` abstraction
that will both remember the previously yielded values, and keep track of the live iterator
so it can be resumed again if a previous invocation only partially consumed it.
This also has the nice side effect of merging simultaneous operations - if two requests come in
for the same hour at the same time, they'll share one iterator and both consume the results
as they come in.
Previously both restreamer and thrimshim had some complex logic for dealing with
graceful shutdown, in different ways, that was still prone to race conditions.
We replace this with a common method that does it properly.
Fixes#226
This is the simplest case as we can just cut each range like we already do,
then concat the results.
We still allow for the full design in the database and cutter, but error out if transitions
is ever anything but hard cuts or if it's a full cut.
We also update the restreamer to allow accepting ranges, however for usability we still allow
the old "just one start and end" args.
Note this changes the thrimshim API to give and take the new "video_ranges" and "video_transitions" columns.
In python 3, file.write() may do a partial write and returns the number of characters written.
In order to not lose data, we need to wrap every instance of file.write() with our new
common.writeall() wrapper that loops until the data is actually written.
Check that open() calls for reading and writing use binary modes
Use alpine version with py3-pip package
Use python3 in Dockerfile CMD
Remove sys.setdefaultencoding() "hack"
Simplify ensure_directory() in common.common package
float() is inaccurate and Decimal() is very slow (~3x the cpu usage)
so instead we right-pad with 0s (eg. so 1.2345 -> 1.234500) then convert to int microsec directly.
Floating point error leads to 1us differences in parsed times,
which causes false positives in the overlapping segments check.
By using a Decimal, we get the exact digits from the filepath.
strptime is very slow. In terms of pure get_best_segments() speed, this change
more than doubles the throughput.
In particular for segment_coverage, this halves the run time for each check.
It causes problems due to the sheer number of unique metrics emitted, which makes
the prometheus endpoint be very expensive / fail a lot.
The data is not useful enough to justify the cost.