Rarely, we find ourselves needing to explicitly delete some data, eg. something that shouldn't
have been public and should be removed from all records.
It would also be nice if we could "clean up" bad versions of the same segment,
which occasionally come up when downloaders have issues.
With our distributed segment database, this is actually rather difficult as deleting the data
from any one server would cause it to be restored from the others. It was only possible
by stopping all backfill, deleting the data on all servers, then starting backfill again.
Here we introduce a more practical approach. An operator creates an empty flag file
with the same name as the segment to be deleted, but with a `.tombstone` extension.
eg. to delete a file `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.ts`,
you would create a tombstone `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.tombstone`.
These tombstone files do two important things:
* They hide the segment from being listed, which both means:
* It can't be restreamed or put into a video
* It can't be backfilled to other nodes
* The tombstone files themselves do get backfilled to other nodes, so you only need to mark them on one server.
Once the tombstone has propagated to all nodes, the segment file can be deleted independently on each one.
We chose not to have a tombstone automatically trigger a segment deletion for safety reasons.
In python 3, file.write() may do a partial write and returns the number of characters written.
In order to not lose data, we need to wrap every instance of file.write() with our new
common.writeall() wrapper that loops until the data is actually written.
Check that open() calls for reading and writing use binary modes
Use alpine version with py3-pip package
Use python3 in Dockerfile CMD
Remove sys.setdefaultencoding() "hack"
Simplify ensure_directory() in common.common package
Then treat backfilling each channel just like backfilling each quality.
This is conceptually simpler (only one kind of thing, a (channel, quality))
and has better behaviour when a node is down (we only have one lot of error handling around it).
It also means we aren't asking the database for the same info once per channel,
and cuts down on logging noise.
We move all connection handling into get_nodes().
This means that problems connecting won't cause further errors
and cause the application to completely crash.
In turn, this means that the behaviour if the database goes down becomes
"continue backfilling from the nodes we know about" instead of crashing.
In testing, GDQ's stream delay went up over 1min, which caused backfillers to backfill
segments at the same time they were downloaded. We increase the window for now,
and also make it configurable.
This will signifigantly increase throughput when downloading
large ranges of segments.
The max concurrency is exposed as a cli arg.
We also slightly modify the logged info, so it reports segments downloaded,
not just number of missing segments (which we might skip downloading for various reasons).
To make this work, we make type a proper segment field.
We also tell get_best_segments to ignore temp segments, since they might go away
before we can actually use them.
We wrap direct dateutil calls to handle two distinct cases:
* `common.dateutil.parse()`: We want to handle arbitrary timestamps including tz info,
then convert them to UTC.
This is used in HLS parsing, and for command line input for backfiller
* `common.dateutil.parse_utc_only()`: We want to only handle UTC timestamps,
but datetime.strptime isn't flexible enough (eg. can't handle missing fractional component).
This is used for restreamer request params.