The database changes rarely, and is disruptive to re-deploy,
so we track its image version seperately so it only needs to be upgraded
when there's actually a change.
Note this version is very simplified compared to the docker-compose
and has some major limitations:
* It relies on hostPath and a nodeSelector to put all the components on a shared storage node
* It only supports use as a replication node (downloader, restreamer, backfiller, segment_coverage)
* It uses the k8s Ingress instead of the built-in nginx for http routing.
We've noticed that when nodes have connection problems, they get full segments
with different hashes. Inspection of these segments shows that
they all have identical data up to a point.
Segments that fetched normally will then have the remainder of the data.
Segments that had issues will have a slightly corrupted end.
The data is still valid, and no errors are raised. It just doesn't have all the data.
We noticed that these corrupted segments all were cut off exactly 60sec after their requests
began. We believe this is a server-side timeout on the request that returns whatever data
it has, then closes the container file cleanly before returning successfully.
We detect segments that take > 59 seconds to recieve, and label them as "suspect".
Suspect segments are treated identically to partial segments, except they are always preferred
over partials.
We occasionally see corrupted segments that are slightly shorter in size
but report the same metadata as the full segments. Prefer the largest version
as it's likely the least corrupt.
Not only is this redundant, but it creates a race condition where
the worker fails before the latest_worker = workers[-1] check,
and we get an IndexError.
Despite our best efforts, this was causing chunked responses to be fully
buffered into memory as a side effect.
This is really bad because responses can be VERY large.