We've noticed that when nodes have connection problems, they get full segments
with different hashes. Inspection of these segments shows that
they all have identical data up to a point.
Segments that fetched normally will then have the remainder of the data.
Segments that had issues will have a slightly corrupted end.
The data is still valid, and no errors are raised. It just doesn't have all the data.
We noticed that these corrupted segments all were cut off exactly 60sec after their requests
began. We believe this is a server-side timeout on the request that returns whatever data
it has, then closes the container file cleanly before returning successfully.
We detect segments that take > 59 seconds to recieve, and label them as "suspect".
Suspect segments are treated identically to partial segments, except they are always preferred
over partials.
We occasionally see corrupted segments that are slightly shorter in size
but report the same metadata as the full segments. Prefer the largest version
as it's likely the least corrupt.
Not only is this redundant, but it creates a race condition where
the worker fails before the latest_worker = workers[-1] check,
and we get an IndexError.
Despite our best efforts, this was causing chunked responses to be fully
buffered into memory as a side effect.
This is really bad because responses can be VERY large.
This is intended mainly for travis CI, because by default it doesn't cache any layers
between builds.
By pulling likely-reusable builds (all parents of the current commit),
we take a fixed cost slowdown but in many cases should see a dramatic speed increase
overall, since we won't need to re-build anything that hasn't changed.
This isn't needed for local builds, where docker will do this on its own
with any previously-built images.
Since these tend to happen around stream endings, etc,
we don't want them to be crazy noisy and cause us to disregard real problems.
We can use the segment coverage to see in metrics if there are overlaps.
This lets us view a number of useful graphs in dashboards, eg. rows by state,
errored rows, rows by day, rows by category, meltdowns per day, fraction of
events that are poster moments by category.
Sheetsync was the natural place to do this since it was already periodically scanning
the entire events table.