wubloader

Commit Graph

Author	SHA1	Message	Date
Mike Lang	c493869b9a	Have list_segment_files also list chat archives Otherwise backfilling of chat doesn't work	3 years ago
Mike Lang	a3e16a2686	thumbnails: Take crop/scaling info from a json file next to the image file	3 years ago
Mike Lang	45c46df8bb	Add thumbnail templating code	3 years ago
Mike Lang	08257386e2	Add restreamer endpoint for viewing chat messages	3 years ago
Mike Lang	1add3c5c22	Implement tombstoning to allow for segment deletion Rarely, we find ourselves needing to explicitly delete some data, eg. something that shouldn't have been public and should be removed from all records. It would also be nice if we could "clean up" bad versions of the same segment, which occasionally come up when downloaders have issues. With our distributed segment database, this is actually rather difficult as deleting the data from any one server would cause it to be restored from the others. It was only possible by stopping all backfill, deleting the data on all servers, then starting backfill again. Here we introduce a more practical approach. An operator creates an empty flag file with the same name as the segment to be deleted, but with a `.tombstone` extension. eg. to delete a file `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.ts`, you would create a tombstone `/segments/desertbus/source/2019-11-13T02/45:51.608000-2.0-full-7IS92rssMzoSBQDIevHStbTNy-URRV3Vw-jzZ6pwOZM.tombstone`. These tombstone files do two important things: * They hide the segment from being listed, which both means: * It can't be restreamed or put into a video * It can't be backfilled to other nodes * The tombstone files themselves do get backfilled to other nodes, so you only need to mark them on one server. Once the tombstone has propagated to all nodes, the segment file can be deleted independently on each one. We chose not to have a tombstone automatically trigger a segment deletion for safety reasons.	3 years ago
Mike Lang	44d0c0269a	cache results of common.segments.best_segments_by_start The restreamer spends most of its time iterating through segments (parsing them, determining the best one for each start time) to serve large time ranges. Since this only depends on the list of filenames read from disk, we can cache it for a given hour as long as that list is identical. This is a little trickier than it sounds because best_segments_by_start is an iterator and in most cases it won't be fully consumed. So we introduce a `CachedIterator` abstraction that will both remember the previously yielded values, and keep track of the live iterator so it can be resumed again if a previous invocation only partially consumed it. This also has the nice side effect of merging simultaneous operations - if two requests come in for the same hour at the same time, they'll share one iterator and both consume the results as they come in.	3 years ago
Mike Lang	9f9ef66a85	Add endpoint to get a given frame of video	4 years ago
Mike Lang	d1ba4bc4eb	Downgrade overlapping segments from warning to info They were causing too much log noise	4 years ago
Mike Lang	7649a4e840	Improve WSGIServer graceful shutdown handling Previously both restreamer and thrimshim had some complex logic for dealing with graceful shutdown, in different ways, that was still prone to race conditions. We replace this with a common method that does it properly. Fixes #226	4 years ago
Mike Lang	aab8cf2f0f	Set up plumbing for multi-range videos and implement no-transition fast cut videos only This is the simplest case as we can just cut each range like we already do, then concat the results. We still allow for the full design in the database and cutter, but error out if transitions is ever anything but hard cuts or if it's a full cut. We also update the restreamer to allow accepting ranges, however for usability we still allow the old "just one start and end" args. Note this changes the thrimshim API to give and take the new "video_ranges" and "video_transitions" columns.	4 years ago
Mike Lang	3de44d6731	Add ability to render waveforms in restreamer	4 years ago
Mike Lang	7599681b6d	yet another py3 map() issue "hey i know lets make everything return an iterable but not update anything else to accept them"	4 years ago
Mike Lang	62bd6539ea	Unpin gevent as that was a workaround for a py2 issue	4 years ago
Mike Lang	21856c68aa	Fix all instances of file.write() for py3 In python 3, file.write() may do a partial write and returns the number of characters written. In order to not lose data, we need to wrap every instance of file.write() with our new common.writeall() wrapper that loops until the data is actually written.	4 years ago
Mike Lang	a56f6859bb	more py3 fixes	4 years ago
Mike Lang	3e69000058	py3 fixes for common	4 years ago
Mike Lang	d03ae49eec	Remove defunct "smart cut" method This was an alternate way of doing a cut that turned out to work exactly the same as a fast cut, just with a more complex implementation.	4 years ago
HubbeKing	6d790a1b36	Do a first naive pass for py3 compatibility Check that open() calls for reading and writing use binary modes Use alpine version with py3-pip package Use python3 in Dockerfile CMD Remove sys.setdefaultencoding() "hack" Simplify ensure_directory() in common.common package	4 years ago
Mike Lang	f0546e2ee3	Pin gevent to 1.5a2 to avoid https://github.com/gevent/gevent/issues/1711	4 years ago
Mike Lang	9d8c47377f	segment parsing: Hand-roll microsecond parsing float() is inaccurate and Decimal() is very slow (~3x the cpu usage) so instead we right-pad with 0s (eg. so 1.2345 -> 1.234500) then convert to int microsec directly.	5 years ago
Mike Lang	66669cd4e4	common: When parsing segment timestamps, use decimal instead of float Floating point error leads to 1us differences in parsed times, which causes false positives in the overlapping segments check. By using a Decimal, we get the exact digits from the filepath.	5 years ago
Mike Lang	13a228070a	common.segments: Speed up segment parsing by rolling our own time parsing strptime is very slow. In terms of pure get_best_segments() speed, this change more than doubles the throughput. In particular for segment_coverage, this halves the run time for each check.	5 years ago
Mike Lang	b029250c1c	Disable stacksampler by default It causes problems due to the sheer number of unique metrics emitted, which makes the prometheus endpoint be very expensive / fail a lot. The data is not useful enough to justify the cost.	5 years ago
Mike Lang	1b12c05e0e	make smart cut work, only to discover it doesn't actually have any advantage over fast	5 years ago
Mike Lang	2dbd1132fe	common.googleapis: Fix a bug in retrying failed access token get Seems that this was never fixed when the code was moved.	6 years ago
Mike Lang	7dcd844e16	add logging to help debug smart cut	6 years ago
Mike Lang	c294fa82b8	smart cut: Fix output format	6 years ago
Mike Lang	c6172ce37f	smart cut: More typos	6 years ago
Mike Lang	82346a55ca	smart cut: Fix int in ffmpeg args	6 years ago
Mike Lang	b39e844c1e	restreamer: Fix missing import of smart cut	6 years ago
Mike Lang	21d5548980	Add new segment type "suspect" We've noticed that when nodes have connection problems, they get full segments with different hashes. Inspection of these segments shows that they all have identical data up to a point. Segments that fetched normally will then have the remainder of the data. Segments that had issues will have a slightly corrupted end. The data is still valid, and no errors are raised. It just doesn't have all the data. We noticed that these corrupted segments all were cut off exactly 60sec after their requests began. We believe this is a server-side timeout on the request that returns whatever data it has, then closes the container file cleanly before returning successfully. We detect segments that take > 59 seconds to recieve, and label them as "suspect". Suspect segments are treated identically to partial segments, except they are always preferred over partials.	6 years ago
Mike Lang	bb05e37ae4	segments: Use longest segment in bytes if duration is the same We occasionally see corrupted segments that are slightly shorter in size but report the same metadata as the full segments. Prefer the largest version as it's likely the least corrupt.	6 years ago
Mike Lang	b516917e62	Add new "smart" cut technique	6 years ago
Mike Lang	eba5fc498a	Remove flask response size tracking Despite our best efforts, this was causing chunked responses to be fully buffered into memory as a side effect. This is really bad because responses can be VERY large.	6 years ago
Mike Lang	2efe1d6218	Fix a bad logging line when handling errors	6 years ago
Mike Lang	59ee5cf5c0	Only log at INFO about multiple versions of a segment Since these tend to happen around stream endings, etc, we don't want them to be crazy noisy and cause us to disregard real problems. We can use the segment coverage to see in metrics if there are overlaps.	6 years ago
Mike Lang	249e32583b	get_best_segments: Don't error if the only segments that exist for time are temp	6 years ago
Mike Lang	6b602592f5	Allow disabling of stacksampling with an env var This gives an easy way to do so across all services without adding new options. Reasons to do so might be to avoid overheads or because your prometheus metrics grow too large.	6 years ago
Mike Lang	4d3aa94a71	Automatically set default encoding to utf-8 when common is imported To be clear, this is an awful hack. It means that any implicit str/unicode coersion will use the utf-8 encoding, which is basically always what you want. However, it is possible that some badly-written libraries might be relying on the default encoding being ascii, and will do weird things as a result. Finally, it's especially hacky to be doing this as part of importing a library. Normally you're meant to do this as part of a sitecustomize.py in your python system directory, and the function is deleted before passing control to normal code (this is why we need to reload() to get it back).	6 years ago
Mike Lang	4d5157cdb5	Fix a mistake with allowing reuse of name in @timed()	6 years ago
Mike Lang	426b1328be	Fix mistakes in common.requests	6 years ago
Mike Lang	a7f5d1c545	Fix issues with metrics gathering for cut functions * Need to allow timed() to have multiple callers with same name * "type" label is reserved, use "cut_type" instead	6 years ago
Mike Lang	eb4fb5a9e1	restreamer: Add more options for fetching cuts Split full cut into two types - an mpegts one and an mp4 one. Add "rough" cut which is just a concat of the segments.	6 years ago
Mike Lang	b27e06d068	Fix typo in common/common/segments.py	6 years ago
Mike Lang	4d52b18b04	cutter,restreamer: Set stream=True for full cuts when appropriate And also default to a new ffmpeg encoding setting for high-quality mpegts (ie. still streamable) that is encoded very quickly.	6 years ago
Mike Lang	9afcc7b399	full cut: Optionally use seekable file OR directly stream The caller can pick depending on the needs of the output format. This reverses most of `80d829b83b`, re-introducing streaming full cuts but keeping non-streaming as an option.	6 years ago
Mike Lang	4f900c5925	Collect metrics around cutting time	6 years ago
Mike Lang	52e6c4ad41	sheetsync, cutter: Collect metrics on http calls In particular, to google apis.	6 years ago
Mike Lang	a2edb38a85	Add an InstrumentedSession wrapper that automatically gathers metrics on http calls	6 years ago
Mike Lang	fc791e03d4	DBManager: Don't test connection on start This gives the individual services more freedom in how to handle a failing connection.	6 years ago
Mike Lang	e435abf72e	Merge pull request #114 from ekimekim/mike/fixes Grab-bag of cutter fixes	6 years ago
Mike Lang	80d829b83b	full cut: ffmpeg requires a seekable output file Most formats like mp4 require ffmpeg to make changes at the start of the file throughout writing. Unfortunately, this prevents us from streaming the upload as we cut it. Instead, we spool to a temporary file until ffmpeg exits, then upload that all at once.	6 years ago
Christopher Usher	f4cd3f546e	removed comments no longer needed	6 years ago
Christopher Usher	51e4520826	replaced warnings.warn with logger.warn	6 years ago
Christopher Usher	7d85eb7272	warn about and ignore files that don't parse as segments	6 years ago
Mike Lang	3a9543a4b5	Suppress less ffmpeg output when cutting The "fatal" level was causing some useful errors to be suppressed.	6 years ago
Mike Lang	12decf015e	Fix multiple typos and mistakes with full cuts	6 years ago
Mike Lang	c970677a76	full cut: Fix a typo	6 years ago
Mike Lang	d3e1d6b4fc	Resurrect non-experimental cut, now dubbed "full" (vs "fast") cut In a fast cut, we edit the first and last segments then concatenate them all. However, this leads to some tiny but perciptible artifacting around the border of the first and second (and second-last and last) segments. A full cut is much slower, but re-encodes the video into the desired format and is more reliable. We want both options to be available. With this commit, we only add the option, we don't use it in restreamer or cutter.	6 years ago
Christopher Usher	928f9733d2	horrible bug with negative times fixed	6 years ago
Christopher Usher	a6303c38ce	fixed parse_segment_path to allow just a filename to be parsed	6 years ago
Christopher Usher	632c5fae2f	added a default timeout to database connections	6 years ago
Christopher Usher	ff5c1f8ecd	fixes based on ekim's suggestions	6 years ago
Christopher Usher	f75f3e61e8	Removed schema from common/database.py	6 years ago
Christopher Usher	86477fae13	fixes for ekim's comments	6 years ago
Christopher Usher	23e3cfce20	Added editor, edit_time and upload_time to thrimshim and cutter updates of the database	6 years ago
Christopher Usher	75cafdabb7	database changes to keep track of editors and edit times	6 years ago
Christopher Usher	67100c4126	comments	6 years ago
Christopher Usher	76bc629720	moved flask monitoring to its own module	6 years ago
Christopher Usher	6c633df3ee	move restreamer.stats to common.stats	6 years ago
Christopher Usher	b959853593	refactored to channel and quality	6 years ago
Mike Lang	7179fcacec	Backfiller: ignore temp segments To make this work, we make type a proper segment field. We also tell get_best_segments to ignore temp segments, since they might go away before we can actually use them.	6 years ago
Mike Lang	499e486b0b	Merge pull request #54 from ekimekim/mike/sheet-sync/initial sheet sync	6 years ago
Christopher Usher	4b9fbcb7d2	backfiller database code	6 years ago
Mike Lang	9762f308a0	Implement main part of sheet sync	6 years ago
Mike Lang	3647d091f8	Move common google api auth functionality into common So we can reuse it for google sheets	6 years ago
Mike Lang	3ccace2a73	database: Update constraints to allow null edit inputs in state DONE This allows manual uploads to work without needing to fill all the edit fields with junk. We also set a constraint on uploader asserting that any videos from claimed onwards have a known uploader. Again, an exception is made for DONE to allow manual uploads.	6 years ago
Mike Lang	cca4d52b7d	Don't error when encountering a temp-type segment These can happen if a downloader or backfiller dies suddenly. We treat it similarly to partial but lacking any hash. At some point in the future we should probably have something to find any temp segments, hash them and rename them to partials.	6 years ago
Mike Lang	f8d10dacdf	Audit and fix all usage of dateutil We wrap direct dateutil calls to handle two distinct cases: * `common.dateutil.parse()`: We want to handle arbitrary timestamps including tz info, then convert them to UTC. This is used in HLS parsing, and for command line input for backfiller * `common.dateutil.parse_utc_only()`: We want to only handle UTC timestamps, but datetime.strptime isn't flexible enough (eg. can't handle missing fractional component). This is used for restreamer request params.	6 years ago
Mike Lang	dfc64481a6	Port existing cutting code from restreamer into common Note this moves over the 'experimental' cutter and deletes the original cutter that concatenates entire videos before cutting. We may eventually want to revive that method if the experimental cutter turns out to introduce too many issues. We move most of the code over verbatim, but adjust it such that it acts as a generic iterator that can be used in a variety of contexts. Some other changes made during the move include telling ffmpeg to be quieter (don't output version info and junk, only log if something goes wrong), and avoiding errors during cleanup.	6 years ago
Mike Lang	3d9ba77745	common: add allow_holes option to get_best_segments() to abort early if holes found This is a performance optimization, allowing us to fail out early (potentially avoiding a LOT of work) if we know we're going to reject any result that contains holes. We add a new exception ContainsHoles that is raised in this condition.	6 years ago
Mike Lang	e383613954	database: Add constraints on edit inputs that they must be non-NULL if state != UNEDITED This should help prevent changing state to EDITED with any of these fields unset, which would blow up the cutter. We also fix up upload_location, which was set up as a sheet input (NOT NULL DEFAULT ''), and add a similar constraint saying any DONE columns must have non-NULL video link.	6 years ago
Mike Lang	292188ad7c	database: Remove retry_on_conflict helper and default to autocommit All our usage was of a single query anyway, so autocommit is easier to handle. You can still opt into a longer transaction using the transaction() helper.	6 years ago
Mike Lang	73640ed4ab	database: Add column video_id for storing upload-location-specific metadata for identifying video ie. for youtube, the video id.	6 years ago
Mike Lang	dc2eb6ed74	Add some common database code This code manages the database connections, setting their isolation level correctly and ensuring the idempotent schema is applied before they're used. Applying the schema on startup means we don't need to deal with the database's state, setting it up before running, running migrations etc. However, it does put constraints on the changes we can safely make. Our use of seralizable isolation means that all transactions can be treated as fully independent - the server must behave as though they'd been run seperately in some valid order. This will give us the least surprising results when multiple connections try to modify the same data, though we'll need to deal with occasional transaction commit failures due to conficts.	6 years ago
Mike Lang	997c1242b2	get_best_segments: Let other things run get_best_segments can sometimes take a very long time, we don't want to stop other work from happening while it's ongoing. So we ask gevent to run other things until there's no other work to do, then we do one hour, then check back with gevent again. In combination with the performance improvements, this should mean we don't block other things from running for more than a few hundred ms at most.	7 years ago
Mike Lang	bf08aa29b8	parse_segment_path: Use datetime.strptime instead of dateutil.parser strptime is much faster but can't handle as varied formats. But in this case we fully control the format, so there's no reason not to use it. Profiling suggests we spend about 80% of our time in get_best_segments just parsing dates, so this is a signifigant performance gain.	7 years ago
Mike Lang	bcdb268ce8	Also need to replace locks on the counter float values to prevent deadlocks See comment for full details	7 years ago
Mike Lang	10cca18922	Fix a deadlock due to signal interactions with prometheus client The prometheus client uses a threading.Lock() to prevent shared access to certain metric state. This lock is taken as part of doing collection, as well as during metric.labels(). We hit a deadlock where our stack sampler signal arrived during a collection, when the lock was held. This meant that flamegraph.labels() blocked forever, and the lock was never released, hanging all metrics collection. Our solution is a hack, which is to reach into the internals of our metric object and replace its lock with a dummy one. This is reasonably safe, but only as long as the prometheus_client internal structure doesn't change signfigiantly.	7 years ago
Mike Lang	b9c2921242	common.stats: Add a stacksampler that records sampled stacks to prometheus This can then be used to generate flamegraphs	7 years ago
Mike Lang	5175b099af	common: Split segment-related stuff into its own module We still import them into __init__.py so they're accessible externally just the same	7 years ago
Mike Lang	6f84a23ba6	common: Split stats-related stuff into its own module We still import them into __init__.py so they're accessible externally just the same	7 years ago
Mike Lang	8fe2fec958	common: convert from module to package	7 years ago
Mike Lang	d90f01b8ce	common: Create general function for timing things, and use it to time get_best_segments The function is quite customizable and therefore quite complex, but it allows us to easily annotate a function to be timed with labels based on input and output, as well as normalize results based on amount of work done to get a better picture of the actual amount of time taken per unit of work. This will help us monitor for performance issues.	7 years ago
Mike Lang	b0ded641c3	Add a logging handler which counts logs for prometheus stats This isn't as good as having a full centralised logging system, but should suffice to know if anything funny is happening.	7 years ago
Christopher Usher	3fcd374449	Moved encode_strings to common	7 years ago
Mike Lang	6815924097	Fix some bugs and linter errors introduced by backfiller I ran `pyflakes` on the repo and found these bugs: ``` ./common/common.py:289: undefined name 'random' ./downloader/downloader/main.py:7: 'random' imported but unused ./backfiller/backfiller/main.py:150: undefined name 'variant' ./backfiller/backfiller/main.py:158: undefined name 'timedelta' ./backfiller/backfiller/main.py:171: undefined name 'sort' ./backfiller/backfiller/main.py:173: undefined name 'sort' ``` (ok, the "imported but unused" one isn't a bug, but the rest are) This fixes those, as well as a further issue I saw with sorting of hours. Iterables are not sortable. As an obvious example, what if your iterable was infinite? As a result, any attempt to sort an iterable that is not already a friendly type like a list or tuple will result in an error. We avoid this by coercing to list, fully realising the iterable and putting it into a form that python will let us sort. It also avoids the nasty side-effect of mutating the list that gets passed into us, which the caller may not expect. Consider this example: ``` >>> my_hours = ["one", "two", "three"] >>> print my_hours ["one", "two", "three"] >>> backfill_node(base_dir, node, stream, variants, hours=my_hours, order='forward') >>> print my_hours ["one", "three", "two"] ``` Also, one of the linter errors was non-trivial to fix - we were trying to get a list of hours (which is an api call for a particular variant), but at a time when we weren't dealing with a single variant. My solution was to get a list of hours for ALL variants, and take the union.	7 years ago
Christopher Usher	fec0975d18	fixed white space and the like	7 years ago
Christopher Usher	3cdfaad664	moved rename, ensure_directory and jitter to common Move a few useful functions in downloader used in the backfiller to common	7 years ago
Mike Lang	6fa74608fb	common: Improve some docs to note types of things that are ambiguous	7 years ago

1 2 3 4

159 Commits (HubbeKing-multiplatform-builds)