wubloader

Commit Graph

Author	SHA1	Message	Date
Mike Lang	e383613954	database: Add constraints on edit inputs that they must be non-NULL if state != UNEDITED This should help prevent changing state to EDITED with any of these fields unset, which would blow up the cutter. We also fix up upload_location, which was set up as a sheet input (NOT NULL DEFAULT ''), and add a similar constraint saying any DONE columns must have non-NULL video link.	6 years ago
Mike Lang	292188ad7c	database: Remove retry_on_conflict helper and default to autocommit All our usage was of a single query anyway, so autocommit is easier to handle. You can still opt into a longer transaction using the transaction() helper.	6 years ago
Mike Lang	73640ed4ab	database: Add column video_id for storing upload-location-specific metadata for identifying video ie. for youtube, the video id.	6 years ago
Mike Lang	dc2eb6ed74	Add some common database code This code manages the database connections, setting their isolation level correctly and ensuring the idempotent schema is applied before they're used. Applying the schema on startup means we don't need to deal with the database's state, setting it up before running, running migrations etc. However, it does put constraints on the changes we can safely make. Our use of seralizable isolation means that all transactions can be treated as fully independent - the server must behave as though they'd been run seperately in some valid order. This will give us the least surprising results when multiple connections try to modify the same data, though we'll need to deal with occasional transaction commit failures due to conficts.	6 years ago
Mike Lang	997c1242b2	get_best_segments: Let other things run get_best_segments can sometimes take a very long time, we don't want to stop other work from happening while it's ongoing. So we ask gevent to run other things until there's no other work to do, then we do one hour, then check back with gevent again. In combination with the performance improvements, this should mean we don't block other things from running for more than a few hundred ms at most.	6 years ago
Mike Lang	bf08aa29b8	parse_segment_path: Use datetime.strptime instead of dateutil.parser strptime is much faster but can't handle as varied formats. But in this case we fully control the format, so there's no reason not to use it. Profiling suggests we spend about 80% of our time in get_best_segments just parsing dates, so this is a signifigant performance gain.	6 years ago
Mike Lang	bcdb268ce8	Also need to replace locks on the counter float values to prevent deadlocks See comment for full details	6 years ago
Mike Lang	10cca18922	Fix a deadlock due to signal interactions with prometheus client The prometheus client uses a threading.Lock() to prevent shared access to certain metric state. This lock is taken as part of doing collection, as well as during metric.labels(). We hit a deadlock where our stack sampler signal arrived during a collection, when the lock was held. This meant that flamegraph.labels() blocked forever, and the lock was never released, hanging all metrics collection. Our solution is a hack, which is to reach into the internals of our metric object and replace its lock with a dummy one. This is reasonably safe, but only as long as the prometheus_client internal structure doesn't change signfigiantly.	6 years ago
Mike Lang	b9c2921242	common.stats: Add a stacksampler that records sampled stacks to prometheus This can then be used to generate flamegraphs	6 years ago
Mike Lang	5175b099af	common: Split segment-related stuff into its own module We still import them into __init__.py so they're accessible externally just the same	6 years ago
Mike Lang	6f84a23ba6	common: Split stats-related stuff into its own module We still import them into __init__.py so they're accessible externally just the same	6 years ago
Mike Lang	8fe2fec958	common: convert from module to package	6 years ago
Mike Lang	d90f01b8ce	common: Create general function for timing things, and use it to time get_best_segments The function is quite customizable and therefore quite complex, but it allows us to easily annotate a function to be timed with labels based on input and output, as well as normalize results based on amount of work done to get a better picture of the actual amount of time taken per unit of work. This will help us monitor for performance issues.	6 years ago
Mike Lang	b0ded641c3	Add a logging handler which counts logs for prometheus stats This isn't as good as having a full centralised logging system, but should suffice to know if anything funny is happening.	6 years ago
Christopher Usher	3fcd374449	Moved encode_strings to common	6 years ago
Mike Lang	6815924097	Fix some bugs and linter errors introduced by backfiller I ran `pyflakes` on the repo and found these bugs: ``` ./common/common.py:289: undefined name 'random' ./downloader/downloader/main.py:7: 'random' imported but unused ./backfiller/backfiller/main.py:150: undefined name 'variant' ./backfiller/backfiller/main.py:158: undefined name 'timedelta' ./backfiller/backfiller/main.py:171: undefined name 'sort' ./backfiller/backfiller/main.py:173: undefined name 'sort' ``` (ok, the "imported but unused" one isn't a bug, but the rest are) This fixes those, as well as a further issue I saw with sorting of hours. Iterables are not sortable. As an obvious example, what if your iterable was infinite? As a result, any attempt to sort an iterable that is not already a friendly type like a list or tuple will result in an error. We avoid this by coercing to list, fully realising the iterable and putting it into a form that python will let us sort. It also avoids the nasty side-effect of mutating the list that gets passed into us, which the caller may not expect. Consider this example: ``` >>> my_hours = ["one", "two", "three"] >>> print my_hours ["one", "two", "three"] >>> backfill_node(base_dir, node, stream, variants, hours=my_hours, order='forward') >>> print my_hours ["one", "three", "two"] ``` Also, one of the linter errors was non-trivial to fix - we were trying to get a list of hours (which is an api call for a particular variant), but at a time when we weren't dealing with a single variant. My solution was to get a list of hours for ALL variants, and take the union.	6 years ago
Christopher Usher	fec0975d18	fixed white space and the like	6 years ago
Christopher Usher	3cdfaad664	moved rename, ensure_directory and jitter to common Move a few useful functions in downloader used in the backfiller to common	6 years ago
Mike Lang	6fa74608fb	common: Improve some docs to note types of things that are ambiguous	6 years ago
Mike Lang	3bbe1ed32d	Prefer longer duration on multiple segments	6 years ago
Christopher Usher	4981c6521b	Merge pull request #5 from ekimekim/mike/restreamer/initial Initial work on restreamer	6 years ago
Mike Lang	75c9793eac	Remove central config file as it's more trouble than it's worth Simpler and easier for testing to stick to configuration via CLI args. We'll worry about deployment later.	6 years ago
Mike Lang	0df8288013	common: Implement code for parsing paths and picking the best sequence of segments This is needed by both the restreamer and the cutter, hence its inclusion in common. The algorithm is pretty simple - it takes the 'best' segment per start time by full first, then length of partial. All the other complexity is mainly just around detecting and reporting holes, and being inclusive of start/end points.	6 years ago
Christopher Usher	8f462f5926	Fixed format_bustime docsting	6 years ago
Christopher Usher	4c22edf2e6	Fixed negative times in format_bustime	6 years ago
Mike Lang	d7641aecf5	common: Fix bugs and issues with bustime utils	6 years ago
Mike Lang	048277b003	common: Basic config and bustime code	6 years ago
Mike Lang	a361ab7a63	Add a common package for common bits in multiple components	6 years ago

1 2 3

128 Commits (e7a839c6cda02d57422a4a12b6ce99e84fceb58a)