Some extra documentation on chat_archiver

3 years ago · 9320251de7
parent d8a9b5ddf0
commit 9320251de7
2 changed files with 26 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -17,6 +17,7 @@ but a brief overview of the components:
 * `thrimbletrimmer` is a browser based video editor.
 * `segment_coverage` regularly checks whether there is complete segment coverage for each hour. 
 * `playlist_manager` adds videos to youtube playlists depending on tags.
 * `chat_archiver` records twitch chat messages and merges them with records from other nodes.
 * `database` hosts a Postgres database to store events to be edited.
 * `nginx` provides a webserver through which the other components are exposed to the outside world.
 * `common` provides code shared between the other components.
--- a/chat_archiver/README
+++ b/chat_archiver/README
@ -3,12 +3,28 @@ in a way that preserves as much context as possible, and allows multiple indepen
 streams to be combined to ensure nothing was missed.
 We store messages in newline-delimited JSON files, in timestamp order.
 Files are stored under the segments path, under /CHANNEL/chat/HOUR/.
 Files are named by their timestamp and hash.
 Each file covers one minute of messages.
 These files are named by their timestamp + hash and merged with other files via a CRDT model.
 	CRDT means you have a merge operation (.) such that
 		(A.B).C == A.(B.C) (associative)
 		A.B == B.A (commutitive)
 		A.A == A (reflexive)
 	This means that it doesn't matter what order files are merged in, or if the same file is merged
 	multiple times. We will always get an identical final result.
 The backfiller copies message files from other nodes to the local node. The chat_archiver
 perioidically scans and merges any files for the same minute.
 So a typical interaction with two nodes will look like this:
 	Node 1 records file A
 	Node 2 records file B
 	File A is backfilled to node 2
 	File B is backfilled to node 1
 	Node 1 merges A + B -> C
 	Node 2 merges B + A -> C (identical on both servers)
 	Since C is identical, it will have the same hash on both servers, so they won't need
 	to copy it to each other.
 We have a few different kinds of messages to deal with:
 	Messages with a unique id and timestamp
@ -26,11 +42,11 @@ We have a few different kinds of messages to deal with:
 		This also governs which file they are in if their range overlaps two files.
 	Messages known to be out of order
 		This is specific to JOINs and PARTs.
-		Twitch documents that these may be delayed by up to 10sec.
+		Twitch documents that these may be delayed by an unknown amount. We have observed up to 30sec.
 		So we follow the rules as per messages with implied ordering,
-		except we pad the possible start time by 10 seconds.
+		except we pad the possible start time by 45 seconds.
-How to merge two files
+How we merge two files
 	In general, if the same message (all non-receiver fields identical) is present in both files,
 		it is included once in the output. For messages with unique ids, this is all that's needed.
 	For messages without unique ids, we face the question of "is this the same message".
@ -43,7 +59,7 @@ How to merge two files
 	Literal edge cases: Timestamp ranges that span two files
 		It may be the case that we can match a message whose timestamp range overhangs file A
 		with a message near the start of file B. So whenever we are merging files, we need to
-		also consider the file before and the file after on our side.
+		also consider the file before and the file after.
 	In all cases when we merge two messages, we should merge the receiver timestamp field which maps
 		the receiver id to the timestamp it received the message. This preserves message provedence.
@ -51,3 +67,8 @@ All files are stored in newline-delimited, canonicalised JSON so we can use hash
 It should always be true that merging B into A and merging A into B should produce identical files
 with the same hash (effects of neighboring files notwithstanding - that just means more merges will
 be needed in order to stabilize).
 A note on the IRC library we're using - this library is called girc and was written by ekimekim.
 It has some twitch-specific handling and is built around gevent. It was much easier to pull it in
 rather than writing our own custom message handling, or trying to adapt another third party client
 to our setup. It is included in this repo via submodule.