|
|
@ -3,12 +3,28 @@ in a way that preserves as much context as possible, and allows multiple indepen
|
|
|
|
streams to be combined to ensure nothing was missed.
|
|
|
|
streams to be combined to ensure nothing was missed.
|
|
|
|
|
|
|
|
|
|
|
|
We store messages in newline-delimited JSON files, in timestamp order.
|
|
|
|
We store messages in newline-delimited JSON files, in timestamp order.
|
|
|
|
|
|
|
|
Files are stored under the segments path, under /CHANNEL/chat/HOUR/.
|
|
|
|
|
|
|
|
Files are named by their timestamp and hash.
|
|
|
|
Each file covers one minute of messages.
|
|
|
|
Each file covers one minute of messages.
|
|
|
|
These files are named by their timestamp + hash and merged with other files via a CRDT model.
|
|
|
|
These files are named by their timestamp + hash and merged with other files via a CRDT model.
|
|
|
|
CRDT means you have a merge operation (.) such that
|
|
|
|
CRDT means you have a merge operation (.) such that
|
|
|
|
(A.B).C == A.(B.C) (associative)
|
|
|
|
(A.B).C == A.(B.C) (associative)
|
|
|
|
A.B == B.A (commutitive)
|
|
|
|
A.B == B.A (commutitive)
|
|
|
|
A.A == A (reflexive)
|
|
|
|
A.A == A (reflexive)
|
|
|
|
|
|
|
|
This means that it doesn't matter what order files are merged in, or if the same file is merged
|
|
|
|
|
|
|
|
multiple times. We will always get an identical final result.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The backfiller copies message files from other nodes to the local node. The chat_archiver
|
|
|
|
|
|
|
|
perioidically scans and merges any files for the same minute.
|
|
|
|
|
|
|
|
So a typical interaction with two nodes will look like this:
|
|
|
|
|
|
|
|
Node 1 records file A
|
|
|
|
|
|
|
|
Node 2 records file B
|
|
|
|
|
|
|
|
File A is backfilled to node 2
|
|
|
|
|
|
|
|
File B is backfilled to node 1
|
|
|
|
|
|
|
|
Node 1 merges A + B -> C
|
|
|
|
|
|
|
|
Node 2 merges B + A -> C (identical on both servers)
|
|
|
|
|
|
|
|
Since C is identical, it will have the same hash on both servers, so they won't need
|
|
|
|
|
|
|
|
to copy it to each other.
|
|
|
|
|
|
|
|
|
|
|
|
We have a few different kinds of messages to deal with:
|
|
|
|
We have a few different kinds of messages to deal with:
|
|
|
|
Messages with a unique id and timestamp
|
|
|
|
Messages with a unique id and timestamp
|
|
|
@ -26,11 +42,11 @@ We have a few different kinds of messages to deal with:
|
|
|
|
This also governs which file they are in if their range overlaps two files.
|
|
|
|
This also governs which file they are in if their range overlaps two files.
|
|
|
|
Messages known to be out of order
|
|
|
|
Messages known to be out of order
|
|
|
|
This is specific to JOINs and PARTs.
|
|
|
|
This is specific to JOINs and PARTs.
|
|
|
|
Twitch documents that these may be delayed by up to 10sec.
|
|
|
|
Twitch documents that these may be delayed by an unknown amount. We have observed up to 30sec.
|
|
|
|
So we follow the rules as per messages with implied ordering,
|
|
|
|
So we follow the rules as per messages with implied ordering,
|
|
|
|
except we pad the possible start time by 10 seconds.
|
|
|
|
except we pad the possible start time by 45 seconds.
|
|
|
|
|
|
|
|
|
|
|
|
How to merge two files
|
|
|
|
How we merge two files
|
|
|
|
In general, if the same message (all non-receiver fields identical) is present in both files,
|
|
|
|
In general, if the same message (all non-receiver fields identical) is present in both files,
|
|
|
|
it is included once in the output. For messages with unique ids, this is all that's needed.
|
|
|
|
it is included once in the output. For messages with unique ids, this is all that's needed.
|
|
|
|
For messages without unique ids, we face the question of "is this the same message".
|
|
|
|
For messages without unique ids, we face the question of "is this the same message".
|
|
|
@ -43,7 +59,7 @@ How to merge two files
|
|
|
|
Literal edge cases: Timestamp ranges that span two files
|
|
|
|
Literal edge cases: Timestamp ranges that span two files
|
|
|
|
It may be the case that we can match a message whose timestamp range overhangs file A
|
|
|
|
It may be the case that we can match a message whose timestamp range overhangs file A
|
|
|
|
with a message near the start of file B. So whenever we are merging files, we need to
|
|
|
|
with a message near the start of file B. So whenever we are merging files, we need to
|
|
|
|
also consider the file before and the file after on our side.
|
|
|
|
also consider the file before and the file after.
|
|
|
|
In all cases when we merge two messages, we should merge the receiver timestamp field which maps
|
|
|
|
In all cases when we merge two messages, we should merge the receiver timestamp field which maps
|
|
|
|
the receiver id to the timestamp it received the message. This preserves message provedence.
|
|
|
|
the receiver id to the timestamp it received the message. This preserves message provedence.
|
|
|
|
|
|
|
|
|
|
|
@ -51,3 +67,8 @@ All files are stored in newline-delimited, canonicalised JSON so we can use hash
|
|
|
|
It should always be true that merging B into A and merging A into B should produce identical files
|
|
|
|
It should always be true that merging B into A and merging A into B should produce identical files
|
|
|
|
with the same hash (effects of neighboring files notwithstanding - that just means more merges will
|
|
|
|
with the same hash (effects of neighboring files notwithstanding - that just means more merges will
|
|
|
|
be needed in order to stabilize).
|
|
|
|
be needed in order to stabilize).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A note on the IRC library we're using - this library is called girc and was written by ekimekim.
|
|
|
|
|
|
|
|
It has some twitch-specific handling and is built around gevent. It was much easier to pull it in
|
|
|
|
|
|
|
|
rather than writing our own custom message handling, or trying to adapt another third party client
|
|
|
|
|
|
|
|
to our setup. It is included in this repo via submodule.
|
|
|
|