mirror of https://github.com/ekimekim/wubloader
Add documentation on all the kinds of data we store in segments dir
parent
7ce736b82e
commit
5e0e470821
@ -0,0 +1,102 @@
|
||||
|
||||
A list of all the info we may potentially save, and what format it is in.
|
||||
|
||||
All filepaths are relative to the "base directory" used for storage.
|
||||
A unix filesystem (case-sensitive, no special characters except `/`) is assumed.
|
||||
|
||||
When not otherwise specified, a "hash" of content is a sha256 hash encoded to url-safe base64 without padding.
|
||||
For example, the empty string hashes to `47DEQpj8HBSa_-TImW_5JCeuQeRkm5NMpJWZG3hSuFU`
|
||||
|
||||
## Video segments
|
||||
|
||||
Stream video data is saved in MPEG-TS segments, with filepaths:
|
||||
`CHANNEL/QUALITY/HOUR/TIME-DURATION-TYPE-HASH.ts`
|
||||
Where:
|
||||
- HOUR is `%Y-%m-%DT%H`
|
||||
- TIME is `%M-%S`
|
||||
- DURATION is a non-negative float value
|
||||
- TYPE is one of:
|
||||
- `full` - A normal segment
|
||||
- `suspect` - A segment which we suspect to not be fully correct
|
||||
- `partial` - A segment which we know to be incomplete
|
||||
|
||||
It is assumed that any set of segments can be concatenated to produce a playable video
|
||||
(once timestamps have been corrected).
|
||||
|
||||
## Chat logs
|
||||
|
||||
Chat logs are saved in "batch" files with filepaths:
|
||||
`CHANNEL/chat/HOUR/TIME-HASH.json`
|
||||
Where HOUR and TIME are as per segment files.
|
||||
|
||||
Each batch file is newline-delimited json containing chat log entries.
|
||||
Each entry corresponds to an IRC PRIVMSG or other command like a JOIN, CLEARCHAT or ROOMSTATE.
|
||||
An entry has a `time` and a `time_range` field, with the estimated "true" time of the message
|
||||
being within the range `[time, time + time_range]`.
|
||||
The `receivers` field contains each unique chat archiver instance that observed this message,
|
||||
and the timestamp at which they recieved it. This information is primarily for debugging purposes.
|
||||
|
||||
Batches from multiple machines will be merged periodically,
|
||||
so while it may be possible to observe two batches for the same time, this should be temporary.
|
||||
Messages may be present in both batches.
|
||||
|
||||
## Blog posts
|
||||
|
||||
Website blog posts are captured by blogbot with filepaths:
|
||||
`blogs/ID-HASH.json`
|
||||
Multiple files with the same ID represent edits of the same blog post.
|
||||
Each JSON file contains an object with the html content plus some other metadata.
|
||||
|
||||
## Coverage maps
|
||||
|
||||
`segment-coverage` generates coverage map images with filepath:
|
||||
`coverage-maps/CHANNEL_QUALITY_coverage.png`
|
||||
along with a html file of the same name which shows the image with a periodic refresh.
|
||||
|
||||
The image files show one pixel per 2 seconds, with the color depending on the coverage state at that time.
|
||||
See `segment-coverage` for more detail.
|
||||
|
||||
## Emotes
|
||||
|
||||
Emote data is saved by chat-archiver for each unique emote seen in chat, with filepath:
|
||||
`emotes/ID/{light,dark}-{1.0,2.0,3.0}`
|
||||
These 6 files per emote represent all the variants of the emote image that twitch provides.
|
||||
|
||||
Each file is either a PNG or a GIF - consult file magic values (the first 4 bytes) to determine which.
|
||||
|
||||
## Downloaded media
|
||||
|
||||
Several components will download arbitary media when they see a URL:
|
||||
- in a blog or social media post
|
||||
- in the media links column of the sheet
|
||||
- in chat
|
||||
|
||||
Because these links are potentially untrusted, we exercise a high degree of caution in fetching them.
|
||||
See `common/common/media.py` for details.
|
||||
|
||||
These files are saved in:
|
||||
`media/URLHASH/FILEHASH.EXT`
|
||||
Where URLHASH and FILEHASH are hashed in the normal way, but of the request URL and response content respectively.
|
||||
EXT is guessed based on the content-type and may not be correct.
|
||||
|
||||
Note the URL hash will include any query string, etc.
|
||||
A file may be retrieved multiple times, if this results in different content then multiple files will be present
|
||||
under the same URL hash.
|
||||
|
||||
## Pubnub data
|
||||
|
||||
`pubbot` watches known PubNub streams and saves an event log `pubnub-log.json`.
|
||||
This is a newline-delimited json file containing messages which can be distingished by the `type` field:
|
||||
- `startup`: Records that pubbot just started. May be used to imply there may have been missed messages preceeding it.
|
||||
- `total`: An update to the donation total
|
||||
- `prize`: An update to the highest bid on a prize
|
||||
- `unknown`: An unrecognized pubnub message
|
||||
- `error`: Something went wrong while handling the message
|
||||
|
||||
The details of what is contained in each type depend on pubnub - you should read the pubbot code.
|
||||
|
||||
## Mastodon toots
|
||||
|
||||
`tootbot` watches the Desert Bus and VST mastodon accounts for updates.
|
||||
It writes these update events to `tootbot.json`. The content of these depends on the mastodon API,
|
||||
see `tootbot` for details.
|
Loading…
Reference in New Issue