mirror of https://github.com/ekimekim/wubloader
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
103 lines
4.2 KiB
Markdown
103 lines
4.2 KiB
Markdown
|
|
A list of all the info we may potentially save, and what format it is in.
|
|
|
|
All filepaths are relative to the "base directory" used for storage.
|
|
A unix filesystem (case-sensitive, no special characters except `/`) is assumed.
|
|
|
|
When not otherwise specified, a "hash" of content is a sha256 hash encoded to url-safe base64 without padding.
|
|
For example, the empty string hashes to `47DEQpj8HBSa_-TImW_5JCeuQeRkm5NMpJWZG3hSuFU`
|
|
|
|
## Video segments
|
|
|
|
Stream video data is saved in MPEG-TS segments, with filepaths:
|
|
`CHANNEL/QUALITY/HOUR/TIME-DURATION-TYPE-HASH.ts`
|
|
Where:
|
|
- HOUR is `%Y-%m-%DT%H`
|
|
- TIME is `%M-%S`
|
|
- DURATION is a non-negative float value
|
|
- TYPE is one of:
|
|
- `full` - A normal segment
|
|
- `suspect` - A segment which we suspect to not be fully correct
|
|
- `partial` - A segment which we know to be incomplete
|
|
|
|
It is assumed that any set of segments can be concatenated to produce a playable video
|
|
(once timestamps have been corrected).
|
|
|
|
## Chat logs
|
|
|
|
Chat logs are saved in "batch" files with filepaths:
|
|
`CHANNEL/chat/HOUR/TIME-HASH.json`
|
|
Where HOUR and TIME are as per segment files.
|
|
|
|
Each batch file is newline-delimited json containing chat log entries.
|
|
Each entry corresponds to an IRC PRIVMSG or other command like a JOIN, CLEARCHAT or ROOMSTATE.
|
|
An entry has a `time` and a `time_range` field, with the estimated "true" time of the message
|
|
being within the range `[time, time + time_range]`.
|
|
The `receivers` field contains each unique chat archiver instance that observed this message,
|
|
and the timestamp at which they recieved it. This information is primarily for debugging purposes.
|
|
|
|
Batches from multiple machines will be merged periodically,
|
|
so while it may be possible to observe two batches for the same time, this should be temporary.
|
|
Messages may be present in both batches.
|
|
|
|
## Blog posts
|
|
|
|
Website blog posts are captured by blogbot with filepaths:
|
|
`blogs/ID-HASH.json`
|
|
Multiple files with the same ID represent edits of the same blog post.
|
|
Each JSON file contains an object with the html content plus some other metadata.
|
|
|
|
## Coverage maps
|
|
|
|
`segment-coverage` generates coverage map images with filepath:
|
|
`coverage-maps/CHANNEL_QUALITY_coverage.png`
|
|
along with a html file of the same name which shows the image with a periodic refresh.
|
|
|
|
The image files show one pixel per 2 seconds, with the color depending on the coverage state at that time.
|
|
See `segment-coverage` for more detail.
|
|
|
|
## Emotes
|
|
|
|
Emote data is saved by chat-archiver for each unique emote seen in chat, with filepath:
|
|
`emotes/ID/{light,dark}-{1.0,2.0,3.0}`
|
|
These 6 files per emote represent all the variants of the emote image that twitch provides.
|
|
|
|
Each file is either a PNG or a GIF - consult file magic values (the first 4 bytes) to determine which.
|
|
|
|
## Downloaded media
|
|
|
|
Several components will download arbitary media when they see a URL:
|
|
- in a blog or social media post
|
|
- in the media links column of the sheet
|
|
- in chat
|
|
|
|
Because these links are potentially untrusted, we exercise a high degree of caution in fetching them.
|
|
See `common/common/media.py` for details.
|
|
|
|
These files are saved in:
|
|
`media/URLHASH/FILEHASH.EXT`
|
|
Where URLHASH and FILEHASH are hashed in the normal way, but of the request URL and response content respectively.
|
|
EXT is guessed based on the content-type and may not be correct.
|
|
|
|
Note the URL hash will include any query string, etc.
|
|
A file may be retrieved multiple times, if this results in different content then multiple files will be present
|
|
under the same URL hash.
|
|
|
|
## Pubnub data
|
|
|
|
`pubbot` watches known PubNub streams and saves an event log `pubnub-log.json`.
|
|
This is a newline-delimited json file containing messages which can be distingished by the `type` field:
|
|
- `startup`: Records that pubbot just started. May be used to imply there may have been missed messages preceeding it.
|
|
- `total`: An update to the donation total
|
|
- `prize`: An update to the highest bid on a prize
|
|
- `unknown`: An unrecognized pubnub message
|
|
- `error`: Something went wrong while handling the message
|
|
|
|
The details of what is contained in each type depend on pubnub - you should read the pubbot code.
|
|
|
|
## Mastodon toots
|
|
|
|
`tootbot` watches the Desert Bus and VST mastodon accounts for updates.
|
|
It writes these update events to `tootbot.json`. The content of these depends on the mastodon API,
|
|
see `tootbot` for details.
|