wubloader/docs/DATA_FORMATS.md


A list of all the info we may potentially save, and what format it is in.

All filepaths are relative to the "base directory" used for storage.
A unix filesystem (case-sensitive, no special characters except `/`) is assumed.

When not otherwise specified, a "hash" of content is a sha256 hash encoded to url-safe base64 without padding.
For example, the empty string hashes to `47DEQpj8HBSa_-TImW_5JCeuQeRkm5NMpJWZG3hSuFU`

## Video segments

Stream video data is saved in MPEG-TS segments, with filepaths:
`CHANNEL/QUALITY/HOUR/TIME-DURATION-TYPE-HASH.ts`
Where:
- HOUR is `%Y-%m-%DT%H`
- TIME is `%M-%S`
- DURATION is a non-negative float value
- TYPE is one of:
    - `full` - A normal segment
	- `suspect` - A segment which we suspect to not be fully correct
	- `partial` - A segment which we know to be incomplete

It is assumed that any set of segments can be concatenated to produce a playable video
(once timestamps have been corrected).

## Chat logs

Chat logs are saved in "batch" files with filepaths:
`CHANNEL/chat/HOUR/TIME-HASH.json`
Where HOUR and TIME are as per segment files.

Each batch file is newline-delimited json containing chat log entries.
Each entry corresponds to an IRC PRIVMSG or other command like a JOIN, CLEARCHAT or ROOMSTATE.
An entry has a `time` and a `time_range` field, with the estimated "true" time of the message
being within the range `[time, time + time_range]`.
The `receivers` field contains each unique chat archiver instance that observed this message,
and the timestamp at which they recieved it. This information is primarily for debugging purposes.

Batches from multiple machines will be merged periodically,
so while it may be possible to observe two batches for the same time, this should be temporary.
Messages may be present in both batches.

## Blog posts

Website blog posts are captured by blogbot with filepaths:
`blogs/ID-HASH.json`
Multiple files with the same ID represent edits of the same blog post.
Each JSON file contains an object with the html content plus some other metadata.

## Coverage maps

`segment-coverage` generates coverage map images with filepath:
`coverage-maps/CHANNEL_QUALITY_coverage.png`
along with a html file of the same name which shows the image with a periodic refresh.

The image files show one pixel per 2 seconds, with the color depending on the coverage state at that time.
See `segment-coverage` for more detail.

## Emotes

Emote data is saved by chat-archiver for each unique emote seen in chat, with filepath:
`emotes/ID/{light,dark}-{1.0,2.0,3.0}`
These 6 files per emote represent all the variants of the emote image that twitch provides.

Each file is either a PNG or a GIF - consult file magic values (the first 4 bytes) to determine which.

## Downloaded media

Several components will download arbitary media when they see a URL:
- in a blog or social media post
- in the media links column of the sheet
- in chat

Because these links are potentially untrusted, we exercise a high degree of caution in fetching them.
See `common/common/media.py` for details.

These files are saved in:
`media/URLHASH/FILEHASH.EXT`
Where URLHASH and FILEHASH are hashed in the normal way, but of the request URL and response content respectively.
EXT is guessed based on the content-type and may not be correct.

Note the URL hash will include any query string, etc.
A file may be retrieved multiple times, if this results in different content then multiple files will be present
under the same URL hash.

## Pubnub data

`pubbot` watches known PubNub streams and saves an event log `pubnub-log.json`.
This is a newline-delimited json file containing messages which can be distingished by the `type` field:
- `startup`: Records that pubbot just started. May be used to imply there may have been missed messages preceeding it.
- `total`: An update to the donation total
- `prize`: An update to the highest bid on a prize
- `unknown`: An unrecognized pubnub message
- `error`: Something went wrong while handling the message

The details of what is contained in each type depend on pubnub - you should read the pubbot code.

## Mastodon toots

`tootbot` watches the Desert Bus and VST mastodon accounts for updates.
It writes these update events to `tootbot.json`. The content of these depends on the mastodon API,
see `tootbot` for details.