You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
wubloader/docs/DATA_FORMATS.md

4.2 KiB

A list of all the info we may potentially save, and what format it is in.

All filepaths are relative to the "base directory" used for storage. A unix filesystem (case-sensitive, no special characters except /) is assumed.

When not otherwise specified, a "hash" of content is a sha256 hash encoded to url-safe base64 without padding. For example, the empty string hashes to 47DEQpj8HBSa_-TImW_5JCeuQeRkm5NMpJWZG3hSuFU

Video segments

Stream video data is saved in MPEG-TS segments, with filepaths: CHANNEL/QUALITY/HOUR/TIME-DURATION-TYPE-HASH.ts Where:

  • HOUR is %Y-%m-%DT%H
  • TIME is %M-%S
  • DURATION is a non-negative float value
  • TYPE is one of:
    • full - A normal segment
    • suspect - A segment which we suspect to not be fully correct
    • partial - A segment which we know to be incomplete

It is assumed that any set of segments can be concatenated to produce a playable video (once timestamps have been corrected).

Chat logs

Chat logs are saved in "batch" files with filepaths: CHANNEL/chat/HOUR/TIME-HASH.json Where HOUR and TIME are as per segment files.

Each batch file is newline-delimited json containing chat log entries. Each entry corresponds to an IRC PRIVMSG or other command like a JOIN, CLEARCHAT or ROOMSTATE. An entry has a time and a time_range field, with the estimated "true" time of the message being within the range [time, time + time_range]. The receivers field contains each unique chat archiver instance that observed this message, and the timestamp at which they recieved it. This information is primarily for debugging purposes.

Batches from multiple machines will be merged periodically, so while it may be possible to observe two batches for the same time, this should be temporary. Messages may be present in both batches.

Blog posts

Website blog posts are captured by blogbot with filepaths: blogs/ID-HASH.json Multiple files with the same ID represent edits of the same blog post. Each JSON file contains an object with the html content plus some other metadata.

Coverage maps

segment-coverage generates coverage map images with filepath: coverage-maps/CHANNEL_QUALITY_coverage.png along with a html file of the same name which shows the image with a periodic refresh.

The image files show one pixel per 2 seconds, with the color depending on the coverage state at that time. See segment-coverage for more detail.

Emotes

Emote data is saved by chat-archiver for each unique emote seen in chat, with filepath: emotes/ID/{light,dark}-{1.0,2.0,3.0} These 6 files per emote represent all the variants of the emote image that twitch provides.

Each file is either a PNG or a GIF - consult file magic values (the first 4 bytes) to determine which.

Downloaded media

Several components will download arbitary media when they see a URL:

  • in a blog or social media post
  • in the media links column of the sheet
  • in chat

Because these links are potentially untrusted, we exercise a high degree of caution in fetching them. See common/common/media.py for details.

These files are saved in: media/URLHASH/FILEHASH.EXT Where URLHASH and FILEHASH are hashed in the normal way, but of the request URL and response content respectively. EXT is guessed based on the content-type and may not be correct.

Note the URL hash will include any query string, etc. A file may be retrieved multiple times, if this results in different content then multiple files will be present under the same URL hash.

Pubnub data

pubbot watches known PubNub streams and saves an event log pubnub-log.json. This is a newline-delimited json file containing messages which can be distingished by the type field:

  • startup: Records that pubbot just started. May be used to imply there may have been missed messages preceeding it.
  • total: An update to the donation total
  • prize: An update to the highest bid on a prize
  • unknown: An unrecognized pubnub message
  • error: Something went wrong while handling the message

The details of what is contained in each type depend on pubnub - you should read the pubbot code.

Mastodon toots

tootbot watches the Desert Bus and VST mastodon accounts for updates. It writes these update events to tootbot.json. The content of these depends on the mastodon API, see tootbot for details.