From 5e0e470821a31feab1afbc3f93bc7733d5e7c15d Mon Sep 17 00:00:00 2001
From: Mike Lang <mikelang3000@gmail.com>
Date: Sun, 21 Sep 2025 14:32:11 +1000
Subject: [PATCH] Add documentation on all the kinds of data we store in
 segments dir

---
 docs/DATA_FORMATS.md | 102 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)
 create mode 100644 docs/DATA_FORMATS.md

diff --git a/docs/DATA_FORMATS.md b/docs/DATA_FORMATS.md
new file mode 100644
index 0000000..ac5976c
--- /dev/null
+++ b/docs/DATA_FORMATS.md
@@ -0,0 +1,102 @@
+
+A list of all the info we may potentially save, and what format it is in.
+
+All filepaths are relative to the "base directory" used for storage.
+A unix filesystem (case-sensitive, no special characters except `/`) is assumed.
+
+When not otherwise specified, a "hash" of content is a sha256 hash encoded to url-safe base64 without padding.
+For example, the empty string hashes to `47DEQpj8HBSa_-TImW_5JCeuQeRkm5NMpJWZG3hSuFU`
+
+## Video segments
+
+Stream video data is saved in MPEG-TS segments, with filepaths:
+`CHANNEL/QUALITY/HOUR/TIME-DURATION-TYPE-HASH.ts`
+Where:
+- HOUR is `%Y-%m-%DT%H`
+- TIME is `%M-%S`
+- DURATION is a non-negative float value
+- TYPE is one of:
+    - `full` - A normal segment
+	- `suspect` - A segment which we suspect to not be fully correct
+	- `partial` - A segment which we know to be incomplete
+
+It is assumed that any set of segments can be concatenated to produce a playable video
+(once timestamps have been corrected).
+
+## Chat logs
+
+Chat logs are saved in "batch" files with filepaths:
+`CHANNEL/chat/HOUR/TIME-HASH.json`
+Where HOUR and TIME are as per segment files.
+
+Each batch file is newline-delimited json containing chat log entries.
+Each entry corresponds to an IRC PRIVMSG or other command like a JOIN, CLEARCHAT or ROOMSTATE.
+An entry has a `time` and a `time_range` field, with the estimated "true" time of the message
+being within the range `[time, time + time_range]`.
+The `receivers` field contains each unique chat archiver instance that observed this message,
+and the timestamp at which they recieved it. This information is primarily for debugging purposes.
+
+Batches from multiple machines will be merged periodically,
+so while it may be possible to observe two batches for the same time, this should be temporary.
+Messages may be present in both batches.
+
+## Blog posts
+
+Website blog posts are captured by blogbot with filepaths:
+`blogs/ID-HASH.json`
+Multiple files with the same ID represent edits of the same blog post.
+Each JSON file contains an object with the html content plus some other metadata.
+
+## Coverage maps
+
+`segment-coverage` generates coverage map images with filepath:
+`coverage-maps/CHANNEL_QUALITY_coverage.png`
+along with a html file of the same name which shows the image with a periodic refresh.
+
+The image files show one pixel per 2 seconds, with the color depending on the coverage state at that time.
+See `segment-coverage` for more detail.
+
+## Emotes
+
+Emote data is saved by chat-archiver for each unique emote seen in chat, with filepath:
+`emotes/ID/{light,dark}-{1.0,2.0,3.0}`
+These 6 files per emote represent all the variants of the emote image that twitch provides.
+
+Each file is either a PNG or a GIF - consult file magic values (the first 4 bytes) to determine which.
+
+## Downloaded media
+
+Several components will download arbitary media when they see a URL:
+- in a blog or social media post
+- in the media links column of the sheet
+- in chat
+
+Because these links are potentially untrusted, we exercise a high degree of caution in fetching them.
+See `common/common/media.py` for details.
+
+These files are saved in:
+`media/URLHASH/FILEHASH.EXT`
+Where URLHASH and FILEHASH are hashed in the normal way, but of the request URL and response content respectively.
+EXT is guessed based on the content-type and may not be correct.
+
+Note the URL hash will include any query string, etc.
+A file may be retrieved multiple times, if this results in different content then multiple files will be present
+under the same URL hash.
+
+## Pubnub data
+
+`pubbot` watches known PubNub streams and saves an event log `pubnub-log.json`.
+This is a newline-delimited json file containing messages which can be distingished by the `type` field:
+- `startup`: Records that pubbot just started. May be used to imply there may have been missed messages preceeding it.
+- `total`: An update to the donation total
+- `prize`: An update to the highest bid on a prize
+- `unknown`: An unrecognized pubnub message
+- `error`: Something went wrong while handling the message
+
+The details of what is contained in each type depend on pubnub - you should read the pubbot code.
+
+## Mastodon toots
+
+`tootbot` watches the Desert Bus and VST mastodon accounts for updates.
+It writes these update events to `tootbot.json`. The content of these depends on the mastodon API,
+see `tootbot` for details.