Your Data is not Your Data

Hello, wildcats.

Using Google Takeout, you can export your Google data.

I use this specifically to export just my YouTube watch history.

I frequently find myself in situations where I am doing data science on my own activity history because some brainworm tells me “hey I'd like to revisit this thing I once visited” even though it was years ago and it will be a pain in the ass to find it again.

A screenshot of me, 6 years ago, posting on Discord a YouTube link to a video and lamenting that I cannot find another meme video which uses this video as source material – https://youtu.be/ZKxhI4I5kq8

To export your YouTube history as JSON, follow these steps.

  1. Visit https://takeout.google.com/
  2. Top right, profile switcher, switch to your brand account (my YouTube account is separate from my Google account)
  3. Deselect all
  4. Scroll to the bottom, YouTube > Enable
  5. “Multiple formats” > switch to JSON
  6. “All YouTube data included” > Deselect all, check history
  7. Next step > File type=.zip, File size=50gb
  8. Create export

Congrats. You now have, locally, slice of your watch history, instead of being beholden to the YouTube interface which is rarely sufficient for querying purposes.

What does the data look like?

{
  "header": "YouTube",
  "title": "Watched The monkey is furiously knocking at the door - Обезьяна неистово стучит в дверь - 猴子是疯狂地在敲门",
  "titleUrl": "https://www.youtube.com/watch?v\u003d3-_OIDRL91c",
  "subtitles": [{
    "name": "Seen that! Видал, чо!",
    "url": "https://www.youtube.com/channel/UCnEelfUE8SE_rZtwaRzUzyQ"
  }],
  "time": "2020-04-19T03:08:27.981Z",
  "products": ["YouTube"],
  "activityControls": ["YouTube watch history"]
},
{
  "header": "YouTube",
  "title": "Watched https://www.youtube.com/watch?v\u003dnmcuoaqdJ9w",
  "titleUrl": "https://www.youtube.com/watch?v\u003dnmcuoaqdJ9w",
  "time": "2020-04-17T18:22:47.173Z",
  "products": ["YouTube"],
  "activityControls": ["YouTube watch history"]
}

The URL and the timestamp are present. Great!

The video title is inconsistently present. Less great!

This helpful StackOverflow comment tells us that we can use the following YouTube endpoint to get some metadata

// https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=nmcuoaqdJ9w
{
    "title": "Weird Al SHREDS!!!",
    "author_name": "alyankovic",
    "author_url": "https://www.youtube.com/@alyankovic",
    "type": "video",
    "height": 113,
    "width": 200,
    "version": "1.0",
    "provider_name": "YouTube",
    "provider_url": "https://www.youtube.com/",
    "thumbnail_height": 360,
    "thumbnail_width": 480,
    "thumbnail_url": "https://i.ytimg.com/vi/nmcuoaqdJ9w/hqdefault.jpg",
    "html": "<iframe width=\"200\" height=\"113\" src=\"https://www.youtube.com/embed/nmcuoaqdJ9w?feature=oembed%5C#34; frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen title=\"Weird Al SHREDS!!!\"></iframe>"
}

So I guess that would be a fairly straightforward way to enrich the data.

That's not what I'm deep in right now though.

The Takeout service responds in a matter of minutes when we have scoped the export to just our YouTube watch history and nothing else.

It is still a manual process and will quickly become outdated given that I frequently watch videos.

I find myself having multiple exports, each with a different slice of my history.

To free up disk space, is it truly safe to simply delete the oldest export?

Using ChatGPT (conversation link), I whipped up a quick validation program that takes the search and watch history json files from the latest export and an older export to check some assumptions.

  1. The newest export MUST contain every entry in the older export.
  2. The newest export MUST NOT contain an entry older than the newest entry in the older export which is not also present in the older export.

I didn't get to number 2 because number 1 was exceedingly disproven.

THERE IS MISSING DATA BETWEEN EXPORTS.

The exports are from 2024-10-30 and 2024-12-07.

Summary: 993 total missing entries in the Watch History file.
Summary: 42 total missing entries in the Search History file.

This is not surprising, just disappointing.

Thankfully, using ChatGPT I was able to build a tool to identify the problem quite easily.

Banana Loof – NSA Releases Internal 1982 Lecture by Computing Pioneer Rear Admiral Grace Hopper

00:08:30

“No work, no research has been done on the value of information. We've completely failed to look at it. And yet it's going to make a tremendous difference in how we run our computer systems of the future. Because if there are two things that are dead sure, I don't even have to call them predictions. One is that the amount of data and the amount of information will continue to increase, and it's more than linear. And the other is the demand for instant access to that information will increase, and those two are in conflict. We've got to know something about the value of the information being processed. Everybody wants their information online.”

I think about that video a lot.

My browser extension + local server tool, Onboarder, lets me take notes in a text area it adds below the video player. The notes then get synced to a plaintext file on the disk.

https://github.com/TeamDman/Onboarder

I can use ripgrep to search through my notes incredibly efficiently.

I also made a program that lets me easily capture my system audio output to a .wav file, toggled on and off by hitting enter in the terminal.

https://github.com/TeamDman/audio-capture.git

I also have WhisperX running, which can transcribe a 1 hour video in 1 minte with incredible fidelity.

https://github.com/TeamDman/voice2text

The process of finding that Grace Hopper video, capturing her saying that sentence, and transcribing it was a collaboration between several disjoint tools I have added to my arsenal.

We've all heard of Big Data.

I want my own Big Data that works for me.

Storage is cheap, and I want a copy of all my data so that when I say “computer, find me the meme from within the last 4 years matching XYZ criteria” it can do so.

The problem with building a grandiose system like this is not the work that it will take, but the charting of the course.

How do I want to structure the data so that all these tools can play nice together?

The answer is probably Postgres.

It has support for vector embeddings, json columns, and generally all the stuff I'd need to proceed.

However, not everything should/can live in the database.

I should probably get building, or at least go to bed lol