the stream

2022/2/9 7:38

I wrote a little pair of scripts today to download and archive my "Saved" media on Instagram. I first reached for an official API to do this, but it turns out there aren't any (at least, that I could find in a few minutes). So I decided to just scrape via internal APIs. The full scripts are here on GitHub Gist, though they may stop working at any time, obviously.

My final system ended up being in two parts:

a "frontend" JavaScript snippet that runs in the browser console on the instagram.com domain, using the browser's stored credentials, to ping Instagram's internal APIs and generate a list of all the image URLs
a "backend" Oak snippet that runs on my computer, locally, and downloads each image from the list of URLs to a unique filename.

Some interesting notes:

They don't have a rate limit on their internal API (or it's very high, such that my nonstop sequential requests for many minutes never hit it).
They have an extra layer of request authentication beyond cookies and CSRF, headers like x-ig-www-claim (an HMAC digest?) and x-asbd-id. They don't seem like message signatures because I could vary the message without changing these IDs.
Their primary GraphQL API is quite nice. Queries are referenced using build-time generated hashes and responses support easy cursor-based pagination.
Their internal API for media (by carousel, resolution, codec, etc.), just like Reddit's, is kind of a mess, with field names like image_versions2. I'm guessing lots of API churn?