you can't actually stop scrapers
as you’ve no doubt heard, twitter’s API access was paywalled a few months ago. for about a decade now, I’ve relied on using that API to essentially scrape tweets for archival by dumping the status response to a JSON and also downloading any of the attached media. it’s how I made this neat little seaborn heatmap showing the timestamps from tweets I scraped, mostly from japanese accounts posting art. while I was starting to think I had fallen through the cracks since it kept working well after their supposed sunset date, my downloader’s app token was finally revoked about a month ago. pretty much every third-party tool has been on the chopping block because developers, rightfully, have decided not to pay the bird’s ridiculously prohibitive asking price for access. some poor suckers have even paid to have read endpoints they relied on for retrieving followers and following removed without any warning, which is how I knew a few days ago that they were making sisyphean attempts to limit scraping. for sickos like me, though, the party has never stopped because sickos don’t play by the same rules as everyone else.
twitter already has a storied history trying to prevent scrapers, as does any other large site that wants to maintain their garden walls. while I’m only really familiar with these efforts trying to take content out, it’s undoubtedly been a focal issue for twitter as they’ve never been able to fully stem the tide of bot accounts in replies. some years back they made javascript a requirement for access and also started fingerprinting headless browsers to prevent people from pilfering data that way. cutting off public read only access and rate limiting statuses is only the latest in their scraper hysterics and is better characterized as a blind shot to the foot. even after all of these changes and blowing up developer API access, though, one way to take out their data has remained stable and trivially accessible: their internal graphQL API. this is the way your web browser and phone apps communicate with twitter, and unlike the versioned developer APIs, it gives you access to everything within the same access bounds as a normal web client. unless someone rewrites the entire client and application experience yet again (the time is ripe for new^6 twitter), it’s probably going to be exploitable by sickos for a long time.
reviewing github turns up plenty of historical examples using this level of access, and prominent examples like nitter lean on it to essentially mirror the entire site. so of course, my downloader now hooks into graphQL the same way as everyone else and can even hit endpoints that keep returning the same v1 status responses that I’ve been hovering up for a decade because those endpoints are used to embed tweets. I’m guessing that specifically will probably be axed soon given today’s changes, but I wouldn’t say it’s inevitable considering you hilariously still can only upload tweets with media using the v1 endpoint1.
so, what’s my point here? you can’t actually stop someone from sucking up your site’s content. search engines might have to obey robots directives, and the CFAA might make it scary to test what sort of access you can get away with, but you can’t stop people from reviewing your content as long as your strongest gate to access is creating an account. scrapers can throw enough time and money at anything to make obfuscation efforts a cat and mouse game that never resolves, the same way DRM is always a lock that someone wants to break. there are metrics like account age that will always be useful to prevent spam, and sites are right to employ them, but it’s a foolish exercise to think you can limit review of read only content. if you have an internal API, someone dedicated enough will hammer against it to figure out which parameters and access flows make it return useful results. any sane site that takes in user generated content realized this long ago by being publicly indexable or by courting developers to actually build things on top of their core experience rather than stripping them for whatever they were worth passively, as scrapers undoubtedly do. twitter failed to understand this when they handicapped third-party clients, when they banned them outright, and again now as they’ve scorched their end-user experience for a problem that they will never effectively beat.
it’s sad to watch this happen, as someone that used and liked twitter for a long time, however many years ago that actually was now. when the art tweets dry up, I’ll finally be free. the scrapers will haunt them forever.
-
the twitter dev trello is still live and preserved in a very funny state from when it was last updated in october 2022: https://trello.com/b/myf7rKwV/twitter-developer-platform-roadmap ↩