Search

Sunday, April 10, 2016

Scraping for Cache, Or: It's Not Piracy If You Left It Out In The Open



Proudly included in the Spring 2016 edition of 2600 magazine!

As a student, I love to have digital copies of my textbooks available. Ease of reference, portability, and minimal back strain are 3 reasons why finding digital copies of my books are hugely important to me. Therefore, I'm understandably annoyed when textbooks, especially ones that are several years old, can't be found online, or are available in online stores at exorbitant prices. Absolute madness!

A recent example of this heinous lack of electronic books would be my APUSH textbook. It isn't terribly old, in fact, it was published in 2012, and an online edition is available from McGraw Hill. In theory, an online edition should mean the end of my accessibility problem (and back pain), however, in order to access the online edition, I'd an access code from my school, something that they had not and could not provide to me. With all legitimate means of digital access exhausted, I would have to resort to other methods of enabling my laziness...

Enter Google.

As Google's web spiders crawl the web, they not only index web pages, but they also cache copies of pages, usually for a month or so, during which Google will host its own copy of the resource. In practice, this means that there's often a good chance that content on the web that has been deleted by a site owner is retrievable, so long as it has been indexed by Google. In fact, there are several organizations that exist to do exactly this sort of thing, most notably Archive.org, though Google is usually better about getting into the smaller cracks and crevices of the internet where my books are more likely to be stored. 

Research.

To start off my search, I searched for exact strings taken out of my textbook, which usually leads to a PDF scan of a section that's clear enough for google to run OCR. What I found, however, was something much, much better: Google had indexed pages directly from McGraw Hill's development servers, with cached copies that spanned the entire book!

And not only my APUSH textbook, but many, many others as well: 


A few quick observations for each URL led me to a way to programmatically download the book: 
  • Each URL followed the same format, http://dev6.mhhe.com/textflowdev/genhtml/<ISBN>/<chapter>.<section>.htm
  • Cached versions of webpages could be easily retrieved from Google with the following URI format: https://webcache.googleusercontent.com/search?q=cache:<full URL of resource>.
  • My book has no more than 32 chapters, with no more than 7 sections per chapter, which means there are 224 pages to potentially retrieve in total.

The Script.

That said, I whipped up the following script, which, though simple, was able to completely retrieve my textbook from Google's cache and compile all of the downloaded HTML files into a single PDF:


Which left me with a complete, text-only copy of my book! Excellent!

*Note: The book has since been removed from Google's cache, rendering the script unusable in its current state.

So, what have we learned? 

As a company dealing in an industry where piracy is a major concern, McGraw Hill should take extra precaution to ensure that all of its content, especially content that they're keen to monetize, is kept strictly under their control. This means securing any channel where this content could be exposed, which, in this case, was their dev servers. Even when a resource is deleted from a site, there are usually cached copies available somewhere, and once it's out on the web, it's out of your control.

Another thing that webmasters can gather from this is that all web content you host, even content hidden under several layers of obfuscation, may as well be considered wide open to the web unless some kind of authentication is implemented. If it exists on your server, accessible to anyone, then anyone will access it. 

With the above in mind, I hope that you've learned something about keeping your content, and channels that lead to your content, in check and under your control. 

Happy hacking!