Search

Monday, May 16, 2016

We've Migrated to Medium

Hello and welcome to my now defunct Blogger instance!

This blog is no longer active and all content has been moved to a shiny new Medium publication (which now sits at http://blog.ctis.me).

There are numerous reasons for this, the chief of these being a much improved UI/UX, more powerful creation/publishing features, greater security (Medium serves content over HTTPS), and better integration with social services.

Direct link to Medium publication: https://medium.com/charltons-blog

Thank you, and enjoy!


-Charlton Trezevant

Sunday, April 10, 2016

Scraping for Cache, Or: It's Not Piracy If You Left It Out In The Open



Proudly included in the Spring 2016 edition of 2600 magazine!

As a student, I love to have digital copies of my textbooks available. Ease of reference, portability, and minimal back strain are 3 reasons why finding digital copies of my books are hugely important to me. Therefore, I'm understandably annoyed when textbooks, especially ones that are several years old, can't be found online, or are available in online stores at exorbitant prices. Absolute madness!

A recent example of this heinous lack of electronic books would be my APUSH textbook. It isn't terribly old, in fact, it was published in 2012, and an online edition is available from McGraw Hill. In theory, an online edition should mean the end of my accessibility problem (and back pain), however, in order to access the online edition, I'd an access code from my school, something that they had not and could not provide to me. With all legitimate means of digital access exhausted, I would have to resort to other methods of enabling my laziness...

Enter Google.

As Google's web spiders crawl the web, they not only index web pages, but they also cache copies of pages, usually for a month or so, during which Google will host its own copy of the resource. In practice, this means that there's often a good chance that content on the web that has been deleted by a site owner is retrievable, so long as it has been indexed by Google. In fact, there are several organizations that exist to do exactly this sort of thing, most notably Archive.org, though Google is usually better about getting into the smaller cracks and crevices of the internet where my books are more likely to be stored. 

Research.

To start off my search, I searched for exact strings taken out of my textbook, which usually leads to a PDF scan of a section that's clear enough for google to run OCR. What I found, however, was something much, much better: Google had indexed pages directly from McGraw Hill's development servers, with cached copies that spanned the entire book!

And not only my APUSH textbook, but many, many others as well: 


A few quick observations for each URL led me to a way to programmatically download the book: 
  • Each URL followed the same format, http://dev6.mhhe.com/textflowdev/genhtml/<ISBN>/<chapter>.<section>.htm
  • Cached versions of webpages could be easily retrieved from Google with the following URI format: https://webcache.googleusercontent.com/search?q=cache:<full URL of resource>.
  • My book has no more than 32 chapters, with no more than 7 sections per chapter, which means there are 224 pages to potentially retrieve in total.

The Script.

That said, I whipped up the following script, which, though simple, was able to completely retrieve my textbook from Google's cache and compile all of the downloaded HTML files into a single PDF:


Which left me with a complete, text-only copy of my book! Excellent!

*Note: The book has since been removed from Google's cache, rendering the script unusable in its current state.

So, what have we learned? 

As a company dealing in an industry where piracy is a major concern, McGraw Hill should take extra precaution to ensure that all of its content, especially content that they're keen to monetize, is kept strictly under their control. This means securing any channel where this content could be exposed, which, in this case, was their dev servers. Even when a resource is deleted from a site, there are usually cached copies available somewhere, and once it's out on the web, it's out of your control.

Another thing that webmasters can gather from this is that all web content you host, even content hidden under several layers of obfuscation, may as well be considered wide open to the web unless some kind of authentication is implemented. If it exists on your server, accessible to anyone, then anyone will access it. 

With the above in mind, I hope that you've learned something about keeping your content, and channels that lead to your content, in check and under your control. 

Happy hacking!

Tuesday, March 15, 2016

Simple, Encrypted Storage on Cloud Services

The Problem

I often find myself at a crossroads of privacy versus convenience, where I might have to trade certain amounts of one for the other. In the case of cloud storage, however, I've found Google Drive to be an incredibly useful tool for keeping everything in a central, synchronized place.

That said, there are certain things I like to have accessible that I wouldn't necessarily want laying around unencrypted (VPN configurations, password databases, etc). Until recently, this has presented a bit of a problem: I couldn't upload these files to Drive, but I also wanted access to them in cases where I needed them most. In addition to this, I also needed something simple and stable, that worked out-of-the-box on my and other's machines (in lieu of a full FUSE/Encfs stack).

Enter Disk Utility

Thankfully, OS X's Disk Utility application allows us to quickly and easily create encrypted disk images and volumes. Storing files within an encrypted image will ensure that they are kept secure and out of the reach of prying eyes, despite being kept on public cloud storage services. 

To create an encrypted disk image, simply go to File > New Image > Blank Image. This will pull up the blank image creation menu:


Here, you can specify the name of the image, its size (note that size is fixed and not dynamically allocated! A 100MB disk image will stay 100MB, even if it contains no files.), format, and partition map. 

I've filled in these settings to what I recommend, though you can tweak them to your liking. I do recommend using AES256, as it is more secure than 128 bit and more or less industry standard.

Another thing to note is that the Image Format field must be set to "read/write disk image" if you want the ability to copy files to/from the image freely. 

Once you've set a password for the image and clicked Save, Disk Utility will prepare your image, and you'll be off to the races.

Using the Disk Image

After the image has been made, move it to the synced folder for your cloud service. To mount, double click the newly minted .dmg and enter the password you set when the image was created. The image will then be mounted like any normal disk, and you can copy files to/from it. 

When you're finished with your files, unmount the disk so that changes can be synced. Rinse and repeat!


Caveat

The only downside to this process that I've yet found is that after any change has been made ot the image, the entire thing has to be reuploaded. This isn't a huge problem, as I tend to keep these images on the smaller side of things (1-200MB), but for larger images or slower connections, the upload time could be something of an issue. I suppose this is the price you pay for the added security ¯\_()_/¯ 

Monday, January 18, 2016

Cooling Down the Hotlinks: Taming Bandwidth Usage with No Compromises

Last year, I noticed an alarming increase in the outbound traffic on charltontrezevant.com:  I was serving nearly 5GB of photos per day. For a static site with <500 page views per week, it was immediately apparent that something had gone terribly wrong.

As it turns out, I had been receiving enormous amounts of traffic from internet forums that had been hotlinking images from my site, most of which were from the weed page.

Luckily, it was incredibly easy to put a stop to this. Returning once again to the powerful sorcery of Lighttpd, we can easily reject certain types of traffic (In this case, images requested from domains other than mine) while allowing legitimate requests to pass through:

Note that this snippet allows certain referrers and user-agents through (such as Googlebot, Skype Web Preview, etc) so that legitimate uses are enabled.

This snippet will redirect all requests from invalid domains to this image, thus returning my bandwidth stats to sane levels while also expressing my mild irritation.

The aftermath scattered my face across the net, which was incredibly funny.