Search

Monday, May 16, 2016

We've Migrated to Medium

Hello and welcome to my now defunct Blogger instance!

This blog is no longer active and all content has been moved to a shiny new Medium publication (which now sits at http://blog.ctis.me).

There are numerous reasons for this, the chief of these being a much improved UI/UX, more powerful creation/publishing features, greater security (Medium serves content over HTTPS), and better integration with social services.

Direct link to Medium publication: https://medium.com/charltons-blog

Thank you, and enjoy!


-Charlton Trezevant

Sunday, April 10, 2016

Scraping for Cache, Or: It's Not Piracy If You Left It Out In The Open



Proudly included in the Spring 2016 edition of 2600 magazine!

As a student, I love to have digital copies of my textbooks available. Ease of reference, portability, and minimal back strain are 3 reasons why finding digital copies of my books are hugely important to me. Therefore, I'm understandably annoyed when textbooks, especially ones that are several years old, can't be found online, or are available in online stores at exorbitant prices. Absolute madness!

A recent example of this heinous lack of electronic books would be my APUSH textbook. It isn't terribly old, in fact, it was published in 2012, and an online edition is available from McGraw Hill. In theory, an online edition should mean the end of my accessibility problem (and back pain), however, in order to access the online edition, I'd an access code from my school, something that they had not and could not provide to me. With all legitimate means of digital access exhausted, I would have to resort to other methods of enabling my laziness...

Enter Google.

As Google's web spiders crawl the web, they not only index web pages, but they also cache copies of pages, usually for a month or so, during which Google will host its own copy of the resource. In practice, this means that there's often a good chance that content on the web that has been deleted by a site owner is retrievable, so long as it has been indexed by Google. In fact, there are several organizations that exist to do exactly this sort of thing, most notably Archive.org, though Google is usually better about getting into the smaller cracks and crevices of the internet where my books are more likely to be stored. 

Research.

To start off my search, I searched for exact strings taken out of my textbook, which usually leads to a PDF scan of a section that's clear enough for google to run OCR. What I found, however, was something much, much better: Google had indexed pages directly from McGraw Hill's development servers, with cached copies that spanned the entire book!

And not only my APUSH textbook, but many, many others as well: 


A few quick observations for each URL led me to a way to programmatically download the book: 
  • Each URL followed the same format, http://dev6.mhhe.com/textflowdev/genhtml/<ISBN>/<chapter>.<section>.htm
  • Cached versions of webpages could be easily retrieved from Google with the following URI format: https://webcache.googleusercontent.com/search?q=cache:<full URL of resource>.
  • My book has no more than 32 chapters, with no more than 7 sections per chapter, which means there are 224 pages to potentially retrieve in total.

The Script.

That said, I whipped up the following script, which, though simple, was able to completely retrieve my textbook from Google's cache and compile all of the downloaded HTML files into a single PDF:


Which left me with a complete, text-only copy of my book! Excellent!

*Note: The book has since been removed from Google's cache, rendering the script unusable in its current state.

So, what have we learned? 

As a company dealing in an industry where piracy is a major concern, McGraw Hill should take extra precaution to ensure that all of its content, especially content that they're keen to monetize, is kept strictly under their control. This means securing any channel where this content could be exposed, which, in this case, was their dev servers. Even when a resource is deleted from a site, there are usually cached copies available somewhere, and once it's out on the web, it's out of your control.

Another thing that webmasters can gather from this is that all web content you host, even content hidden under several layers of obfuscation, may as well be considered wide open to the web unless some kind of authentication is implemented. If it exists on your server, accessible to anyone, then anyone will access it. 

With the above in mind, I hope that you've learned something about keeping your content, and channels that lead to your content, in check and under your control. 

Happy hacking!

Tuesday, March 15, 2016

Simple, Encrypted Storage on Cloud Services

The Problem

I often find myself at a crossroads of privacy versus convenience, where I might have to trade certain amounts of one for the other. In the case of cloud storage, however, I've found Google Drive to be an incredibly useful tool for keeping everything in a central, synchronized place.

That said, there are certain things I like to have accessible that I wouldn't necessarily want laying around unencrypted (VPN configurations, password databases, etc). Until recently, this has presented a bit of a problem: I couldn't upload these files to Drive, but I also wanted access to them in cases where I needed them most. In addition to this, I also needed something simple and stable, that worked out-of-the-box on my and other's machines (in lieu of a full FUSE/Encfs stack).

Enter Disk Utility

Thankfully, OS X's Disk Utility application allows us to quickly and easily create encrypted disk images and volumes. Storing files within an encrypted image will ensure that they are kept secure and out of the reach of prying eyes, despite being kept on public cloud storage services. 

To create an encrypted disk image, simply go to File > New Image > Blank Image. This will pull up the blank image creation menu:


Here, you can specify the name of the image, its size (note that size is fixed and not dynamically allocated! A 100MB disk image will stay 100MB, even if it contains no files.), format, and partition map. 

I've filled in these settings to what I recommend, though you can tweak them to your liking. I do recommend using AES256, as it is more secure than 128 bit and more or less industry standard.

Another thing to note is that the Image Format field must be set to "read/write disk image" if you want the ability to copy files to/from the image freely. 

Once you've set a password for the image and clicked Save, Disk Utility will prepare your image, and you'll be off to the races.

Using the Disk Image

After the image has been made, move it to the synced folder for your cloud service. To mount, double click the newly minted .dmg and enter the password you set when the image was created. The image will then be mounted like any normal disk, and you can copy files to/from it. 

When you're finished with your files, unmount the disk so that changes can be synced. Rinse and repeat!


Caveat

The only downside to this process that I've yet found is that after any change has been made ot the image, the entire thing has to be reuploaded. This isn't a huge problem, as I tend to keep these images on the smaller side of things (1-200MB), but for larger images or slower connections, the upload time could be something of an issue. I suppose this is the price you pay for the added security ¯\_()_/¯ 

Monday, January 18, 2016

Cooling Down the Hotlinks: Taming Bandwidth Usage with No Compromises

Last year, I noticed an alarming increase in the outbound traffic on charltontrezevant.com:  I was serving nearly 5GB of photos per day. For a static site with <500 page views per week, it was immediately apparent that something had gone terribly wrong.

As it turns out, I had been receiving enormous amounts of traffic from internet forums that had been hotlinking images from my site, most of which were from the weed page.

Luckily, it was incredibly easy to put a stop to this. Returning once again to the powerful sorcery of Lighttpd, we can easily reject certain types of traffic (In this case, images requested from domains other than mine) while allowing legitimate requests to pass through:

Note that this snippet allows certain referrers and user-agents through (such as Googlebot, Skype Web Preview, etc) so that legitimate uses are enabled.

This snippet will redirect all requests from invalid domains to this image, thus returning my bandwidth stats to sane levels while also expressing my mild irritation.

The aftermath scattered my face across the net, which was incredibly funny.

Thursday, December 31, 2015

Live Picture Frames: A Curious Perspective

Over the past week, I converted an old Kobo ereader into a live photo frame that displays a new photo from the Curiosity rover every day. The project turned out exceptionally well!

As always, source and instructions are on GitHub.

Case removed, picture frame being prepared.

Framed and hung!

Thursday, December 3, 2015

Archiving your Pocket list with Ruby


I've been seeking a more powerful and extensible alternative to Bash, and so I've recently begun experimenting with Ruby. For my first "real" test of the language, I decided to solve a problem I had been seeking an answer to for some time: Since the web is constantly changing, how could I go through my entire reading list and ensure that I had backup copies of the articles I've saved? As it turns out, there was a fairly simple solution to this- only 35 lines of Ruby!

The script itself uses the Curb and Nokogiri libraries to follow URL shorteners and parse HTML to ensure that the third main component, wkhtmltopdf (a personal favorite of mine), gets the most correct data for each link. To get your Pocket data into the script, you simply use Pocket's nifty HTML export tool to get a webpage full of links to all of your saved articles. 

Using the script is extraordinarily simple: Once dependencies are installed (see the top of the script for more information on that), you simply run ruby pocket_export.rb ~/Downloads/ril_export.html and you're off! The script creates the directory pocket_export_data to store the PDFs it generates and pocket_export_errors.log to keep track of any links it has trouble with. 

Enjoy!

Tuesday, November 17, 2015

Language Speeds: A Quick and Dirty Test


Some time ago I decided to run a quick test to determine the performance of several popular programming languages by measuring the execution time of an algorithm to compute the greatest prime factor of a number.

The test itself was easy enough to set up and run, being no more than an implementation of a simple algorithm in several languages combined with a bash script to run them all in sequence. My findings, however, were interesting. 

I tested the following languages on several systems I had handy: 
  • C
  • Java
  • Python
  • Ruby
  • JavaScript
Unsurprisingly, C won out for speed; its execution times were the lowest across the board due to it being compiled directly to assembly.

In second place, surprisingly, came JavaScript. Looks like V8 really is all it's cracked up to be.

Following that, Python and Ruby were fairly close in speed, and Java was slowest across the board.

I encourage you to run your own tests and post the results! The source code (along with my initial test) is on GitHub.