Search

Sunday, May 24, 2015

Stopping Google Analytics Spammers

Any observant user of Google Analytics will have undoubtedly noticed a large amount of traffic coming from curious looking referrers, like so:


This is worrying! Not only does it completely destroy any chance of getting actual, valid traffic data from Analytics, but it also frustratingly puts your website's data at the hands of internet spammers. Ouch!

So here's a quick tip for stopping the (unfortunately) prevalent Google Analytics spammers from your site.

First, delete and re-add your website to Analytics to get a new property ID, and retrieve your new tracking code.

Next, obfuscate your JS tracking code to keep bots from finding your site's analytics property ID.  I've been using this method for several months now and can confirm that it has no effect on the functionality of analytics and doesn't break anything.

Monday, May 11, 2015

Blocking Bad Bots in 3 Easy Steps with Lighttpd.

I figured that a good follow up to my last post would be a new one about good bot stopping practices.

In this tutorial, we'll be looking at best practices for Lighttpd, but configuration options for these tips are available in all major web servers (Apache, Nginx, IIS, etc).

The internet is a scary place, and anyone whose even taken a cursory glance at their web server access logs will certainly have noticed the amount of clearly malicious or spammy requests made to their web server. Internet scrapers and spam referrers are just some of the nasty things you're sure to see strewn about your access logs.

These bots are typically searching for exploits (which is an automated, brute-forced process), though many others are looking for personal information such as email addresses and phone numbers to send spam to. There may also be search engines who crawl your site too frequently, making large amounts of requests at close intervals (eating bandwidth and system resources).

There are a few ways to keep this behavior under control, at least for the good robots. Any good webmaster will have undoubtedly created a good, strong robots.txt to keep out unwanted crawlers, but a shockingly large amount of them disregard it entirely.

A good example of this is Baidu, who, after being blocked in both my robots.txt and my web server configuration, actually went so far as to change the user agent of their web spiders to circumvent the blocks!

Because such nasty programs exist, we have to "break out the big guns" and block them in our web server config, and thankfully, Lighttpd provides plenty of options for us to do just that :)

Step one.

Grab the latest copy of my anti-spam configuration and copy it to your clipboard.

Step two.

On your server, cd to /etc/lighttpd and touch the file spam.conf. Open that file in your favorite text editor and paste in the contents of the anti-spam config. (note: you may need to su to root or use sudo to create and edit these files)

Open your lighttpd configuration file (usually /etc/lighttpd/lighttpd.conf) and add the line:

    include spam.conf

Step three.

Reload your web server's configuration, and restart it:

    /etc/init.d/lighttpd reload
  /etc/init.d/lighttpd restart

And just like that, a large portion of the most common offenders is blocked from your site!

Of course, this tutorial wouldn't be complete without a section on how to add your own custom rules, so I'll explain that, too!

Let's look at the three different conditional statements that you would be using to block bots and referrers from accessing your site. They are more or less self-explanatory, so here they are:
  1.     $HTTP["referer"]
  2. $HTTP["useragent"]
  3. $HTTP["remoteip"]
These conditionals, when used in conjunction with one of lighttpd's conditional operators and regular expressions give you powerful, granular control over who (or what) can access your site.