Business website design and development | Established 2004 | Based in Manchester UK
Call Us Today: 0161 408 0087

Our Blog

How to solve a problem like appspot (proxies)

Google is indexing proxy websitesThe usage of web proxies isn’t something new, they have been around for years as a way for web surfers to get around firewall rules in places such as schools and workplaces. The way a web proxy works is quite simple – if you wanted to visit Facebook at work but found that it was blocked by the IT department then you could visit a web proxy and then visit Facebook through it. The company firewall would not detect the connection to Facebook and will let you use it.

Now using a web proxy is fraught with danger, some of them are perfectly fine but how do you know that your Facebook email address and password you just entered into the proxy version of Facebook hasn’t been recorded or stored? Anyway that’s a different story for another day… The problem we are looking at today is about SEO and how some people are now using web proxies hosted on Google’s appspot platform to sabotage legitimate websites in the Google ranking results.

The problem of duplicate content and web proxies.

Google has a bit of a bee in it’s bonnet about “duplicate content” – that is content that has been copied or stolen from one web page and included in another. It’s policy is that the original page should be credited with being the “owner” of the content and the copied content should be dropped. Now this is fine and dandy – well done Google, but what happens when it’s not only the content of a page that is copied but the entire source code of that page? That is what a web proxy does…

It’s quite easy for Google to spot copied content – text that has been copied from website A to website B, but it seems to have a huge problem dealing with exact code clones of websites. I don’t know why this should be although I have a theory that it has something to do with data being synced across Google’s data-centers. There may be some sort of conflict between the data-centers as to which site is the original and which is the copy? (this is pure guesswork)

Now this shouldn’t be a problem as web proxies are not normally indexed by Google – in order to get a web proxy indexed you would need to link to the proxied page from somewhere on the web – and this is what seems to be happening.

A real example.

I was recently contacted by a client who’s SEO company had discovered what they called a “scraper” website – a scraper website copies website content and displays it as it’s own… In this case the SEO company was wrong, it wasn’t a scraper site it was a proxy but the problem was still there.

I don’t want to give away the client’s details so we will call the site “siteA.com” – their rankings for certain keywords were fluctuating widely and the SEO company involved were convinced it was the scraper / proxy site causing the problem.

How to tell if you have been “proxied”

This is quite easy, because the proxy copies everything – copy some text from your homepage and paste it into Google inside quotation marks, so if you have Welcome to Bobs Widgets – the largest selection of widgets available in the UK – then search for “Bobs Widgets – the largest selection of widgets available in the UK” and make sure when the results come up you click the “additional results” at the bottom of the page.

If you have been proxied then you may see something like http://proxyname.appspot.com/www.yourwebsite.com listed with the same title and description as your own.

I have even seen examples whilst reading up on this of proxied websites outranking the originals for certain search terms!

How do these proxy sites get into the rankings?

Good question – in theory they shouldn’t as they are generate on the fly to proxy any website as the request comes in, so they only exist at the exact moment someone uses them. The answer to this HAS TO BE that someone is actively linking to these proxied pages from other websites. This means that getting the proxied pages listed is intentional.

I can only surmise that this is some sort of Black Hat SEO technique, used by some SEO companies in an attempt to sabotage other websites that are in competition against their clients website.

How to fix the appspot proxy problem

Good question, the traditional defence against ‘scraper’ websites won’t work (the rel=”canonical” tag) as the tag is rewritten by the proxy. Banning the proxy by it’s IP range isn’t an option either as the appspot domain is hosted on Google’s cloud environment and because Google uses the same IP ranges for some of it’s other products (Google translate for example and even possibly it’s web crawler) then you could in theory ban Google from your own website!

The answer is in the server logs – when the proxy site is accessed it goes to the legitimate website and grabs a copy, when it visits it uses a unique USER AGENT. The USER AGENT is used by any browser or spider when it visits your website to identify itself.

The appspot proxies use different USER AGENTS but they all contain the words “AppEngine”

Now we know this we can formulate a defence…

The simplest way is to use .htaccess – a simple Rewrite Rule can be used…

RewriteCond %{HTTP_USER_AGENT} AppEngine [NC]
RewriteRule .* - [F]

The above code detects the AppEngine USER AGENT and blocks it – the result is that anyone trying to visit your website through an appspot proxy will get a “Forbidden” error message. This means that the results in Google for your website through the proxy now also return a “Forbidden” message. It will take Google some time to realise this and drop the proxy website from it’s index but it will drop it eventually.

I thought about using JavaScript to detect the proxy but I was unsure if this would / could be seen as “cloaking” so decided not to!

If anyone has any other suggestions on how to block or even redirect  traffic from a web proxy back to the original website using a 301 redirect then I would love to hear it… I did try using .htaccess to 301 redirect but it didn’t work on the proxy site – I am presuming that the proxy was rewriting the URLs in the .htaccess file so that it was looping?

Read more

WordPress vulnrability and the easy fix

Tim Thumb Scanner

TimThumb Scanner admin screen

The popular blogging platform WordPress has been fairly secure for quite some time now, but a very serious vulnerability has been found that allows a hacker to gain root access to your hosting account and alter any file at will. The affected file is called timthumb.php and is in itself a fairly harmless file – it allows image thumbnail generation from a chosen list of remote websites.

Read more

Buy Facebook fans and Twitter followers

Buying Facebook fans and twitter followers is a very touchy subject, some people say that paying for fans and followers is a waste of time as these new profiles who follow you are not “targeted” – in other words they have not decided to follow you because they are interested in your products and services, they have just followed you because they are paid to do so.

Whilst this is a valid point and 100% true, it’s not fair to dismiss the purchasing of friends and followers outright.

Read more

Free Google+ icons for you website

Our next bunch of social networking icons are for the “new kid” on the block… Google+

At the time of writing Google+ is in “beta” which means that you need an invite to join – seeing as most people searching for these icons will already have a Google+ account this won’t be a problem (although we have quite a few Google+ invites left if you need one)

Read more
Page 1 of 212
Copyright Total Web Development 2011 | Email: info@totalwebdevelopment.co.uk
Web design and development in Manchester, Oldham, Rochdale and the surrounding areas.