Got a minute to answer a question?

When building a website, which is the most important?
Price.
Visual Design.
The ability to edit the site yourself.
The attitude / friendliness of the designer.
View Result
Open Widget Area

How to solve a problem like appspot (proxies)

How to solve a problem like appspot (proxies), 5.0 out of 5 based on 1 rating
GD Star Rating
loading...

Google is indexing proxy websitesThe usage of web proxies isn’t something new, they have been around for years as a way for web surfers to get around firewall rules in places such as schools and workplaces. The way a web proxy works is quite simple – if you wanted to visit Facebook at work but found that it was blocked by the IT department then you could visit a web proxy and then visit Facebook through it. The company firewall would not detect the connection to Facebook and will let you use it.

Now using a web proxy is fraught with danger, some of them are perfectly fine but how do you know that your Facebook email address and password you just entered into the proxy version of Facebook hasn’t been recorded or stored? Anyway that’s a different story for another day… The problem we are looking at today is about SEO and how some people are now using web proxies hosted on Google’s appspot platform to sabotage legitimate websites in the Google ranking results.

The problem of duplicate content and web proxies.

Google has a bit of a bee in it’s bonnet about “duplicate content” – that is content that has been copied or stolen from one web page and included in another. It’s policy is that the original page should be credited with being the “owner” of the content and the copied content should be dropped. Now this is fine and dandy – well done Google, but what happens when it’s not only the content of a page that is copied but the entire source code of that page? That is what a web proxy does…

It’s quite easy for Google to spot copied content – text that has been copied from website A to website B, but it seems to have a huge problem dealing with exact code clones of websites. I don’t know why this should be although I have a theory that it has something to do with data being synced across Google’s data-centers. There may be some sort of conflict between the data-centers as to which site is the original and which is the copy? (this is pure guesswork)

Now this shouldn’t be a problem as web proxies are not normally indexed by Google – in order to get a web proxy indexed you would need to link to the proxied page from somewhere on the web – and this is what seems to be happening.

A real example.

I was recently contacted by a client who’s SEO company had discovered what they called a “scraper” website – a scraper website copies website content and displays it as it’s own… In this case the SEO company was wrong, it wasn’t a scraper site it was a proxy but the problem was still there.

I don’t want to give away the client’s details so we will call the site “siteA.com” – their rankings for certain keywords were fluctuating widely and the SEO company involved were convinced it was the scraper / proxy site causing the problem.

How to tell if you have been “proxied”

This is quite easy, because the proxy copies everything – copy some text from your homepage and paste it into Google inside quotation marks, so if you have Welcome to Bobs Widgets – the largest selection of widgets available in the UK – then search for “Bobs Widgets – the largest selection of widgets available in the UK” and make sure when the results come up you click the “additional results” at the bottom of the page.

If you have been proxied then you may see something like http://proxyname.appspot.com/www.yourwebsite.com listed with the same title and description as your own.

I have even seen examples whilst reading up on this of proxied websites outranking the originals for certain search terms!

How do these proxy sites get into the rankings?

Good question – in theory they shouldn’t as they are generate on the fly to proxy any website as the request comes in, so they only exist at the exact moment someone uses them. The answer to this HAS TO BE that someone is actively linking to these proxied pages from other websites. This means that getting the proxied pages listed is intentional.

I can only surmise that this is some sort of Black Hat SEO technique, used by some SEO companies in an attempt to sabotage other websites that are in competition against their clients website.

How to fix the appspot proxy problem

Good question, the traditional defence against ‘scraper’ websites won’t work (the rel=”canonical” tag) as the tag is rewritten by the proxy. Banning the proxy by it’s IP range isn’t an option either as the appspot domain is hosted on Google’s cloud environment and because Google uses the same IP ranges for some of it’s other products (Google translate for example and even possibly it’s web crawler) then you could in theory ban Google from your own website!

The answer is in the server logs – when the proxy site is accessed it goes to the legitimate website and grabs a copy, when it visits it uses a unique USER AGENT. The USER AGENT is used by any browser or spider when it visits your website to identify itself.

The appspot proxies use different USER AGENTS but they all contain the words “AppEngine”

Now we know this we can formulate a defence…

The simplest way is to use .htaccess – a simple Rewrite Rule can be used…

RewriteCond %{HTTP_USER_AGENT} AppEngine [NC]
RewriteRule .* - [F]

The above code detects the AppEngine USER AGENT and blocks it – the result is that anyone trying to visit your website through an appspot proxy will get a “Forbidden” error message. This means that the results in Google for your website through the proxy now also return a “Forbidden” message. It will take Google some time to realise this and drop the proxy website from it’s index but it will drop it eventually.

I thought about using JavaScript to detect the proxy but I was unsure if this would / could be seen as “cloaking” so decided not to!

If anyone has any other suggestions on how to block or even redirect  traffic from a web proxy back to the original website using a 301 redirect then I would love to hear it… I did try using .htaccess to 301 redirect but it didn’t work on the proxy site – I am presuming that the proxy was rewriting the URLs in the .htaccess file so that it was looping?

With just one click you can share this page...
Facebook Twitter Linkedin Digg Stumbleupon

Like ! Share with friends on the following networks

We are currently updating our website so please excuse us if some of the pages are not currently working - we are working hard to get everything spot on!

Menu