Bigdaddy - Google's new structure and problems

SEO - Search engine optimization

Google's new structure - Bigdaddy.

Trying to describe the new structure and its problems.

The new structure:
- Wat is Bigdaddy?
- Google's new bot
Search engine problems:
Vision on the new structure
Footnote

Description of the new structure.

A short reminder of what Bigdaddy is:

The update known as Bigdaddy wasn't just an update in which only the algorithm has changed, as we have had many times before. No, this update was a change in the structure of and how data is ordered in the data centers.

The new structure had to ensure Google's abillity to provide searchers all over the world with good and fast results, for now and for the future. To maintain and strengthen its position as the largest provider of information, it was necessary to implement a new and better way to handle, sort and store the data it receives from its bots. The new database structure is said to be a switch to 64 bit processors.

To deliver good results fast, much depends on the handling of the received data. Datacenters are filled with new information in an endless stream of new found pages in a pace one cann't imagine. Already stored information has to be sorted over and over again to maintain a high standard of delivering high quality results. It aint hard to imagine that speed is necessary, speed that within the new structure, a few smart people have figured out, could well be twice as high as the old structure could handle.

Part of the Bigdaddy update is a new robot.

Allong with the update a new robot was introduced, it can be recognized by its UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This string can be found in the access-log of your website.

There used to be many robots send from different datacenters to collect the information, now, there will just be one datacenter that has been assigned to collect the information from a certain website. As of now, Bigdaddy is fully operational, you will see just 1 IP for all the different robots that Google will send to your website.

What does this mean for Google's structure?

New, and altered data will be collected from 1 datacenter and distrubuted from there to all the other centers. This means that the traffic generated from Google's robots to your and other sites will be reduced. All seems to be perfect and ideal, but it bears a great risk. Lets assume your site is visited by a robot comming from a datacenter that isn't working properly. The consequensces for your sites positions in the SERP's, would be severe. In the old situation, a problem would be overcome easily by the other bots comming from different IP's.

Search engine problems - Solved?

304 status - document changed or not

Implemented in Google's robot during Bigdaddy is the request for the status (304) that declares if a document has been altered or not. This would, if working properly, be a next step to cut back the traffic to your site. Pages that have that status, don't have to be requested over and over again, a page that would have this status for a longer period, could even be visited less frequently. Statisticly not shokking, but what if the robot isn't handling the status correctly? It could take days, even weeks or months before new information is indexed!

The data that is contained in a search engine, and the results it deliverd would be out of date!

The new robot did not handle this status correct! At least it didn't while i was writing this article.

After requested the headers to check if a document had changed, the bot got a 304 if it hadn't changed. The result: the bot disappeared for weeks. On the other hand, documents that did not change for a long time, became a status '200' oke and the robot requested them. Over and over again. So something had to be wrong, something had to be broken. And my site, amongst numerous others, became the victim of a broken bot. Pages that actually were altered, didn't get indexed and even worse, new pages were not indexed at all.

A software problem?

Canonical problem / 301

Amongst all other problems it appeared that Google still had a problem with canonical URL's. Because of that, Google did see two versions for 1 site, for quite a few domains. A website could as well be indexed with 'www.' as without 'www.'. For websites it does, this means an almost certain penalty or even worse a complete ban from the index.

For our site a search on webontwerp did deliver the following results:

"http://vision2form.nl/webontwerp/"

and

"http://www.vision2form.nl/webontwerp/"

Due to the fact that many websites did try to get better results by placing many duplicate sites on the net, a penalty on duplicate content would be just. But not in this case. So maybe even without knowing, even your site might have had, or still has a penalty due to duplicat content, that in fact is only seen by Google!.

A solution to this problem would be to show Google where the real content is and block the way to a possible duplicate. The only proper way to do that is by serving a 301 'Redirect permanent' on all false requests.

Unfortunatly the 301 wasn't always handled correctly by the bot and the search engine did show the unwanted URL.

302 hijack

Another problem it did have, or still has, is the 302 redirect that, not so simple but still, could be used to 'hijack a website'. By returning a '302 redirect' to a bot when visiting a malicious website, it is possible to make the robot beleive the document redirected to belonges to the domain it redirects to.

If a page has been hijacked, all the traffic to the original resource will dry up and the page will be lost for ever. There is no method known that for 100% could reverse the process.

A true nightmare for every webmaster. What a 302 hijack is and how it might be removed is explained on: Clsc.net - 302 page hijack.

The new structure should prevent a 302, but if it does?

Duplicate content and Wikipedia and DMOZ clones

Webmasters all over the world try to fill their sites with as much content as possible in the shortest possible time and with as little effort as possible. The easiest way to do that is to use content from Wikipedia and links from DMOZ. The content build by volunteers in Wikipedia, is available under the GNU Free Documentation License, the DMOZ - Open Directory has a similar license that provides anyone that agrees with their terms the possibility to place the content on their own websites.

Normaly speaking this would not be a huge problem, for the visitors of Google's search engine and therefor for Goolge, it is. The results get littered by pages that are nothing more than clones of Wikipedia en DMOZ. You most certainly have seen them and perhapse are getting irritated by them.

Results that are not off interest for a search engines user, are not off interest for the search engine. You might loose your confidence and use another one to get the results you want. This problem (the clones) should be taken care of in the new update.

Search engine results bound by region

The next problem is a problem that is somewhat different and has been arround for a few years now. It appears to be as from Bigdaddy on it became worse. More and more Google tries to deliver results that are of use to you. They therefor gather all kinds of information from you, the way you search and use their products and even from where you do that. Google is trying to give you results that are related to that information. They try to make a profile off you and what you might want. That could be great, but it isn't. It means you get that information Google thinks is good for you. Results are already manipulated by Google as far as they concern your phisical location.

Google tries to calculate your position, but fails to, again and again. For example our own IP that Google uses in trying to locate me. I live in Gelderland, but Google positions me in Alkmaar and sometimes in Haarlem. Just 200 km away from my real location.

If Google does not get that location right, you might get results it thinks are in your vicinity. You might for example get results for Groningen while you rather should have them related to Zeeland.

Google has always stated it would 'filter' the results for searching through the portal build for that specific country / language. But it still does even if you use the google.com version. A development that i for myselve think is not that good. Or as a concerend webmaster from India once said it would look like: 'fishing in the same little pond surrounding us'.

Vision on the new structure

The new structure most certainly has great advantages, but brought even worse problems, if Google wants to survive as the leading search engine, these problems have to be solved.

Footnote:

This article was written to my personal vision on the development and the results Google gave and still gives by reading many articles on several blogs and my experience with the way Google handles my site and places it in the results. Bear in mind that i not employed by Google.

Corrections, discussions or any meaningfull alteration, point of view are welcome, please get in contact with me!

Search engine optimization - SEO FAQ's

SEO advice
giving reliable SEO advice
SEO faq
An answer to frequently asked questions and unraveling myths about SEO, Search engines, rankings, website promotion etc.
SEO Expert
What is an SEO expert and does he exist?
SEO ethics
Optimizing websites, is it all about tricks?

SEO basics

SEO links
Links on SEO and search engine optimization information.
Flash and being found
What to do to get your Flash website being found?
Google update Bigdaddy
or, something went wrong. Google's implementation of a new structure for its datacenters has some problems.