You are missing our premiere tool bar navigation system! Register and use it for FREE!

NukeCops  
•  Home •  Downloads •  Gallery •  Your Account •  Forums • 
Readme First
- Readme First! -

Read and follow the rules, otherwise your posts will be closed
Modules
· Home
· FAQ
· Buy a Theme
· Advertising
· AvantGo
· Bookmarks
· Columbia
· Community
· Donations
· Downloads
· Feedback
· Forums
· PHP-Nuke HOWTO
· Private Messages
· Search
· Statistics
· Stories Archive
· Submit News
· Surveys
· Theme Gallery
· Top
· Topics
· Your Account
Who's Online
There are currently, 152 guest(s) and 0 member(s) that are online.

You are Anonymous user. You can register for free by clicking here
Nuke Cops :: View topic - Google are changing their index policy [ ]
 Forum FAQ  •  Search  •   •  Memberlist  •  Usergroups   •  Register  •  Profile •    •  Log in to check your private messages  •  Log in

 
Post new topic  Reply to topicprinter-friendly view
View previous topic Log in to check your private messages View next topic
Author Message
Imago
Captain
Captain


Joined: Jan 17, 2003
Posts: 629

Location: Europe

PostPosted: Thu Sep 11, 2003 12:11 am Reply with quoteBack to top

Today, I've been surprised to find googlebot (64.68.88.) indexing the following dynamic pages:
Articles
http://www.orientalgate.org/modules.php?name=News&file=article&sid=290
http://www.orientalgate.org/modules.php?name=News&file=article&sid=130
http://www.orientalgate.org/modules.php?name=News&file=article&sid=95
http://www.orientalgate.org/modules.php?name=News&file=article&sid=348
http://www.orientalgate.org/modules.php?name=News&file=article&sid=118
http://www.orientalgate.org/modules.php?name=News&file=article&sid=85
http://www.orientalgate.org/modules.php?name=News&file=article&sid=91
http://www.orientalgate.org/modules.php?name=News&file=article&sid=98
http://www.orientalgate.org/modules.php?name=News&file=article&sid=128
http://www.orientalgate.org/modules.php?name=News&file=article&sid=404

Forums
http://www.orientalgate.org/modules.php?name=Forums&file=viewtopic&p=8136

I am going to disallow modules.php in robots.txt
Find all posts by ImagoView user's profileSend private messageVisit poster's website
MikeMiles
Lieutenant
Lieutenant


Joined: May 29, 2003
Posts: 231


PostPosted: Thu Sep 11, 2003 1:27 am Reply with quoteBack to top

The bot has been doing one and two variables after the ? for quite awhile now. I've seen him do three before but to a much, much lesser extent. I've also seen him index a page with multiple variables if the link is found over on another site. He will definitely index things you don't want if you don't guide him with robots.txt.

Here's a good example where he caught a session id and spidered the same page thousands and thousands of times:
http://www.google.com/search?hl=en&lr=&ie=ISO-8859-1&q=site%3Ainvisionboard.com+-zyxzyxzyxzyx That one use to show over 30,000 to the exact same page with only a different session id. For sure, definitely watch the good bots so they don't get lost or go off spidering things you don't want.
Find all posts by MikeMilesView user's profileSend private message
Imago
Captain
Captain


Joined: Jan 17, 2003
Posts: 629

Location: Europe

PostPosted: Thu Sep 11, 2003 2:39 am Reply with quoteBack to top

I don't need my short urls and my dynamic urls being indexed parallelly. If this is a stable tendency with Google, perhaps, I will disable the short urls because there are SEs that are reluctant to index them (I mean the Russian yandex).

_________________
www.vdsp.net | www.indopedia.org | www.orientalia.org | www.indology.net | www.yogadarsana.org | www.husserl.info | www.medicum.net
Find all posts by ImagoView user's profileSend private messageVisit poster's website
Imago
Captain
Captain


Joined: Jan 17, 2003
Posts: 629

Location: Europe

PostPosted: Thu Sep 11, 2003 7:10 am Reply with quoteBack to top

googlebot has been driven mad. Now it indexes a non-existent encyclopedia

http://www.orientalgate.org/modules.php?name=Encyclopedia&op=list_content&eid=135

and has started indexing all topics of the Forum
http://www.orientalgate.org/modules.php?name=Forums&file=viewtopic&t=99

with this ban in the robots.txt

Disallow: modules.php?name*$

What I am afraid to say is that our mod_rewrite tecniques have become obsolete. Google needs no more short urls to index your site. Another example from the encyclopedia terms

googlebot takes this URL
http://www.orientalgate.org/modules.php?name=Encyclopedia&op=content&tid=220

and ignores this one
http://www.orientalgate.org/term220.html

Soon it will learn to ignore all rewritten URLs.
Find all posts by ImagoView user's profileSend private messageVisit poster's website
NovemberRain
Sergeant
Sergeant


Joined: Jan 30, 2003
Posts: 96

Location: Istanbul

PostPosted: Thu Sep 11, 2003 5:22 pm Reply with quoteBack to top

so what?
uninstall googlifier?

there are links to my short urls around web everywhere
Find all posts by NovemberRainView user's profileSend private messageICQ Number
manunkind
Sergeant
Sergeant


Joined: Aug 16, 2003
Posts: 75


PostPosted: Thu Sep 11, 2003 5:54 pm Reply with quoteBack to top

I've been seeing this also. Confused
Find all posts by manunkindView user's profileSend private message
Imago
Captain
Captain


Joined: Jan 17, 2003
Posts: 629

Location: Europe

PostPosted: Thu Sep 11, 2003 10:13 pm Reply with quoteBack to top

// there are links to my short urls around web everywhere

This is the main problem. Short links are very convenient. Let us see whether Google will replace the short links with the dynamic ones in its Search Index.

Please, share your observations.
"Who is Where" should be installed to see the IP of the bot and the indexed url.

_________________
www.vdsp.net | www.indopedia.org | www.orientalia.org | www.indology.net | www.yogadarsana.org | www.husserl.info | www.medicum.net
Find all posts by ImagoView user's profileSend private messageVisit poster's website
MikeMiles
Lieutenant
Lieutenant


Joined: May 29, 2003
Posts: 231


PostPosted: Fri Sep 12, 2003 1:49 am Reply with quoteBack to top

Your robots.txt file currently shows you're only allowing two bots (plamenbot and yandex) to spider your site and have banned all the rest. Yandex is a Russian SE. Who is plamenbot? I don't see that one listed at all using a couple SE's. Is that a Russian user agent Googlebot is using? If not, it shouldn't even be indexing your site at all right now with the blanket ban you have. From what I've read, only Googlebot recognizes wild cards in the robot.txt file not the other bots.

There are also people who fake the Googlebot user agent. What is the IP of the one indexing your dynamic urls?
Find all posts by MikeMilesView user's profileSend private message
Imago
Captain
Captain


Joined: Jan 17, 2003
Posts: 629

Location: Europe

PostPosted: Fri Sep 12, 2003 4:34 am Reply with quoteBack to top

// Who is plamenbot?

Smile)))))))))

This is googlebot. I have renamend it to plamenbot to be sure that no bot will index my pages (Plamen is my name).

Yet Google continued to index my site quite a while after the ban. Now it is out. I will keep it out every time I find it indexing modules.php dynamic URLs.

The IP is 64.68.88.xxx - usually there are 30 or 40 bots.

As for the wildcards, you are right. Yet the guys from Yandex told me the robots.txt was fine with them. The problem is that they do not index whatever short URL and index any dynamic URL.
Find all posts by ImagoView user's profileSend private messageVisit poster's website
MikeMiles
Lieutenant
Lieutenant


Joined: May 29, 2003
Posts: 231


PostPosted: Fri Sep 12, 2003 6:27 am Reply with quoteBack to top

oooh, okay. Sounds like a pretty cool bot's name to me. lol

Yup, that's a Google IP. Their bot usually follows robots.txt but sometimes takes a little while when you change it because it caches the file. I recommend you change these to have a leading slash. Not sure if Googlebot will take it w/o the slash but the standard says to use it.

Disallow: /admin.php
Disallow: /modules.php?name*$
Disallow: /odp.php
Disallow: /websearch.php
Find all posts by MikeMilesView user's profileSend private message
Imago
Captain
Captain


Joined: Jan 17, 2003
Posts: 629

Location: Europe

PostPosted: Fri Sep 12, 2003 8:13 am Reply with quoteBack to top

Thanks, I'll try with slashes.

_________________
www.vdsp.net | www.indopedia.org | www.orientalia.org | www.indology.net | www.yogadarsana.org | www.husserl.info | www.medicum.net
Find all posts by ImagoView user's profileSend private messageVisit poster's website
NovemberRain
Sergeant
Sergeant


Joined: Jan 30, 2003
Posts: 96

Location: Istanbul

PostPosted: Fri Sep 12, 2003 7:16 pm Reply with quoteBack to top

At the moment

64.68.88.146 crawl34.googlebot.com is indexing ftopic376.html

64.68.88.31 crawl32.googlebot.com is indexing forum1.html

aslo one google bot is indexing news with long urls, another one is indexing the short ones.

both long urls and short ones are being indexed. i think this means twice bandwith for google and two diffrent types of urls in the search results. we have to wait and see.

also 216.88.158.142 crawlers.looksmart.com is always at my site although all other bots are disallowed except google


Last edited by NovemberRain on Fri Sep 12, 2003 7:34 pm; edited 1 time in total
Find all posts by NovemberRainView user's profileSend private messageICQ Number
nickweb
Nuke Soldier
Nuke Soldier


Joined: Aug 27, 2003
Posts: 34


PostPosted: Fri Sep 12, 2003 7:22 pm Reply with quoteBack to top

i have who is where and ip tracker installed at my site ( www.nick-web.co.uk) and the googlebots went absolutly crazy! i think they started eating up my bandwidth, so i banned all googlbot ip ranges. Very Happy

At last, a whos where without 10 visitors frm google. lol
Find all posts by nickwebView user's profileSend private message
Stylee
Sergeant
Sergeant


Joined: Jun 22, 2003
Posts: 140

Location: USA

PostPosted: Fri Sep 12, 2003 7:43 pm Reply with quoteBack to top

I dont have googlifier installed on my site, and those same IP's have been there all day long, between 4 and 10 of them at a time.

Can someone shoot me a link for who is where?
Find all posts by StyleeView user's profileSend private messageSend e-mailVisit poster's website
MikeMiles
Lieutenant
Lieutenant


Joined: May 29, 2003
Posts: 231


PostPosted: Fri Sep 12, 2003 8:10 pm Reply with quoteBack to top

Googlebot always brings over a bunch of his buddies for serious indexing. They use distributed processing which actually makes indexing a site much faster.

All bots eat up bandwidth, but a person has to decide is it worth it for the potential visitors their SE may bring to his site. Some hardly bring any in return, but I think for most people Googlebot pays back for what it eats during indexing. Like the rest, you have to watch him to make sure he behaves.
Find all posts by MikeMilesView user's profileSend private message
Display posts from previous:      
Post new topic  Reply to topicprinter-friendly view
View previous topic Log in to check your private messages View next topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2005 phpBB Group

Ported by Nuke Cops © 2003 www.nukecops.com
:: FI Theme :: PHP-Nuke theme by coldblooded (www.nukemods.com) ::
Powered by TOGETHER TEAM srl ITALY http://www.togetherteam.it - DONDELEO E-COMMERCE http://www.DonDeLeo.com - TUTTISU E-COMMERCE http://www.tuttisu.it
Web site engine's code is Copyright © 2002 by PHP-Nuke. All Rights Reserved. PHP-Nuke is Free Software released under the GNU/GPL license.
Page Generation: 0.589 Seconds - 91 pages served in past 5 minutes. Nuke Cops Founded by Paul Laudanski (Zhen-Xjell)
:: FI Theme :: PHP-Nuke theme by coldblooded (www.nukemods.com) ::