| Author |
Message |
Imago
Captain


Joined: Jan 17, 2003
Posts: 629
Location: Europe
|
Posted:
Thu Sep 11, 2003 12:11 am |
  |
|
    |
 |
MikeMiles
Lieutenant


Joined: May 29, 2003
Posts: 231
|
Posted:
Thu Sep 11, 2003 1:27 am |
  |
The bot has been doing one and two variables after the ? for quite awhile now. I've seen him do three before but to a much, much lesser extent. I've also seen him index a page with multiple variables if the link is found over on another site. He will definitely index things you don't want if you don't guide him with robots.txt.
Here's a good example where he caught a session id and spidered the same page thousands and thousands of times:
http://www.google.com/search?hl=en&lr=&ie=ISO-8859-1&q=site%3Ainvisionboard.com+-zyxzyxzyxzyx That one use to show over 30,000 to the exact same page with only a different session id. For sure, definitely watch the good bots so they don't get lost or go off spidering things you don't want. |
|
|
   |
 |
Imago
Captain


Joined: Jan 17, 2003
Posts: 629
Location: Europe
|
Posted:
Thu Sep 11, 2003 2:39 am |
  |
|
    |
 |
Imago
Captain


Joined: Jan 17, 2003
Posts: 629
Location: Europe
|
Posted:
Thu Sep 11, 2003 7:10 am |
  |
|
    |
 |
NovemberRain
Sergeant


Joined: Jan 30, 2003
Posts: 96
Location: Istanbul
|
Posted:
Thu Sep 11, 2003 5:22 pm |
  |
so what?
uninstall googlifier?
there are links to my short urls around web everywhere |
|
|
    |
 |
manunkind
Sergeant


Joined: Aug 16, 2003
Posts: 75
|
Posted:
Thu Sep 11, 2003 5:54 pm |
  |
I've been seeing this also.  |
|
|
   |
 |
Imago
Captain


Joined: Jan 17, 2003
Posts: 629
Location: Europe
|
Posted:
Thu Sep 11, 2003 10:13 pm |
  |
|
    |
 |
MikeMiles
Lieutenant


Joined: May 29, 2003
Posts: 231
|
Posted:
Fri Sep 12, 2003 1:49 am |
  |
Your robots.txt file currently shows you're only allowing two bots (plamenbot and yandex) to spider your site and have banned all the rest. Yandex is a Russian SE. Who is plamenbot? I don't see that one listed at all using a couple SE's. Is that a Russian user agent Googlebot is using? If not, it shouldn't even be indexing your site at all right now with the blanket ban you have. From what I've read, only Googlebot recognizes wild cards in the robot.txt file not the other bots.
There are also people who fake the Googlebot user agent. What is the IP of the one indexing your dynamic urls? |
|
|
   |
 |
Imago
Captain


Joined: Jan 17, 2003
Posts: 629
Location: Europe
|
Posted:
Fri Sep 12, 2003 4:34 am |
  |
// Who is plamenbot?
)))))))))
This is googlebot. I have renamend it to plamenbot to be sure that no bot will index my pages (Plamen is my name).
Yet Google continued to index my site quite a while after the ban. Now it is out. I will keep it out every time I find it indexing modules.php dynamic URLs.
The IP is 64.68.88.xxx - usually there are 30 or 40 bots.
As for the wildcards, you are right. Yet the guys from Yandex told me the robots.txt was fine with them. The problem is that they do not index whatever short URL and index any dynamic URL. |
|
|
    |
 |
MikeMiles
Lieutenant


Joined: May 29, 2003
Posts: 231
|
Posted:
Fri Sep 12, 2003 6:27 am |
  |
oooh, okay. Sounds like a pretty cool bot's name to me. lol
Yup, that's a Google IP. Their bot usually follows robots.txt but sometimes takes a little while when you change it because it caches the file. I recommend you change these to have a leading slash. Not sure if Googlebot will take it w/o the slash but the standard says to use it.
Disallow: /admin.php
Disallow: /modules.php?name*$
Disallow: /odp.php
Disallow: /websearch.php |
|
|
   |
 |
Imago
Captain


Joined: Jan 17, 2003
Posts: 629
Location: Europe
|
Posted:
Fri Sep 12, 2003 8:13 am |
  |
|
    |
 |
NovemberRain
Sergeant


Joined: Jan 30, 2003
Posts: 96
Location: Istanbul
|
Posted:
Fri Sep 12, 2003 7:16 pm |
  |
At the moment
64.68.88.146 crawl34.googlebot.com is indexing ftopic376.html
64.68.88.31 crawl32.googlebot.com is indexing forum1.html
aslo one google bot is indexing news with long urls, another one is indexing the short ones.
both long urls and short ones are being indexed. i think this means twice bandwith for google and two diffrent types of urls in the search results. we have to wait and see.
also 216.88.158.142 crawlers.looksmart.com is always at my site although all other bots are disallowed except google |
Last edited by NovemberRain on Fri Sep 12, 2003 7:34 pm; edited 1 time in total |
|
    |
 |
nickweb
Nuke Soldier


Joined: Aug 27, 2003
Posts: 34
|
Posted:
Fri Sep 12, 2003 7:22 pm |
  |
i have who is where and ip tracker installed at my site ( www.nick-web.co.uk) and the googlebots went absolutly crazy! i think they started eating up my bandwidth, so i banned all googlbot ip ranges.
At last, a whos where without 10 visitors frm google. lol |
|
|
   |
 |
Stylee
Sergeant


Joined: Jun 22, 2003
Posts: 140
Location: USA
|
Posted:
Fri Sep 12, 2003 7:43 pm |
  |
I dont have googlifier installed on my site, and those same IP's have been there all day long, between 4 and 10 of them at a time.
Can someone shoot me a link for who is where? |
|
|
     |
 |
MikeMiles
Lieutenant


Joined: May 29, 2003
Posts: 231
|
Posted:
Fri Sep 12, 2003 8:10 pm |
  |
Googlebot always brings over a bunch of his buddies for serious indexing. They use distributed processing which actually makes indexing a site much faster.
All bots eat up bandwidth, but a person has to decide is it worth it for the potential visitors their SE may bring to his site. Some hardly bring any in return, but I think for most people Googlebot pays back for what it eats during indexing. Like the rest, you have to watch him to make sure he behaves. |
|
|
   |
 |
|
|