Over the past couple of weeks, I have been noticing unusual activity on my verision 7.4 site. News articles are suddenly getting far, far more reads than has been the prevailing usage. While at first you might think this it's good to get more traffic, it's the patterns that smell funny.
Specifically, the numbers of reads for all articles on the main page go up by exactly the same increment. For example, each one goes up by 10 or 20 (usually an even number) once or twice a day. I know all my articles are not equally popular, and seeing each one of them getting precisely equal additional reads just seems out of line.
So far, the increased traffic is not so substantial as to constitute a denial of service attack, but it's strange nonetheless.
tkevans Nuke Soldier
Joined: Jul 14, 2003
Posts: 29
Location: Baltimore, MD, USA
Posted:
Mon Oct 04, 2004 5:19 am
Just to follow up my own posting here:
This is a rude search engine (news.allresearch.com) repeatedly spidering the site, following all links every few hours. This would account for the pretty much exactly equal numbers of reads of all news articles.
wandering_goliard Nuke Cadet
Joined: Jul 19, 2003
Posts: 4
Posted:
Sat Oct 16, 2004 7:38 pm
I'll back this up, as well...they started banging the crap out of my site on 17 September, every day, same thing. I finally sent them a strongly worded email tonight, as well as banning the domain .allresearch.com from my sites.
tkevans Nuke Soldier
Joined: Jul 14, 2003
Posts: 29
Location: Baltimore, MD, USA
Posted:
Sun Oct 17, 2004 7:49 am
They seem to start out with the backen.php RSS feed, which would be fine if that's all they did. In fact, however, they open each item in the RSS feed, then follow all the links on the page, which includes links to the home/news module, your topics links, you old news links, and all the rest.
So, all the links on that page, and the home/news, and all the others, page get followed. Then, they hit the next RSS entry, and do it all over again.
So, then, every few hours, this site scans pretty much everything on your site.
Noah977 Nuke Cadet
Joined: Oct 19, 2004
Posts: 2
Posted:
Tue Oct 19, 2004 12:17 am
Hi,
I'm one of the authors of the system in question.
We're attempting to develop an RSS search engine. We DON'T spider any sites with this code.
The logic is simple.
1) Fetch an RSS feed.
2) Visit the links listed in the RSS feed.
3) Look for "timing clues" in the RSS feed that will tell us when to re-visit. These "clues" understood by our software include pdateFrequency, updatePeriod, updateBase, ttl, skipDays, skipHours, The “e-tag” HTTP header, The “Last-Modified” HTTP header.
If there are not timing clues at all, then we will re-visit in one hour.
Our goal is not to cause any kind of DOS or problem, but to be good netziens and follow published protocol. It is our udnerstanding of the RSS protocoll that a feed page will tell you when to re-visit for new content.
If there is a bug somewhere in our code, and our software is not correctly picking up a timing clue, then please let me know. We would be very appreciative and happy to fix it.
tkevans Nuke Soldier
Joined: Jul 14, 2003
Posts: 29
Location: Baltimore, MD, USA
Posted:
Tue Oct 19, 2004 4:14 am
As the original poster on this, I can assure you my site logfiles show repeated loads of not only the RSS feed, but every link on every page.
You need to familarize yourself with how PHP-Nuke works, with its common blocks that appear on every page. You load every page in the RSS feed, which is fine, but you then follow every link on every page. This ends up with you loading every link on every page every time you scan the site--and that means repeated loads of the same pages.
I can show you multiple (maybe 10-15 of them) loads of the same pages within just a minute or two, pretty much hourly.
In any event, my ISP has blocked you altogether (not only for my site, but all others they host), since you're clearly using up excessive bandwidth.
Noah977 Nuke Cadet
Joined: Oct 19, 2004
Posts: 2
Posted:
Fri Oct 22, 2004 8:14 am
Hi,
I'm not trying to be argumentative, but what you are describing is simply impossible. I KNOW what the code does. It simply DOESN'T spider. We wrote it from scratch, so we are very aware of what it does and doesn't do. Spidering a site involves a loop that strips links from each page visited and appends them to a data structure. We don't do that.
If you want to send me some log entries, I'll happily look at them.
Thanks,
-N
jfesler Nuke Cadet
Joined: Oct 24, 2004
Posts: 2
Posted:
Sun Oct 24, 2004 7:53 am
Alas, their crawler doesn't do If-Modified-Since and try to stick to using their own cache. I can confirm their crawler rudenes. Additionally, they don't ever request /robots.txt and follow the robots.txt standards.
Between what I've seen on the web of this outfit (non caching *hourly* crawling of destination links, no robots.txt handling, they service commercial accounts only and provide nothing to the general public).. and what I've seen in my logs.. I've simply null-routed them.
oprime2001 Lieutenant
Joined: Jul 13, 2003
Posts: 165
Posted:
Sun Oct 24, 2004 10:07 am
what user agent shows up in your logs -- just in case I want to add it to my list of banned bots?
jfesler Nuke Cadet
Joined: Oct 24, 2004
Posts: 2
Posted:
Sun Oct 24, 2004 10:26 am
Looks like it is behind nat. The one with the rude behavior
is: "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
coming from 38.144.36.16.
The one with what looks like more reasonable behavior is:
"Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)"; the mac
appears to be doing a proper If-Modified-Since (I *am* seeing 304's, and the links to .txt files are not being followed by the mac).
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum