Hi folks, if you have any questions or comments feel free to reply.
I've received some questions lately as to what Google Tap is all about. In order to answer I'd like to start off with "why googletap"...
I started up http://computercops.biz over 1.5 years ago and wanted to have the site crawled by Google. Was not very successful. Same thing with this site, http://nukecops.com.
This is where Google Tap comes into play... I've been told that someone originally wrote the mod_rewrite code for an old php-nuke, and if you know his name please feel free to comment on it. This original code slice was introduced to me by our same member who eventually released his googlifier. I've credited him in the GT release package.
With history aside, the original code was ported into the new php-nuke CMS, and a complete rewrite of the forums and other links were done. A mod_rewrite overhaul if you will.
This recode removed the &SID from the news articles, and also from the forums. Google sees this as a cookie per session only ID. And as such, ignores the URLs.
That is why the php-nuke news and forums are not indexed.
Now I know many of you have wonderful data in both of these modules, and I for one wanted to share my sites via Google and other crawlers.
From there, the forums have been opened up and so has the news. And GoogleTap has taken on a life of its own shortening links to other modules.
This is how it works...
There is a php code within the header.php file. While the php engine executes the php code, it finds all matching URLs and translates them into shorter URLs as defined in a header.php array.
When you click on the shorter links, the web browser (apache) can't find them because the files don't exist. This is where mod_rewrite comes into play. It matches the rewritten shorter php link into the real php-nuke long link -- all behind the scenes. Transparent if you will.
Now the problem with this is mod_rewrite is a major resource hog (can be) and typically makes apache processes run away. It can effectively bring an entire server to a halt thanks to memory leaks.
For the longest time I had to run a cron job that monitored the server load average. I had it set to "5", and when reached, the cron job would restart the apache server, forcing it to fall below that "5" threshold. Anything above that started to cause a noticeable delay in page loads.
However, since the server was upgraded to a dual CPU system, I no longer have high server load issues, and apache no longer runs away. So this is truly the case of "buy better hardware to solve the software hog problem".
Google Tap has effectively enabled php-nuke sites to be reached by millions of Internet users. The key is to stay on top of its development and future coding staying inline with php-nuke code changes, and add-ons.
At the same time I try new techniques to optimize the mod_rewrite code.
As a quick synopsis...
Within the header.php one needs to write the php-nuke URL to match using regular expressions. The second array is used to take the matching URL and craft it into the new shorter URL.
Hence the header.php does this:
long url --> short url
Then .htaccess while using mod_rewrite does this:
short url --> long url
Now remember these for mod_rewrite to work:
- mod_rewrite needs to be compiled and loaded into apache
- allowoverrides need to be set to "all" for the directory location your site resides in
- rewriteengine needs to be turned "on" in the .htaccess file
I can get into more details if you, so please let me know.
Thanks for your time.
For historical purposes, GT grew out of this thread:
_________________ Paul Laudanski, Microsoft MVP Windows-Security
CastleCops: [de] [en] [wiki]
sting Site Admin
Joined: Jul 24, 2003
Posts: 1985
Location: Apparently ALWAYS Online. . .
Posted:
Fri Nov 07, 2003 8:54 am
Could you post an example of your cron job to handle the runaway processes?
Thanks,
-sting
_________________ Is it paranoia if they are really out to get you?
-------------------------------------------------------
sting usually hangs out at nukehaven.net
Zhen-Xjell Nuke Cops Founder
Joined: Nov 14, 2002
Posts: 5939
Posted:
Fri Nov 07, 2003 8:56 am
filename: loadavg.pl
Code:
#!/usr/bin/perl -w
#use strict;
$|++;
open(LOAD,"/proc/loadavg") || die "couldn't open /proc/loadavg: $!\n";
my @load=split(/ /,<LOAD>);
close(LOAD);
if ($load[0] > 5) {
`/sbin/service httpd restart`;
}
_________________ Paul Laudanski, Microsoft MVP Windows-Security
CastleCops: [de] [en] [wiki]
KaTXi Nuke Soldier
Joined: Jul 02, 2003
Posts: 13
Posted:
Tue Nov 11, 2003 11:11 pm
mod_rewrite is killing my web site.
Is there any way to use mod_rewrite only when selected bots are visiting our sites or the first time that someone comes from a "rewrited" url (after searching on google, for example?
That should reduce the impact of mod_rewrite.
kjcdude Captain
Joined: Jun 10, 2003
Posts: 441
Location: Southern California
Posted:
Tue Nov 11, 2003 11:19 pm
I have also found that many problems arise from mod_rewrite, but i am still going to use GT because it gets me more pages cached.
Is there any way to use mod_rewrite only when selected bots are visiting our sites
Certainly. Suppose the bot has the IP address 192.168.0.0. Then, to tell mod_rewrite that the following rule has to be applied to that IP address only, you prepend the line
Code:
RewriteCond %{HTTP_HOST} ^192\.168\.0\.0$
to the RewriteRule line that you want it to apply.
For the IP addresses of the Google bots:
google deepbot (IP 216.x.x.x)
google fresh bot (IP 64.x.x.x)
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum