<< Masa.20120221-update-indexRbx-sbsm-fix-ssl-err-interaction-shutdown-over80Threads-oddbOrg-update-gridRb-htmlgrid | Index | Masa.20120217-fix-errors-check-threads-problem-oddb_org >>
Note
Refer to the submitt It appears that Crawlers need to be slowed down somewhat, to prevent DRbServers from being swamped with requests
Crawler Patterns
SBSM::Request::CRAWLER_PATTERN = /archiver|slurp|bot|crawler|jeeves|spider|\.{6}/i
Pattern maching (sbsm/lib/sbsm/request.rb#is_crawler?)
!!CRAWLER_PATTERN.match(@cgi.user_agent)
Experiment
def is_crawler? warn "@cgi.user_agent = #{@cgi.user_agent}" !!CRAWLER_PATTERN.match(@cgi.user_agent) end
Local test
Result
127.0.0.1 - - [20/Feb/2012:08:05:33 +0100] "GET /de/gcc/search/zone/drugs/search_query/inderal/search_type/st_oddb HTTP/1.1" 200 76393 127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/javascript/bit.ly.js HTTP/1.1" 304 - 127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/icon_twitter.gif HTTP/1.1" 200 664 127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/mail.gif HTTP/1.1" 200 961 127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/sponsor/gcc_default_Rectangle_Blisterkreuz_300x250_d.swf HTTP/1.1" 200 18540
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0 @cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0 @cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0 @cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0 @cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
Note
Check script
Result
masa@masa ~/work $ ruby check_crawler.rb access_log.20120120 total lines: 131901, crawler_pattern: 73715 (55.89 %)
Strategy
Algorithm design
Note
Thread creation point
def drb_process
args = {
'database_manager' => CGI::Session::DRbSession,
'drbsession_uri' => @drb_uri,
'session_path' => '/',
}
if(is_crawler?)
sleep 2.0
sid = [ENV['DEFAULT_FLAVOR'], @cgi.params['language'], @cgi.user_agent].join('-')
args.store('session_id', sid)
end
@session = CGI::Session.new(@cgi, args)
Analyze access log
How to use
ruby check_log.rb access_log
Scripts
How to use
sh batch.sh
Note
Algorithm (oddbd application side)
def process(request) @request_path = request.unparsed_uri @process_start = Time.now super if(!is_crawler? && self.lookandfeel.enabled?(:query_limit)) limit_queries end # Check the number of threads # If it is over the limit, sleep for a whilecheck_threads
'' ## return empty string across the drb-border enddef check_threads
if th = Thread.list.length and th > 100
sleep (th / 10)
end
end
Note