view · edit · sidebar · attach · print · history

20120220-update-crawler_pattern-drop-pointer-link-sbsm-test-heavy-access-oddb_org

<< Masa.20120221-update-indexRbx-sbsm-fix-ssl-err-interaction-shutdown-over80Threads-oddbOrg-update-gridRb-htmlgrid | Index | Masa.20120217-fix-errors-check-threads-problem-oddb_org >>

Check crawler patterns
Control sending a request to oddb server
Test oddb.org with heavy access
Make a limit for the number of threads oddb.org

Commits

Note

It seems that these updates is not applied to apache online -> continue it tomorrow

Check crawler patterns

Refer to the submitt It appears that Crawlers need to be slowed down somewhat, to prevent DRbServers from being swamped with requests

Crawler Patterns

 SBSM::Request::CRAWLER_PATTERN = /archiver|slurp|bot|crawler|jeeves|spider|\.{6}/i

Pattern maching (sbsm/lib/sbsm/request.rb#is_crawler?)

 !!CRAWLER_PATTERN.match(@cgi.user_agent)

Experiment

/usr/local/lib/ruby/gems/1.9.1/gems/sbsm-1.0.8/lib/sbsm/request.rb

    def is_crawler?
warn "@cgi.user_agent = #{@cgi.user_agent}"
      !!CRAWLER_PATTERN.match(@cgi.user_agent)
    end

Local test

Access http://oddb.masa.org/de/gcc/search/zone/drugs/search_query/inderal/search_type/st_oddb#best_result

Result

/var/apache/access_log

127.0.0.1 - - [20/Feb/2012:08:05:33 +0100] "GET /de/gcc/search/zone/drugs/search_query/inderal/search_type/st_oddb HTTP/1.1" 200 76393
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/javascript/bit.ly.js HTTP/1.1" 304 -
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/icon_twitter.gif HTTP/1.1" 200 664
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/mail.gif HTTP/1.1" 200 961
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/sponsor/gcc_default_Rectangle_Blisterkreuz_300x250_d.swf HTTP/1.1" 200 18540

/var/apache/error_log

@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0

Note

Neighter IP address nor URL is included in @cgi.user_agent

Check script

Attach:check_crawlers.rb.20120220.txt

Result

masa@masa ~/work $ ruby check_crawler.rb access_log.20120120 
total lines: 131901, crawler_pattern: 73715 (55.89 %)

Control sending a request to oddb server

Strategy

Add a new crawler pattern by the analyzing of access_log
Control requests by the following algorithm

Algorithm design

Check request type (z.B. Session ID, IP address or @cgi.user_agent, etc)
Check the number of request in a few seconds
Do not send the requests to oddbd server

Note

These processes should be done in SBSM::Request class (on Apache (mod_ruby))
SBSM::Request instance is created in doc/index.rbx and it runs on mod_ruby

Thread creation point

SBSM::Requst#drb_process

     def drb_process
      args = {
        'database_manager'  =>  CGI::Session::DRbSession,
        'drbsession_uri'    =>  @drb_uri,
        'session_path'      =>  '/',
      }
      if(is_crawler?)
        sleep 2.0
        sid = [ENV['DEFAULT_FLAVOR'], @cgi.params['language'], @cgi.user_agent].join('-')
        args.store('session_id', sid)
      end
      @session = CGI::Session.new(@cgi, args)

At this point, a new thread is created in oddbd server side if many accesses comes at the same time

Analyze access log

Attach:check_log.rb.20120220.txt

How to use

 ruby check_log.rb access_log

Test oddb.org with heavy access

Scripts

How to use

 sh batch.sh

Note

This batch script executes 100 access_test.rb processes to access http://oddb.masa.org at the same time
url and user_agent are taken from access_log file randomly

Make a limit for the number of threads oddb.org

Algorithm (oddbd application side)

If the number of threads in oddbd application goes over a limit, (z.B. 100)
1. check the remote IP address, if it is same as before
2. sleep for a while, (z.B. 10 seconds)
src/util/session.rb

    def process(request)
      @request_path = request.unparsed_uri
      @process_start = Time.now
      super
      if(!is_crawler? && self.lookandfeel.enabled?(:query_limit))
        limit_queries
      end

      # Check the number of threads
      # If it is over the limit, sleep for a while
      check_threads
      '' ## return empty string across the drb-border
    end
    def check_threads
      if th = Thread.list.length  and th > 100
        sleep (th / 10)
      end
    end

Note

It does not help

ywesee Developer-Wiki
Dieses Wiki richtet sich an alle ywesee-Entwickler

About

EBPS

Bbmb

ODBA

Oddb

Rpdf2txt

YDPM

YDIM

XmlConv

20120220-update-crawler_pattern-drop-pointer-link-sbsm-test-heavy-access-oddb_org

Check crawler patterns

Control sending a request to oddb server

Test oddb.org with heavy access

Make a limit for the number of threads oddb.org

ywesee Developer-Wiki Dieses Wiki richtet sich an alle ywesee-Entwickler

About

EBPS

Bbmb

ODBA

Oddb

Rpdf2txt

YDPM

YDIM

XmlConv

20120220-update-crawler_pattern-drop-pointer-link-sbsm-test-heavy-access-oddb_org

Check crawler patterns

Control sending a request to oddb server

Test oddb.org with heavy access

Make a limit for the number of threads oddb.org

ywesee Developer-Wiki
Dieses Wiki richtet sich an alle ywesee-Entwickler