view · edit · sidebar · attach · print · history

20120220-update-crawler_pattern-drop-pointer-link-sbsm-test-heavy-access-oddb_org

<< Masa.20120221-update-indexRbx-sbsm-fix-ssl-err-interaction-shutdown-over80Threads-oddbOrg-update-gridRb-htmlgrid | Index | Masa.20120217-fix-errors-check-threads-problem-oddb_org >>


  1. Check crawler patterns
  2. Control sending a request to oddb server
  3. Test oddb.org with heavy access
  4. Make a limit for the number of threads oddb.org

Commits
  1. Added a new crawler pattern, windows, to SBSM::Request::CRAWLER_PATTERN (sbsm)
  2. Drop any request containing pointer in it (sbsm)

Note

  • It seems that these updates is not applied to apache online -> continue it tomorrow

Check crawler patterns

Refer to the submitt It appears that Crawlers need to be slowed down somewhat, to prevent DRbServers from being swamped with requests

Crawler Patterns

 SBSM::Request::CRAWLER_PATTERN = /archiver|slurp|bot|crawler|jeeves|spider|\.{6}/i

Pattern maching (sbsm/lib/sbsm/request.rb#is_crawler?)

 !!CRAWLER_PATTERN.match(@cgi.user_agent)

Experiment

  • /usr/local/lib/ruby/gems/1.9.1/gems/sbsm-1.0.8/lib/sbsm/request.rb
    def is_crawler?
warn "@cgi.user_agent = #{@cgi.user_agent}"
      !!CRAWLER_PATTERN.match(@cgi.user_agent)
    end

Local test

Result

  • /var/apache/access_log
127.0.0.1 - - [20/Feb/2012:08:05:33 +0100] "GET /de/gcc/search/zone/drugs/search_query/inderal/search_type/st_oddb HTTP/1.1" 200 76393
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/javascript/bit.ly.js HTTP/1.1" 304 -
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/icon_twitter.gif HTTP/1.1" 200 664
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/mail.gif HTTP/1.1" 200 961
127.0.0.1 - - [20/Feb/2012:08:05:34 +0100] "GET /resources/sponsor/gcc_default_Rectangle_Blisterkreuz_300x250_d.swf HTTP/1.1" 200 18540
  • /var/apache/error_log
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0
@cgi.user_agent = Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20100101 Firefox/8.0

Note

  • Neighter IP address nor URL is included in @cgi.user_agent

Check script

Result

masa@masa ~/work $ ruby check_crawler.rb access_log.20120120 
total lines: 131901, crawler_pattern: 73715 (55.89 %)

Control sending a request to oddb server

Strategy

  1. Add a new crawler pattern by the analyzing of access_log
  2. Control requests by the following algorithm

Algorithm design

  1. Check request type (z.B. Session ID, IP address or @cgi.user_agent, etc)
  2. Check the number of request in a few seconds
  3. Do not send the requests to oddbd server

Note

  • These processes should be done in SBSM::Request class (on Apache (mod_ruby))
  • SBSM::Request instance is created in doc/index.rbx and it runs on mod_ruby

Thread creation point

  • SBSM::Requst#drb_process
     def drb_process
      args = {
        'database_manager'  =>  CGI::Session::DRbSession,
        'drbsession_uri'    =>  @drb_uri,
        'session_path'      =>  '/',
      }
      if(is_crawler?)
        sleep 2.0
        sid = [ENV['DEFAULT_FLAVOR'], @cgi.params['language'], @cgi.user_agent].join('-')
        args.store('session_id', sid)
      end
      @session = CGI::Session.new(@cgi, args)
  • At this point, a new thread is created in oddbd server side if many accesses comes at the same time

Analyze access log

How to use

 ruby check_log.rb access_log

Test oddb.org with heavy access

Scripts

How to use

 sh batch.sh

Note

  • This batch script executes 100 access_test.rb processes to access http://oddb.masa.org at the same time
  • url and user_agent are taken from access_log file randomly

Make a limit for the number of threads oddb.org

Algorithm (oddbd application side)

  • If the number of threads in oddbd application goes over a limit, (z.B. 100)
    1. check the remote IP address, if it is same as before
    2. sleep for a while, (z.B. 10 seconds)
  • src/util/session.rb
    def process(request)
      @request_path = request.unparsed_uri
      @process_start = Time.now
      super
      if(!is_crawler? && self.lookandfeel.enabled?(:query_limit))
        limit_queries
      end

      # Check the number of threads
      # If it is over the limit, sleep for a while
      check_threads
      '' ## return empty string across the drb-border
    end
    def check_threads
      if th = Thread.list.length  and th > 100
        sleep (th / 10)
      end
    end

Note

  • It does not help
view · edit · sidebar · attach · print · history
Page last modified on February 20, 2012, at 05:25 PM