view · edit · sidebar · attach · print · history






Port to use rack

Must fix the select_all error when running the siege test against santesuisse.

Adding a mutex around the initialisation of sorted_fachinfos, sorted_minifis and sorted_feedbacks in oddbapp.rb solved this problem. But maybe it would be easier just to initialize these items correctly.

Also adding a rewrite.log for, and as I am unsure whether some problems were not related to wrong rewrites.

Looked at why we got so many errors like FEHLER: Relation »object« existiert bereits. With postgresql >= 9.0 we could add a line definition.sub!('CREATE TABLE ', 'CREATE TABLE if not exist ') in lib/odba/storage.rb @ line 526 to fix this problem

Pushed a commit Fix concurrency for running siege- Changing the new apache config to use the new directory /var/www/ on thinpower. This enables to activate/deactivate the new version by just replacing the apache config and reloading it.

Activating the changes on thinpower via

  • cd /var/www/
  • git pull rack
  • /usr/local/bin/bundle-240 install
  • sudo svc -h /service/ch.oddb.rack/
  • sudo tail -f /service/ch.oddb.rack/log/main/current # to check whether it starts correctly
  • cp /home/ywesee/ /etc/apache2/vhosts.d/
  • mkdir doc/sl_errors
  • chown apache doc/sl_errors
  • /etc/init.d/apache2 reload

We have one problem with a missing status pages and Zeno thinks, that we are slower than with the pre-rack version. Reverting to the old version with

  • cp /home/ywesee/ /etc/apache2/vhosts.d/
  • /etc/init.d/apache2 reload

The missing status pages are:

Pushed commits

Also fixed the rewrite rules in apache.conf for

Committed in Use SBSM 1.4.8

Tried to remove the @cache_lock in method_missing of src/util/oddbapp.rb, but this lead rapidly to the select_all error when running siege against

The status pages for crawler and google_crawler do not get created correctly. This must be fixed. Reverted on thinpower to use the old mod_ruby based version.

We must rethink howto handle google crawlers. Upto now we used in SBSM::Session a method is_crawler?. This method was implemented in the pre-rack version of sbsm in lib/sbsm/request.rb as follows

  def is_crawler?
      crawler_pattern = /archiver|slurp|bot|crawlearchiver|slurp|bot|crawler|jeeves|spider|\.{6}r|jeeves|spider|\.{6}/i

It was removed when converting to rack, as the unit test for is_crawler still worked.

On the other hand I think we have to choices:

  • use the same pattern inside the apache conf and redirect when matching to a different port than the normal rack
  • Fix sbsm session and redirect, which means that the crawler still can affect normal users.

We will use the first solution. Also Zeno would like to have separate log files for the crawlers. And I want to remove the legacy references to is_crawler from SBSM and code. No, we cannot remove it completely from the, as it used in the global state for download.

Pushed commit

Must remove bin/crawler and correct to be able to set the port number and app name. Must change service/run to something like exec sudo -u apache bundle-240 exec rackup -p 8112 -e "APPNAME='crawler'"

Fixed locally and the services/run to make them start. But now I have to adapte the apache conf and test it with something like curl -v --user-agent 'Google Crawler Test'

Remarked that the apache conf regular expression is case sensitive by default. Found the following lines in the logs

==> /service/ch.oddb-google_crawler/log/main/current <==
@40000000595263be3a87352c - - [27/Jun/2017:15:55:00 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 8.6771

==> /var/www/ <== - - [27/Jun/2017:15:54:52 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 "-" "Google crawler Test"

==> /var/log/apache2/rewrite.log <== - - [27/Jun/2017:15:54:52 +0200] [][rid#25f0dd0/initial] (2) init rewrite engine with requested uri /de/gcc/fachinfo/reg/45674 - - [27/Jun/2017:15:54:52 +0200] [][rid#25f0dd0/initial] (3) applying pattern '^/(.*)$' to uri '/de/gcc/fachinfo/reg/45674' - - [27/Jun/2017:15:54:52 +0200] [][rid#25f0dd0/initial] (2) rewrite '/de/gcc/fachinfo/reg/45674' -> 'http://localhost:8112/de/gcc/fachinfo/reg/45674' - - [27/Jun/2017:15:54:52 +0200] [][rid#25f0dd0/initial] (2) forcing proxy-throughput with http://localhost:8112/de/gcc/fachinfo/reg/45674 - - [27/Jun/2017:15:54:52 +0200] [][rid#25f0dd0/initial] (1) go-ahead with proxy request proxy:http://localhost:8112/de/gcc/fachinfo/reg/45674 [OK]

This works fine. But how do I distinguish between the google-crawler and the other ones?

Looking for different crawlers on thinpower

thinpower # egrep -i "archiver|slurp|bot|crawler|jeeves|spider|\.{6}"  /var/www/*/access_log   | cut -d ' ' -f12-20 | sort | uniq
"Googlebot/2.1 (+"
"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML,
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
"Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; psscanapp; rv:11.0) like
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv: Gecko/2009073022
"Mozilla/5.0 (compatible; AhrefsBot/5.2; +"
"Mozilla/5.0 (compatible; BLEXBot/1.0; +"
"Mozilla/5.0 (compatible; Baiduspider/2.0; +"
"Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +"
"Mozilla/5.0 (compatible; Exabot/3.0; +"
"Mozilla/5.0 (compatible; Googlebot/2.1; +"
"Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +"
"Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.7;"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.8;"
"Mozilla/5.0 (compatible; SEOkicks-Robot; +"
"Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +"
"Mozilla/5.0 (compatible; Yahoo! Slurp;"
"Mozilla/5.0 (compatible; YandexBot/3.0; +"
"Mozilla/5.0 (compatible; bingbot/2.0; +"
"Mozilla/5.0 (compatible; linkdexbot/2.2; +"
"Mozilla/5.0 (compatible; proximic; +"
"Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS
"Sogou web spider/4.0(+"

Must create different log files per process (user, crawler, google_crawler) Done with commits

Adapt apache config for and Added the following rewrite rules to and generika.

  # ports must be kept in sync between apache.conf and /service/ch.oddb-*crawler/run  
  RewriteRule ^/(.*)$ http://localhost:8112/$1 [P,L]
  RewriteCond %{HTTP_USER_AGENT} "archiver|slurp|bot|crawler|jeeves|spider|\.{6}"
  RewriteRule ^/(.*)$ http://localhost:8212/$1 [P,L]  

(Modifier P => proxy, L => last).

view · edit · sidebar · attach · print · history
Page last modified on June 27, 2017, at 05:37 PM