view · edit · sidebar · attach · print · history

Index>

20170627-oddb-org-rack

Summary

Commits

Index

Port oddb.org to use rack

Must fix the select_all error when running the siege test against santesuisse.

Adding a mutex around the initialisation of sorted_fachinfos, sorted_minifis and sorted_feedbacks in oddbapp.rb solved this problem. But maybe it would be easier just to initialize these items correctly.

Also adding a rewrite.log for ch.oddb.org, santesuisse.oddb.org and www.oddb.org as I am unsure whether some problems were not related to wrong rewrites.

Looked at why we got so many errors like FEHLER: Relation »object« existiert bereits. With postgresql >= 9.0 we could add a line definition.sub!('CREATE TABLE ', 'CREATE TABLE if not exist ') in lib/odba/storage.rb @ line 526 to fix this problem

Pushed a commit Fix concurrency for running siege- Changing the new apache config to use the new directory /var/www/oddb.org.rack/ on thinpower. This enables to activate/deactivate the new version by just replacing the apache config and reloading it.

Activating the changes on thinpower via

  • cd /var/www/oddb.org.rack
  • git pull https://github.com/ngiger/oddb.org.git rack
  • /usr/local/bin/bundle-240 install
  • sudo svc -h /service/ch.oddb.rack/
  • sudo tail -f /service/ch.oddb.rack/log/main/current # to check whether it starts correctly
  • cp /home/ywesee/20_oddb.org.conf.rack /etc/apache2/vhosts.d/20_oddb.org.conf
  • mkdir doc/sl_errors
  • chown apache doc/sl_errors
  • /etc/init.d/apache2 reload

We have one problem with a missing status pages and Zeno thinks, that we are slower than with the pre-rack version. Reverting to the old version with

  • cp /home/ywesee/20_oddb.org.conf.pre_rack /etc/apache2/vhosts.d/20_oddb.org.conf
  • /etc/init.d/apache2 reload

The missing status pages are:

Pushed commits

Also fixed the rewrite rules in apache.conf for i.ch.oddb.org.

Committed in oddb.org Use SBSM 1.4.8

Tried to remove the @cache_lock in method_missing of src/util/oddbapp.rb, but this lead rapidly to the select_all error when running siege against http://santesuisse.oddb-ci2.dyndns.org.

The status pages for crawler and google_crawler do not get created correctly. This must be fixed. Reverted on thinpower to use the old mod_ruby based version.

We must rethink howto handle google crawlers. Upto now we used in SBSM::Session a method is_crawler?. This method was implemented in the pre-rack version of sbsm in lib/sbsm/request.rb as follows

  def is_crawler?
      crawler_pattern = /archiver|slurp|bot|crawlearchiver|slurp|bot|crawler|jeeves|spider|\.{6}r|jeeves|spider|\.{6}/i
      !!crawler_pattern.match(@cgi.user_agent)
    end

It was removed when converting to rack, as the unit test for is_crawler still worked.

On the other hand I think we have to choices:

  • use the same pattern inside the apache conf and redirect when matching to a different port than the normal rack
  • Fix sbsm session and redirect, which means that the crawler still can affect normal users.

We will use the first solution. Also Zeno would like to have separate log files for the crawlers. And I want to remove the legacy references to is_crawler from SBSM and oddb.org code. No, we cannot remove it completely from the oddb.org, as it used in the global state for download.

Pushed commit

Must remove bin/crawler and correct config.ru to be able to set the port number and app name. Must change service/run to something like exec sudo -u apache bundle-240 exec rackup -p 8112 -e "APPNAME='crawler'"

Fixed locally config.ru and the services/run to make them start. But now I have to adapte the apache conf and test it with something like curl -v --user-agent 'Google Crawler Test' http://oddb-ci2.dyndns.org/de/gcc/fachinfo/reg/45674

Remarked that the apache conf regular expression is case sensitive by default. Found the following lines in the logs

==> /service/ch.oddb-google_crawler/log/main/current <==
@40000000595263be3a87352c 172.25.1.75 - - [27/Jun/2017:15:55:00 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 8.6771

==> /var/www/oddb.org.rack/log/oddb/access_log <==
172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 "-" "Google crawler Test"

==> /var/log/apache2/rewrite.log <==
172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) init rewrite engine with requested uri /de/gcc/fachinfo/reg/45674
172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (3) applying pattern '^/(.*)$' to uri '/de/gcc/fachinfo/reg/45674'
172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) rewrite '/de/gcc/fachinfo/reg/45674' -> 'http://localhost:8112/de/gcc/fachinfo/reg/45674'
172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) forcing proxy-throughput with http://localhost:8112/de/gcc/fachinfo/reg/45674
172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (1) go-ahead with proxy request proxy:http://localhost:8112/de/gcc/fachinfo/reg/45674 [OK]

This works fine. But how do I distinguish between the google-crawler and the other ones?

Looking for different crawlers on thinpower

thinpower oddb.org # egrep -i "archiver|slurp|bot|crawler|jeeves|spider|\.{6}"  /var/www/oddb.org/log/generika/2017/06/*/access_log   | cut -d ' ' -f12-20 | sort | uniq
"Googlebot-Image/1.0"
"Googlebot/2.1 (+http://www.google.com/bot.html)"
"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML,
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
"Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; psscanapp; rv:11.0) like
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022
"Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"
"Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
"Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)"
"Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
"Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)"
"Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
"Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)"
"Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
"Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
"Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)"
"Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)"
"Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
<...>

Must create different log files per process (user, crawler, google_crawler) Done with commits

Adapt apache config for ch.oddb.org and generika.oddb.org. Added the following rewrite rules to ch.oddb.org and generika.

  # ports must be kept in sync between apache.conf and /service/ch.oddb-*crawler/run  
  RewriteRule ^/(.*)$ http://localhost:8112/$1 [P,L]
  RewriteCond %{HTTP_USER_AGENT} "archiver|slurp|bot|crawler|jeeves|spider|\.{6}"
  RewriteRule ^/(.*)$ http://localhost:8212/$1 [P,L]  

(Modifier P => proxy, L => last).

view · edit · sidebar · attach · print · history
Page last modified on June 27, 2017, at 05:37 PM