Must fix the select_all error when running the siege test against santesuisse.
Adding a mutex around the initialisation of sorted_fachinfos, sorted_minifis and sorted_feedbacks in oddbapp.rb solved this problem. But maybe it would be easier just to initialize these items correctly.
Also adding a rewrite.log for ch.oddb.org, santesuisse.oddb.org and www.oddb.org as I am unsure whether some problems were not related to wrong rewrites.
Looked at why we got so many errors like FEHLER: Relation »object« existiert bereits. With postgresql >= 9.0 we could add a line definition.sub!('CREATE TABLE ', 'CREATE TABLE if not exist ') in lib/odba/storage.rb @ line 526 to fix this problem
Pushed a commit Fix concurrency for running siege- Changing the new apache config to use the new directory /var/www/oddb.org.rack/ on thinpower. This enables to activate/deactivate the new version by just replacing the apache config and reloading it.
Activating the changes on thinpower via
We have one problem with a missing status pages and Zeno thinks, that we are slower than with the pre-rack version. Reverting to the old version with
The missing status pages are:
Pushed commits
Also fixed the rewrite rules in apache.conf for i.ch.oddb.org.
Committed in oddb.org Use SBSM 1.4.8
Tried to remove the @cache_lock in method_missing of src/util/oddbapp.rb, but this lead rapidly to the select_all error when running siege against http://santesuisse.oddb-ci2.dyndns.org.
The status pages for crawler and google_crawler do not get created correctly. This must be fixed. Reverted on thinpower to use the old mod_ruby based version.
We must rethink howto handle google crawlers. Upto now we used in SBSM::Session a method is_crawler?. This method was implemented in the pre-rack version of sbsm in lib/sbsm/request.rb as follows
def is_crawler?
crawler_pattern = /archiver|slurp|bot|crawlearchiver|slurp|bot|crawler|jeeves|spider|\.{6}r|jeeves|spider|\.{6}/i
!!crawler_pattern.match(@cgi.user_agent)
end
It was removed when converting to rack, as the unit test for is_crawler still worked.
On the other hand I think we have to choices:
We will use the first solution. Also Zeno would like to have separate log files for the crawlers. And I want to remove the legacy references to is_crawler from SBSM and oddb.org code. No, we cannot remove it completely from the oddb.org, as it used in the global state for download.
Pushed commit
Must remove bin/crawler and correct config.ru to be able to set the port number and app name. Must change service/run to something like exec sudo -u apache bundle-240 exec rackup -p 8112 -e "APPNAME='crawler'"
Fixed locally config.ru and the services/run to make them start. But now I have to adapte the apache conf and test it with something like curl -v --user-agent 'Google Crawler Test' http://oddb-ci2.dyndns.org/de/gcc/fachinfo/reg/45674
Remarked that the apache conf regular expression is case sensitive by default. Found the following lines in the logs
==> /service/ch.oddb-google_crawler/log/main/current <== @40000000595263be3a87352c 172.25.1.75 - - [27/Jun/2017:15:55:00 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 8.6771 ==> /var/www/oddb.org.rack/log/oddb/access_log <== 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 "-" "Google crawler Test" ==> /var/log/apache2/rewrite.log <== 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) init rewrite engine with requested uri /de/gcc/fachinfo/reg/45674 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (3) applying pattern '^/(.*)$' to uri '/de/gcc/fachinfo/reg/45674' 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) rewrite '/de/gcc/fachinfo/reg/45674' -> 'http://localhost:8112/de/gcc/fachinfo/reg/45674' 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) forcing proxy-throughput with http://localhost:8112/de/gcc/fachinfo/reg/45674 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (1) go-ahead with proxy request proxy:http://localhost:8112/de/gcc/fachinfo/reg/45674 [OK]
This works fine. But how do I distinguish between the google-crawler and the other ones?
Looking for different crawlers on thinpower
thinpower oddb.org # egrep -i "archiver|slurp|bot|crawler|jeeves|spider|\.{6}" /var/www/oddb.org/log/generika/2017/06/*/access_log | cut -d ' ' -f12-20 | sort | uniq
"Googlebot-Image/1.0"
"Googlebot/2.1 (+http://www.google.com/bot.html)"
"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML,
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0"
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
"Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; psscanapp; rv:11.0) like
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022
"Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"
"Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
"Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)"
"Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
"Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)"
"Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
"Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)"
"Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
"Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
"Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)"
"Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)"
"Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
<...>
Must create different log files per process (user, crawler, google_crawler) Done with commits
Adapt apache config for ch.oddb.org and generika.oddb.org. Added the following rewrite rules to ch.oddb.org and generika.
# ports must be kept in sync between apache.conf and /service/ch.oddb-*crawler/run
RewriteRule ^/(.*)$ http://localhost:8112/$1 [P,L]
RewriteCond %{HTTP_USER_AGENT} "archiver|slurp|bot|crawler|jeeves|spider|\.{6}"
RewriteRule ^/(.*)$ http://localhost:8212/$1 [P,L]
(Modifier P => proxy, L => last).