Must fix the select_all error when running the siege test against santesuisse.
Adding a mutex around the initialisation of sorted_fachinfos, sorted_minifis and sorted_feedbacks in oddbapp.rb solved this problem. But maybe it would be easier just to initialize these items correctly.
Also adding a rewrite.log for ch.oddb.org, santesuisse.oddb.org and www.oddb.org as I am unsure whether some problems were not related to wrong rewrites.
Looked at why we got so many errors like FEHLER: Relation »object« existiert bereits
. With postgresql >= 9.0 we could add a line definition.sub!('CREATE TABLE ', 'CREATE TABLE if not exist ')
in lib/odba/storage.rb @ line 526
to fix this problem
Pushed a commit Fix concurrency for running siege- Changing the new apache config to use the new directory /var/www/oddb.org.rack/ on thinpower. This enables to activate/deactivate the new version by just replacing the apache config and reloading it.
Activating the changes on thinpower via
We have one problem with a missing status pages and Zeno thinks, that we are slower than with the pre-rack version. Reverting to the old version with
The missing status pages are:
Pushed commits
Also fixed the rewrite rules in apache.conf for i.ch.oddb.org.
Committed in oddb.org Use SBSM 1.4.8
Tried to remove the @cache_lock in method_missing of src/util/oddbapp.rb, but this lead rapidly to the select_all error when running siege against http://santesuisse.oddb-ci2.dyndns.org.
The status pages for crawler and google_crawler do not get created correctly. This must be fixed. Reverted on thinpower to use the old mod_ruby based version.
We must rethink howto handle google crawlers. Upto now we used in SBSM::Session a method is_crawler?. This method was implemented in the pre-rack version of sbsm in lib/sbsm/request.rb as follows
def is_crawler? crawler_pattern = /archiver|slurp|bot|crawlearchiver|slurp|bot|crawler|jeeves|spider|\.{6}r|jeeves|spider|\.{6}/i !!crawler_pattern.match(@cgi.user_agent) end
It was removed when converting to rack, as the unit test for is_crawler still worked.
On the other hand I think we have to choices:
We will use the first solution. Also Zeno would like to have separate log files for the crawlers. And I want to remove the legacy references to is_crawler from SBSM and oddb.org code. No, we cannot remove it completely from the oddb.org, as it used in the global state for download.
Pushed commit
Must remove bin/crawler and correct config.ru to be able to set the port number and app name. Must change service/run to something like exec sudo -u apache bundle-240 exec rackup -p 8112 -e "APPNAME='crawler'"
Fixed locally config.ru and the services/run to make them start. But now I have to adapte the apache conf and test it with something like curl -v --user-agent 'Google Crawler Test' http://oddb-ci2.dyndns.org/de/gcc/fachinfo/reg/45674
Remarked that the apache conf regular expression is case sensitive by default. Found the following lines in the logs
==> /service/ch.oddb-google_crawler/log/main/current <== @40000000595263be3a87352c 172.25.1.75 - - [27/Jun/2017:15:55:00 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 8.6771 ==> /var/www/oddb.org.rack/log/oddb/access_log <== 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] "GET /de/gcc/fachinfo/reg/45674 HTTP/1.1" 200 75609 "-" "Google crawler Test" ==> /var/log/apache2/rewrite.log <== 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) init rewrite engine with requested uri /de/gcc/fachinfo/reg/45674 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (3) applying pattern '^/(.*)$' to uri '/de/gcc/fachinfo/reg/45674' 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) rewrite '/de/gcc/fachinfo/reg/45674' -> 'http://localhost:8112/de/gcc/fachinfo/reg/45674' 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (2) forcing proxy-throughput with http://localhost:8112/de/gcc/fachinfo/reg/45674 172.25.1.75 - - [27/Jun/2017:15:54:52 +0200] [oddb-ci2.dyndns.org/sid#23d7130][rid#25f0dd0/initial] (1) go-ahead with proxy request proxy:http://localhost:8112/de/gcc/fachinfo/reg/45674 [OK]
This works fine. But how do I distinguish between the google-crawler and the other ones?
Looking for different crawlers on thinpower
thinpower oddb.org # egrep -i "archiver|slurp|bot|crawler|jeeves|spider|\.{6}" /var/www/oddb.org/log/generika/2017/06/*/access_log | cut -d ' ' -f12-20 | sort | uniq "Googlebot-Image/1.0" "Googlebot/2.1 (+http://www.google.com/bot.html)" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; psscanapp; rv:11.0) like "Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)" "Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)" "Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)" "Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" <...>
Must create different log files per process (user, crawler, google_crawler) Done with commits
Adapt apache config for ch.oddb.org and generika.oddb.org. Added the following rewrite rules to ch.oddb.org and generika.
# ports must be kept in sync between apache.conf and /service/ch.oddb-*crawler/run RewriteRule ^/(.*)$ http://localhost:8112/$1 [P,L] RewriteCond %{HTTP_USER_AGENT} "archiver|slurp|bot|crawler|jeeves|spider|\.{6}" RewriteRule ^/(.*)$ http://localhost:8212/$1 [P,L]
(Modifier P => proxy, L => last).