<< | Index | >>
ext/fiparse/test/data/html/de/fi_58106.de.html
showed that we have 165 occurrences of </p
but only 163 of <p
The following commit fixes for me the problem. It does not even require a reparsing, only a restart of the oddbd service.
The method has_italic?
in textinfo_hpricot.rg
returns true if it encounter the style s8 for a swissmedic-info
I think to parse the content
element in medicalInformation
element one must consider the style defined in the style
element!
Here the style from 58106 which creates the problem, as s8 is defined as s8{font-family:Arial;font-size:11pt;}
<style>p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt;}table{border-spacing:0pt;border-collapse:collapse;} table td{vertical-align:top;}.s2{font-family:Arial;font-size:14pt;font-weight:bold;}.s3{font-family:Arial;font-size:11.2pt;font-weight:bold;}.s4{line-height:150%;margin-right:113.3pt;}.s5{font-size:11pt;line-height:150%;margin-right:113.3pt;}.s6{font-family:Arial;font-size:11pt;font-weight:bold;}.s7{font-family:Arial;font-size:11pt;font-style:italic;}.s8{font-family:Arial;font-size:11pt;}.s9{font-family:Arial;font-size:8.8pt;}.s10{font-family:Arial;font-size:11pt;font-style:italic;text-decoration:line-through;}.s11{line-height:150%;margin-right:113.4pt;}.s12{font-size:11pt;line-height:150%;margin-right:113.4pt;}.s13{font-family:Arial;font-size:11pt;color:#000000;}.s14{font-family:Arial;font-size:9.5pt;}.s15{font-family:Arial;font-size:11pt;line-height:150%;margin-right:56.7pt;}.s16{line-height:150%;margin-right:56.7pt;}.s17{font-family:Times New Roman;font-size:8.8pt;}.s18{font- family:Arial;font-size:11pt;line-height:150%;margin-right:113.4pt;}</style>
and here from Here the style from 58107 which has no problem because it no style definition for s8
and therefore never refences it.
<style>.h1{font-size:22pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-weight:bolder;color:black;}.LabelText{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:smaller;font-weight:bolder;color:Black;}.ErrorMessage{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:small;color:Red;}.ErrorMessageSmall{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:smaller;color:Red;}.ListText{font-family:"Arial Black","Helvetica Black","LB Helvetica Black","Univers Black","Zurich Blk BT","sans-serif";font-size:smaller;}.EmptyGridText{font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";color:Red;}.CompanyDetail{font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-size:smaller;}.ContentTitle{font-family:"Arial Black","Helvetica Black","LB Helvetica Black","Univers Black","Zurich Blk BT","sans-serif";color:#003366;}. CopyrightText{font-family:"Arial Black","Helvetica Black","LB Helvetica Black","Univers Black","Zurich Blk BT","sans-serif";font-size:smaller;color:#003366;}.BekanntmachungenText{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:smaller;color:Black;}.BekanntmachungenTitel{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:smaller;font-weight:bolder;color:Black;}.CookieWarning{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:medium;font-weight:bolder;color:Red;}.HelpText{font-size:12pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";color:black;}.HelpTextBold{font-size:11pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";color:black;font-weight:bold;}.HelpTextSmall{font-size:11pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";color:black;}.HelpTextSmallBold{font-size:11pt;font-family:Arial,Helvetica,Univers," Zurich BT","sans-serif";color:black;font-weight:bold;}.Zwischentitel{font-size:12pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-weight:bolder;color:black;}.AdobeAcrobatReaderText{font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-size:x-small;}.Hyperlinks{font-family:Arial,"Helvetica Black","LB Helvetica Black","Univers Black",Zurich,"sans-serif";font-size:smaller;color:Blue;}#menue{z-index:1;position:fixed;left:0px;top:0px;}#monographie{margin-left:30px;margin-right:30px;margin-bottom:10px;font-family:"Verdana","Arial","Tahoma","Helvetica","Geneva","sans-serif";color:black;}#monographie .MonTitle{font-size:1.5em;font-weight:bold;margin-bottom:0.2em;}#monographie .absTitle{font-size:0.9em;font-weight:bold;font-style:italic;margin-bottom:0.0em;}#monographie .untertitel{font-size:0.9em;font-weight:normal;margin-top:0.5em;margin-bottom:0.0em;}#monographie .untertitel1{font-size:0.9em;font-weight:normal;font-style:italic;margin-top:0.2em;margin-bottom:0.0em;}#monographie . header{font-size:0.85em;font-weight:normal;color:#999999;text-align:left;margin-bottom:2em;visibility:hidden;display:none;}#monographie .footer{font-size:0.85em;font-weight:normal;color:#999999;margin-top:2.0em;border-top:#999999 1px solid;padding-top:0.5em;}#monographie div p{font-size:0.9em;}#monographie .paragraph{font-weight:normal;font-style:normal;margin-top:0.8em;}.noSpacing{margin-top:0em;margin-bottom:0em;}.spacing1{margin-top:0em;margin-bottom:0.25em;}.spacing2{margin-top:0em;margin-bottom:0.5em;}#monographie .ownerCompany{font-size:1em;font-style:italic;font-weight:bold;text-align:right;margin-bottom:1.0em;border-top:black 1px solid;border-bottom:black 1px solid;padding-top:0.2em;padding-bottom:0.2em;}#monographie .titleAdd{font-size:0.9em;font-weight:bold;font-style:italic;}#monographie .shortCharacteristic{font-size:1.1em;font-style:italic;}#monographie .indention1{margin-left:5em;}#monographie .indention2{margin-left:10em;}#monographie .indention3{margin-left:15em;}#monographie .box{font-size:. 9em;font-weight:normal;font-style:normal;margin-top:5px;margin-bottom:5px;padding-top:5px;padding-bottom:6px;padding-left:5px;padding-right:5px;border-width:1px;border-color:Black;border-style:solid;}#monographie .image{margin-top:20px;margin-bottom:20px;}#monographie table{font-family:"Courier New","sans-serif";font-size:1.0em;margin-top:1.0em;margin-bottom:1.0em;border-top:solid 1px black;border-bottom:solid 1px black;}#monographie td{font-family:"Courier New","sans-serif";font-size:1.0em;}.rowSepBelow{border-bottom:solid 1px black;}.goUp{float:right;margin-right:-40px;}.tblArticles{border:solid 1pt #E5E7E8;vertical-align:top;text-align:left;border-spacing:0;width:100%;}.tblArticles .product{width:37%;font-size:small;vertical-align:top;border-top:solid 1pt #E5E7E8;border-right:solid 1pt #E5E7E8;}.tblArticles .productEmpty{width:37%;}.tblArticles .normal-right{border-right:solid 1pt #E5E7E8;border-top:solid 1pt #E5E7E8;width:15%;text-align:right;vertical-align:top;font-size:small;}.tblArticles .normal- center{border-right:solid 1pt #E5E7E8;border-top:solid 1pt #E5E7E8;width:10%;text-align:center;vertical-align:top;font-size:small;}.tblArticles .picture{width:15%;text-align:center;border-top:solid 1pt #E5E7E8;}.tblArticles .pictureEmpty{width:15%;border-top:solid 1pt #E5E7E8;}</style>
I think the style element comes from the embedded Word document. If we want to get correct results we must adapt the parser to do the following stuff
Therefore must extend FachinfoHpricot.extract to receive the HTML, the style and the title. Will create test cases.
Running now jobs/update_textinfo_swissmedicinfo --target=fi no-download --reparse 58107 58106 78656 62111 62439 62223 62728
After running the update 58106 looks oka, 58107 too, but the other still have the italic problem.
And got the following funny error with http://oddb.niklaus.org/de/gcc/fachinfo/reg/62728
ODBA::Stub was unable to replace ODDB::Text::Chapter#29962902 from ODDB::FachinfoDocument2001:#29962897 ODBA::Stub was unable to replace ODDB::Text::Chapter#29962902 from ODDB::FachinfoDocument2001:#29962897 ODBA::Stub was unable to replace ODDB::Text::Chapter#29962902 from ODDB::FachinfoDocument2001:#29962897 ODBA::Stub was unable to replace ODDB::Text::Chapter#29962902 from ODDB::FachinfoDocument2001:#29962897 error in SBSM::Session#process: /de/gcc NoMethodError undefined method `to_a' for "62728":String /var/www/oddb.org/src/plugin/text_info.rb:550:in `import_fulltext' /var/www/oddb.org/src/state/admin/registration.rb:73:in `get_fachinfo' /var/www/oddb.org/src/state/admin/registration.rb:143:in `do_update' /var/www/oddb.org/src/state/admin/registration.rb:228:in `update' /usr/local/lib64/ruby/gems/1.9.1/gems/sbsm-1.2.3/lib/sbsm/state.rb:203:in `_trigger'
Added some debug information for the job giving me
opts == {:target=>:fi, :reparse=>true, :iksnrs=>["58107", "58106", "78656", "62111", "62439", "62223", "62728"], :companies=>[], :download=>false} parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/de/Finasterid_Mepha__5_swissmedicinfo.html, nil, name Finasterid-Mepha® 5, styles .h1{font-size:22pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-weight:bolder;color:black; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/fr/Finast_ride_Mepha__5_swissmedicinfo.html, nil, name Finastéride-Mepha® 5, styles .h1{font-size:22pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-weight:bolder;color:black; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/de/Finasterid_Streuli__5_swissmedicinfo.html, nil, name Finasterid Streuli® 5, styles .h1{font-size:22pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-weight:bolder;color:black; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/fr/Finast_ride_Streuli__5_swissmedicinfo.html, nil, name Finastéride Streuli® 5, styles .h1{font-size:22pt;font-family:Arial,Helvetica,Univers,"Zurich BT","sans-serif";font-weight:bolder;color:black; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/de/Bisoprolol_Axapharm_swissmedicinfo.html, nil, name Bisoprolol Axapharm, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/fr/Bisoprolol_Axapharm_swissmedicinfo.html, nil, name Bisoprolol Axapharm, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/de/Xalos_Duo_swissmedicinfo.html, nil, name Xalos-Duo, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/fr/Xalos_Duo_swissmedicinfo.html, nil, name Xalos-Duo, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/de/Olanpax__Filmtabletten_Schmelztabletten_swissmedicinfo.html, nil, name Olanpax® Filmtabletten/Schmelztabletten, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/fr/Olanpax__Comprim_s_pellicul_s_Comprim_s_orodispersibles_swissmedicinfo.html, nil, name Olanpax® Comprimés pelliculés/Comprimés orodispersibles, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/de/Diclo_Acino_retard_rektale_Kapseln__Film__Retardtabletten_swissmedicinfo.html, nil, name Diclo-Acino retard/rektale Kapseln, Film-/Retardtabletten, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt; parse_and_update: calls parse_fachinfo, /var/www/oddb.org/data/html/fachinfo/fr/Diclo_Acino_comprim_s_pellicul_s___liberation_prolong_e__capsules_rectales___lib_ration_prolong_e_swissmedicinfo.html, nil, name Diclo-Acino comprimés pelliculés/à liberation prolongée, capsules rectales/à libération prolongée, styles p{margin-top:0pt;margin-right:0pt;margin-bottom:0pt;margin-left:0pt;
Here a sample of how styles are defined
extract: swissmedicinfo type pi name Finasterid-Mepha® 5 styles ["s4", "s6", "s11"] extract: swissmedicinfo type pi name Finastéride-Mepha® 5 styles ["s4", "s6", "s11"] extract: swissmedicinfo type pi name Finasterid Streuli® 5, Filmtabletten styles ["s5", "s13", "s14"] extract: swissmedicinfo type pi name Finastéride Streuli® 5, comprimés filmés styles ["s5", "s14", "s15", "s16"] extract: swissmedicinfo type pi name Timoptic® styles ["s2", "s7"] extract: swissmedicinfo type pi name Timoptic® styles ["s2", "s7", "s17"] extract: swissmedicinfo type pi name Xalos-Duo styles ["s9", "s11"] extract: swissmedicinfo type pi name Xalos-Duo styles ["s9", "s13", "s14"] extract: swissmedicinfo type pi name Olanpax Filmtabletten und Schmelztabletten styles ["s10", "s11", "s12"] extract: swissmedicinfo type pi name Olanpax comprimés pelliculés et comprimés orodispersibles styles ["s9", "s10", "s11"] extract: swissmedicinfo type pi name Diclo-Acino Filmtabletten, rektale Kapseln styles ["s12", "s13"] extract: swissmedicinfo type pi name Diclo-Acino Comprimés pelliculés/Capsules rectales styles ["s10", "s11"] extract: swissmedicinfo type fi name Finasterid Streuli® 5 styles ["s7", "s10"] extract: swissmedicinfo type fi name Finastéride Streuli® 5 styles ["s7", "s23"] extract: swissmedicinfo type fi name Bisoprolol Axapharm styles ["s7", "s9", "s12"] extract: swissmedicinfo type fi name Bisoprolol Axapharm styles ["s6", "s10"] extract: swissmedicinfo type fi name Xalos-Duo styles ["s6", "s7", "s11"] extract: swissmedicinfo type fi name Xalos-Duo styles ["s5", "s6"] extract: swissmedicinfo type fi name Olanpax® Filmtabletten/Schmelztabletten styles ["s7", "s10"] extract: swissmedicinfo type fi name Olanpax® Comprimés pelliculés/Comprimés orodispersibles styles ["s8"] extract: swissmedicinfo type fi name Diclo-Acino retard/rektale Kapseln, Film-/Retardtabletten styles ["s7"]