view · edit · sidebar · attach · print · history

Index>

20150420-oddb2xml-with-parslet

Summary

  • Switch parsing composition for --calc in oddb2xml to use parslet

Commits

Index

Keep in Mind for work to do
  • Fix dojo error http://www.sitepen.com/blog/2012/10/31/debugging-dojo-common-error-messages/#forgot-dom-ready
  • I removed on May-27 tests for ix_registrationss, fix_sequences, fix_compositions, fix_packages from test/test_plugin/swissmedic.rb,as he could not find any references for them in the src code. Did I erroneously remove stuff when cleaning up the swissmedic import earlier?
  • The whole test for older/newer Packages must be adapted to xlsx. One must compare the rows (e.g. by creating csv files) and do the same stuff in xlsx!
  • creat gem: task: input=file with ean-codes, standard output show ean-codes + atc-code. Source is Swissmedic Packungen.xlsx or XML.
  • Import via data/medreg_companies.yaml
  • Fix problem with radioactivatum 99m-technetio when parsing Wirkstoffe

Switch parsing composition for --calc in oddb2xml to use parslet

Stuff to do today includes:

  • Adapt parsing packages-XLSX to use the new library
  • Fix up/lowercase issues for substance names
  • Fix 20 failing unit tests
  • Fix the 380 lines that cannot be parsed
  • Fix various issues in code and spec tests marked TODO
  • document design and decisions (2 or 3 pages in textile format, giving IKSNR/names for various examples)
    • 'ut' -> salts
    • howto handle entries like ratio
    • what is the meaning of DER?
    • shall excipiens and friends be a normal substance? But their quantity "pro" is used for various measures?
    • Join the various lines to form correct parts for a composition (using the indexes, Solvens, etc)
    • Howto handle stuff like Praeparatio cryodesiccata
  • Selling units for 7680611860045 should be 5 and not 12500

Adapting lib/oddb2xml/calc.rb to use the new library was easy. Had to add some lines to catch errors when we the parsed could not recognise lines. There were 653 such lines.

Must fix adding the label is_active_agents for each substance. Added a few lines.

Running oddb2xml --calc --skip-download takes now about 430 seconds, over 7 minutes (before about 2 minutes). This is probably the price we have to pay for the better parsing.

Running spec/calc_spec.rb signals 53 errors. Here I list the most important ones.

669: failed parsing ==>  haemagglutininum influenzae A (H1N1) (Virus-Stamm A/California/7/2009 (H1N1)-like: reassortant virus NYMC X-179A) 15 µg, haemagglutininum influenzae A (H3N2) (Virus-Stamm A/Texas/50/2012 (H3N2)-like: reassortant virus NYMC X-223A) 15 µg, haemagglutininum influenzae B (Virus-Stamm B/Massachusetts/2/2012-like: B/Massachusetts/2/2012) 15 µg, natrii chloridum, kalii chloridum, dinatrii phosphas dihydricus, kalii dihydrogenophosphas, residui: formaldehydum max. 100 µg, octoxinolum-9 max. 500 µg, ovalbuminum max. 0.05 µg, saccharum nihil, neomycinum nihil, aqua ad iniectabilia q.s. ad suspensionem pro 0.5 ml
669: failed parsing ==>  lactobacillus acidophilus cryodesiccatus min. 10^9 CFU, bifidobacterium infantis min. 10^9 CFU, color.: E 127, E 132, E 104, excipiens pro capsula
669: failed parsing ==>  haemagglutininum influenzae A (H1N1) (Virus-Stamm A/California/7/2009 (H1N1)-like: reassortant virus NYMC X-179A) 15 µg, haemagglutininum influenzae A (H3N2) (Virus-Stamm A/Texas/50/2012 (H3N2)-like: reassortant virus NYMC X-223A) 15 µg, haemagglutininum influenzae B (Virus-Stamm B/Massachusetts/2/2012-like: B/Massachusetts/2/2012) 15 µg, natrii chloridum, kalii chloridum, dinatrii phosphas dihydricus, kalii dihydrogenophosphas, residui: formaldehydum max. 100 µg, octoxinolum-9 max. 500 µg, ovalbuminum max. 0.05 µg, saccharum nihil, neomycinum nihil, aqua ad iniectabilia q.s. ad suspensionem pro 0.5 ml
669: failed parsing ==>  I) et II) et III) corresp.: aminoacida 48 g/l, carbohydrata 150 g/l, materia crassa 50 g/l, in emulsione recenter mixta 1250 ml
669: failed parsing ==>  lactobacillus acidophilus cryodesiccatus min. 10^9 CFU, bifidobacterium infantis min. 10^9 CFU, color.: E 127, E 132, E 104, excipiens pro capsula
669: failed parsing ==>  lactobacillus acidophilus cryodesiccatus min. 10^9 CFU, bifidobacterium infantis min. 10^9 CFU, color.: E 127, E 132, E 104, excipiens pro capsula
669: failed parsing ==>  I) et II) et III) corresp.: aminoacida 32 g/l, acetas 32 mmol/l, acidum citricum monohydricum, in emulsione recenter mixta 1250 ml

       expected: "Toxoidum Diphtheriae"
            got: "Toxoidum Diphtheriae 30 U.i., Toxoidum Tetani 40 U.i., Toxoidum Pertussis 25 ?g Et Haemagglutininum 

       expected: "U.I/ml"
            got: "U."

       expected: "Viscum Album (mali) Recens"
            got: "Extractum Aquosum Liquidum Fermentatum 0.05 Mg Ex Viscum Album (mali)"

Pushed commit Use parslet for --calc. Still many failing unit tests as the generated oddb_calc.xml file looks fine.

Try to fix first all the error where excipiens contains a dose which must be used when excipiens contains something like "pro". Done with commit Fixed handling unit if pro ml is given in excipiens

Pushed commit Many small fixes (label, qty, mineralia, residui). Still 567 lines fail parsing. 156 of them contain the ratio keyword.

Substance name 7680656280013 for is Vipera Aspis > 1000 Ld50 Mus and must be corrected to Vipera Aspis > 1000 Ld50 Mus

Also for 7680616310026 the description for the label A is wrong. To be fixed tomorrow.

view · edit · sidebar · attach · print · history
Page last modified on April 20, 2015, at 08:13 PM