Index>
20150415-oddb2xml-with-parslet
Summary
- Switch parsing composition for --calc in oddb2xml to use parslet
Commits
Index
- Keep in Mind for work to do
-
- Fix dojo error http://www.sitepen.com/blog/2012/10/31/debugging-dojo-common-error-messages/#forgot-dom-ready
- I removed on May-27 tests for ix_registrationss, fix_sequences, fix_compositions, fix_packages from test/test_plugin/swissmedic.rb,as he could not find any references for them in the src code. Did I erroneously remove stuff when cleaning up the swissmedic import earlier?
- The whole test for older/newer Packages must be adapted to xlsx. One must compare the rows (e.g. by creating csv files) and do the same stuff in xlsx!
- creat gem: task: input=file with ean-codes, standard output show ean-codes + atc-code. Source is Swissmedic Packungen.xlsx or XML.
- Import via data/medreg_companies.yaml
- Fix problem with radioactivatum 99m-technetio when parsing Wirkstoffe
- Pending work with parslet/oddb2xml
- Adapt parsing packages-XLSX to use the new library
- Fix up/lowercase issues for substance names
- Fix 20 failing unit tests
- Fix the 380 lines that cannot be parsed
- Fix various issues in code and spec tests marked
TODO
- document design and decisions (2 or 3 pages in textile format, giving IKSNR/names for various examples)
- 'ut' -> salts
- howto handle entries like ratio
- what is the meaning of DER?
- shall excipiens and friends be a normal substance? But their quantity "pro" is used for various measures?
- Join the various lines to form correct parts for a composition (using the indexes, Solvens, etc)
- Howto handle stuff like Praeparatio cryodesiccata
Switch parsing composition for --calc in oddb2xml to use parslet
Decided to add an error-correction pass before passing the string to the Parslet parser to be able to fix obvious errors. Also this allows us to easily apply in our daily operation fixes before Swissmedic will have it corrected. E.g http://ch.oddb.org/de/gcc/drug/reg/62432/seq/01 where we find the string hepar sulfuris D6 2,2 mg hypericum perforatum D2 0,66 mg
where itlacks a comma and should be hepar sulfuris D6 2,2 mg, hypericum perforatum D2 0,66 mg
Stuff I am planning to resolve today:
Add handler for swissmedic error See Added HandleSwissmedicErrors
- Make the failing 16 unit tests pass again, after reworking the low level details of yesterday. See commit Reworked parsing of parenthesis in names
"I) DTPa-IPV-Komponente (Suspension): toxoidum diphtheriae 30 U.I., toxoidum pertussis 25 µg et haemagglutininum filamentosum 25 µg"
which is a nice, but not too complicated composition. Done with commit Fix handling several substances
Handle substances having 'et', which are called in oddb.org chemical_substance. See Added support to recognize preaparatio. Fix 'et'
Handle substances having 'ut'. This is a new feature and will name 'salts' as on active agent may transported using several different salt, e.g. is calcii gluconas 100 mg et calcii lactas pentahydricus 25 mg et calcii hydrogenophosphas anhydricus 300 mg
Fix problems with praeparations. See Added support to recognize preaparatio. Fix 'et'
Add IKSNR 42847 Nutriflex special, Infusionslösung as unit test
Parsing all compositions-text takes now two minutes, but the number of lines I am unable to parse dropped to < 1000. Output was Parsed 8937 lines with 964 errors in 122 seconds
From the Nutriflex special, Infusionslösung I already handle more or less. Had to add the units 'kJ' and 'mmol'. Added a unit test for it. The line 3 is not okay for the parser.
More work done with the commits:
Running all unit tests takes about 5 minutes. Result is
# Testing whether 8937 composition lines can be parsed. Found 380 errors in 293 seconds
# 520 examples, 20 failures, 1 pending