view · edit · sidebar · attach · print · history

20110512-debug-update-function-de_oddb-debug-rpdf2txt

<< | Index | >>


  1. Check import_gkv again
  2. Debug import_gkv
  3. A new command to replace the company name
  4. Debug rpdf2txt

Goal/Estimate/Evaluation
  • Debug import de.oddb / 70% /50%
Milestones
  • Review gkv error
  • Debug error
Summary

:Commits


Check import_gkv again

Run

  • de.oddb/bin/oddbd
 masa@masa ~/ywesee/de.oddb.org $ ruby -I lib bin/oddbd
  • import_gkv
 masa@masa ~/ywesee/de.oddb.org $ ruby -I lib jobs/import_gkv 

Result

Fri May 13 07:39:46 2011: de.oddb.org ODDB::Import::Gkv#import
NoMethodError
undefined method `type=' for #<ODDB::Util::Money:0x7f508fc71460>
./lib/oddb/drugs/package.rb:68:in `_price_exfactory'
./lib/oddb/import/gkv.rb:265:in `import_package'
./lib/oddb/import/gkv.rb:111:in `import_row'
./lib/oddb/import/gkv.rb:458:in `process_page'
./lib/oddb/import/gkv.rb:457:in `each'
./lib/oddb/import/gkv.rb:457:in `process_page'
./lib/oddb/import/gkv.rb:40:in `call'
./lib/oddb/import/gkv.rb:40:in `send_page'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:43:in `extract_text'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:493:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:452:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:493:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:492:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:477:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:41:in `extract_text'
./lib/oddb/import/gkv.rb:100:in `import'
/usr/lib64/ruby/site_ruby/1.8/oddb/util/updater.rb:96:in `reported_import'
/usr/lib64/ruby/site_ruby/1.8/oddb/util/updater.rb:103:in `call'
/usr/lib64/ruby/site_ruby/1.8/oddb/util/updater.rb:103:in `_reported_import'
/usr/lib64/ruby/site_ruby/1.8/oddb/util/updater.rb:96:in `reported_import'
/usr/lib64/ruby/site_ruby/1.8/oddb/util/updater.rb:58:in `import_gkv'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open'
./lib/oddb/import/gkv.rb:77:in `download_latest'
/usr/lib64/ruby/site_ruby/1.8/oddb/util/updater.rb:57:in `import_gkv'
jobs/import_gkv:16
./lib/oddb/util/job.rb:16:in `call'
./lib/oddb/util/job.rb:16:in `run'
jobs/import_gkv:15
Imported     1 Zubef-Entries on 13.05.2011:
Visited      0 existing Zubef-Entries
Visited      0 existing Companies
Visited      0 existing Substances
Created      0 new Zubef-Entries
Created      0 new Products
Created      0 new Sequences
Created      0 new Companies
Created      0 new Substances
Assigned     0 Chemical Equivalences
Assigned     0 Companies
Created      0 Incomplete Packages:
Created      0 Product(s) without a name (missing product name):

Debug import_gkv

Experiment (lib/oddb/drugs/package.rb)

      def _price_exfactory(country='DE')
        if((price = price(:public, country)) \
           && (code = code(:prescription)) && code.value)
          c = ODDB.config
          efp = (price - c.pharmacy_premium) * 100 /
            (100.0 + c.vat + c.pharmacy_percentage)
p efp
p efp.class
          efp.type = :exfactory
          efp.country = country
          efp
        end
      end

Result

#<ODDB::Util::Money:0x7f7c23a94010 @credits=99008, @amount=990.081967213115, @valid_from=Fri May 13 07:48:53 +0200 2011>
ODDB::Util::Money

Note

  • It is certainly ODDB::Util::Money instance
  • But the Money class has a definition of 'type=' method
  • Also attr_reader :type property is defined

Experiment (lib/oddb/drugs/package.rb)

require 'oddb/util/money'

Run again

  • de.oddb/bin/oddbd
  • import_gkv

Result

  • Email

Note

A new command to replace the company name

Test (output all the company names)

 masa@masa ~/ywesee/de.oddb.org $ bin/admin
 de.oddb> ODDB::Business::Company.all.length
 -> 20013
 de.oddb> ODDB::Business::Company.all[0].name
 -> Tbhmg. Id. Rahsiuqapnsa
 de.oddb> open('/home/masa/work/company_names.dat','w'){|f| count = 0; f.print ODDB::Business::Company.all.map{|c| c.name.to_s}.sort.map{|y| (count+=1).to_s + "\t" + y}.join("\n")}

Result

Test (rename a company name, actually replace a company property of product instance)

 $ bin/admin
 de.oddb> ODDB::Business::Company.find_by_name('1rphabhmgama').products.length
 -> 0
 de.oddb> ODDB::Business::Company.find_by_name('1mapharbhmga').products.length
 -> 0
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').products.length
 -> 1
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').name
 -> 1hpaarbhmgma
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').odba_id
 -> 4128540
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').products[0].name
 -> Ramipril 1a
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').products[0].odba_id
 -> 47143
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').products[0].company = ODDB::Business::Company.find_by_name('1rphabhmgama') 
 -> 1rphabhmgama
 de.oddb> ODDB::Business::Company.find_by_name('1hpaarbhmgma').products.length
 -> 0
 de.oddb> ODDB::Business::Company.find_by_name('1rphabhmgama').products.length
 -> 1
 de.oddb> ODDB::Business::Company.find_by_name('1rphabhmgama').products[0].name
 -> Ramipril 1a
 de.oddb> ODDB::Business::Company.find_by_name('1rphabhmgama').products[0].odba_id
 -> 47143

Debug rpdf2txt

Experiment (lib/rpdf2txt/parser.rb#extract_text)

    def extract_text(callback_handler = SimpleHandler.new)
      page_tree.each_with_index { |node, i|
        node.text(callback_handler)
exit
        callback_handler.send_page
      }
      callback_handler.send_eof
    end

Run

 masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef.pdf

Result

...
ACC200                                                                               3867219            EAAGXHL                                                                             
ycsyteteinAlc                                   200      mg                     50                St                 
Brausentteeatbl                                                                                             12,72

Note

  • This is the first data line
  • The correct 'Darreichungsform' is 'Brausetabletten', but rpdf2txt outputs 'Brausentteeatbl'
  • To this part, there is something wrong

Check old pdf files

 masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef20100909.pdf OK
 masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef20101216.pdf OK
 masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef20110114.pdf FAIL

Note

  • It seems that the PDF format changed from 2011
  • and rpdf2txt could not recognize the data correctly
  • The PDF version of zubef20110114.pdf is 1.4
 masa@masa ~/ywesee/rpdf2txt $ head zubef20110114.pdf 
 %PDF-1.4

Experiment (lib/rpdf2txt/object.rb#text)

    def text(callback_handler)
p "getin text"
      concat_stream = Stream.new('')
      if(@contents.size == 1 && @contents.first.is_a?(ReferenceArray))
        @contents.first.build_stream(concat_stream)
      else
        @contents.each { |stream|
          concat_stream.append(stream.decoded_stream)
        }
      end
      @text_state.media_box = self.media_box
      text_snippets = concat_stream.extract_text_objects(self, @text_state)
p text_snippets.class
p text_snippets.length
p text_snippets.map{|x| x.class}.uniq.join("\n")
p text_snippets[0].txt
p text_snippets[1].txt
p text_snippets[2].txt
exit
      join_snippets(text_snippets, callback_handler)
    end

Run with an old pdf

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef20101216.pdf 
"getin text"
Array
265
"Rpdf2txt::TextState"
"Zuzahlungsbefreite Arzneimittel nach \302\247 31 Abs. 3 Satz 4 SGB V"
"PZN"
"Arzneimit"

Run with the latest pdf

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf 
"getin text"
Array
698
"Rpdf2txt::TextState"
"Zu"
"z"
"a"

Note

  • The first TextState instance has only 'Zu' text data, and the line looks divided into small parts (in TextState instances)

Tracing the parsing part

lib/rpdf2txt/parser.rb

  1. extract_text
  2. page_tree
  3. build_page_tree
  4. page_tree_root -> NodeCatalogue#build_tree
  5. object_catalogue -> trailer_dictionary, rebuild_object_catalogue
  6. build_object_catalogue
  7. build_object
    • iteration of all the PDF data (@src)
    • create objects and save it into '@catalogue'

Experiment

    def build_page_tree
      page_tree_root.build_tree(object_catalogue)
print object_catalogue.values.map{|x| x.class.to_s}.sort.uniq.join("\n")
exit

Result

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef20101216.pdf 
Rpdf2txt::CatalogNode
Rpdf2txt::Font
Rpdf2txt::FontDescriptor
Rpdf2txt::ObjStream
Rpdf2txt::PageLeaf
Rpdf2txt::PageNode
Rpdf2txt::PdfHash
Rpdf2txt::Resource
Rpdf2txt::Stream
Rpdf2txt::TrailerDictionary

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf 
Rpdf2txt::CatalogNode
Rpdf2txt::Font
Rpdf2txt::FontDescriptor
Rpdf2txt::PageLeaf
Rpdf2txt::PageNode
Rpdf2txt::PdfHash
Rpdf2txt::Stream
Rpdf2txt::Unknown

view · edit · sidebar · attach · print · history
Page last modified on May 13, 2011, at 04:58 PM