suspend
lib/oddb/html/state/global.rb#grant_download
lib/oddb/html/view/download.rb#to_html
suspend
Tue Nov 16 14:40:39 2010: de.oddb.org Zubef (PDF)
Tue Nov 16 14:40:35 2010: de.oddb.org ODDB::Import::Gkv#import NoMethodError undefined method `[]' for nil:NilClass /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:146:in `scan_object_stream' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:137:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `each' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:47:in `object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:160:in `page_tree_root' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:142:in `build_page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:50:in `page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:40:in `extract_text' /var/www/de.oddb.org/lib/oddb/import/gkv.rb:96:in `import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:117:in `call' /var/www/de.oddb.org/lib/oddb/util/updater.rb:117:in `_reported_import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open' /var/www/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest' /var/www/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv' jobs/import_gkv:16 /var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `call' /var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `run' jobs/import_gkv:15 Imported 0 Zubef-Entries on 16.11.2010: Visited 0 existing Zubef-Entries Visited 0 existing Companies Visited 0 existing Substances Created 0 new Zubef-Entries Created 0 new Products Created 0 new Sequences Created 0 new Companies Created 0 new Substances Assigned 0 Chemical Equivalences Assigned 0 Companies Created 0 Incomplete Packages: Created 0 Product(s) without a name (missing product name):
Run de.oddb.org/bin/oddbd
Run jobs/import_gkv
Result
Tue Nov 23 08:19:43 2010: de.oddb.org ODDB::Import::Gkv#import NoMethodError undefined method `[]' for nil:NilClass /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:148:in `scan_object_stream' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:137:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `each' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:47:in `object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:163:in `page_tree_root' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:142:in `build_page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:50:in `page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:40:in `extract_text' /home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:96:in `import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:117:in `call' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:117:in `_reported_import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open' /home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv' jobs/import_gkv:16 /home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `call' /home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `run' jobs/import_gkv:15 Imported 0 Zubef-Entries on 23.11.2010: Visited 0 existing Zubef-Entries Visited 0 existing Companies Visited 0 existing Substances Created 0 new Zubef-Entries Created 0 new Products Created 0 new Sequences Created 0 new Companies Created 0 new Substances Assigned 0 Chemical Equivalences Assigned 0 Companies Created 0 Incomplete Packages: Created 0 Product(s) without a name (missing product name):
Note
Experiment with the last pdf
masa@masa ~/ywesee/de.oddb.org $ jobs/import_gkv pdf=/home/masa/work/Zuzahlungsbefreit_sort_Name_101101_14842.pdf
Result
Tue Nov 23 08:31:54 2010: de.oddb.org ODDB::Import::Gkv#import Imported 6521 Zubef-Entries on 23.11.2010: Visited 6521 existing Zubef-Entries Visited 6521 existing Companies Visited 1030 existing Substances Created 0 new Zubef-Entries Created 0 new Products Created 0 new Sequences Created 0 new Companies Created 0 new Substances Assigned 0 Chemical Equivalences Assigned 0 Companies Created 0 Incomplete Packages: Created 1 Product(s) without a name (missing product name): http://de.oddb.org/de/drugs/product/uid/3480899
Notes
Experiment
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb
def scan_object_stream src, catalogue p src
Run jobs/import_gkv
Result
'incorrect header check' when filtering with /FlateDecode "w.\267\022̭Q�B P\021ˊIz\274V<�|�u�w\t\203\230\206\376f\177kO\031k4@�K\242F:��\225�rd\212H!\251c\200\3778j{Co\006Ap\022.�<\247���R�\r\n"
Note
Experiment
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb
def build_object_catalogue p "getin build_object_catalogue" startobj=0 endobj=0 catalogue = {} @src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match| obj = build_object(match.to_s) catalogue.store(obj.oid, obj) end print "catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=" p catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length exit
Run jobs/import_gkv
Result
"getin build_object_catalogue" catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=621
Run jobs/import_gkv pdf=/home/masa/work/Zuzahlungsbefreit_sort_Name_101101_14842.pdf
"getin build_object_catalogue" catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=0
Notes
Check PDF version
masa@masa ~/work $ file pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4 masa@masa ~/work $ file pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6
Note
But
masa@masa ~/work $ file pdf/* pdf/2009.09.07-Zuzahlungsbefreit_sort_Name_090901_8703.pdf: PDF document, version 1.4 pdf/2009.09.08-Zuzahlungsbefreit_sort_Name_090901_8703.pdf: PDF document, version 1.4 pdf/2009.09.16-Zuzahlungsbefreit_sort_Name_090915_8987.pdf: PDF document, version 1.4 pdf/2009.10.03-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.04-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.05-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.06-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.07-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.08-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.09-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.10-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.11.10-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.11-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.12-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.13-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.14-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.15-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.16-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.17-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.18-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.19-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.20-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.21-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.22-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.23-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.24-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.25-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.26-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.12.02-Zuzahlungsbefreit_sort_Name_091201_10326.pdf: PDF document, version 1.4 pdf/2009.12.16-Zuzahlungsbefreit_sort_Name_091215_10697.pdf: PDF document, version 1.4 pdf/2010.01.15-Zuzahlungsbefreit_sort_Name_100101_10925.pdf: PDF document, version 1.4 pdf/2010.01.19-Zuzahlungsbefreit_sort_Name_100115_11232.pdf: PDF document, version 1.4 pdf/2010.02.20-Zuzahlungsbefreit_sort_Name_100215_11862.pdf: PDF document, version 1.4 pdf/2010.03.01-Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4 pdf/2010.03.02-Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4 pdf/2010.03.16-Zuzahlungsbefreit_sort_Name_100315_12454.pdf: PDF document, version 1.4 pdf/2010.04.02-Zuzahlungsbefreit_sort_Name_100401_12872.pdf: PDF document, version 1.4 pdf/2010.04.16-Zuzahlungsbefreit_sort_Name_100415_13077.pdf: PDF document, version 1.4 pdf/2010.09.09-Zuzahlungsbefreit_sort_Name_100901_14383.pdf: PDF document, version 1.4 pdf/2010.10.05-Zuzahlungsbefreit_sort_Name_101001_14562.pdf: PDF document, version 1.4 pdf/2010.10.18-Zuzahlungsbefreit_sort_Name_101015_14671.pdf: PDF document, version 1.4 pdf/2010.11.12-Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4 pdf/2010.11.16-Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6 pdf/Zuzahlungsbefreit_sort_Name_090901_8703.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_090915_8987.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_091201_10326.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_091215_10697.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100101_10925.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100115_11232.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100215_11862.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100315_12454.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100401_12872.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100415_13077.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100901_14383.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101001_14562.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101015_14671.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6
Notes
Experiment
Run jobs/import_gkv with the other pdf in version 1.6
masa@masa ~/ywesee/de.oddb.org $ jobs/import_gkv pdf=/home/masa/work/pdf/2009.11.10-Zuzahlungsbefreit_sort_Name_091101_9883.pdf
Result
"getin build_object_catalogue" 'incorrect header check' when filtering with /FlateDecode
Conclusion
References
Check test
masa@masa ~/ywesee/rpdf2txt $ ruby test/suite.rb Loaded suite test/suite Started ......................'invalid literal/lengths set' when filtering with /FlateDecode .ruby: symbol lookup error: /usr/lib64/ruby/gems/1.8/gems/rmagick-2.9.0/lib/RMagick2.so: undefined symbol: DestroyConstitute
Note
Check tests one by one
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_object.rb Loaded suite test/test_pdf_object Started ......................'invalid literal/lengths set' when filtering with /FlateDecode .ruby: symbol lookup error: /usr/lib64/ruby/gems/1.8/gems/rmagick-2.9.0/lib/RMagick2.so: undefined symbol: DestroyConstitute masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_parser.rb Loaded suite test/test_pdf_parser Started .."getin build_object_catalogue" ."getin build_object_catalogue" F........F."getin build_object_catalogue" ."getin build_object_catalogue" ...E"getin build_object_catalogue" ."getin build_object_catalogue" .. Finished in 3.691601 seconds. 1) Failure: test_encrypt(TestParser) [test/test_pdf_parser.rb:1319]: <395> expected but was <nil>. 2) Failure: test_join_snippets__hex_chars(TestParser) [test/test_pdf_parser.rb:316]: <"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\344t f\374r a1-, a2- und b-Adrenozeptoren sowie f\374r\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n"> expected but was <"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\344t f\374r a1-, a2- und b-Adrenozeptoren sowie f\374r\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n">. 3) Error: test_trailer_dictionary(TestParser): NoMethodError: undefined method `values' for nil:NilClass /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:54:in `build_trailer_dictionary' test/test_pdf_parser.rb:1292:in `test_trailer_dictionary' 22 tests, 59 assertions, 2 failures, 1 errors masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_text.rb Loaded suite test/test_pdf_text Started ................................ Finished in 1.109014 seconds. 32 tests, 41 assertions, 0 failures, 0 errors masa@masa ~/ywesee/rpdf2txt $ ruby test/test_space_bug_05_2004.rb Loaded suite test/test_space_bug_05_2004 Started . Finished in 0.370815 seconds. 1 tests, 1 assertions, 0 failures, 0 errors masa@masa ~/ywesee/rpdf2txt $ ruby test/test_stream.rb Loaded suite test/test_stream Started .......... Finished in 0.735466 seconds. 10 tests, 14 assertions, 0 failures, 0 errors masa@masa ~/ywesee/rpdf2txt $ ruby test/test_text_state.rb Loaded suite test/test_text_state Started .............. Finished in 1.611227 seconds. 14 tests, 75 assertions, 0 failures, 0 errors
Notes
Pass
Pass
Pass
Pass
Important parts
handler = Rpdf2txt::SimpleHandler.new(STDOUT) parser = Rpdf2txt::Parser.new(File.read(ARGV[0]), 'utf8') parser.extract_text(handler)
Notes
1. lib/rpdf2txt/parser.rb#extract_text
def extract_text(callback_handler = SimpleHandler.new) page_tree.each { |node| node.text(callback_handler) callback_handler.send_page } callback_handler.send_eof end
2. page_tree
def page_tree @page_tree ||= build_page_tree() end
Note
def build_page_tree page_tree_root.build_tree(object_catalogue) end
def page_tree_root object_catalogue[trailer_dictionary.root_id] end
def object_catalogue @object_catalogue ||= build_object_catalogue() end
def build_object_catalogue startobj=0 endobj=0 catalogue = {} @src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match| obj = build_object(match.to_s) catalogue.store(obj.oid, obj) end catalogue.values.select do |obj| obj.is_a?(ObjStream) end.each do |obj| scan_object_stream obj.decoded_stream, catalogue end catalogue end
Notes
Read line by line
catalogue = {}
@src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match| obj = build_object(match.to_s) catalogue.store(obj.oid, obj) end
22 0 obj << /Creator (BBEdit) /Producer (Mac OS X 10.2.2 Quartz PDFContext) /CreationDate (D:20021114130430Z00'00') /ModDate (D:20021114130430Z00'00') >> endobj
Check PDF version 1.6
masa@masa ~/work $ cat Zuzahlungsbefreit_sort_Name_101115_14972.pdf |more %��F-1.6 <</Filter/FlateDecode/First 7/Length 272/N 1/Type/ObjStm>>stream @�y�zk#�������y����8�>x��Shv�����F��E�ud��R�G�������k��"��������cL5r-����FF���%-�Jv~���I5XJ�N ����~a����:��N.�4=r��J�� ������N*��2� |�at���qfm?��T����?��?�*��4�. j������Y�I� <</Filter/FlateDecode/First 688/Length 2117/N 70/Type/ObjStm>>stream
Notes
7. build_object
def build_object(src) case src when /\/Type\s*\/Catalog\b/n CatalogNode.new(src, @target_encoding) when /\/Type\s*\/Pages\b/n PageNode.new(src, @target_encoding) when /\/Type\s*\/Page\b/n PageLeaf.new(src, @target_encoding) when /\/Type\s*\/Font\b/n Font.new(src, @target_encoding) when /\/Type\s*\/FontDescriptor\b/n FontDescriptor.new(src, @target_encoding) when /\/Type\s*\/Encoding\b/n Encoding.new(src, @target_encoding) when /\/Type\s*\/ObjStm\b/n ObjStream.new(src, @target_encoding) when /\/Type\s*\/XRef\b/n TrailerDictionary.new(src, @target_encoding) when %r!/Subtype\s*/Image!n Image.new(src, @target_encoding) when /\bstream\b/n, %r{/ToUnicode\b}n Stream.new(src, @target_encoding) when /\/Font\s*<</mn Resource.new(src, @target_encoding) when /^(?:\d+\s+){2}obj\s*\[\s*(?:(\d+\s+){2}R\s*)*\]\s+endobj/mn ReferenceArray.new(src, @target_encoding) when /^(?:\d+\s+){2}obj\s*\[\s*(?:(\d+\s*))*\]\s+endobj/mn PdfArray.new(src, @target_encoding) when /obj\s*<</mn PdfHash.new(src, @target_encoding) else Unknown.new(src, @target_encoding) end end
Notes
Question
Check PDF parser (Ruby)
Reference
Experiment
def decode_raw_stream @decrypted_stream = raw_stream unless(@decoder.nil?) @decrypted_stream = @decoder.decrypt(self) end stream = @decrypted_stream [@attributes[:filter]].flatten.compact.each { |filter| begin stream = case filter when "/FlateDecode" flate_decode stream when "/LZWDecode" lzw_decode stream else raise "Unimplemented filter: #{filter}" end p "done decode" rescue StandardError => err warn "'#{err.message}' when filtering with #{filter}" end } stream exit end def flate_decode(data) p "getin flate_decode" Zlib::Inflate.inflate(data) end
Result
masa@masa ~/work/rpdf2txt $ ruby -I lib bin/rpdf2txt Zuzahlungsbefreit_sort_Name_101115_14972.pdf Rpdf2txt::ObjStream "getin flate_decode" 'incorrect header check' when filtering with /FlateDecode
Notes
Reference