suspend
lib/oddb/html/state/global.rb#grant_download
lib/oddb/html/view/download.rb#to_html
suspend
Tue Nov 16 14:40:39 2010: de.oddb.org Zubef (PDF)
Tue Nov 16 14:40:35 2010: de.oddb.org ODDB::Import::Gkv#import NoMethodError undefined method `[]' for nil:NilClass /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:146:in `scan_object_stream' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:137:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `each' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:47:in `object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:160:in `page_tree_root' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:142:in `build_page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:50:in `page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:40:in `extract_text' /var/www/de.oddb.org/lib/oddb/import/gkv.rb:96:in `import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:117:in `call' /var/www/de.oddb.org/lib/oddb/util/updater.rb:117:in `_reported_import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /var/www/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open' /var/www/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest' /var/www/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv' jobs/import_gkv:16 /var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `call' /var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `run' jobs/import_gkv:15 Imported 0 Zubef-Entries on 16.11.2010: Visited 0 existing Zubef-Entries Visited 0 existing Companies Visited 0 existing Substances Created 0 new Zubef-Entries Created 0 new Products Created 0 new Sequences Created 0 new Companies Created 0 new Substances Assigned 0 Chemical Equivalences Assigned 0 Companies Created 0 Incomplete Packages: Created 0 Product(s) without a name (missing product name):
Run de.oddb.org/bin/oddbd
Run jobs/import_gkv
Result
Tue Nov 23 08:19:43 2010: de.oddb.org ODDB::Import::Gkv#import NoMethodError undefined method `[]' for nil:NilClass /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:148:in `scan_object_stream' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:137:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `each' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `build_object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:47:in `object_catalogue' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:163:in `page_tree_root' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:142:in `build_page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:50:in `page_tree' /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:40:in `extract_text' /home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:96:in `import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:117:in `call' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:117:in `_reported_import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open' /usr/lib64/ruby/1.8/open-uri.rb:32:in `open' /home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest' /home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv' jobs/import_gkv:16 /home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `call' /home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `run' jobs/import_gkv:15 Imported 0 Zubef-Entries on 23.11.2010: Visited 0 existing Zubef-Entries Visited 0 existing Companies Visited 0 existing Substances Created 0 new Zubef-Entries Created 0 new Products Created 0 new Sequences Created 0 new Companies Created 0 new Substances Assigned 0 Chemical Equivalences Assigned 0 Companies Created 0 Incomplete Packages: Created 0 Product(s) without a name (missing product name):
Note
Experiment with the last pdf
masa@masa ~/ywesee/de.oddb.org $ jobs/import_gkv pdf=/home/masa/work/Zuzahlungsbefreit_sort_Name_101101_14842.pdf
Result
Tue Nov 23 08:31:54 2010: de.oddb.org ODDB::Import::Gkv#import Imported 6521 Zubef-Entries on 23.11.2010: Visited 6521 existing Zubef-Entries Visited 6521 existing Companies Visited 1030 existing Substances Created 0 new Zubef-Entries Created 0 new Products Created 0 new Sequences Created 0 new Companies Created 0 new Substances Assigned 0 Chemical Equivalences Assigned 0 Companies Created 0 Incomplete Packages: Created 1 Product(s) without a name (missing product name): http://de.oddb.org/de/drugs/product/uid/3480899
Notes
Experiment
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb
def scan_object_stream src, catalogue
p src
Run jobs/import_gkv
Result
'incorrect header check' when filtering with /FlateDecode
"w.\267\022̭Q�B P\021ˊIz\274V<�|�u�w\t\203\230\206\376f\177kO\031k4@�K\242F:��\225�rd\212H!\251c\200\3778j{Co\006Ap\022.�<\247���R�\r\n"
Note
Experiment
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb
def build_object_catalogue
p "getin build_object_catalogue"
startobj=0
endobj=0
catalogue = {}
@src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match|
obj = build_object(match.to_s)
catalogue.store(obj.oid, obj)
end
print "catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length="
p catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length
exit
Run jobs/import_gkv
Result
"getin build_object_catalogue"
catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=621
Run jobs/import_gkv pdf=/home/masa/work/Zuzahlungsbefreit_sort_Name_101101_14842.pdf
"getin build_object_catalogue"
catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=0
Notes
Check PDF version
masa@masa ~/work $ file pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4 masa@masa ~/work $ file pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6
Note
But
masa@masa ~/work $ file pdf/* pdf/2009.09.07-Zuzahlungsbefreit_sort_Name_090901_8703.pdf: PDF document, version 1.4 pdf/2009.09.08-Zuzahlungsbefreit_sort_Name_090901_8703.pdf: PDF document, version 1.4 pdf/2009.09.16-Zuzahlungsbefreit_sort_Name_090915_8987.pdf: PDF document, version 1.4 pdf/2009.10.03-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.04-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.05-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.06-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.07-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.08-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.09-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.10.10-Zuzahlungsbefreit_sort_Name_091001_9270.pdf: PDF document, version 1.4 pdf/2009.11.10-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.11-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.12-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.13-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.14-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.15-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.16-Zuzahlungsbefreit_sort_Name_091101_9883.pdf: PDF document, version 1.6 pdf/2009.11.17-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.18-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.19-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.20-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.21-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.22-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.23-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.24-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.25-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.11.26-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/2009.12.02-Zuzahlungsbefreit_sort_Name_091201_10326.pdf: PDF document, version 1.4 pdf/2009.12.16-Zuzahlungsbefreit_sort_Name_091215_10697.pdf: PDF document, version 1.4 pdf/2010.01.15-Zuzahlungsbefreit_sort_Name_100101_10925.pdf: PDF document, version 1.4 pdf/2010.01.19-Zuzahlungsbefreit_sort_Name_100115_11232.pdf: PDF document, version 1.4 pdf/2010.02.20-Zuzahlungsbefreit_sort_Name_100215_11862.pdf: PDF document, version 1.4 pdf/2010.03.01-Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4 pdf/2010.03.02-Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4 pdf/2010.03.16-Zuzahlungsbefreit_sort_Name_100315_12454.pdf: PDF document, version 1.4 pdf/2010.04.02-Zuzahlungsbefreit_sort_Name_100401_12872.pdf: PDF document, version 1.4 pdf/2010.04.16-Zuzahlungsbefreit_sort_Name_100415_13077.pdf: PDF document, version 1.4 pdf/2010.09.09-Zuzahlungsbefreit_sort_Name_100901_14383.pdf: PDF document, version 1.4 pdf/2010.10.05-Zuzahlungsbefreit_sort_Name_101001_14562.pdf: PDF document, version 1.4 pdf/2010.10.18-Zuzahlungsbefreit_sort_Name_101015_14671.pdf: PDF document, version 1.4 pdf/2010.11.12-Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4 pdf/2010.11.16-Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6 pdf/Zuzahlungsbefreit_sort_Name_090901_8703.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_090915_8987.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_091201_10326.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_091215_10697.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100101_10925.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100115_11232.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100215_11862.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100315_12454.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100401_12872.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100415_13077.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_100901_14383.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101001_14562.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101015_14671.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4 pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6
Notes
Experiment
Run jobs/import_gkv with the other pdf in version 1.6
masa@masa ~/ywesee/de.oddb.org $ jobs/import_gkv pdf=/home/masa/work/pdf/2009.11.10-Zuzahlungsbefreit_sort_Name_091101_9883.pdf
Result
"getin build_object_catalogue" 'incorrect header check' when filtering with /FlateDecode
Conclusion
References
Check test
masa@masa ~/ywesee/rpdf2txt $ ruby test/suite.rb Loaded suite test/suite Started ......................'invalid literal/lengths set' when filtering with /FlateDecode .ruby: symbol lookup error: /usr/lib64/ruby/gems/1.8/gems/rmagick-2.9.0/lib/RMagick2.so: undefined symbol: DestroyConstitute
Note
Check tests one by one
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_object.rb
Loaded suite test/test_pdf_object
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
.ruby: symbol lookup error: /usr/lib64/ruby/gems/1.8/gems/rmagick-2.9.0/lib/RMagick2.so: undefined symbol: DestroyConstitute
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_parser.rb
Loaded suite test/test_pdf_parser
Started
.."getin build_object_catalogue"
."getin build_object_catalogue"
F........F."getin build_object_catalogue"
."getin build_object_catalogue"
...E"getin build_object_catalogue"
."getin build_object_catalogue"
..
Finished in 3.691601 seconds.
1) Failure:
test_encrypt(TestParser) [test/test_pdf_parser.rb:1319]:
<395> expected but was
<nil>.
2) Failure:
test_join_snippets__hex_chars(TestParser) [test/test_pdf_parser.rb:316]:
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\344t f\374r a1-, a2- und b-Adrenozeptoren sowie f\374r\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n"> expected but was
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\344t f\374r a1-, a2- und b-Adrenozeptoren sowie f\374r\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n">.
3) Error:
test_trailer_dictionary(TestParser):
NoMethodError: undefined method `values' for nil:NilClass
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:54:in `build_trailer_dictionary'
test/test_pdf_parser.rb:1292:in `test_trailer_dictionary'
22 tests, 59 assertions, 2 failures, 1 errors
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_text.rb
Loaded suite test/test_pdf_text
Started
................................
Finished in 1.109014 seconds.
32 tests, 41 assertions, 0 failures, 0 errors
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_space_bug_05_2004.rb
Loaded suite test/test_space_bug_05_2004
Started
.
Finished in 0.370815 seconds.
1 tests, 1 assertions, 0 failures, 0 errors
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_stream.rb
Loaded suite test/test_stream
Started
..........
Finished in 0.735466 seconds.
10 tests, 14 assertions, 0 failures, 0 errors
masa@masa ~/ywesee/rpdf2txt $ ruby test/test_text_state.rb
Loaded suite test/test_text_state
Started
..............
Finished in 1.611227 seconds.
14 tests, 75 assertions, 0 failures, 0 errors
Notes
Pass
Pass
Pass
Pass
Important parts
handler = Rpdf2txt::SimpleHandler.new(STDOUT) parser = Rpdf2txt::Parser.new(File.read(ARGV[0]), 'utf8') parser.extract_text(handler)
Notes
1. lib/rpdf2txt/parser.rb#extract_text
def extract_text(callback_handler = SimpleHandler.new)
page_tree.each { |node|
node.text(callback_handler)
callback_handler.send_page
}
callback_handler.send_eof
end
2. page_tree
def page_tree
@page_tree ||= build_page_tree()
end
Note
def build_page_tree
page_tree_root.build_tree(object_catalogue)
end
def page_tree_root
object_catalogue[trailer_dictionary.root_id]
end
def object_catalogue
@object_catalogue ||= build_object_catalogue()
end
def build_object_catalogue
startobj=0
endobj=0
catalogue = {}
@src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match|
obj = build_object(match.to_s)
catalogue.store(obj.oid, obj)
end
catalogue.values.select do |obj|
obj.is_a?(ObjStream)
end.each do |obj|
scan_object_stream obj.decoded_stream, catalogue
end
catalogue
end
Notes
Read line by line
catalogue = {}
@src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match|
obj = build_object(match.to_s)
catalogue.store(obj.oid, obj)
end
22 0 obj << /Creator (BBEdit) /Producer (Mac OS X 10.2.2 Quartz PDFContext) /CreationDate (D:20021114130430Z00'00') /ModDate (D:20021114130430Z00'00') >> endobj
Check PDF version 1.6
masa@masa ~/work $ cat Zuzahlungsbefreit_sort_Name_101115_14972.pdf |more
%��F-1.6
<</Filter/FlateDecode/First 7/Length 272/N 1/Type/ObjStm>>stream
@�y�zk#�������y����8�>x��Shv�����F��E�ud��R�G�������k��"��������cL5r-����FF���%-�Jv~���I5XJ�N ����~a����:��N.�4=r��J�� ������N*��2�
|�at���qfm?��T����?��?�*��4�. j������Y�I�
<</Filter/FlateDecode/First 688/Length 2117/N 70/Type/ObjStm>>stream
Notes
7. build_object
def build_object(src)
case src
when /\/Type\s*\/Catalog\b/n
CatalogNode.new(src, @target_encoding)
when /\/Type\s*\/Pages\b/n
PageNode.new(src, @target_encoding)
when /\/Type\s*\/Page\b/n
PageLeaf.new(src, @target_encoding)
when /\/Type\s*\/Font\b/n
Font.new(src, @target_encoding)
when /\/Type\s*\/FontDescriptor\b/n
FontDescriptor.new(src, @target_encoding)
when /\/Type\s*\/Encoding\b/n
Encoding.new(src, @target_encoding)
when /\/Type\s*\/ObjStm\b/n
ObjStream.new(src, @target_encoding)
when /\/Type\s*\/XRef\b/n
TrailerDictionary.new(src, @target_encoding)
when %r!/Subtype\s*/Image!n
Image.new(src, @target_encoding)
when /\bstream\b/n, %r{/ToUnicode\b}n
Stream.new(src, @target_encoding)
when /\/Font\s*<</mn
Resource.new(src, @target_encoding)
when /^(?:\d+\s+){2}obj\s*\[\s*(?:(\d+\s+){2}R\s*)*\]\s+endobj/mn
ReferenceArray.new(src, @target_encoding)
when /^(?:\d+\s+){2}obj\s*\[\s*(?:(\d+\s*))*\]\s+endobj/mn
PdfArray.new(src, @target_encoding)
when /obj\s*<</mn
PdfHash.new(src, @target_encoding)
else
Unknown.new(src, @target_encoding)
end
end
Notes
Question
Check PDF parser (Ruby)
Reference
Experiment
def decode_raw_stream
@decrypted_stream = raw_stream
unless(@decoder.nil?)
@decrypted_stream = @decoder.decrypt(self)
end
stream = @decrypted_stream
[@attributes[:filter]].flatten.compact.each { |filter|
begin
stream = case filter
when "/FlateDecode"
flate_decode stream
when "/LZWDecode"
lzw_decode stream
else
raise "Unimplemented filter: #{filter}"
end
p "done decode"
rescue StandardError => err
warn "'#{err.message}' when filtering with #{filter}"
end
}
stream
exit
end
def flate_decode(data)
p "getin flate_decode"
Zlib::Inflate.inflate(data)
end
Result
masa@masa ~/work/rpdf2txt $ ruby -I lib bin/rpdf2txt Zuzahlungsbefreit_sort_Name_101115_14972.pdf Rpdf2txt::ObjStream "getin flate_decode" 'incorrect header check' when filtering with /FlateDecode
Notes
Reference