view · edit · sidebar · attach · print · history

20101123-debug-import_gkv update-rpdf2txt


  1. Think about tests by using Selenium suspend
  2. Confirm the error of import_gkv
  3. Trace data
  4. Update rpdf2txt (Check testcases)
  5. Read code rpdf2txt
  6. Read PDF Reference version 1.6

Goal
  • Debug import_gkv / 60%
Milestones
  1. Confirm error
  2. Update rpdf2txt
    • PDF version reference
    • Test check 11:00
    • Read code
Summary
Commits
ToDo Tomorrow
Keep in Mind
  1. Testcases of lib/oddb/html/state/global.rb#grant_download, lib/oddb/html/view/download.rb#to_html
  2. Debug testcases in test/export/test_server.rb de.oddb.org
  3. A bug import_gkv Tue Nov 16 02:00:10 2010: de.oddb.org Zubef (PDF)
  4. Compression (refer to lib/oddb/export/server.rb), Test cases (grant_download, Logging, Reporting)
  5. Log Error: on production server, de.oddb.org/log/import_dimdi, import_pharmnet
  6. On Ice
  7. emerge --sync

Think about tests by using Selenium

lib/oddb/html/state/global.rb#grant_download

  • This method is called when an user accesses to a specific URL produced by bin/admin grant_download command
  • It may be possible to test only this method independently without Selenium

lib/oddb/html/view/download.rb#to_html

suspend

Confirm the error of import_gkv

Tue Nov 16 14:40:39 2010: de.oddb.org Zubef (PDF)

Tue Nov 16 14:40:35 2010: de.oddb.org ODDB::Import::Gkv#import
NoMethodError
undefined method `[]' for nil:NilClass
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:146:in `scan_object_stream'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:137:in `build_object_catalogue'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `build_object_catalogue'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:47:in `object_catalogue'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:160:in `page_tree_root'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:142:in `build_page_tree'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:50:in `page_tree'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:40:in `extract_text'
/var/www/de.oddb.org/lib/oddb/import/gkv.rb:96:in `import'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:117:in `call'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:117:in `_reported_import'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open'
/var/www/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv'
jobs/import_gkv:16
/var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `call'
/var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `run'
jobs/import_gkv:15
Imported     0 Zubef-Entries on 16.11.2010:
Visited      0 existing Zubef-Entries
Visited      0 existing Companies
Visited      0 existing Substances
Created      0 new Zubef-Entries
Created      0 new Products
Created      0 new Sequences
Created      0 new Companies
Created      0 new Substances
Assigned     0 Chemical Equivalences
Assigned     0 Companies
Created      0 Incomplete Packages:
Created      0 Product(s) without a name (missing product name):

Run de.oddb.org/bin/oddbd

Run jobs/import_gkv

Result

Tue Nov 23 08:19:43 2010: de.oddb.org ODDB::Import::Gkv#import
NoMethodError
undefined method `[]' for nil:NilClass
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:148:in `scan_object_stream'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:137:in `build_object_catalogue'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:134:in `build_object_catalogue'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:47:in `object_catalogue'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:163:in `page_tree_root'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:142:in `build_page_tree'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:50:in `page_tree'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:40:in `extract_text'
/home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:96:in `import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:117:in `call'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:117:in `_reported_import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:110:in `reported_import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open'
/home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv'
jobs/import_gkv:16
/home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `call'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `run'
jobs/import_gkv:15
Imported     0 Zubef-Entries on 23.11.2010:
Visited      0 existing Zubef-Entries
Visited      0 existing Companies
Visited      0 existing Substances
Created      0 new Zubef-Entries
Created      0 new Products
Created      0 new Sequences
Created      0 new Companies
Created      0 new Substances
Assigned     0 Chemical Equivalences
Assigned     0 Companies
Created      0 Incomplete Packages:
Created      0 Product(s) without a name (missing product name):

Note

  • Same error

Experiment with the last pdf

masa@masa ~/ywesee/de.oddb.org $ jobs/import_gkv pdf=/home/masa/work/Zuzahlungsbefreit_sort_Name_101101_14842.pdf 

Result

Tue Nov 23 08:31:54 2010: de.oddb.org ODDB::Import::Gkv#import
Imported  6521 Zubef-Entries on 23.11.2010:
Visited   6521 existing Zubef-Entries
Visited   6521 existing Companies
Visited   1030 existing Substances
Created      0 new Zubef-Entries
Created      0 new Products
Created      0 new Sequences
Created      0 new Companies
Created      0 new Substances
Assigned     0 Chemical Equivalences
Assigned     0 Companies
Created      0 Incomplete Packages:
Created      1 Product(s) without a name (missing product name):
http://de.oddb.org/de/drugs/product/uid/3480899

Notes

  • Definitely there is some error in the data file

Trace data

Experiment

/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb

    def scan_object_stream src, catalogue
p src

Run jobs/import_gkv

Result

'incorrect header check' when filtering with /FlateDecode
"w.\267\022&#813;Q&#65533;B P\021&#714;Iz\274V<&#65533;|&#65533;u&#65533;w\t\203\230\206\376f\177kO\031k4@&#65533;K\242F:&#65533;&#65533;\225&#65533;rd\212H!\251c\200\3778j{Co\006Ap\022.&#65533;<\247&#65533;&#65533;&#65533;R&#65533;\r\n"

Note

  • If I use the last pdf file, this does not outputted

Experiment

/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb

        def build_object_catalogue
p "getin build_object_catalogue"
            startobj=0
            endobj=0
      catalogue = {}
      @src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match|
        obj = build_object(match.to_s)
        catalogue.store(obj.oid, obj)
      end
print "catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length="
p catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length
exit

Run jobs/import_gkv

Result

"getin build_object_catalogue"
catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=621

Run jobs/import_gkv pdf=/home/masa/work/Zuzahlungsbefreit_sort_Name_101101_14842.pdf

"getin build_object_catalogue"
catalogue.values.select{|obj| obj.is_a?(ObjStream)}.length=0

Notes

  • What does this difference mean?

Check PDF version

masa@masa ~/work $ file pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf 
pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4
masa@masa ~/work $ file pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf 
pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6

Note

  • The last one is version 1.4
  • The new one is version 1.6
  • Probably this is the cause

But

masa@masa ~/work $ file pdf/*
pdf/2009.09.07-Zuzahlungsbefreit_sort_Name_090901_8703.pdf:  PDF document, version 1.4
pdf/2009.09.08-Zuzahlungsbefreit_sort_Name_090901_8703.pdf:  PDF document, version 1.4
pdf/2009.09.16-Zuzahlungsbefreit_sort_Name_090915_8987.pdf:  PDF document, version 1.4
pdf/2009.10.03-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.04-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.05-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.06-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.07-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.08-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.09-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.10.10-Zuzahlungsbefreit_sort_Name_091001_9270.pdf:  PDF document, version 1.4
pdf/2009.11.10-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.11-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.12-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.13-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.14-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.15-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.16-Zuzahlungsbefreit_sort_Name_091101_9883.pdf:  PDF document, version 1.6
pdf/2009.11.17-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.18-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.19-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.20-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.21-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.22-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.23-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.24-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.25-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.11.26-Zuzahlungsbefreit_sort_Name_091115_10082.pdf: PDF document, version 1.4
pdf/2009.12.02-Zuzahlungsbefreit_sort_Name_091201_10326.pdf: PDF document, version 1.4
pdf/2009.12.16-Zuzahlungsbefreit_sort_Name_091215_10697.pdf: PDF document, version 1.4
pdf/2010.01.15-Zuzahlungsbefreit_sort_Name_100101_10925.pdf: PDF document, version 1.4
pdf/2010.01.19-Zuzahlungsbefreit_sort_Name_100115_11232.pdf: PDF document, version 1.4
pdf/2010.02.20-Zuzahlungsbefreit_sort_Name_100215_11862.pdf: PDF document, version 1.4
pdf/2010.03.01-Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4
pdf/2010.03.02-Zuzahlungsbefreit_sort_Name_100301_12152.pdf: PDF document, version 1.4
pdf/2010.03.16-Zuzahlungsbefreit_sort_Name_100315_12454.pdf: PDF document, version 1.4
pdf/2010.04.02-Zuzahlungsbefreit_sort_Name_100401_12872.pdf: PDF document, version 1.4
pdf/2010.04.16-Zuzahlungsbefreit_sort_Name_100415_13077.pdf: PDF document, version 1.4
pdf/2010.09.09-Zuzahlungsbefreit_sort_Name_100901_14383.pdf: PDF document, version 1.4
pdf/2010.10.05-Zuzahlungsbefreit_sort_Name_101001_14562.pdf: PDF document, version 1.4
pdf/2010.10.18-Zuzahlungsbefreit_sort_Name_101015_14671.pdf: PDF document, version 1.4
pdf/2010.11.12-Zuzahlungsbefreit_sort_Name_101101_14842.pdf: PDF document, version 1.4
pdf/2010.11.16-Zuzahlungsbefreit_sort_Name_101115_14972.pdf: PDF document, version 1.6
pdf/Zuzahlungsbefreit_sort_Name_090901_8703.pdf:             PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_090915_8987.pdf:             PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_091115_10082.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_091201_10326.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_091215_10697.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100101_10925.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100115_11232.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100215_11862.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100301_12152.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100315_12454.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100401_12872.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100415_13077.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_100901_14383.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_101001_14562.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_101015_14671.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_101101_14842.pdf:            PDF document, version 1.4
pdf/Zuzahlungsbefreit_sort_Name_101115_14972.pdf:            PDF document, version 1.6

Notes

  • There was also a pdf file in version 1.6 in the past

Experiment

Run jobs/import_gkv with the other pdf in version 1.6

masa@masa ~/ywesee/de.oddb.org $ jobs/import_gkv pdf=/home/masa/work/pdf/2009.11.10-Zuzahlungsbefreit_sort_Name_091101_9883.pdf

Result

  • The same error
"getin build_object_catalogue"
'incorrect header check' when filtering with /FlateDecode

Conclusion

  • It seems that rpdf2txt library does not recognize PDF version 1.6

Update rpdf2txt

References

Check test

masa@masa ~/ywesee/rpdf2txt $ ruby test/suite.rb 
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
.ruby: symbol lookup error: /usr/lib64/ruby/gems/1.8/gems/rmagick-2.9.0/lib/RMagick2.so: undefined symbol: DestroyConstitute

Note

  • Error

Check tests one by one

masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_object.rb 
Loaded suite test/test_pdf_object
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
.ruby: symbol lookup error: /usr/lib64/ruby/gems/1.8/gems/rmagick-2.9.0/lib/RMagick2.so: undefined symbol: DestroyConstitute


masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_parser.rb 
Loaded suite test/test_pdf_parser
Started
.."getin build_object_catalogue"
."getin build_object_catalogue"
F........F."getin build_object_catalogue"
."getin build_object_catalogue"
...E"getin build_object_catalogue"
."getin build_object_catalogue"
..
Finished in 3.691601 seconds.

  1) Failure:
test_encrypt(TestParser) [test/test_pdf_parser.rb:1319]:
<395> expected but was
<nil>.

  2) Failure:
test_join_snippets__hex_chars(TestParser) [test/test_pdf_parser.rb:316]:
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\344t f\374r a1-, a2- und b-Adrenozeptoren sowie f\374r\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n"> expected but was
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\344t f\374r  a1-, a2- und b-Adrenozeptoren sowie f\374r\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n">.

  3) Error:
test_trailer_dictionary(TestParser):
NoMethodError: undefined method `values' for nil:NilClass
    /usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:54:in `build_trailer_dictionary'
    test/test_pdf_parser.rb:1292:in `test_trailer_dictionary'

22 tests, 59 assertions, 2 failures, 1 errors

masa@masa ~/ywesee/rpdf2txt $ ruby test/test_pdf_text.rb 
Loaded suite test/test_pdf_text
Started
................................
Finished in 1.109014 seconds.

32 tests, 41 assertions, 0 failures, 0 errors


masa@masa ~/ywesee/rpdf2txt $ ruby test/test_space_bug_05_2004.rb 
Loaded suite test/test_space_bug_05_2004
Started
.
Finished in 0.370815 seconds.

1 tests, 1 assertions, 0 failures, 0 errors


masa@masa ~/ywesee/rpdf2txt $ ruby test/test_stream.rb 
Loaded suite test/test_stream
Started
..........
Finished in 0.735466 seconds.

10 tests, 14 assertions, 0 failures, 0 errors


masa@masa ~/ywesee/rpdf2txt $ ruby test/test_text_state.rb 
Loaded suite test/test_text_state
Started
..............
Finished in 1.611227 seconds.

14 tests, 75 assertions, 0 failures, 0 errors

Notes

  • test_pdf_object.rb Error
  • test_pdf_parser.rb 2 failures, 1 errors
  • test_pdf_text.rb Pass
  • test_space_bug_05_2004.rb Pass
  • test_stream.rb Pass
  • test_text_state.rb Pass

Read code rpdf2txt

bin/rpdf2txt

Important parts

handler =  Rpdf2txt::SimpleHandler.new(STDOUT)
parser = Rpdf2txt::Parser.new(File.read(ARGV[0]), 'utf8')
parser.extract_text(handler)

Notes

  • Parse and Handler classes are related
  • The main method is Rpdf2txt::Parse.extract
  • Handler class has an output stream object
    • I guess Handler class is in charge of outputting data
    • I guess Parser class calls a method of Handler class, and every Handler class has common methods
    • I guess we can change Handler class then output will be different
  • Parser class has an input stream object, so Parser class is in charge of reading a pdf data

Trace forward from extract_text method

1. lib/rpdf2txt/parser.rb#extract_text

        def extract_text(callback_handler = SimpleHandler.new)
            page_tree.each { |node|
                node.text(callback_handler)
                callback_handler.send_page
            }
            callback_handler.send_eof
        end

2. page_tree

        def page_tree
            @page_tree ||= build_page_tree()
        end

Note

  • The structure of a pdf page looks 'tree'

3. build_page_tree

        def build_page_tree
            page_tree_root.build_tree(object_catalogue)
        end

4. page_tree_root

        def page_tree_root
            object_catalogue[trailer_dictionary.root_id]
        end

5. object_catalogue

        def object_catalogue
            @object_catalogue ||= build_object_catalogue()
        end

6. build_object_catalogue

        def build_object_catalogue
            startobj=0
            endobj=0
      catalogue = {}
      @src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match|
        obj = build_object(match.to_s)
        catalogue.store(obj.oid, obj)
      end
      catalogue.values.select do |obj|
        obj.is_a?(ObjStream)
      end.each do |obj|
        scan_object_stream obj.decoded_stream, catalogue
      end
            catalogue
        end

Notes

  • @src (File.read('pdf file')) -> build_object_catalog (@object_catalog) -> page_tree_root -> @page_tree
  • This method looks one of core methods in Parser class

Read line by line

      catalogue = {}
  • catalogue is Hash
      @src.scan(/(?:\d+ ){2}obj\b.*?\bendobj\b/mn) do |match|
        obj = build_object(match.to_s)
        catalogue.store(obj.oid, obj)
      end
  • This is the core to pick up the objects from PDF stream (@src)
  • /(?:\d+ ){2}obj\b.*?\bendobj\b/mn
    • For example,
22 0 obj
<< /Creator (BBEdit) /Producer (Mac OS X 10.2.2 Quartz PDFContext)
/CreationDate (D:20021114130430Z00'00') /ModDate (D:20021114130430Z00'00')
>>
endobj
  • One object: 2 numbers and 'obj' starts and 'endobj' ends
  • 'm' means match seveal lines, 'n' means $KCODE='NONE'
  • bulid_object method creates a object from matched string
  • key=obj.oid, value=obj

Check PDF version 1.6

masa@masa ~/work $ cat Zuzahlungsbefreit_sort_Name_101115_14972.pdf |more
%&#65533;&#65533;F-1.6
<</Filter/FlateDecode/First 7/Length 272/N 1/Type/ObjStm>>stream
@&#65533;y&#65533;zk#&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;y&#65533;&#65533;&#65533;&#65533;8&#65533;>x&#65533;&#65533;Shv&#65533;&#65533;&#65533;&#65533;&#65533;F&#65533;&#65533;E&#65533;ud&#65533;&#65533;R&#65533;G&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;k&#65533;&#65533;"&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;cL5r-&#65533;&#65533;&#65533;&#65533;FF&#65533;&#65533;&#65533;%-&#65533;Jv~&#65533;&#65533;&#65533;I5XJ&#65533;N &#65533;&#65533;&#65533;&#65533;~a&#65533;&#65533;&#65533;&#65533;:&#65533;&#65533;N.&#65533;4=r&#65533;&#65533;J&#65533;&#65533; &#65533;&#65533;&#65533;&#65533;&#65533;&#65533;N*&#65533;&#65533;2&#65533;
                                                                                                                                    |&#65533;at&#65533;&#65533;&#65533;qfm?&#65533;&#65533;T&#65533;&#65533;&#65533;&#65533;?&#65533;&#65533;?&#65533;*&#65533;&#65533;4&#65533;.     j&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;Y&#65533;I&#65533;
<</Filter/FlateDecode/First 688/Length 2117/N 70/Type/ObjStm>>stream

Notes

  • Namely, the regular expression is not available in version 1.6.

7. build_object

        def build_object(src)
            case src
            when /\/Type\s*\/Catalog\b/n
                CatalogNode.new(src, @target_encoding)
            when /\/Type\s*\/Pages\b/n
                PageNode.new(src, @target_encoding)
            when /\/Type\s*\/Page\b/n
                PageLeaf.new(src, @target_encoding)
            when /\/Type\s*\/Font\b/n
                Font.new(src, @target_encoding)
            when /\/Type\s*\/FontDescriptor\b/n
                FontDescriptor.new(src, @target_encoding)
            when /\/Type\s*\/Encoding\b/n
                Encoding.new(src, @target_encoding)
            when /\/Type\s*\/ObjStm\b/n
        ObjStream.new(src, @target_encoding)
      when /\/Type\s*\/XRef\b/n
        TrailerDictionary.new(src, @target_encoding)
      when %r!/Subtype\s*/Image!n
        Image.new(src, @target_encoding)
            when /\bstream\b/n, %r{/ToUnicode\b}n
                Stream.new(src, @target_encoding)
            when /\/Font\s*<</mn
                Resource.new(src, @target_encoding)
            when /^(?:\d+\s+){2}obj\s*\[\s*(?:(\d+\s+){2}R\s*)*\]\s+endobj/mn
                ReferenceArray.new(src, @target_encoding)
            when /^(?:\d+\s+){2}obj\s*\[\s*(?:(\d+\s*))*\]\s+endobj/mn
                PdfArray.new(src, @target_encoding)
      when /obj\s*<</mn
        PdfHash.new(src, @target_encoding)
            else
                Unknown.new(src, @target_encoding)
            end
        end

Notes

  • This is the second core part to recognize each object depending on the ASCII string pattern

Question

  • How do I pick up the objects?
  • I have to read PDF reference (1.6, 1.7)

Check PDF parser (Ruby)

Read PDF Reference version 1.6

Reference

Experiment

lib/rpdf2txt/object.rb

        def decode_raw_stream
            @decrypted_stream = raw_stream
            unless(@decoder.nil?)
                @decrypted_stream = @decoder.decrypt(self)
            end
            stream = @decrypted_stream
            [@attributes[:filter]].flatten.compact.each { |filter|
        begin
          stream = case filter
                   when "/FlateDecode"
                     flate_decode stream
                   when "/LZWDecode"
                     lzw_decode stream
                   else
                     raise "Unimplemented filter: #{filter}"
                   end
p "done decode"
        rescue StandardError => err
          warn "'#{err.message}' when filtering with #{filter}"
        end
            }
            stream
exit
        end
    def flate_decode(data)
p "getin flate_decode"
      Zlib::Inflate.inflate(data)
    end

Result

masa@masa ~/work/rpdf2txt $ ruby -I lib bin/rpdf2txt Zuzahlungsbefreit_sort_Name_101115_14972.pdf 
Rpdf2txt::ObjStream
"getin flate_decode"
'incorrect header check' when filtering with /FlateDecode

Notes

  • Zlib::Inflate.inflate outputs an error.
  • This indicates that the 'Object Stream' is wrong

Reference

view · edit · sidebar · attach · print · history
Page last modified on November 23, 2010, at 04:48 PM