view · edit · sidebar · attach · print · history

20110527-report-pdf-writer-rpdf2txt

<< | Index | >>


  1. Read PDF reference Meta Data section

Goal/Estimate/Evaluation
  • Report pdf writer (producer) info / 70% / 50%
Milestones
  • Check PDF Reference, Meta Data section
Summary
Commits

Read PDF reference Meta Data section

Reference

Note

  • There are 2 cases where the metadata is written
    1. Document information dictionary
    2. Metadata stream

zubef2011.pdf

%PDF-1.4
%????^M
1 0 obj
<<
/Title (?&#65199;8Z???^^???u???~K?^@^X?\\~?  ?)
/Author (?&#1645;?S???^P???a???n`)
/Creator (?&#622;^_S???^K???d???ez?\n^R?.W?^F??b!)
/Producer (?&#622;^_S???^K???=?&#547;"C?^A^N?tC?7??p!??$???x0M&#497;?o?Z?????Y^Y???S&#1127;?e^S?/^Q)
/CreationDate (&#727;?i^C???H???!?&#564;!$?HZ?$)
>>
endobj

Note

  • but the data looks encoded
  • There is a description about the metadata at the top of a zubef pdf file (since 2011) (This type may be the 'Document information dictionary')

zubef2010.pdf

<</Length 3674/Subtype/XML/Type/Metadata>>stream
...
endstream^Mendobj^M
2536 0 obj^M<</EncryptMetadata false/Filter/Standard/Length 128/O(?\r???i8?h?}^Q^W??3^Fn?Q>??^B?d?b????)
...

Note

  • This is the type of 'Metadata stream' that contains the Metadata (Zubef PDF includes this object until 2010)

Experiment (lib/rpdf2txt/parser.rb)

    def check_producer
      object_catalogue.values.each do |v|
        p v.class
        begin
        if producer = v.contents[:producer]
          p producer
        end
        rescue 
        end
      end
    end

bin/rpdf2txt

#parser.extract_text(handler)
parser.check_producer

Run

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef2011.pdf > test.dat

Result

...
Rpdf2txt::PageLeaf
Rpdf2txt::Unknown
Rpdf2txt::PdfHash
"\377\311\256\037S\371\251\335\v\210\364\363=\277\310\243\"C\333\001\016\365tC\3027\263\335p!\272\203$\373?\315x0M\307\261\340o\333Z\215\372\323\370\341Y\031\372\241\320S\321\247\271e\023\265/\021"
Rpdf2txt::Unknown
Rpdf2txt::Unknown
Rpdf2txt::Stream
...

Note

  • The producer information is kept in a instance of PdfHash
  • in more detail, PdfHash.contents[:producer]
  • But I cannot see the data by the simple p method
  • I guess this is encrypted

Experiment

test.rb

require 'origami'
include Origami

pdf = PDF.read('zubef2011.pdf', {:verbosity => Parser::VERBOSE_QUIET})
docinfo = pdf.get_document_info
pro = Origami::Name.new('Producer')
p docinfo[pro]

Result

"pdfFactory 3.25 (Windows Server 2003 R2 Standard Edition German)"
view · edit · sidebar · attach · print · history
Page last modified on May 27, 2011, at 04:41 PM