view · edit · sidebar · attach · print · history

20110516-trace-rpdf2txt

<< | Index | >>


  1. Review the problem
  2. Trace the snippets joining part
  3. Trace the snippets sorting part
  4. Trace the snipping part

Goal/Estimate/Evaluation
  • Update rpdf2txt / 50% / 40%
Milestones
  • Trace parser process
  • Trace dycription process
  • Trace snipetts joining process
Summary
Commits

Review the problem

Problem

  • rpdf2txt cannot recognize the pdf since 2011

The pdf versions and writers (producers)

2010.01.15 %PDF-1.4 
2010.01.19 %PDF-1.4
2010.02.20 %PDF-1.4
2010.03.01 %PDF-1.4
2010.03.02 %PDF-1.4
2010.03.16 %PDF-1.4
2010.04.02 %PDF-1.4
2010.04.16 %PDF-1.4
2010.09.09 %PDF-1.4
2010.10.05 %PDF-1.4
2010.10.18 %PDF-1.4
2010.11.12 %PDF-1.4 (Acrobat 5.x) (Acrobat Distiller 7.0 Windows)
2010.11.16 %PDF-1.6
2010.11.26 %PDF-1.6
2010.12.02 %PDF-1.6
2010.12.16 %PDF-1.6 (Acrobat 7.x) (Acrobat Distiller 9.0 Windows)
2011.01.14 %PDF-1.4 (Acrobat 5.x) (pdfFactory 3.25 (Windows Server 2003 R2 Standard Edition German))
2011.01.18 %PDF-1.4
2011.02.09 %PDF-1.4
2011.02.16 %PDF-1.4
2011.03.14 %PDF-1.4
2011.04.02 %PDF-1.4
2011.04.19 %PDF-1.4
2011.05.04 %PDF-1.4

Note

  • Snipetts are different between the old and new pdfs

Experiment (lib/rpdf2txt/object.rb#text)

    def text(callback_handler)
p "getin text"
      concat_stream = Stream.new('')
      if(@contents.size == 1 && @contents.first.is_a?(ReferenceArray))
        @contents.first.build_stream(concat_stream)
      else
        @contents.each { |stream|
          concat_stream.append(stream.decoded_stream)
        }
      end
      @text_state.media_box = self.media_box

      text_snippets = concat_stream.extract_text_objects(self, @text_state)
p text_snippets.class
p text_snippets.length
p text_snippets.map{|x| x.class}.uniq.join("\n")
puts
10.times do |i|
print i, "\t", text_snippets[i].txt, "\n"
end
10.times do |i|
print i, "\t", text_snippets[-1-i].txt, "\n"
end
exit

Result

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_old.pdf 
"getin text"
Array
265
"Rpdf2txt::TextState"

0       Zuzahlungsbefreite Arzneimittel nach § 31 Abs. 3 Satz 4 SGB V
1       PZN
2       Arzneimit
3       t
4       e
5       lname
6       Darreichungsf
7       o
8       rm
9       Hersteller
0       Seite 1 von 618
1       mg
2       600
3       n
4       i
5       e
6       st
7       y
8       c
9       l


masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf 
"getin text"
Array
698
"Rpdf2txt::TextState"

0       Zu
1       z
2       a
3       h
4       l
5       ung
6       s
7       b
8       ef
9       r
0       638
1        
2       n
3       o
4       v
5        
6       1
7        
8       e
9       t

Note

  • The separating point of snippet is different

Trace the snippets joining part

Default run

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf 
638nov1eSiet
 600CCAABST                                 0434230            XAGLEHA                                                     teinAscytylce                                                                                                                                                                                                                                                                                                                            600      mg                          20                  t S       Tableentt                                                                             1216,
ACC600ABST                                 0434224            XHELAAG                                                  cetsyineyAtcl      
....

Experiment (lib/rpdf2txt/object.rb#join_snippets)

    def join_snippets(text_snippets, callback_handler)
#      text_snippets.sort!

Result

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf
ZuzahlungsbefreiteArzneimittel nach§31Abs. 3Satz4SGB V
PZNArzneimittelname                                                                                                                                                                      DarreichungsformHersteller                                                                                                                                                                                                                                                                                                                                                                Apothekenverkaufspreis
 inkl.MwSt
Packungs-
größe
Wirkstoff(e)                                                                        Wirkstärke(n)
Produktstand
sortiertnachArzneimittelname
01 . 05  . 2011
3867219ACC200                                                                                                                                                                                                                                                                                                                                                       Brausetabletten                                                                                                                                                                                                                                                                                                                                                                                                                 12,72HEXALAG                                                                                                                                 50                St Acetylcystein                                                                      200      mg
3867225ACC200                                                                                      
...

Note

  • It becomes a little bit formatted, but not correct

Consideration

  • I can say, at least, that Rpdf2txt::TextState#sort! (<=>) method does not run correctly
  • '<=>' method is defined in lib/rpdf2txt/test_state.rb
  • It depends on @x or @y instance variables
  • The @x and @y are set in 'update!' method

Check @x and @y meanings

Experiment (lib/rpdf2txt/object.rb#join_snippets)

    def join_snippets(text_snippets, callback_handler)
p "getin join_snippets"
      text_snippets.sort!
      columns = []
      if(callback_handler.identify_columns?)
        columns = identify_columns(text_snippets,
                                   :width => callback_handler.column_width,
                                   :count => callback_handler.column_count)
        columns.shift #throw away the first colum - we'll use the left media-edge
      end

      next_column = nil
      working_set = []
count = 0
      each_pair(text_snippets) { |last_text_state, text_state|
print count, "\t",  text_state.txt, "\t", text_state.x, "\t", text_state.y, "\n"
...

Result

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_old.pdf
"getin join_snippets"
0       Zuzahlungsbefreite Arzneimittel nach § 31 Abs. 3 Satz 4 SGB V   43.4305 59.64
Zuzahlungsbefreite Arzneimittel nach § 31 Abs. 3 Satz 4 SGB V
1       Produktstand    43.4305 79.5

Produktstand
2       15      145.39037171    79.5
   15
3       .       162.41985007    79.5
.
4       1       169.37920036    79.5
1
5       2       177.17434468    79.5
2
6       .       185.44968557    79.5
.
7       2010    189.35005771    79.5
2010
8       sortiert nach Arzneimittelname  43.4305 96.48702837

sortiert nach Arzneimittelname
9       Arzneimit       43.4827 117.0

Arzneimit
10      t       80.8894 117.0
t
11      e       83.4103 117.0
e
12      lname   88.3873 117.0
lname
13      PZN     176.6305        117.0
                          PZN
14      Hersteller      221.9725        117.0
           Hersteller
15      Wirkstoff       343.9795        117.0
                                  Wirkstoff
16      (       379.03  117.0
(
17      e       382.0288        117.0
e
18      )       386.9455        117.0
)
19      Wirkstärke      440.3245        117.0
                    Wirkstärke
masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf 
"getin join_snippets"
0       638     5656.8  3132.0
638
1       n       5656.8  3132.0
n
2       o       5656.8  3132.0
o
3       v       5656.8  3132.0
v
4       1       5656.8  3132.0
1
5       e       5656.8  3132.0
e
6       S       5656.8  3132.0
S
7       i       5656.8  3132.0
i
8       e       5656.8  3132.0
e
9       t       5656.8  3132.0
t

Note

  • @x is the position in the line and @y is the line number
  • The snippets should be sorted first by @y and second by @x
  • It means the small number of lines become the top of the list, text_snippets Array.
  • But in the case of the latest pdf file, the sorting looks wrong

Experiment (lib/rpdf2txt/object.rb#join_snippets)

    def join_snippets(text_snippets, callback_handler)
p "getin join_snippets"
      text_snippets.sort!

Result

...
165     A       43.44   450.48
A
166     CC      43.44   450.48
CC
167     200     43.44   450.48
200
168     Br      582.24  450.48
                                                                                                                                                                                                                                                                                                                                                       Br
169     a       582.24  450.48
a
170     u       582.24  450.48
u
171     s       582.24  450.48
s
172     e       582.24  450.48
e
173     t       582.24  450.48
...

Note

  • The setting @x looks also something wrong
  • Some @x position are same

Consideration

  • It seems that both 'sorting' and 'setting' snippets, in particular regarding @x and @y, do not work correctly
  • The strange name of 'Wirkstoffe' and 'Company name' is caused by the same value of @x, because if it is the same value it cannot be sorted.

Trace the snippets sorting part

Experiment (lib/rpdf2txt/test_state.rb#<=>)

    def <=> (other)
      if(same_line(other))
        @x <=> other.x
      elsif(other.is_a?(self.class))
p @cmyscale
        # @cmyscale may be negative, reversing the sort-order
        (@y <=> other.y) \
          * (@cmyscale == 0 ? 1 : @cmyscale)
      else
        @y <=> other.y
      end
    end

Result

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_old.pdf
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
...
masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt zubef_latest.pdf
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0

Note

  • @cmyscale is different
  • if @cmyscale is negative, the sort order is reversed

Trace the snipping part

Reference

The snippets are extracted from lib/rpdf2txt/object.rb#extract_text_objects

    def extract_text_objects(page, text_state)
      @page, @text_state = page, text_state
      stack = []
      result = []
      startpoint = decoded_stream.index(BT_PATTERN)
      endpoint = decoded_stream.index(ET_PATTERN)
      while FAIL_PTRN.match(decoded_stream[0..(endpoint+2)])
        endpoint = decoded_stream.index(ET_PATTERN, endpoint.next)
      end
      unless(startpoint && endpoint && (startpoint < endpoint))
        startpoint = 0
      end
      rotation = (page && Math::PI * page.attributes[:rotate].to_f / 180) || 0
      dmatrix = Matrix[[Math.cos(rotation),Math.sin(rotation),0],
                       [Math.sin(rotation),-Math.cos(rotation),0],
                       [0,0,1]]

      dm_src = decoded_stream[0...startpoint]
      while(endpoint && startpoint)
        ### pick out the bits in between Text that are relevant to 
        ### text positioning (such as the device-transformation-matrix)
        ### NOTE: as far as I understand, the device matrix should 
        ###       not be used to position text. However it is used 
        ###       by some PDF-Creators and therefore we have to include
        ###       it in our calculations.
        dmatrix = extract_nontext_objects(dm_src, dmatrix, stack, result)
        extract_horizontal_rules(dm_src, dmatrix, result)
        tsrc = decoded_stream[startpoint..(endpoint+2)]
        while FAIL_PTRN.match(tsrc)
          endpoint = decoded_stream.index(ET_PATTERN, endpoint + 2) || -1
          tsrc = decoded_stream[startpoint..(endpoint+2)]
        end
        text = Text.new(tsrc, @target_encoding, dmatrix)
        text.current_page = page
        text.text_state = text_state
        # here @x and @y of TextState instance are set
        result.concat text.scan

lib/rpdf2txt/text.rb#scan

    def scan
      @snippets = []
      ast = Rpdf2txt.text_parser.parse(@src)
      scan_tree(ast)
      @snippets
    rescue Exception
      puts @src
      raise
    end

lib/rpdf2txt/text.rb#scan_tree

    def scan_tree(ast)
      ast.values.each { |node|
        if(node.name == 'Array') \
          && (node.values.first.children_names.first == 'kerning')
          ## If the case [ 34 (foo) ] crops up, the first operation 
          ## executed on @text_state is advance_x. This results in 
          ## the width of the last text-snipped being calculated twice.
          ## This here is a workaround that resets the snippet to an 
          ## empty string if we are encountering a [ ??? ] construct
          ## (an array).
          ## TODO: find a more general solution
          @text_state.set_txt('')
        end
        node.children_names.each { |child_name|
          case child_name
          when 'alpha'
            @text_state.tmalpha = node.alpha.value.to_f
          when 'beta'
            @text_state.tmbeta = -node.beta.value.to_f
            skew = node.beta.value.to_f > 0.1
            if(@current_font && @current_font.skewed != skew)
              @current_font = @current_font.dup
              @current_font.skewed = skew
              @text_state.set_font(@current_font)
            end
          when 'xscale'
            @text_state.set_xscale(node.xscale.value)
          when 'yscale'
            @text_state.set_yscale(node.yscale.value)
          when 'charspace'
           @text_state.set_char_spacing(node.charspace.value)
          when 'kerning'
            @text_state.advance_x(node.kerning.value.to_f)
          when 'tdleadx'
            @text_state.update_x(node.tdleadx.value.to_f)
          when 'tdleady'
            lead = node.tdleady.value.to_f
            @text_state.set_lead(lead)
            @text_state.update_y(lead)
          when 'xpos'
            @text_state.update_x(node.xpos.value.to_f)
          when 'ypos'
            @text_state.update_y(node.ypos.value.to_f)
          when 'fontname'
            @current_font = get_font(node.fontname.value)
            @text_state.set_font(@current_font)
            @text_state.set_font_size(node.fontsize.value)
          when 'tmx'
            @text_state.set_x(node.tmx.value.to_f)
          when 'tmy'
            @text_state.set_y(node.tmy.value)
          when 'render'
            val = node.render.value
            if(@current_font && @current_font.rendering_mode != val)
              @current_font = @current_font.dup
              @current_font.rendering_mode = val
              @text_state.set_font(@current_font)
            end
          when 'wordspace'
            @text_state.set_word_spacing(node.wordspace.value)
          when 'values'
            scan_tree(node)
          when 'snippet'
            snip(node.snippet.value)
          when 'aposnippet'
            @text_state.step
            snip(node.aposnippet.value)
          when 'linebreak'
            @text_state.step
          when 'textrise'
            #add functionality for textrise p 387 pdf manual
          when 'hexsnippet'
            hex_bytes = node.hexsnippet.value
            char = ''
            hex_bytes.scan(/.{2,4}/n) { |pair|
              dec_byte = pair.hex
              char << (mapped_ascii(dec_byte) || '?')
            }
            _snip(char)
          end
        }
      }
    end

Note

  • The node.children_names is decided in Rpdf2txt.text_parser.parse(@src)
  • Rpdf2txt.text_parser.parse(@src) refers to lib/rpdf2txt/data/pdftext.rb
  • This file is created automatically by Rockit library, and it is complicated.
  • The 'scan_tree' method is a recursive method, 'scan_tree' method is called when the children_name of node is 'values'

Check pdf binary data (lib/rpdf2txt/text.rb#scan)

    def scan
p "getin scan"
      @snippets = []
      ast = Rpdf2txt.text_parser.parse(@src)
exit
      scan_tree(ast)
      @snippets
    rescue Exception
      puts
      puts @src
      raise
    end

Result

  • @src (in the case of old pdf) (from BT to ET)
BT
/TT2 1 Tf
0 14.0053 -13.9999 0 59.64 43.4305 Tm
0 g
-.0002 Tc
.0008 Tw
(Zuzahlungsbefreite Arzneimittel nach &#65533; 31 Abs. 3 Satz 4 SGB V)Tj
/TT4 1 Tf
0 9.0035 -9 0 117 176.6305 Tm
.0009 Tc
0 Tw
(PZN)Tj
-14.7942 0 TD
-.0016 Tc
[(Arzneimit)-3.7(t)-3.7(e)1.4(lname)]TJ
59.8165 0 TD
[(Darreichungsf)-3.7(o)1.4(rm)]TJ
-39.9843 0 TD
(Hersteller)Tj
52.2661 0 TD
.0001 Tc
[(Apo)9.8(t)-8.6(h)9.8(e)-3.5(ke)9.8(nverka)9.8(ufspre)9.8(is)]TJ
3.1321 -1.14 TD
-.0006 Tc
.0027 Tw
[( in)-4.2(kl)8.3(.)-9.3(M)-.6(w)21.8(S)0(t)]TJ
ET
  • @src (the latest pdf)
BT
/F1 13.92 Tf
43.44 535.68 TD -0.10512 Tc (Zu) Tj
0 Tc (z) Tj
-0.05952 Tc (a) Tj
-0.10512 Tc (h) Tj
-0.02976 Tc (l) Tj
-0.10512 Tc (ung) Tj
-0.05952 Tc (s) Tj
-0.10512 Tc (b) Tj
-0.05952 Tc (ef) Tj
0.10512 Tc (r) Tj
-0.05952 Tc (e) Tj
-0.02976 Tc (i) Tj
-0.07536 Tc (te) Tj
-0.02976 Tc ( ) Tj
-0.21024 Tc (A) Tj
0.10512 Tc (r) Tj
0 Tc (z) Tj
-0.10512 Tc (n) Tj
-0.05952 Tc (e) Tj
-0.02976 Tc (i) Tj
0.10512 Tc (m) Tj
-0.02976 Tc (i) Tj
-0.07536 Tc (tte) Tj
-0.02976 Tc (l ) Tj
-0.10512 Tc (n) Tj
-0.05952 Tc (ac) Tj
-0.10512 Tc (h) Tj
-0.02976 Tc ( ) Tj
-0.05952 Tc (\247) Tj
-0.02976 Tc ( ) Tj
-0.05952 Tc (31) Tj
-0.02976 Tc ( ) Tj
-0.21024 Tc (A) Tj
-0.10512 Tc (b) Tj
-0.05952 Tc (s) Tj
-0.02976 Tc (. ) Tj
-0.05952 Tc (3) Tj
-0.02976 Tc ( ) Tj
0.07536 Tc (S) Tj
-0.05952 Tc (at) Tj
0 Tc (z) Tj
-0.02976 Tc ( ) Tj
-0.05952 Tc (4) Tj
-0.02976 Tc ( ) Tj
0.07536 Tc (S) Tj
-0.02976 Tc (G) Tj
0.02976 Tc (B) Tj
-0.02976 Tc ( V) Tj
/F2 8.88 Tf
ET

Note

  • The 'scan' method builds the hierarchical structure from the pdf binary data by Rockit library
 Rpdf2txt.text_parser.parse(@src)
  • The 'scan' method is called once for the part from 'BT' to 'ET'.

Check each snippet position (@x, @y) (lib/rpdf2txt/text.rb#_snip)

    def snip(snippet)
      snippet_text = snippet[1..-2].gsub(/\\[nrt]/n, " ")
      snippet_text.gsub!(/\\([()])/n, '\1')
      snippet_text.gsub!(/./n) { |char|
        self.mapped_ascii(char.unpack('C*')[0]) || char
      }
print snippet_text, "\t"
      _snip(snippet_text)
    end
    def _snip(snippet_text)
      @text_state.set_txt(snippet_text)
      @text_state.update!(@current_page ? @current_page.attributes[:rotate] : 0)
print "x = ", @text_state.x, "\ty = ", @text_state.y, "\n"
      @snippets.push(@text_state.dup).last
    end

Result

  • an old pdf
Zuzahlungsbefreite Arzneimittel nach &#65533; 31 Abs. 3 Satz 4 SGB V   x = 43.4305     y = 59.64
PZN     x = 176.6305    y = 117.0
"getin scan_tree"
Arzneimit       x = 43.4827     y = 117.0
t       x = 80.8894     y = 117.0
e       x = 83.4103     y = 117.0
lname   x = 88.3873     y = 117.0
"getin scan_tree"
Darreichungsf   x = 581.8312    y = 117.0
o       x = 637.6933    y = 117.0
rm      x = 642.6703    y = 117.0
Hersteller      x = 221.9725    y = 117.0
"getin scan_tree"
Apo     x = 692.3674    y = 117.0
t       x = 708.2929    y = 117.0
h       x = 710.8732    y = 117.0
e       x = 715.7899    y = 117.0
ke      x = 720.8263    y = 117.0
nverka  x = 730.2439    y = 117.0
ufspre  x = 757.1701    y = 117.0
is      x = 782.0983    y = 117.0
"getin scan_tree"
 in     x = 720.5563    y = 127.26399
kl      x = 730.1062    y = 127.26399
.       x = 736.5187    y = 127.26399
M       x = 739.099     y = 127.26399
w       x = 746.596     y = 127.26399
S       x = 752.8924    y = 127.26399
t       x = 758.89      y = 127.26399
  • the latest pdf
Zu      x = 43.44       y = 535.68
z       x = 43.44       y = 535.68
a       x = 43.44       y = 535.68
h       x = 43.44       y = 535.68
l       x = 43.44       y = 535.68
ung     x = 43.44       y = 535.68
s       x = 43.44       y = 535.68
b       x = 43.44       y = 535.68
ef      x = 43.44       y = 535.68
r       x = 43.44       y = 535.68
e       x = 43.44       y = 535.68
i       x = 43.44       y = 535.68
te      x = 43.44       y = 535.68
        x = 43.44       y = 535.68
A       x = 43.44       y = 535.68
r       x = 43.44       y = 535.68
z       x = 43.44       y = 535.68
n       x = 43.44       y = 535.68
e       x = 43.44       y = 535.68
i       x = 43.44       y = 535.68
m       x = 43.44       y = 535.68
i       x = 43.44       y = 535.68
tte     x = 43.44       y = 535.68
l       x = 43.44       y = 535.68
n       x = 43.44       y = 535.68
ac      x = 43.44       y = 535.68
h       x = 43.44       y = 535.68
        x = 43.44       y = 535.68
\247    x = 43.44       y = 535.68
        x = 43.44       y = 535.68
31      x = 43.44       y = 535.68
        x = 43.44       y = 535.68
A       x = 43.44       y = 535.68
b       x = 43.44       y = 535.68
s       x = 43.44       y = 535.68
.       x = 43.44       y = 535.68
3       x = 43.44       y = 535.68
        x = 43.44       y = 535.68
S       x = 43.44       y = 535.68
at      x = 43.44       y = 535.68
z       x = 43.44       y = 535.68
        x = 43.44       y = 535.68
4       x = 43.44       y = 535.68
        x = 43.44       y = 535.68
S       x = 43.44       y = 535.68
G       x = 43.44       y = 535.68
B       x = 43.44       y = 535.68
 V      x = 43.44       y = 535.68

Note

  • The x position should be different

Next

  • How is the x position decided?

Experiment (lib/rpdf2txt/text_state.rb#update))

    def update!(rotation=0)
      orientation = (rotation.to_f.round / 90) % 2
print "orientation= ", orientation, "\n"
print "@tmxoffset = ", @tmxoffset, "\n"
print "@cmxoffset = ", @cmxoffset, "\n"
print "@tmx       = ", @tmx, "\n"
print "@dtmx      = ", @dtmx, "\n"
print "@tmxscale  = ", @tmxscale, "\n"
print "@tmyoffset = ", @tmyoffset, "\n"
print "@cmyoffset = ", @cmyoffset, "\n"
print "@tmy       = ", @tmy, "\n"
print "@dtmy      = ", @dtmy, "\n"
print "@tmyscale  = ", @tmyscale, "\n"
print "@w         = ", @w, "\n"
print "@font_size = ", @font_size, "\n"

      x, y, x2, y2, bx, by = nil
      if orientation == 1
        x = @tmxoffset + @tmy * @tmalpha
        y = @tmyoffset + (@tmx + @dtmx) * @tmbeta
        x2 = bx = x + @font_size * @tmalpha
        y2 = y + @w * @tmbeta
        by = y + @boxwidth * @tmbeta
        @x = y + @cmxoffset
        @y = x + @cmyoffset
        @x2 = y2 + @cmxoffset
        @y2 = x2 + @cmyoffset
        @right_edge = by + @cmxoffset
      else
        x = @tmxoffset + (@tmx + @dtmx) * @tmxscale
        y = @tmyoffset - @tmy * @tmyscale
        x2 = x + @w * @tmxscale
        y2 = by = y - @font_size * @tmyscale
        bx = x + @boxwidth * @tmxscale
        @x = x + @cmxoffset
        @y = y + @cmyoffset
        @x2 = x2 + @cmxoffset
        @y2 = y2 + @cmyoffset
        @right_edge = bx + @cmxoffset
      end
puts
print "@x         = ", @x, "\n"
print "@x2        = ", @x2, "\n"
print "@y         = ", @y, "\n"
print "@y2        = ", @y2, "\n"
print "@right_edge= ", @right_edge, "\n"
    end

Result

  • an old pdf
snippet_text = Zuzahlungsbefreite Arzneimittel nach &#65533; 31 Abs. 3 Satz 4 SGB V
orientation= 1
@tmxoffset = 59.64
@cmxoffset = 0
@tmx       = 0
@dtmx      = 0
@tmxscale  = 0.0
@tmyoffset = 43.4305
@cmyoffset = 0
@tmy       = 0
@dtmy      = nil
@tmyscale  = 0.0
@w         = 29.8378
@font_size = 1.0

@x         = 43.4305
@x2        = 461.15671622
@y         = 59.64
@y2        = 73.6453
@right_edge= 461.15671622
x = 43.4305     y = 59.64
==================================================
snippet_text = PZN
orientation= 1
@tmxoffset = 117.0
@cmxoffset = 0
@tmx       = 0
@dtmx      = 0
@tmxscale  = 0.0
@tmyoffset = 176.6305
@cmyoffset = 0
@tmy       = 0
@dtmy      = nil
@tmyscale  = 0.0
@w         = 2.0027
@font_size = 1.0

@x         = 176.6305
@x2        = 194.6548
@y         = 117.0
@y2        = 126.0035
@right_edge= 194.6548
x = 176.6305    y = 117.0
...
  • the latest pdf
snippet_text = Zu
orientation= 0
@tmxoffset = 0.0
@cmxoffset = 0
@tmx       = 43.44
@dtmx      = 0
@tmxscale  = 1.0
@tmyoffset = 0.0
@cmyoffset = 0
@tmy       = -535.68
@dtmy      = nil
@tmyscale  = 1.0
@w         = 14.0836992
@font_size = 13.92

@x         = 43.44
@x2        = 57.5236992
@y         = 535.68
@y2        = 521.76
@right_edge= 57.5236992
x = 43.44       y = 535.68
==================================================
snippet_text = z
orientation= 0
@tmxoffset = 0.0
@cmxoffset = 0
@tmx       = 43.44
@dtmx      = 0
@tmxscale  = 1.0
@tmyoffset = 0.0
@cmyoffset = 0
@tmy       = -535.68
@dtmy      = nil
@tmyscale  = 1.0
@w         = 6.96
@font_size = 13.92

@x         = 43.44
@x2        = 50.4
@y         = 535.68
@y2        = 521.76
@right_edge= 50.4
x = 43.44       y = 535.68
...
  • Forcus on only x position
    def update!(rotation=0)
      orientation = (rotation.to_f.round / 90) % 2
print "orientation= ", orientation, "\n"
if orientation == 0
print "@tmxoffset = ", @tmxoffset, "\n"
print "@tmx       = ", @tmx, "\n"
print "@dtmx      = ", @dtmx, "\n"
print "@tmxscale  = ", @tmxscale, "\n"
print "@cmxoffset = ", @cmxoffset, "\n"
print "x   = @tmxoffset + (@tmx + @dtmx) * @tmxscale = ", @tmxoffset + (@tmx + @dtmx) * @tmxscale, "\n"
print "@x  = x + @cmxoffset                          = ", @tmxoffset + (@tmx + @dtmx) * @tmxscale + @cmxoffset, "\n"
elsif orientation == 1
print "@tmxoffset = ", @tmxoffset, "\n"
print "@tmy       = ", @tmy, "\n"
print "@tmalpha   = ", @tmalpha, "\n"
print "x = @tmxoffset + @tmy * @tmalpha = ", @tmxoffset + @tmy * @tmalpha, "\n"
print "@tmyoffset = ", @tmyoffset, "\n"
print "@tmx       = ", @tmx, "\n"
print "@dtmx      = ", @dtmx, "\n"
print "@tmbeta    = ", @tmbeta, "\n"
print "y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = ", @tmyoffset + (@tmx + @dtmx) * @tmbeta, "\n"
print "@cmxoffset = ", @cmxoffset, "\n"
print "@x = y + @cmxoffset = ", @tmyoffset + (@tmx + @dtmx) * @tmbeta + @cmxoffset, "\n"
end

Result

  • an old pdf
snippet_text = Zuzahlungsbefreite Arzneimittel nach &#65533; 31 Abs. 3 Satz 4 SGB V
orientation= 1
@tmxoffset = 59.64
@tmy       = 0
@tmalpha   = 14.0053
x = @tmxoffset + @tmy * @tmalpha = 59.64
@tmyoffset = 43.4305
@tmx       = 0
@dtmx      = 0
@tmbeta    = 13.9999
y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = 43.4305
@cmxoffset = 0
@x = y + @cmxoffset = 43.4305

@x         = 43.4305
x = 43.4305     y = 59.64
==================================================
snippet_text = PZN
orientation= 1
@tmxoffset = 117.0
@tmy       = 0
@tmalpha   = 9.0035
x = @tmxoffset + @tmy * @tmalpha = 117.0
@tmyoffset = 176.6305
@tmx       = 0
@dtmx      = 0
@tmbeta    = 9.0
y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = 176.6305
@cmxoffset = 0
@x = y + @cmxoffset = 176.6305

@x         = 176.6305
x = 176.6305    y = 117.0
==================================================
"getin scan_tree"
snippet_text = Arzneimit
orientation= 1
@tmxoffset = 117.0
@tmy       = 0.0
@tmalpha   = 9.0035
x = @tmxoffset + @tmy * @tmalpha = 117.0
@tmyoffset = 176.6305
@tmx       = -14.7942
@dtmx      = 0
@tmbeta    = 9.0
y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = 43.4827
@cmxoffset = 0
@x = y + @cmxoffset = 43.4827

@x         = 43.4827
x = 43.4827     y = 117.0
==================================================
snippet_text = t
orientation= 1
@tmxoffset = 117.0
@tmy       = 0.0
@tmalpha   = 9.0035
x = @tmxoffset + @tmy * @tmalpha = 117.0
@tmyoffset = 176.6305
@tmx       = -14.7942
@dtmx      = 4.1563
@tmbeta    = 9.0
y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = 80.8894
@cmxoffset = 0
@x = y + @cmxoffset = 80.8894

@x         = 80.8894
x = 80.8894     y = 117.0
==================================================
snippet_text = e
orientation= 1
@tmxoffset = 117.0
@tmy       = 0.0
@tmalpha   = 9.0035
x = @tmxoffset + @tmy * @tmalpha = 117.0
@tmyoffset = 176.6305
@tmx       = -14.7942
@dtmx      = 4.4364
@tmbeta    = 9.0
y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = 83.4103
@cmxoffset = 0
@x = y + @cmxoffset = 83.4103

@x         = 83.4103
x = 83.4103     y = 117.0
==================================================
snippet_text = lname
orientation= 1
@tmxoffset = 117.0
@tmy       = 0.0
@tmalpha   = 9.0035
x = @tmxoffset + @tmy * @tmalpha = 117.0
@tmyoffset = 176.6305
@tmx       = -14.7942
@dtmx      = 4.9894
@tmbeta    = 9.0
y = @tmyoffset + (@tmx + @dtmx) * @tmbeta = 88.3873
@cmxoffset = 0
@x = y + @cmxoffset = 88.3873

@x         = 88.3873
x = 88.3873     y = 117.0
==================================================
....
  • the latest pdf
snippet_text = Zu
orientation= 0
@tmxoffset = 0.0
@tmx       = 43.44
@dtmx      = 0
@tmxscale  = 1.0
@cmxoffset = 0
x   = @tmxoffset + (@tmx + @dtmx) * @tmxscale = 43.44
@x  = x + @cmxoffset                          = 43.44

@x         = 43.44
x = 43.44       y = 535.68
==================================================
snippet_text = z
orientation= 0
@tmxoffset = 0.0
@tmx       = 43.44
@dtmx      = 0
@tmxscale  = 1.0
@cmxoffset = 0
x   = @tmxoffset + (@tmx + @dtmx) * @tmxscale = 43.44
@x  = x + @cmxoffset                          = 43.44

@x         = 43.44
x = 43.44       y = 535.68
==================================================
snippet_text = a
orientation= 0
@tmxoffset = 0.0
@tmx       = 43.44
@dtmx      = 0
@tmxscale  = 1.0
@cmxoffset = 0
x   = @tmxoffset + (@tmx + @dtmx) * @tmxscale = 43.44
@x  = x + @cmxoffset                          = 43.44

@x         = 43.44
x = 43.44       y = 535.68
==================================================
snippet_text = h
orientation= 0
@tmxoffset = 0.0
@tmx       = 43.44
@dtmx      = 0
@tmxscale  = 1.0
@cmxoffset = 0
x   = @tmxoffset + (@tmx + @dtmx) * @tmxscale = 43.44
@x  = x + @cmxoffset                          = 43.44

@x         = 43.44
x = 43.44       y = 535.68
==================================================
...

Next

  • I have to understand each variable,in particular how and where they are set
view · edit · sidebar · attach · print · history
Page last modified on May 17, 2011, at 07:28 AM