view · edit · sidebar · attach · print · history

20110119-update-rpdf2txt

<< | Index | >>


  1. Test-cases rpdf2txt

Goal
  • Update rpdf2txt for Ruby1.9 / 90%
  • Setup Ramaze version oddb / 30%
Milestones
  • Test-cases rpdf2txt
  • Local test importer oddb.org
  • Commit rpdf2txt
  • Set up ramaze version oddb
Summary
Commits
ToDo Tomorrow
Keep in Mind
  1. Encoding woes (from Davatz-san)
  2. Feedback: This option indicates that the regular expression is parsed as 'UTF8' (from Davatz-san)
  3. pg on Ubuntu - see http://dev.ywesee.com/wiki.php/Gem/Pg (from Davatz-san)
  4. On Ice
  5. emerge --sync

Test-cases rpdf2txt

Check the current state

Ruby 1.8 with the latest libraries

masa@masa ~/ywesee/rpdf2txt $ ruby test/suite.rb 
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
...................................................................unknown encoding 370 0 R
.........#<Rpdf2txt::CMap:0x7f9d3ecb0b78 @target_encoding="utf8", @decoded_stream="", @decrypted_stream="", @src="<< >>", @raw_stream="", @map={}, @attributes={}>
.......................
Finished in 9.064189 seconds.

121 tests, 277 assertions, 0 failures, 0 errors

Ruby 1.8 with the updated libraries

masa@masa ~/ywesee/rpdf2txt $ ruby18 -I lib test/suite.rb 
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
...................................................................unknown encoding 370 0 R
.........E......................
Finished in 9.428983 seconds.

  1) Error:
test_join_snippets__hex_chars(TestParser):
NoMethodError: undefined method `[]' for nil:NilClass
    ./lib/rpdf2txt/object.rb:787:in `raw_stream'
    ./lib/rpdf2txt/object.rb:790:in `decode_raw_stream'
    ./lib/rpdf2txt/object.rb:682:in `decoded_stream'
    ./lib/rpdf2txt/object.rb:1050:in `extract_bfchar'
    ./lib/rpdf2txt/object.rb:1069:in `parse_cmap'
    ./lib/rpdf2txt/object.rb:1008:in `initialize'
    /home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:313:in `new'
    /home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:313:in `test_join_snippets__hex_chars'

121 tests, 277 assertions, 0 failures, 1 errors

Ruby 1.9

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/suite.rb 
test/suite.rb:26: warning: variable $KCODE is no longer effective; ignored
/home/masa/ywesee/rpdf2txt/test/test_pdf_object.rb:26: warning: variable $KCODE is no longer effective; ignored
/home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
.........................................................
..........unknown encoding 370 0 R
.........E........F.............

  1) Error:
test_join_snippets__hex_chars(TestParser):
NoMethodError: undefined method `each' for nil:NilClass
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:107:in `extract_attributes'
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:88:in `parse_attributes'
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:44:in `initialize'
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:1007:in `initialize'
    /home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:313:in `new'
    /home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:313:in `test_join_snippets__hex_chars'

  2) Failure:
test_char_width(TestTextState) [/home/masa/ywesee/rpdf2txt/test/test_text_state.rb:303]:
<0.313> expected but was
<0.301>

diff:
? 0.3013

Finished in 7.396021768 seconds.

121 tests, 277 assertions, 1 failures, 1 errors, 0 pendings, 0 omissions, 0 notifications
98.3471% passed

Focus on the test_join_snippets__hex_chars error

test/test_pdf_parser.rb#test_join_snippets__hex_chars

  1) Error:
test_join_snippets__hex_chars(TestParser):
NoMethodError: undefined method `each' for nil:NilClass
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:107:in `extract_attributes'
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:88:in `parse_attributes'
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:44:in `initialize'
    /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:1007:in `initialize'
    /home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:313:in `new'
    /home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:313:in `test_join_snippets__hex_chars'

Experiment

lib/rpdf2txt/object.rb#extract_attributes

        def extract_attributes(ast)
            if(ast.children_names.include?('value'))
                pdf_unescape(ast.value)
            elsif(ast.children_names.include?('text'))
                pdf_unescape(ast.text.value[1...-1])
            elsif(ast.children_names.include?('values'))
                ast.values.collect { |child| extract_attributes(child) }
            elsif(ast.children_names.include?('pairs'))
                result = {}
print "ast="
p ast
print "ast.pairs="
p ast.pairs
print "ast.pairs.class="
p ast.pairs.class
puts
                ast.pairs.each { |pair|
                    k, v = pair
                    keystr = k.value.strip.tr('/','')
                    unless(keystr.empty?)
                        result.store(keystr.downcase.intern, extract_attributes(v))
                    end
                }
                result
            else
                value = ast
            end
        end

Result

Ruby 1.8 with the latest libraries

masa@masa ~/ywesee/rpdf2txt $ ruby18 test/test_pdf_parser.rb 
...
ast=Hash:[_ArrayNode]
ast.pairs=_ArrayNode
ast.pairs.class=ArrayNode

Ruby 1.9 with the updated libraries

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
...
ast=Hash:[nil]
ast.pairs=nil
ast.pairs.class=NilClass

Note

  • I should check the creating process of 'ast'

Experiment

lib/rpdf2txt-rockit/glr_parser.rb#actor

  def actor(stack)
print "stack="
p stack
    #puts "actor(#{stack.state}) @stacks_to_act_on = #{@stacks_to_act_on.map{|s| s.state}.inspect}, @active_stacks = #{@active_stacks.map{|s| s.state}.inspect}"
    tokens = stack.lexer.peek
        #print "tokens = #{tokens.inspect}, "
print "tokens="
p tokens
    tokens.each do |token|
      #print "state = #{stack.state.inspect}, "
print "@parse_table="
p @parse_table
exit

Result

Ruby 1.8 with the latest libraries

Actions: 
0:         ,s13 ,s1 ,s9 ,s11 ,   ,   ,s5 ,s10 ,   ,s14 ,   ,s3 ,   ,   ,s2 ,|  ,8,6, , , ,7, ,12,4,
1:      r5 ,r5 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r5 ,   ,   ,   ,   ,|  , , , , , , , , , ,
2:      r28 ,r28 ,r28 ,r28 ,r28 ,   ,   ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,|  , , , , , , , , , ,
3:         ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s15 ,   ,   ,|  , , , , , , , , , ,
4:      r6 ,r6 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r6 ,   ,   ,   ,   ,|  , , , , , , , , , ,
5:      r9 ,r9 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r9 ,   ,   ,   ,   ,|  , , , , , , , , , ,
6:      r3 ,r3 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r3 ,   ,   ,   ,   ,|  , , , , , , , , , ,
7:      r2 ,r2 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r2 ,   ,   ,   ,   ,|  , , , , , , , , , ,
8:       a ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
9:      r4 ,r4 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r4 ,   ,   ,   ,   ,|  , , , , , , , , , ,
10:        ,s23 ,s16 ,s21 ,s22 ,   ,   ,s18 ,s10 ,s26 ,s14 ,   ,   ,   ,   ,s2 ,|  , ,20,27,25,24,19, , ,17,
11:     r8 ,r8 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r8 ,   ,   ,   ,   ,|  , , , , , , , , , ,
12:     r7 ,r7 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r7 ,   ,   ,   ,   ,|  , , , , , , , , , ,
13:     r1 ,r1 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r1 ,   ,   ,   ,   ,|  , , , , , , , , , ,
14:        ,s29 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s28 ,   ,   ,   ,   ,|  , , , , , , ,30, , ,
15:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s31 ,   ,|  , , , , , , , , , ,
16:        ,r17 ,r17 ,r17 ,r17 ,   ,   ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,|  , , , , , , , , , ,
17:        ,r22 ,r22 ,r22 ,r22 ,   ,   ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,|  , , , , , , , , , ,
18:        ,r21 ,r21 ,r21 ,r21 ,   ,   ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,|  , , , , , , , , , ,
19:        ,r16 ,r16 ,r16 ,r16 ,   ,   ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,|  , , , , , , , , , ,
20:        ,r15 ,r15 ,r15 ,r15 ,   ,   ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,|  , , , , , , , , , ,
21:        ,r19 ,r19 ,r19 ,r19 ,   ,   ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,|  , , , , , , , , , ,
22:        ,r20 ,r20 ,r20 ,r20 ,   ,   ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,|  , , , , , , , , , ,
23:        ,r18 ,r18 ,r18 ,r18 ,   ,   ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,|  , , , , , , , , , ,
24:        ,r14 ,r14 ,r14 ,r14 ,   ,   ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,|  , , , , , , , , , ,
25:        ,s23 ,s16 ,s21 ,s22 ,   ,   ,s18 ,s10 ,r12 ,s14 ,   ,   ,   ,   ,s2 ,|  , ,20, , ,32,19, , ,17,
26:     r11 ,r11 ,r11 ,r11 ,r11 ,   ,   ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,|  , , , , , , , , , ,
27:        ,   ,   ,   ,   ,   ,   ,   ,   ,s33 ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
28:     r24 ,r24 ,r24 ,r24 ,r24 ,   ,   ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,|  , , , , , , , , , ,
29:        ,s13 ,s1 ,s9 ,s11 ,   ,   ,s5 ,s10 ,   ,s14 ,   ,s3 ,   ,   ,s2 ,|  ,34,6, , , ,7, ,12,4,
30:        ,s36 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s35 ,   ,   ,   ,   ,|  , , , , , , , , , ,
31:     r27 ,r27 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r27 ,   ,   ,   ,   ,|  , , , , , , , , , ,
32:        ,r13 ,r13 ,r13 ,r13 ,   ,   ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,|  , , , , , , , , , ,
33:     r10 ,r10 ,r10 ,r10 ,r10 ,   ,   ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,|  , , , , , , , , , ,
34:        ,r26 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r26 ,   ,   ,   ,   ,|  , , , , , , , , , ,
35:     r23 ,r23 ,r23 ,r23 ,r23 ,   ,   ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,|  , , , , , , , , , ,
36:        ,s13 ,s1 ,s9 ,s11 ,   ,   ,s5 ,s10 ,   ,s14 ,   ,s3 ,   ,   ,s2 ,|  ,37,6, , , ,7, ,12,4,
37:        ,r25 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r25 ,   ,   ,   ,   ,|  , , , , , , , , , ,

Ruby 1.8 with the updated libraries

Actions: 
0:         ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,8,6, , , ,7, ,11,4,
1:      r8 ,r8 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r8 ,   ,   ,   ,   ,|  , , , , , , , , , ,
2:      r9 ,r9 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r9 ,   ,   ,   ,   ,|  , , , , , , , , , ,
3:         ,s17 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s16 ,   ,   ,   ,   ,|  , , , , , , ,15, , ,
4:      r6 ,r6 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r6 ,   ,   ,   ,   ,|  , , , , , , , , , ,
5:      r5 ,r5 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r5 ,   ,   ,   ,   ,|  , , , , , , , , , ,
6:      r3 ,r3 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r3 ,   ,   ,   ,   ,|  , , , , , , , , , ,
7:      r2 ,r2 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r2 ,   ,   ,   ,   ,|  , , , , , , , , , ,
8:       a ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
9:      r28 ,r28 ,r28 ,r28 ,r28 ,   ,   ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,|  , , , , , , , , , ,
10:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s18 ,   ,   ,|  , , , , , , , , , ,
11:     r7 ,r7 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r7 ,   ,   ,   ,   ,|  , , , , , , , , , ,
12:     r4 ,r4 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r4 ,   ,   ,   ,   ,|  , , , , , , , , , ,
13:        ,s30 ,s24 ,s28 ,s19 ,   ,   ,s21 ,s13 ,s23 ,s3 ,   ,   ,   ,   ,s9 ,|  , ,26,29,20,27,25, , ,22,
14:     r1 ,r1 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r1 ,   ,   ,   ,   ,|  , , , , , , , , , ,
15:        ,s32 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s31 ,   ,   ,   ,   ,|  , , , , , , , , , ,
16:     r24 ,r24 ,r24 ,r24 ,r24 ,   ,   ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,|  , , , , , , , , , ,
17:        ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,33,6, , , ,7, ,11,4,
18:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s34 ,   ,|  , , , , , , , , , ,
19:        ,r20 ,r20 ,r20 ,r20 ,   ,   ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,|  , , , , , , , , , ,
20:        ,s30 ,s24 ,s28 ,s19 ,   ,   ,s21 ,s13 ,r12 ,s3 ,   ,   ,   ,   ,s9 ,|  , ,26, , ,35,25, , ,22,
21:        ,r21 ,r21 ,r21 ,r21 ,   ,   ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,|  , , , , , , , , , ,
22:        ,r22 ,r22 ,r22 ,r22 ,   ,   ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,|  , , , , , , , , , ,
23:     r11 ,r11 ,r11 ,r11 ,r11 ,   ,   ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,|  , , , , , , , , , ,
24:        ,r17 ,r17 ,r17 ,r17 ,   ,   ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,|  , , , , , , , , , ,
25:        ,r16 ,r16 ,r16 ,r16 ,   ,   ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,|  , , , , , , , , , ,
26:        ,r15 ,r15 ,r15 ,r15 ,   ,   ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,|  , , , , , , , , , ,
27:        ,r14 ,r14 ,r14 ,r14 ,   ,   ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,|  , , , , , , , , , ,
28:        ,r19 ,r19 ,r19 ,r19 ,   ,   ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,|  , , , , , , , , , ,
29:        ,   ,   ,   ,   ,   ,   ,   ,   ,s36 ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
30:        ,r18 ,r18 ,r18 ,r18 ,   ,   ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,|  , , , , , , , , , ,
31:     r23 ,r23 ,r23 ,r23 ,r23 ,   ,   ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,|  , , , , , , , , , ,
32:        ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,37,6, , , ,7, ,11,4,
33:        ,r26 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r26 ,   ,   ,   ,   ,|  , , , , , , , , , ,
34:     r27 ,r27 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r27 ,   ,   ,   ,   ,|  , , , , , , , , , ,
35:        ,r13 ,r13 ,r13 ,r13 ,   ,   ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,|  , , , , , , , , , ,
36:     r10 ,r10 ,r10 ,r10 ,r10 ,   ,   ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,|  , , , , , , , , , ,
37:        ,r25 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r25 ,   ,   ,   ,   ,|  , , , , , , , , , ,

Ruby 1.9 with the updated libraries

Actions: 
0:         ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,8,6, , , ,7, ,11,4,
1:      r8 ,r8 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r8 ,   ,   ,   ,   ,|  , , , , , , , , , ,
2:      r9 ,r9 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r9 ,   ,   ,   ,   ,|  , , , , , , , , , ,
3:         ,s17 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s16 ,   ,   ,   ,   ,|  , , , , , , ,15, , ,
4:      r6 ,r6 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r6 ,   ,   ,   ,   ,|  , , , , , , , , , ,
5:      r5 ,r5 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r5 ,   ,   ,   ,   ,|  , , , , , , , , , ,
6:      r3 ,r3 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r3 ,   ,   ,   ,   ,|  , , , , , , , , , ,
7:      r2 ,r2 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r2 ,   ,   ,   ,   ,|  , , , , , , , , , ,
8:       a ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
9:      r28 ,r28 ,r28 ,r28 ,r28 ,   ,   ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,|  , , , , , , , , , ,
10:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s18 ,   ,   ,|  , , , , , , , , , ,
11:     r7 ,r7 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r7 ,   ,   ,   ,   ,|  , , , , , , , , , ,
12:     r4 ,r4 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r4 ,   ,   ,   ,   ,|  , , , , , , , , , ,
13:        ,s30 ,s24 ,s28 ,s19 ,   ,   ,s21 ,s13 ,s23 ,s3 ,   ,   ,   ,   ,s9 ,|  , ,26,29,20,27,25, , ,22,
14:     r1 ,r1 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r1 ,   ,   ,   ,   ,|  , , , , , , , , , ,
15:        ,s32 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s31 ,   ,   ,   ,   ,|  , , , , , , , , , ,
16:     r24 ,r24 ,r24 ,r24 ,r24 ,   ,   ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,|  , , , , , , , , , ,
17:        ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,33,6, , , ,7, ,11,4,
18:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s34 ,   ,|  , , , , , , , , , ,
19:        ,r20 ,r20 ,r20 ,r20 ,   ,   ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,|  , , , , , , , , , ,
20:        ,s30 ,s24 ,s28 ,s19 ,   ,   ,s21 ,s13 ,r12 ,s3 ,   ,   ,   ,   ,s9 ,|  , ,26, , ,35,25, , ,22,
21:        ,r21 ,r21 ,r21 ,r21 ,   ,   ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,|  , , , , , , , , , ,
22:        ,r22 ,r22 ,r22 ,r22 ,   ,   ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,|  , , , , , , , , , ,
23:     r11 ,r11 ,r11 ,r11 ,r11 ,   ,   ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,|  , , , , , , , , , ,
24:        ,r17 ,r17 ,r17 ,r17 ,   ,   ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,|  , , , , , , , , , ,
25:        ,r16 ,r16 ,r16 ,r16 ,   ,   ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,|  , , , , , , , , , ,
26:        ,r15 ,r15 ,r15 ,r15 ,   ,   ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,|  , , , , , , , , , ,
27:        ,r14 ,r14 ,r14 ,r14 ,   ,   ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,|  , , , , , , , , , ,
28:        ,r19 ,r19 ,r19 ,r19 ,   ,   ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,|  , , , , , , , , , ,
29:        ,   ,   ,   ,   ,   ,   ,   ,   ,s36 ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
30:        ,r18 ,r18 ,r18 ,r18 ,   ,   ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,|  , , , , , , , , , ,
31:     r23 ,r23 ,r23 ,r23 ,r23 ,   ,   ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,|  , , , , , , , , , ,
32:        ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,37,6, , , ,7, ,11,4,
33:        ,r26 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r26 ,   ,   ,   ,   ,|  , , , , , , , , , ,
34:     r27 ,r27 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r27 ,   ,   ,   ,   ,|  , , , , , , , , , ,
35:        ,r13 ,r13 ,r13 ,r13 ,   ,   ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,|  , , , , , , , , , ,
36:     r10 ,r10 ,r10 ,r10 ,r10 ,   ,   ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,|  , , , , , , , , , ,
37:        ,r25 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r25 ,   ,   ,   ,   ,|  , , , , , , , , , ,

Note

  • @parse_table is different
  • in the case of 'with the updated libraries', both Ruby 1.8 and 1.9 are same
  • So, this difference comes from the updating, not due to the Ruby version

Next

  • I have to check the process of @parse_table

Experiment

lib/rpdf2txt/data/pdfattributes.rb#_attr_parser

  def Rpdf2txt._attr_parser
print "@@parse_table70010113197280="
p @@parse_table70010113197280
exit

Result

Ruby 1.8 with the latest libraries

Actions: 
0:         ,s13 ,s1 ,s9 ,s11 ,   ,   ,s5 ,s10 ,   ,s14 ,   ,s3 ,   ,   ,s2 ,|  ,8,6, , , ,7, ,12,4,
1:      r5 ,r5 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r5 ,   ,   ,   ,   ,|  , , , , , , , , , ,
2:      r28 ,r28 ,r28 ,r28 ,r28 ,   ,   ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,|  , , , , , , , , , ,
3:         ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s15 ,   ,   ,|  , , , , , , , , , ,
4:      r6 ,r6 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r6 ,   ,   ,   ,   ,|  , , , , , , , , , ,
5:      r9 ,r9 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r9 ,   ,   ,   ,   ,|  , , , , , , , , , ,
6:      r3 ,r3 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r3 ,   ,   ,   ,   ,|  , , , , , , , , , ,
7:      r2 ,r2 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r2 ,   ,   ,   ,   ,|  , , , , , , , , , ,
8:       a ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
9:      r4 ,r4 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r4 ,   ,   ,   ,   ,|  , , , , , , , , , ,
10:        ,s23 ,s16 ,s21 ,s22 ,   ,   ,s18 ,s10 ,s26 ,s14 ,   ,   ,   ,   ,s2 ,|  , ,20,27,25,24,19, , ,17,
11:     r8 ,r8 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r8 ,   ,   ,   ,   ,|  , , , , , , , , , ,
12:     r7 ,r7 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r7 ,   ,   ,   ,   ,|  , , , , , , , , , ,
13:     r1 ,r1 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r1 ,   ,   ,   ,   ,|  , , , , , , , , , ,
14:        ,s29 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s28 ,   ,   ,   ,   ,|  , , , , , , ,30, , ,
15:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s31 ,   ,|  , , , , , , , , , ,
16:        ,r17 ,r17 ,r17 ,r17 ,   ,   ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,|  , , , , , , , , , ,
17:        ,r22 ,r22 ,r22 ,r22 ,   ,   ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,|  , , , , , , , , , ,
18:        ,r21 ,r21 ,r21 ,r21 ,   ,   ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,|  , , , , , , , , , ,
19:        ,r16 ,r16 ,r16 ,r16 ,   ,   ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,|  , , , , , , , , , ,
20:        ,r15 ,r15 ,r15 ,r15 ,   ,   ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,|  , , , , , , , , , ,
21:        ,r19 ,r19 ,r19 ,r19 ,   ,   ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,|  , , , , , , , , , ,
22:        ,r20 ,r20 ,r20 ,r20 ,   ,   ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,|  , , , , , , , , , ,
23:        ,r18 ,r18 ,r18 ,r18 ,   ,   ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,|  , , , , , , , , , ,
24:        ,r14 ,r14 ,r14 ,r14 ,   ,   ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,|  , , , , , , , , , ,
25:        ,s23 ,s16 ,s21 ,s22 ,   ,   ,s18 ,s10 ,r12 ,s14 ,   ,   ,   ,   ,s2 ,|  , ,20, , ,32,19, , ,17,
26:     r11 ,r11 ,r11 ,r11 ,r11 ,   ,   ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,|  , , , , , , , , , ,
27:        ,   ,   ,   ,   ,   ,   ,   ,   ,s33 ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
28:     r24 ,r24 ,r24 ,r24 ,r24 ,   ,   ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,|  , , , , , , , , , ,
29:        ,s13 ,s1 ,s9 ,s11 ,   ,   ,s5 ,s10 ,   ,s14 ,   ,s3 ,   ,   ,s2 ,|  ,34,6, , , ,7, ,12,4,
30:        ,s36 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s35 ,   ,   ,   ,   ,|  , , , , , , , , , ,
31:     r27 ,r27 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r27 ,   ,   ,   ,   ,|  , , , , , , , , , ,
32:        ,r13 ,r13 ,r13 ,r13 ,   ,   ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,|  , , , , , , , , , ,
33:     r10 ,r10 ,r10 ,r10 ,r10 ,   ,   ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,|  , , , , , , , , , ,
34:        ,r26 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r26 ,   ,   ,   ,   ,|  , , , , , , , , , ,
35:     r23 ,r23 ,r23 ,r23 ,r23 ,   ,   ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,|  , , , , , , , , , ,
36:        ,s13 ,s1 ,s9 ,s11 ,   ,   ,s5 ,s10 ,   ,s14 ,   ,s3 ,   ,   ,s2 ,|  ,37,6, , , ,7, ,12,4,
37:        ,r25 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r25 ,   ,   ,   ,   ,|  , , , , , , , , , ,

Ruby 1.9 with the updated libraries

Actions: 
0:         ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,8,6, , , ,7, ,11,4,
1:      r8 ,r8 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r8 ,   ,   ,   ,   ,|  , , , , , , , , , ,
2:      r9 ,r9 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r9 ,   ,   ,   ,   ,|  , , , , , , , , , ,
3:         ,s17 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s16 ,   ,   ,   ,   ,|  , , , , , , ,15, , ,
4:      r6 ,r6 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r6 ,   ,   ,   ,   ,|  , , , , , , , , , ,
5:      r5 ,r5 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r5 ,   ,   ,   ,   ,|  , , , , , , , , , ,
6:      r3 ,r3 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r3 ,   ,   ,   ,   ,|  , , , , , , , , , ,
7:      r2 ,r2 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r2 ,   ,   ,   ,   ,|  , , , , , , , , , ,
8:       a ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
9:      r28 ,r28 ,r28 ,r28 ,r28 ,   ,   ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,r28 ,|  , , , , , , , , , ,
10:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s18 ,   ,   ,|  , , , , , , , , , ,
11:     r7 ,r7 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r7 ,   ,   ,   ,   ,|  , , , , , , , , , ,
12:     r4 ,r4 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r4 ,   ,   ,   ,   ,|  , , , , , , , , , ,
13:        ,s30 ,s24 ,s28 ,s19 ,   ,   ,s21 ,s13 ,s23 ,s3 ,   ,   ,   ,   ,s9 ,|  , ,26,29,20,27,25, , ,22,
14:     r1 ,r1 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r1 ,   ,   ,   ,   ,|  , , , , , , , , , ,
15:        ,s32 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s31 ,   ,   ,   ,   ,|  , , , , , , , , , ,
16:     r24 ,r24 ,r24 ,r24 ,r24 ,   ,   ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,r24 ,|  , , , , , , , , , ,
17:        ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,33,6, , , ,7, ,11,4,
18:        ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,   ,s34 ,   ,|  , , , , , , , , , ,
19:        ,r20 ,r20 ,r20 ,r20 ,   ,   ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,r20 ,|  , , , , , , , , , ,
20:        ,s30 ,s24 ,s28 ,s19 ,   ,   ,s21 ,s13 ,r12 ,s3 ,   ,   ,   ,   ,s9 ,|  , ,26, , ,35,25, , ,22,
21:        ,r21 ,r21 ,r21 ,r21 ,   ,   ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,r21 ,|  , , , , , , , , , ,
22:        ,r22 ,r22 ,r22 ,r22 ,   ,   ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,r22 ,|  , , , , , , , , , ,
23:     r11 ,r11 ,r11 ,r11 ,r11 ,   ,   ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,r11 ,|  , , , , , , , , , ,
24:        ,r17 ,r17 ,r17 ,r17 ,   ,   ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,r17 ,|  , , , , , , , , , ,
25:        ,r16 ,r16 ,r16 ,r16 ,   ,   ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,r16 ,|  , , , , , , , , , ,
26:        ,r15 ,r15 ,r15 ,r15 ,   ,   ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,r15 ,|  , , , , , , , , , ,
27:        ,r14 ,r14 ,r14 ,r14 ,   ,   ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,r14 ,|  , , , , , , , , , ,
28:        ,r19 ,r19 ,r19 ,r19 ,   ,   ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,r19 ,|  , , , , , , , , , ,
29:        ,   ,   ,   ,   ,   ,   ,   ,   ,s36 ,   ,   ,   ,   ,   ,   ,|  , , , , , , , , , ,
30:        ,r18 ,r18 ,r18 ,r18 ,   ,   ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,r18 ,|  , , , , , , , , , ,
31:     r23 ,r23 ,r23 ,r23 ,r23 ,   ,   ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,r23 ,|  , , , , , , , , , ,
32:        ,s14 ,s5 ,s12 ,s1 ,   ,   ,s2 ,s13 ,   ,s3 ,   ,s10 ,   ,   ,s9 ,|  ,37,6, , , ,7, ,11,4,
33:        ,r26 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r26 ,   ,   ,   ,   ,|  , , , , , , , , , ,
34:     r27 ,r27 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r27 ,   ,   ,   ,   ,|  , , , , , , , , , ,
35:        ,r13 ,r13 ,r13 ,r13 ,   ,   ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,r13 ,|  , , , , , , , , , ,
36:     r10 ,r10 ,r10 ,r10 ,r10 ,   ,   ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,r10 ,|  , , , , , , , , , ,
37:        ,r25 ,   ,   ,   ,   ,   ,   ,   ,   ,   ,r25 ,   ,   ,   ,   ,|  , , , , , , , , , ,

Note

  • I should check next where @@parse_table70010113197280 is set

IMPORTANT

  • The Ruby script lib/rpdf2txt/data/pdfattributes.rb is generated!!
  • This becomes different from the original
  • This file changes every time when we execute the rpdf2txt for the first time after the downloading of it
  • This makes a debug difficult

The latest libraries

  # Parser for PdfAttributes
  # created by Rockit version 0.3.8 on Fri Nov 26 11:17:14 +0100 2010
  # Rockit is copyright (c) 2001 Robert Feldt, feldt@ce.chalmers.se
  # and licensed under GPL
  # but this parser is under LGPL

The updated libraries

  # Parser for PdfAttributes
  # created by Rockit version 0.3.8 on Tue Jan 18 07:42:18 +0100 2011
  # Rockit is copyright (c) 2001 Robert Feldt, feldt@ce.chalmers.se
  # and licensed under GPL
  # but this parser is under LGPL

Reference

Next

  • Think the error from lib/rpdf2txt/object.rb again

Check lib/rpdf2txt/object.rb again

lib/rpdf2txt/object.rb#extract_attributes

        def extract_attributes(ast)
            if(ast.children_names.include?('value'))
                pdf_unescape(ast.value)
            elsif(ast.children_names.include?('text'))
                pdf_unescape(ast.text.value[1...-1])
            elsif(ast.children_names.include?('values'))
                ast.values.collect { |child| extract_attributes(child) }
            elsif(ast.children_names.include?('pairs'))
                result = {}
print "ast="
p ast
print "ast.pairs="
p ast.pairs
print "ast.pairs.class="
p ast.pairs.class
puts
                ast.pairs.each { |pair|
                    k, v = pair
                    keystr = k.value.strip.tr('/','')
                    unless(keystr.empty?)
                        result.store(keystr.downcase.intern, extract_attributes(v))
                    end
                }
                result
            else
                value = ast
            end
        end

Result

The latest libraries

masa@masa ~/ywesee/rpdf2txt $ ruby18 -I ~/work/rpdf2txt/lib test/test_pdf_parser.rb 
...
ast=Hash:[_ArrayNode]
ast.pairs=_ArrayNode
ast.pairs.class=ArrayNode

The updated libraries

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
...
ast=Hash:[nil]
ast.pairs=nil
ast.pairs.class=NilClass

Note

  • Namely, the difference is the value of Hash

Experiment

lib/rpdf2txt/object.rb

        def extract_attributes(ast)
            if(ast.children_names.include?('value'))
                pdf_unescape(ast.value)
            elsif(ast.children_names.include?('text'))
                pdf_unescape(ast.text.value[1...-1])
            elsif(ast.children_names.include?('values'))
                ast.values.collect { |child| extract_attributes(child) }
            elsif(ast.children_names.include?('pairs'))
                result = {}
if(ast_pairs = ast.pairs)
                #ast.pairs.each { |pair| 
                ast_pairs.each { |pair|
                    k, v = pair
                    keystr = k.value.strip.tr('/','')
                    unless(keystr.empty?)
                        result.store(keystr.downcase.intern, extract_attributes(v))
                    end
                }
end
                result
            else
                value = ast
            end
        end
...
        def raw_stream
            #@raw_stream ||= @src.scan(/stream[\r\n]{1,2}(.*)endstream/mn).to_s
            #@raw_stream ||= @src.scan(/stream[\r\n]{1,2}(.*)endstream/mn)[0][0]
            unless(@raw_stream)
              if(src_scan = @src.scan(/stream[\r\n]{1,2}(.*)endstream/mn) and !src_scan.empty?)
                @raw_stream = src_scan[0][0]
              else
                @raw_stream = src_scan.to_s
              end
            end
            return @raw_stream
        end

Result

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
.

Finished in 0.084548164 seconds.

1 tests, 0 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed

Note

  • I do not understand why 'ast.pairs' becomes 'nil'
  • But it becomes 'nil' in only the case of 'src=="<< >>"'
  • So, I guess this update is no problem
  • Let's see the actual behavior of rpdf2txt on oddb

Check all the tests (test_pdf_parser.rb)

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
....F........

  1) Failure:
test_join_snippets__hex_chars(TestParser) [test/test_pdf_parser.rb:334]:
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\xE4t f\xFCr  a1-, a2- und b-Adrenozeptoren sowie f\xFCr\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n"> expected but was
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit\xE4t f\xFCr a1-, a2- und b-Adrenozeptoren sowie f\xFCr\nDopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n">

diff:
  Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu
? trizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533;  a1-, a2- und b-Adrenozeptoren sowie f&#65533;
  Dopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer

Finished in 0.421212833 seconds.

13 tests, 56 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
92.3077% passed
  • Delete one space

Result

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
.............

Finished in 0.411339522 seconds.

13 tests, 56 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed

Next

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
........F..............

  1) Failure:
test_join_snippets6(TestParser) [test/test_pdf_parser.rb:481]:
<"In Studie 1 evaluierte man 271 Patienten mit einer m\xE4ssigen bis schweren aktiven rheumatoiden \nArthritis, die \xB318 Jahre alt waren, bei denen die Therapie mit mindestens einem, aber mit nicht mehr \n"> expected but was
<"In Studie 1 evaluierte man 271 Patienten mit einer m\xE4ssigen bis schweren aktiven rheumatoiden \nArthritis, die $18 Jahre alt waren, bei denen die Therapie mit mindestens einem, aber mit nicht mehr \n">

diff:
  In Studie 1 evaluierte man 271 Patienten mit einer m&#65533;sigen bis schweren aktiven rheumatoiden 
? Arthritis, die &#65533;18 Jahre alt waren, bei denen die Therapie mit mindestens einem, aber mit nicht mehr 
?                $                                                                                     

Finished in 2.576134358 seconds.

23 tests, 69 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
95.6522% passed

Note

  • This does not come in Ruby 1.8
  • This is probably a problem of character encoding in Ruby 1.9

suspend

First, focus on

  1) Failure:
test_char_width(TestTextState) [test/test_text_state.rb:303]:
<0.313> expected but was
<0.301>

Experiment

lib/rpdf2txt/text_state.rb#char_width

        def char_width(char)
            if(char.is_a? String)
                char = char[0]
            end
            w = 0.0
            if(@font && (width = @font.width(char)))
                w = width
            elsif(@font && (avg = @font.attributes[:avgwidth]))
                w = avg
            end
print "w="
p w
print "char="
p char
      w = 300.0 if w == 0
            w += @char_spacing
            if(char==32)
                w += @word_spacing
            end
            w * @font_size / USER_SPACE
        end

Result

Ruby 1.8 with the latest libraries

w=278
char=32
w=278
char=32
w=278
char=32

Ruby 1.9 with the updated libraries

w=278
char=" "
w=278
char=" "
w=278
char=" "

Note

  • 'String#[0]' does not return an ascii code in Ruby 1.9, (Ruby 1.8 does)

Experiment

lib/rpdf2txt/text_state.rb#char_width

        def char_width(char)
            if(char.is_a? String)
                #char = char[0]
                char = char.unpack('C*')[0]
            end

Reference

  • 'String#bytes.to_a[0]' can also return an ascii code, but this is only for Ruby 1.9

Result

masa@masa ~/ywesee/rpdf2txt $ ruby18 -I lib test/test_text_state.rb 
Loaded suite test/test_text_state
Started
.
Finished in 0.113862 seconds.

1 tests, 3 assertions, 0 failures, 0 errors

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_text_state.rb 
Loaded suite test/test_text_state
Started
.

Finished in 0.077112895 seconds.

1 tests, 3 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed

Note

  • Good

The last failure

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
........F..............

  1) Failure:
test_join_snippets6(TestParser) [test/test_pdf_parser.rb:482]:
<"In Studie 1 evaluierte man 271 Patienten mit einer m\xE4ssigen bis schweren aktiven rheumatoiden \nArthritis, die \xB318 Jahre alt waren, bei denen die Therapie mit mindestens einem, aber mit nicht mehr \n"> expected but was
<"In Studie 1 evaluierte man 271 Patienten mit einer m\xE4ssigen bis schweren aktiven rheumatoiden \nArthritis, die $18 Jahre alt waren, bei denen die Therapie mit mindestens einem, aber mit nicht mehr \n">

diff:
  In Studie 1 evaluierte man 271 Patienten mit einer m&#65533;sigen bis schweren aktiven rheumatoiden 
? Arthritis, die &#65533;18 Jahre alt waren, bei denen die Therapie mit mindestens einem, aber mit nicht mehr 
?                $                                                                                     

Finished in 2.577773336 seconds.

23 tests, 69 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
95.6522% passed

Experiment

lib/rpdf2txt/text.rb#mapped_ascii

        def mapped_ascii(ascii)
            if(@current_font)
print "@current_font.cmap=", @current_font.cmap.inspect, "\n" if @current_font.cmap
print "cmap.map=", @current_font.cmap.map.inspect, "\n" if @current_font.cmap
print "ascii=", ascii, "\n" if @current_font.cmap
        if((cmap = @current_font.cmap) && (map = cmap.map) \
           && (unicode_bytes = map[ascii]) \
           && (ascii = SymbolMap::SYMBOL_ENTITIES[unicode_bytes]))
print "ascii.chr="
p ascii.chr

Result

Ruby 1.8

masa@masa ~/ywesee/rpdf2txt $ ruby18 -I lib test/test_pdf_parser.rb 
Loaded suite test/test_pdf_parser
Started
@current_font.cmap=#<Rpdf2txt::CMap:0x7fdfca1b7298 @map={36=>8805}, @target_encoding="utf8", @decrypted_stream="", @decoded_stream="", @attributes={}, @src="<< >>", @raw_stream="">
cmap.map={36=>8805}
ascii=36
ascii.chr="\263"

Ruby 1.9

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
@current_font.cmap=#<Rpdf2txt::CMap:0x00000000c6c780 @map={36=>8805}, @attributes={}, @src="<< >>", @target_encoding="utf8", @raw_stream="[]", @decrypted_stream="[]", @decoded_stream="[]">
cmap.map={36=>8805}
ascii=$

Note

  • The cause is the argument data type, ascii code (Ruby 1.8) and character (Ruby 1.9)
  • I should check the place where this mapped_ascii is called

Experiment

lib/rpdf2txt/text.rb#snip

        def snip(snippet)
            snippet_text = snippet[1..-2].gsub(/\\[nrt]/n, " ")
      snippet_text.gsub!(/\\([()])/n, '\1')
            snippet_text.gsub!(/./n) { |char|
        #self.mapped_ascii(char[0]) || char
        self.mapped_ascii(char.unpack('C*')[0]) || char

Result

Ruby 1.8

masa@masa ~/ywesee/rpdf2txt $ ruby18 -I lib test/test_pdf_parser.rb 
Loaded suite test/test_pdf_parser
Started
.
Finished in 0.286884 seconds.

1 tests, 1 assertions, 0 failures, 0 errors

Ruby 1.9

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/test_pdf_parser.rb 
test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/test_pdf_parser
Started
.

Finished in 0.198423306 seconds.

1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed

Note

  • Very good!!

Final check

Ruby 1.8 with the updated libraries

masa@masa ~/ywesee/rpdf2txt $ ruby18 -I lib test/suite.rb 
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
...................................................................unknown encoding 370 0 R
.............................................
Finished in 12.640062 seconds.

134 tests, 295 assertions, 0 failures, 0 errors

Ruby 1.9 with the updated libraries

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib test/suite.rb 
test/suite.rb:26: warning: variable $KCODE is no longer effective; ignored
/home/masa/ywesee/rpdf2txt/test/test_pdf_object.rb:26: warning: variable $KCODE is no longer effective; ignored
/home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:28: warning: variable $KCODE is no longer effective; ignored
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
.........................................................
..........unknown encoding 370 0 R
.............................................

Finished in 9.930899752 seconds.

134 tests, 295 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed

Note

  • Good

Ruby 1.8 with the latest libraries

masa@masa ~/ywesee/rpdf2txt $ ruby18 test/suite.rb 
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
...................................................................unknown encoding 370 0 R
....................F........................
Finished in 12.225387 seconds.

  1) Failure:
test_join_snippets__hex_chars(TestParser) [/home/masa/ywesee/rpdf2txt/test/test_pdf_parser.rb:335]:
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533; a1-, a2- und b-Adrenozeptoren sowie f&#65533;
Dopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n"> expected but was
<"Paroxetin besitzt eine selektive Wirkung; in-vitro Studien haben gezeigt, dass es, im Gegensatz zu\ntrizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533;  a1-, a2- und b-Adrenozeptoren sowie f&#65533;
Dopamin (D2)-, 5-HT1-artige, 5-HT2 und Histamin (H1)-Rezeptoren aufweist. Das Fehlen einer\n">.

Note

  • This is also probably the character enconding (ascii code) problem

Experiment

lib/rpdf2txt/text.rb#snip

        def snip(snippet)
            snippet_text = snippet[1..-2].gsub(/\\[nrt]/n, " ")
print "snippet_text1="
p snippet_text
      snippet_text.gsub!(/\\([()])/n, '\1')
print "snippet_text2="
p snippet_text
            snippet_text.gsub!(/./n) { |char|
        #self.mapped_ascii(char[0]) || char
        self.mapped_ascii(char.unpack('C*')[0]) || char
            }
print "snippet_text3="
p snippet_text
            _snip(snippet_text)
        end

Result

Ruby 1.8 with the latest libraries

snippet_text1="trizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533; &#65533;"
snippet_text2="trizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533; &#65533;"
snippet_text3="trizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533; "

Ruby 1.9 with the updated libraries

snippet_text1="trizyklischen Antidepressiva, eine geringe Affinit\xE4t f\xFCr "
snippet_text2="trizyklischen Antidepressiva, eine geringe Affinit\xE4t f\xFCr "
snippet_text3="trizyklischen Antidepressiva, eine geringe Affinit\xE4t f\xFCr "

Note

  • Ruby 1.9, all the output is same
  • Ruby 1.8, there is a difference between 2 and 3
  • This means the following process outputs different result
            snippet_text.gsub!(/./n) { |char|
        #self.mapped_ascii(char[0]) || char
        self.mapped_ascii(char.unpack('C*')[0]) || char
            }

Experiment

        def snip(snippet)
            snippet_text = snippet[1..-2].gsub(/\\[nrt]/n, " ")
print "snippet_text1="
p snippet_text
      snippet_text.gsub!(/\\([()])/n, '\1')
print "snippet_text2="
p snippet_text
print "snippet_text.scan(/./n)="
p snippet_text.scan(/./n)
            snippet_text.gsub!(/./n) { |char|
        #self.mapped_ascii(char[0]) || char
        self.mapped_ascii(char.unpack('C*')[0]) || char
            }
print "snippet_text3="
p snippet_text
            _snip(snippet_text)
        end

Result

Ruby 1.8 with the latest libraries

snippet_text2="trizyklischen Antidepressiva, eine geringe Affinit&#65533; f&#65533; "
snippet_text.scan(/./n)=["t", "r", "i", "z", "y", "k", "l", "i", "s", "c", "h", "e", "n", " ", "A", "n", "t", "i", "d", "e", "p", "r", "e", "s", "s", "i", "v", "a", ",", " ", "e", "i", "n", "e", " ", "g", "e", "r", "i", "n", "g", "e", " ", "A", "f", "f", "i", "n", "i", "t", "\344", "t", " ", "f", "\374", "r", " "]

Ruby 1.9 with the updated libraries

snippet_text2="trizyklischen Antidepressiva, eine geringe Affinit\xE4t f\xFCr "
snippet_text.scan(/./n)=["t", "r", "i", "z", "y", "k", "l", "i", "s", "c", "h", "e", "n", " ", "A", "n", "t", "i", "d", "e", "p", "r", "e", "s", "s", "i", "v", "a", ",", " ", "e", "i", "n", "e", " ", "g", "e", "r", "i", "n", "g", "e", " ", "A", "f", "f", "i", "n", "i", "t", "\xE4", "t", " ", "f", "\xFC", "r", " "]

Note

  • The regular expression '/./n' looks working as well both Ruby 1.8 and 1.9

Then the point is the following process

  self.mapped_ascii(char.unpack('C*')[0]) || char
view · edit · sidebar · attach · print · history
Page last modified on January 20, 2011, at 07:27 AM