view · edit · sidebar · attach · print · history

20110117-patch-spreadsheet-update-rpdf2txt

<< | Index | >>


  1. Debug import pharma24 de.oddb.org suspend
  2. Patch spreadsheet
  3. Update rpdf2txt for Ruby1.9
  4. Check non-ascii characters

:Goal:
  • Update rpdf2txt for Ruby1.9 / 60%
Milestones
  1. Debug de.oddb.org ODDB::Import::Pharma24 suspend
  2. Patch spreadsheet 10:15
  3. Update rpdf2txt
    • Check non-ascii character
    • Understand how it is used
Summary
Commits
ToDo Tomorrow
Keep in Mind
  1. Encoding woes (from Davatz-san)
  2. Feedback: This option indicates that the regular expression is parsed as 'UTF8' (from Davatz-san)
  3. pg on Ubuntu - see http://dev.ywesee.com/wiki.php/Gem/Pg (from Davatz-san)
  4. emerge portage warning It works today
  5. On Ice
  6. emerge --sync

Debug import pharma24 de.oddb.org

Email Sat Jan 15 02:03:17 2011: de.oddb.org ODDB::Import::Pharma24

Sat Jan 15 02:03:17 2011: de.oddb.org ODDB::Import::Pharma24#import
Mechanize::RedirectLimitReachedError
Maximum redirect limit (20) reached
/usr/lib64/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:605:in `fetch_page'
/usr/lib64/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:611:in `fetch_page'
/usr/lib64/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in `get'
/var/www/de.oddb.org/lib/oddb/import/pharma24.rb:132:in `search'
/var/www/de.oddb.org/lib/oddb/import/pharma24.rb:160:in `update_package'
/var/www/de.oddb.org/lib/oddb/import/pharma24.rb:20:in `import'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:154:in `update_prices'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:113:in `call'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:113:in `_reported_import'
/var/www/de.oddb.org/lib/oddb/util/updater.rb:153:in `update_prices'
/var/www/de.oddb.org/jobs/import_pharma24:12
/var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `call'
/var/www/de.oddb.org/lib/oddb/util/job.rb:16:in `run'
/var/www/de.oddb.org/jobs/import_pharma24:11
Checked     1 Packages
Updated     0 Packages
Created     0 Companies

suspend

  • This is low priority

Patch spreadsheet

We got a patch from Moo-san.

Read the fixed points

Commit

Update rpdf2txt for Ruby1.9

References

  1. Encoding woes (from Davatz-san)
  2. Feedback: This option indicates that the regular expression is parsed as 'UTF8' (from Davatz-san)

Check test-cases with Ruby1.8

masa@masa ~/ywesee/rpdf2txt $ ruby test/suite.rb 
Loaded suite test/suite
Started
......................'invalid literal/lengths set' when filtering with /FlateDecode
...................................................................unknown encoding 370 0 R
.............................................
Finished in 12.683721 seconds.

134 tests, 295 assertions, 0 failures, 0 errors

Note

  • They pass anyway, but there are some strange messages

Check test-cases with Ruby1.9

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 test/suite.rb 
test/suite.rb:26: warning: variable $KCODE is no longer effective; ignored
test/suite.rb:29:in `require': /home/masa/ywesee/rpdf2txt/test/test_pdf_object.rb:177: invalid multibyte char (US-ASCII) (SyntaxError)
/home/masa/ywesee/rpdf2txt/test/test_pdf_object.rb:174: Invalid char `\x0F' in expression
/home/masa/ywesee/rpdf2txt/test/test_pdf_object.rb:174: invalid multibyte char (US-ASCII)
/home/masa/ywesee/rpdf2txt/test/test_pdf_object.rb:174: syntax error, unexpected $end, expecting keyword_end
/Title (&#65533;&#65533;&#65533;)&#65533;&#65533;\\&#65533;&#65533;&#65533;#/&#65533;-&&#65533;&#65533;;S&#65533;&#65533;A)
           ^
        from test/suite.rb:29:in `block in <main>'
        from test/suite.rb:28:in `foreach'
        from test/suite.rb:28:in `<main>'

Note

  • This is definitely due to the character encoding of file

My guess

  • The binary data is directly embedded into the source code

test/test_pdf_object.rb#test_tree_node4

        def test_tree_node4
            src = '
400 0 obj
<< 
/Title (^O\)\\<9e>PT#/-&<9f>;S<93>OA)    #<= here
/Parent 399 0 R 
/A 436 0 R 
/Next 433 0 R 
>> 
endobj
            '
            node = Rpdf2txt::TreeNode.new(src)
            assert_equal(400, node.oid)
            assert_equal('433 0 R', node.attributes[:next])
        end

BraSt

  • Do test-cases later
    • I should replace the binary data in the source code to an external file
  • First, check non-ascii character in source code
  • Understand how it is used
  • Think the replace it to simpler manner
  • Check it with Ruby1.8 and then Ruby1.9

Check non-ascii characters

Experiment

masa@masa ~/ywesee/rpdf2txt $ ruby -I lib bin/rpdf2txt test/data/test.pdf 

untitled text                                                                        Page 1 of 1
Printed: Donnerstag, 14. November 2002 14:04:29 Uhr
testpdf
masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib bin/rpdf2txt test/data/test.pdf 
/home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/grammar.rb:1:in `require': /home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/token.rb:138: invalid multibyte char (US-ASCII) (SyntaxError)
/home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/token.rb:138: syntax error, unexpected '~', expecting ')'
    super("EOF", "&#65533;~~&#65533;&#65533;~^^~" + rand(1e10).inspect)
                    ^
/home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/token.rb:138: invalid multibyte char (US-ASCII)
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/grammar.rb:1:in `<top (required)>'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/lalr_parsetable_generator.rb:1:in `require'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/lalr_parsetable_generator.rb:1:in `<top (required)>'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/rockit.rb:2:in `require'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt-rockit/rockit.rb:2:in `<top (required)>'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/textparser.rb:25:in `require'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/textparser.rb:25:in `<top (required)>'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/text.rb:26:in `require'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/text.rb:26:in `<top (required)>'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:26:in `require'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/object.rb:26:in `<top (required)>'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:26:in `require'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:26:in `<top (required)>'
        from bin/rpdf2txt:25:in `require'
        from bin/rpdf2txt:25:in `<main>'

Experiment

lib/rpdf2txt-rockit/token.rb

class EofToken < Token
  def initialize(*args)
    # Shouldn't match anything but since I'm not sure how to do a regexp
    # with that chareacteristic we use a highly unlikely string in the mean 
    # time.
# super("EOF", "~~~^^~" + rand(1e10).inspect)  (delete)
  end

...
class EpsilonToken < Token
  def initialize
    # Shouldn't match anything but since I'm not sure how to do a regexp
    # with that chareacteristic we use a highly unlikely string in the mean 
    # time.
# super("epsilon", "~~~^^~" + rand(1e10).inspect) (delete)
  end

lib/rpdf2txt-rockit/rockit_grammars_parser.rb

require 'rpdf2txt-rockit/rockit'
module Parse
  # Parser for RockitGrammar
  # created by Rockit version 0.3.8 on Mon Dec 02 20:05:20 CET 2002
  # Rockit is copyright (c) 2001 Robert Feldt, feldt@ce.chalmers.se
  # and licensed under GPL
  # but this parser is under LGPL
  tokens = [
# t1 = EofToken.new("EOF",/^(~~~^^~5348086680)/n), (delete)
    t2 = Token.new("Blank",/^(\s+)/n,:Skip),

lib/rpdf2txt/object.rb

#require 'md5'
require 'digest/md5'

lib/rpdf2txt/parser.rb

#require 'md5'
require 'digest/md5'

Result

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib bin/rpdf2txt test/data/test.pdf 
/home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:131:in `scan': invalid byte sequence in UTF-8 (ArgumentError)
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:131:in `build_object_catalogue'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:48:in `object_catalogue'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:163:in `page_tree_root'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:145:in `build_page_tree'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:51:in `page_tree'
        from /home/masa/ywesee/rpdf2txt/lib/rpdf2txt/parser.rb:41:in `extract_text'
        from bin/rpdf2txt:58:in `<main>'

Search the difference of String class between Ruby 1.8 and 1.9

References

Experiment

test1.rb

open( "test.dat", "wb" ) do |f|
  a = 999
  f.write( [testData].pack("l") )
end

test2.rb

open( "test.dat", "rb" ) do |f|
  a = f.read
  p a.unpack("U")
end

Result

masa@masa ~/work $ ruby test1.rb 
masa@masa ~/work $ ruby test2.rb 
[999]

test3.rb

a = " "
p a.unpack("U")

How to make test3.rb

masa@masa ~/work $ mv test.dat test3.rb
masa@masa ~/work $ vim test3.rb 
a="
(ESC), $, a
"
p a.unpack("U")

Result (Ruby 1.8)

masa@masa ~/work $ ruby test3.rb 
[999]

Result (Ruby 1.9)

masa@masa ~/work $ ruby1.9 test3.rb 
test3.rb:1: invalid multibyte char (US-ASCII)
test3.rb:1: invalid multibyte char (US-ASCII)

Experiment

masa@masa ~/work $ cat test4.rb 
# encoding: utf-8
masa@masa ~/work $ cat test3.rb >> test4.rb 
masa@masa ~/work $ cat test4.rb 
# encoding: utf-8
a="&#999;"
p a.unpack("U")
masa@masa ~/work $ ruby1.9 test4.rb 
[999]

Notes

  • As for Ruby 1.8, binary data (byte data) can be embedded directly into source code and String class can recognize it
  • But Ruby 1.9 cannot recognize it as binary data (byte data)
  • I have to check how to utilize binary data in Ruby 1.9 (I should not use String class for binary data)

Reference

Experiment

test4.rb

# encoding: ascii-8bit
a="&#999;"
p a.unpack("U")
masa@masa ~/work $ ruby1.9 test4.rb 

Result

masa@masa ~/work $ ruby1.9 test4.rb 
[999, 999]

Notes

  • It works
  • If we use binary data in Ruby 1.9, the encoding should be 'ascii-8bit'

Experiment (add magic comments 'encoding: ascii-8bit' and force_encoding)

Result

masa@masa ~/ywesee/rpdf2txt $ ruby1.9 -I lib bin/rpdf2txt test/data/test.pdf 
masa@masa ~/ywesee/rpdf2txt $ 

Note

  • It runs but it is not expected result

Important

view · edit · sidebar · attach · print · history
Page last modified on January 17, 2011, at 04:54 PM