view · edit · sidebar · attach · print · history

20110209-debug-import_gkv-rpdf2txt

<< | Index | >>


  1. Debug import_gkv rpdf2txt de.oddb.org
  2. Study the encryption of pdf
  3. Update test-case test_util

Goal/Estimate
  • Debug import_gkv rpdf2txt / 70%
  • (All tests pass oddb.org / 90%)
Milestones
  1. Debug import_gkv de.oddb.org
    • Understand decryption process
    • Update decryption process
  2. Run import_gkv online 15:00
    • Commit again after the run online
  3. Update test_util
  4. Update test_view
Summary
Commits
ToDo Tomorrow
Keep in Mind
  1. On Ice

Debug import_gkv de.oddb.org

Confirm the error in local environment

Run

  • de.oddb.org/bin/oddbd
  • jobs/export_gkv

Result

Wed Feb  9 07:48:44 2011: de.oddb.org ODDB::Import::Gkv#import
Rpdf2txt::PdfEncrypt::DecryptionError
test-key did not match user-key ('"\003=3DK\222\263\\nC\177\\nw\001Z"' / '"\003=3DK\222\263\nC\177\nw\001Z"')
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:206:in `encryption_key'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:177:in `decrypt_key'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:162:in `decrypt'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:783:in `decode_raw_stream'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:673:in `decoded_stream'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:538:in `text'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:537:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:537:in `text'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:42:in `extract_text'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:489:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:448:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:489:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:488:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/object.rb:473:in `each'
/usr/lib64/ruby/site_ruby/1.8/rpdf2txt/parser.rb:41:in `extract_text'
/home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:100:in `import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:106:in `reported_import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:113:in `call'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:113:in `_reported_import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:106:in `reported_import'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:58:in `import_gkv'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open_uri_original_open'
/usr/lib64/ruby/1.8/open-uri.rb:32:in `open'
/home/masa/ywesee/de.oddb.org/lib/oddb/import/gkv.rb:77:in `download_latest'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/updater.rb:57:in `import_gkv'
jobs/import_gkv:16
/home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `call'
/home/masa/ywesee/de.oddb.org/lib/oddb/util/job.rb:16:in `run'
jobs/import_gkv:15
Imported     0 Zubef-Entries on 09.02.2011:
Visited      0 existing Zubef-Entries
Visited      0 existing Companies
Visited      0 existing Substances
Created      0 new Zubef-Entries
Created      0 new Products
Created      0 new Sequences
Created      0 new Companies
Created      0 new Substances
Assigned     0 Chemical Equivalences
Assigned     0 Companies
Created      0 Incomplete Packages:
Created      0 Product(s) without a name (missing product name):

Next

  • Traceback the error message

Memo

Confirmed

  • These keys are usuall same

Next

    def encryption_key
      input_string = PADDING.dup
      ## we don't support a user-password. if we did, it would have to replace
      #  the first [n..32] bytes of the padding string here.
      input_string << owner_key
      input_string << permission_flag
      input_string << file_id
      ## revision >= 4: add 0xffffffff if document metadata is not encrypted
      digest = Digest::MD5.digest(input_string)
      uk = user_key
      if revision >= 3
        50.times do digest = Digest::MD5.digest(digest[0,keylength]) end
        uk = uk[0,16]
      end
      encryption_key = digest[0,keylength]
      test_key = compute_user_key encryption_key
      if(test_key != uk)
        raise DecryptionError, "test-key did not match user-key ('#{test_key.inspect}' / '#{uk.inspect}')"
      end
      encryption_key
    end

Experiment

rpdf2txt/object.rb#encryption_key

    def encryption_key
      input_string = PADDING.dup
      ## we don't support a user-password. if we did, it would have to replace
      #  the first [n..32] bytes of the padding string here.
      input_string << owner_key
      input_string << permission_flag
      input_string << file_id
      ## revision >= 4: add 0xffffffff if document metadata is not encrypted
      digest = Digest::MD5.digest(input_string)
      uk = user_key
      if revision >= 3
        50.times do digest = Digest::MD5.digest(digest[0,keylength]) end
        uk = uk[0,16]
      end
      encryption_key = digest[0,keylength]
      test_key = compute_user_key encryption_key
 #      if(test_key != uk)
 #        raise DecryptionError, "test-key did not match user-key ('#{test_key.inspect}' / '#{uk.inspect}')"
 #      end
      encryption_key
    end

Delete downloaded files

 masa@masa ~/ywesee/de.oddb.org $ rm var/pdf/gkv/*.pdf

Reboot

  • de.oddb.org/bin/oddbd
  • jobs/import_gkv

Result

Wed Feb  9 09:24:37 2011: de.oddb.org ODDB::Import::Gkv#import
Imported  6301 Zubef-Entries on 09.02.2011:
Visited   6034 existing Zubef-Entries
Visited   4155 existing Companies
Visited   1003 existing Substances
Created    267 new Zubef-Entries
Created      1 new Products
Created      1 new Sequences
Created   2146 new Companies
Created   2300 new Substances
Assigned     2 Chemical Equivalences
Assigned    10 Companies
Created      1 Incomplete Packages:
http://de.oddb.org/de/drugs/package/pzn/7388792
Created      1 Product(s) without a name (missing product name):
http://de.oddb.org/de/drugs/product/uid/3480899

Note

  • Looks fine
  • If this is fine, this indicate that
    • 'encryption_key' is correct, 'test_key' is wrong
    • 'user_key checking process' is something wrong (probably in 'pdf_escape' method)

Next

  • Update the user_key checking process (decryption process)

Study the encryption of pdf

Check decription process of user_key

Reference (PDF Reference)

Keywords

  • 'user key'
  • 'encryption key'

Focus on

Notes

  • PDFs standard encryption methods use the MD5 message-digest algorithm
  • a proprietary encryption algorithm known as RC4.
  • RC4 is a symmetric stream cipher:
 the same algorithm is used for both encryption and decryption, 
 and the algorithm does not change the length of the data.
  • Algorithm 3.1 Encryption of data using an encryption key
 Decryption of strings (other than those in the encryption dictionary) is done
 after escape-sequence processing and hexadecimal decoding as appropriate to the
 string representation described in Section 3.2.3, String Objects.

Experiment (check cyrpt)

rpdf2txt/lib/rpdf2txt/object.rb#encryption_key

    def encryption_key
      input_string = PADDING.dup
      ## we don't support a user-password. if we did, it would have to replace
      #  the first [n..32] bytes of the padding string here.
      input_string << owner_key
      input_string << permission_flag
      input_string << file_id
      ## revision >= 4: add 0xffffffff if document metadata is not encrypted
      digest = Digest::MD5.digest(input_string)
print "user_key="
p user_key
      uk = user_key
      if revision >= 3
        50.times do digest = Digest::MD5.digest(digest[0,keylength]) end
        uk = uk[0,16]
      end
      encryption_key = digest[0,keylength]
      test_key = compute_user_key encryption_key
print "test_key="
p test_key
      if(test_key != uk)
        raise DecryptionError, "test-key did not match user-key ('#{test_key.inspect}' / '#{uk.inspect}')"
      end
      encryption_key
    end

rpdf2txt/lib/rpdf2txt/object.rb#compute_user_key

    def compute_user_key encryption_key
      if revision < 3
        pdf_escape arc4(encryption_key, PADDING)
      else
        crypt = Digest::MD5.digest PADDING + file_id
        20.times do |xor|
          key = encryption_key.unpack('C*').collect! do |byte|
            byte ^ xor
          end.pack('C*')
          crypt = arc4(key, crypt)
        end
print "crypt="
p crypt
        pdf_escape crypt
      end

Delete downloaded files

 masa@masa ~/ywesee/de.oddb.org $ rm var/pdf/gkv/*.pdf

Reboot

  • de.oddb.org/bin/oddbd
  • jobs/import_gkv

Result

Note

  • Namely, 'pdf_escape' makes 'crypt' different

Commit

Update git bare repository online

 ~/git/rpdf2txt $ git checkout -f

Run job/import_gkv online server

 $ su
 # cd */de.oddb.org/var/pdf/gkv
 # rm [latest pdf files]
 # exit
 $ cd */de.oddb.org
 $ screen -S masa
 $ sudo -u apache jobs/import_gkv
 $ (C+a, C+d) (detach)
 $ exit
 $ exit

Result

Wed Feb  9 13:47:22 2011: de.oddb.org ODDB::Import::Gkv#import
Imported  6301 Zubef-Entries on 09.02.2011:
Visited   6034 existing Zubef-Entries
Visited   4155 existing Companies
Visited   1003 existing Substances
Created    267 new Zubef-Entries
Created      1 new Products
Created      1 new Sequences
Created   2146 new Companies
Created   2300 new Substances
Assigned     2 Chemical Equivalences
Assigned    10 Companies
Created      1 Incomplete Packages:
http://de.oddb.org/de/drugs/package/pzn/7388792
Created      1 Product(s) without a name (missing product name):
http://de.oddb.org/de/drugs/product/uid/3480899

Commit

Update test-case test_util

First

  • resolve exporter.rb (sleep problem)

Problem

  • The re-definitions of 'sleep' method and 'Exporter' class in exporter.rb influence on the other test scripts

Updated

  • exporter.rb

Result

masa@masa ~/ywesee/oddb.org/test/test_util $ ruby suite.rb
...
251 tests, 490 assertions, 0 failures, 0 errors

Note

  • 'sleep' definition is replaced to the defnition of flexstub of Object (Object#sleep method is stubbed)
  • '@@today' class variable is replaced to the definition of flexmock of Object (Object#today method id stubbed, src/util/schedule.rb)

Commit

view · edit · sidebar · attach · print · history
Page last modified on February 09, 2011, at 05:05 PM