view · edit · sidebar · attach · print · history

20120321-patinfo2csv

<< | Index | >>


summary

  • create patinfo2csv script.
    • extracted target chapters.
    • tierd encoding problem.

index


create patinfo.yaml converter

extract chapters

target chapters

  • Anwendungsgebiete (:effects)
  • Kontraindikationen (:contra-indications)
  • Schwangerschaft & Stillzeit (:pregnancy)
  • Anwendungsempfehlung (:usage)
  • Zusammensetzung (:composition)

extracted only these chapters.

encoding problem

I made converter script for patinfo.yaml (to csv).
But I have Encoding (escaped Backslash) problem.

exported patinfo.yaml

...
  effects: !oddb.org,2003/ODDB::Text::Chapter 
      heading: "Propri\xC3\xA9t\xC3\xA9s/Emploi th\xC3\xA9rapeutique"
      sections: 
      - !oddb.org,2003/ODDB::Text::Section 
        subheading: "Qu'est-ce que Multilind P\xC3\xA2te curative et quand est-elle utilis\xC3\xA9e?\n"
        paragraphs: 
        - !oddb.org,2003/ODDB::Text::Paragraph 
          formats: 
          - !oddb.org,2003/ODDB::Text::Format 
            values: []
...

patinfo.yaml is ASCII encoding.
But ASCII encoding doesn't have some character code.(umulaute usw.)

$ file patinfo.yaml
patinfo.yaml: ASCII text, with very long lines

z.B.

irb(main):029:0> str = "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n"
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
irb(main):030:0> str = "    name: \"Multilind\xC2\xAE Heilpaste\"\n"
=> "    name: "Multilind® Heilpaste"\n"

YAML.loda_documents convert documents force as UTF-8.

I tried replace escaped backslash with various way.

irb(main):072:0> "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\x5C")
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
irb(main):073:0> "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\&")
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
irb(main):074:0> "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\\")
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
  • StringScanner
  • RegerExpression

But, in this case, \\ is single escaped backslash.
I could not replace or cut half this character.

We must escape xC2 and xAE with backslash.

irb(main):018:0> "\xC3\xAB"
=> "ë"
irb(main):019:0> "\xC2\xAB"
=> "«"

Next, I tried to get codepoints of this "\".
But, I could not get good idea.

irb(main):075:0> "\\xC2\\xAE".codepoints.to_a
=> [92, 120, 67, 50, 92, 120, 65, 69]
irb(main):077:0> "\xC2\xAE".codepoints.to_a
=> [174]

Finaly, I created replace character map table, as temporary solution.

z.B.

 ESCAPED_STR_CODE_MAP = {
    "\\x24"      => "$", 
...
    "\\xC2\\xBE" => "¾",
    "\\xC3\\x82" => "Â",
    "\\xC3\\x83" => "Ã",
    "\\xC3\\x84" => "Ä",
    "\\xC3\\x85" => "Å",
    "\\xC3\\x86" => "Æ",
...
 }
NOTE
  • open irb with specified encoding
$ irb -E iso-8859-1
irb(main):001:0> Encoding.default_external
=> #<Encoding:ISO-8859-1> 

refs

view · edit · sidebar · attach · print · history
Page last modified on March 22, 2012, at 08:07 AM