view · edit · sidebar · attach · print · history

20120321-patinfo2csv

<< | Index | >>

summary

create patinfo2csv script.
- extracted target chapters.
- tierd encoding problem.

index

create patinfo.yaml converter
- extract chapters
- encoding problem

create patinfo.yaml converter

extract chapters

target chapters

Anwendungsgebiete (:effects)
Kontraindikationen (:contra-indications)
Schwangerschaft & Stillzeit (:pregnancy)
Anwendungsempfehlung (:usage)
Zusammensetzung (:composition)

extracted only these chapters.

encoding problem

I made converter script for patinfo.yaml (to csv).
But I have Encoding (escaped Backslash) problem.

exported patinfo.yaml

...
  effects: !oddb.org,2003/ODDB::Text::Chapter 
      heading: "Propri\xC3\xA9t\xC3\xA9s/Emploi th\xC3\xA9rapeutique"
      sections: 
      - !oddb.org,2003/ODDB::Text::Section 
        subheading: "Qu'est-ce que Multilind P\xC3\xA2te curative et quand est-elle utilis\xC3\xA9e?\n"
        paragraphs: 
        - !oddb.org,2003/ODDB::Text::Paragraph 
          formats: 
          - !oddb.org,2003/ODDB::Text::Format 
            values: []
...

patinfo.yaml is ASCII encoding.
But ASCII encoding doesn't have some character code.(umulaute usw.)

$ file patinfo.yaml
patinfo.yaml: ASCII text, with very long lines

z.B.

irb(main):029:0> str = "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n"
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
irb(main):030:0> str = "    name: \"Multilind\xC2\xAE Heilpaste\"\n"
=> "    name: "Multilind� Heilpaste"\n"

YAML.loda_documents convert documents force as UTF-8.

I tried replace escaped backslash with various way.

irb(main):072:0> "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\x5C")
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
irb(main):073:0> "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\&")
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"
irb(main):074:0> "    name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\\")
=> "    name: "Multilind\\xC2\\xAE Heilpaste"\n"

StringScanner
RegerExpression

But, in this case, \\ is single escaped backslash.
I could not replace or cut half this character.

We must escape xC2 and xAE with backslash.

irb(main):018:0> "\xC3\xAB"
=> "�"
irb(main):019:0> "\xC2\xAB"
=> "�"

Next, I tried to get codepoints of this "\".
But, I could not get good idea.

irb(main):075:0> "\\xC2\\xAE".codepoints.to_a
=> [92, 120, 67, 50, 92, 120, 65, 69]
irb(main):077:0> "\xC2\xAE".codepoints.to_a
=> [174]

Finaly, I created replace character map table, as temporary solution.

z.B.

 ESCAPED_STR_CODE_MAP = {
    "\\x24"      => "$", 
...
    "\\xC2\\xBE" => "�",
    "\\xC3\\x82" => "�",
    "\\xC3\\x83" => "�",
    "\\xC3\\x84" => "�",
    "\\xC3\\x85" => "�",
    "\\xC3\\x86" => "�",
...
 }

NOTE

open irb with specified encoding

$ irb -E iso-8859-1
irb(main):001:0> Encoding.default_external
=> #<Encoding:ISO-8859-1>

ywesee Developer-Wiki
Dieses Wiki richtet sich an alle ywesee-Entwickler

About

EBPS

Bbmb

ODBA

Oddb

Rpdf2txt

YDPM

YDIM

XmlConv

20120321-patinfo2csv

summary

index

create patinfo.yaml converter

extract chapters

encoding problem

refs

ywesee Developer-Wiki Dieses Wiki richtet sich an alle ywesee-Entwickler

About

EBPS

Bbmb

ODBA

Oddb

Rpdf2txt

YDPM

YDIM

XmlConv

20120321-patinfo2csv

summary

index

create patinfo.yaml converter

extract chapters

encoding problem

refs

ywesee Developer-Wiki
Dieses Wiki richtet sich an alle ywesee-Entwickler