<< | Index | >>
target chapters
extracted only these chapters.
I made converter script for patinfo.yaml (to csv).
But I have Encoding (escaped Backslash) problem.
exported patinfo.yaml
...
effects: !oddb.org,2003/ODDB::Text::Chapter
heading: "Propri\xC3\xA9t\xC3\xA9s/Emploi th\xC3\xA9rapeutique"
sections:
- !oddb.org,2003/ODDB::Text::Section
subheading: "Qu'est-ce que Multilind P\xC3\xA2te curative et quand est-elle utilis\xC3\xA9e?\n"
paragraphs:
- !oddb.org,2003/ODDB::Text::Paragraph
formats:
- !oddb.org,2003/ODDB::Text::Format
values: []
...
patinfo.yaml is ASCII encoding.
But ASCII encoding doesn't have some character code.(umulaute usw.)
$ file patinfo.yaml patinfo.yaml: ASCII text, with very long lines
z.B.
irb(main):029:0> str = " name: \"Multilind\\xC2\\xAE Heilpaste\"\n" => " name: "Multilind\\xC2\\xAE Heilpaste"\n" irb(main):030:0> str = " name: \"Multilind\xC2\xAE Heilpaste\"\n" => " name: "Multilind® Heilpaste"\n"
YAML.loda_documents convert documents force as UTF-8.
I tried replace escaped backslash with various way.
irb(main):072:0> " name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\x5C") => " name: "Multilind\\xC2\\xAE Heilpaste"\n" irb(main):073:0> " name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\&") => " name: "Multilind\\xC2\\xAE Heilpaste"\n" irb(main):074:0> " name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\\") => " name: "Multilind\\xC2\\xAE Heilpaste"\n"
But, in this case, \\ is single escaped backslash.
I could not replace or cut half this character.
We must escape xC2 and xAE with backslash.
irb(main):018:0> "\xC3\xAB" => "ë" irb(main):019:0> "\xC2\xAB" => "«"
Next, I tried to get codepoints of this "\".
But, I could not get good idea.
irb(main):075:0> "\\xC2\\xAE".codepoints.to_a => [92, 120, 67, 50, 92, 120, 65, 69] irb(main):077:0> "\xC2\xAE".codepoints.to_a => [174]
Finaly, I created replace character map table, as temporary solution.
z.B.
ESCAPED_STR_CODE_MAP = {
"\\x24" => "$",
...
"\\xC2\\xBE" => "¾",
"\\xC3\\x82" => "Â",
"\\xC3\\x83" => "Ã",
"\\xC3\\x84" => "Ä",
"\\xC3\\x85" => "Å",
"\\xC3\\x86" => "Æ",
...
}
$ irb -E iso-8859-1 irb(main):001:0> Encoding.default_external => #<Encoding:ISO-8859-1>