<< | Index | >>
target chapters
extracted only these chapters.
I made converter script for patinfo.yaml (to csv).
But I have Encoding (escaped Backslash) problem.
exported patinfo.yaml
... effects: !oddb.org,2003/ODDB::Text::Chapter heading: "Propri\xC3\xA9t\xC3\xA9s/Emploi th\xC3\xA9rapeutique" sections: - !oddb.org,2003/ODDB::Text::Section subheading: "Qu'est-ce que Multilind P\xC3\xA2te curative et quand est-elle utilis\xC3\xA9e?\n" paragraphs: - !oddb.org,2003/ODDB::Text::Paragraph formats: - !oddb.org,2003/ODDB::Text::Format values: [] ...
patinfo.yaml is ASCII
encoding.
But ASCII encoding doesn't have some character code.(umulaute usw.)
$ file patinfo.yaml patinfo.yaml: ASCII text, with very long lines
z.B.
irb(main):029:0> str = " name: \"Multilind\\xC2\\xAE Heilpaste\"\n" => " name: "Multilind\\xC2\\xAE Heilpaste"\n" irb(main):030:0> str = " name: \"Multilind\xC2\xAE Heilpaste\"\n" => " name: "Multilind® Heilpaste"\n"
YAML.loda_documents
convert documents force as UTF-8.
I tried replace escaped backslash with various way.
irb(main):072:0> " name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\x5C") => " name: "Multilind\\xC2\\xAE Heilpaste"\n" irb(main):073:0> " name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\&") => " name: "Multilind\\xC2\\xAE Heilpaste"\n" irb(main):074:0> " name: \"Multilind\\xC2\\xAE Heilpaste\"\n".gsub(/\\/, "\\") => " name: "Multilind\\xC2\\xAE Heilpaste"\n"
But, in this case, \\ is single escaped backslash.
I could not replace or cut half this character.
We must escape xC2 and xAE with backslash.
irb(main):018:0> "\xC3\xAB" => "ë" irb(main):019:0> "\xC2\xAB" => "«"
Next, I tried to get codepoints of this "\".
But, I could not get good idea.
irb(main):075:0> "\\xC2\\xAE".codepoints.to_a => [92, 120, 67, 50, 92, 120, 65, 69] irb(main):077:0> "\xC2\xAE".codepoints.to_a => [174]
Finaly, I created replace character map table, as temporary solution.
z.B.
ESCAPED_STR_CODE_MAP = { "\\x24" => "$", ... "\\xC2\\xBE" => "¾", "\\xC3\\x82" => "Â", "\\xC3\\x83" => "Ã", "\\xC3\\x84" => "Ä", "\\xC3\\x85" => "Å", "\\xC3\\x86" => "Æ", ... }
$ irb -E iso-8859-1 irb(main):001:0> Encoding.default_external => #<Encoding:ISO-8859-1>