How to convert escaped sequences to literal characters - python

I have to deal with rtf files where cyrillic characters are converted to escaped sequences:
{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}
I want to convert cyrillic symbols, but left rtf tags unchaged. Is there pythonic way to do it without third party apps (like OpenOffice)?

We can first make a list of the hex codes using a regex, then create a bytes object with these values, which we can decode. It appears that your data was encoded using "cp1251".
data = r"pg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall(r"(?<=')[0-9A-F]{2}", data)
encoded = bytes(int(hcode, 16) for hcode in hex_codes)
# or, as rightly suggested by #Henry Tjhia:
# encoded = bytes.fromhex(''.join(hex_codes))
text = encoded.decode('cp1251')
print(text)
# Список документов

Despite #Thierry Lathuille's answer didn't solve the initial problem (I need rtf tags unchanged) it solved the most difficult part. So the solution to the initial problem:
string = "{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall("(?<=')[0-9A-F]{2}", string)
d = {"\\\'" + code: bytes.fromhex(code).decode("cp1251") for code in hex_codes}
for byte, char in d.items():
string = string.replace(byte, char)
print(string)
# {\rtf1\fbidis\ansicpg1251{\info{\title Список документов}

Related

How to deal with variations in apostrophe ’ and ' [duplicate]

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site's editor (simple textarea, no WYSIWYG etc.).
Those texts usually contain "nice" quotes instead of the plain ascii ones ("). They also sometimes contain those longer dashes like – instead of -.
Now I want to replace all those characters with their ascii counterparts. However, I do not want to remove umlauts and other non-ascii character. I'd also highly prefer to use a proper solution that does not involve creating a mapping dict for all those characters.
All my strings are unicode objects.
What about this?
It creates translation table first, but honestly I don't think you can do this without it.
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-", u"'''\"\"--") ] )
with open( "a.txt", "w", encoding = "utf-8" ) as f_out :
a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” "
print( " a_str = " + a_str, file = f_out )
fixed_str = a_str.translate( transl_table )
print( " fixed_str = " + fixed_str, file = f_out )
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:
a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single
quotes’ “nice double quotes” fixed_str = 'funny single quotes'
long--and--short dashes 'nice single quotes' "nice double quotes"
By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language
There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.
For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:
#!/usr/bin/env python3
import unicodedata
def unicode_character_name(char):
try:
return unicodedata.name(char)
except ValueError:
return None
# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff): # Unicode planes 0-16
char = chr(n) # Python 3
#char = unichr(n) # Python 2
name = unicode_character_name(char)
if name:
all_unicode_characters.append((char, name))
# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 " 🙶 🙷 🙸
# Find all Unicode hyphens
print (' '.join([char for char, name in all_unicode_characters if 'HYPHEN' in name]))
# - ­ ֊ ᐀ ᠆ ‐ ‑ ‧ ⁃ ⸗ ⸚ ⹀ ゠ ﹣ - 󠀭
# Find all Unicode dashes
print (' '.join([char for char, name in all_unicode_characters if 'DASH' in name and 'DASHED' not in name]))
# ‒ – — ⁓ ⊝ ⑈ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ╌ ╍ ╎ ╏ ⤌ ⤍ ⤎ ⤏ ⤐ ⥪ ⥫ ⥬ ⥭ ⩜ ⩝ ⫘ ⫦ ⬷ ⸺ ⸻ ⹃ 〜 〰 ︱ ︲ ﹘ 💨
As you can see, as easy as this example is, there are many problems. There are many quotation marks in Unicode that don't look anything like the quotation marks in US-ASCII and there are many hyphens in Unicode that don't look anything like the hyphen-minus sign in US-ASCII.
And there are many questions. For example:
should the "SWUNG DASH" (⁓) symbol be replaced with an ASCII hyphen (-) or a tilde (~)?
should the "CANADIAN SYLLABICS HYPHEN" (᐀) be replaced with an ASCII hyphen (-) or an equals sign (=)?
should the "SINGLE LEFT-POINTING ANGLE QUOTATION MARK" (‹) be replaces with an ASCII quotation mark ("), an apostrophe (') or a less-than sign (<)?
To establish a "correct" ASCII counterpart, somebody needs to answer these questions based on the use context. That's why all the solutions to your problem are based on a mapping dictionary in one way or another. And all these solutions will provide different results.
You can build on top of the unidecode package.
This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.
import unidecode
import unicodedata
import re
def char_filter(string):
latin = re.compile('[a-zA-Z]+')
for char in unicodedata.normalize('NFC', string):
decoded = unidecode.unidecode(char)
if latin.match(decoded):
yield char
else:
yield decoded
def clean_string(string):
return "".join(char_filter(string))
print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.
Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.
This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html
-S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
and ... to ellipses. Nonbreaking spaces are inserted after certain
abbreviations, such as “Mr.” (Note: This option is significant only
when the input format is markdown or textile. It is selected
automatically when the input format is textile or the output format is
latex or context.)
It's haskell, so you'd have to figure out the interface.

JSON contains incorrect UTF-8 \u00ce\u00b2 instead of Unicode \u03b2, how to fix in Python?

First note that symbol β (Greek beta) have hex representation in UTF-8: CE B2
I have legacy source code in Python 2.7 that uses json strings:
u'{"something":"text \\u00ce\\u00b2 text..."}'
I then it calls json.loads(string) or json.loads(string, 'utf-8'), but the result is Unicode string with UTF-8 characters:
u'text \xce\xb2 text'
What I want is normal Python Unicode (UTF-16?) string:
u'text β text'
If I call:
text = text.decode('unicode_escape')
before json.loads, then I got correct Unicode β symbol, but it also breaks json by also replacing all new lines - \n
The question is, how to convert only "\\u00ce\\00b2" part without affecting other json special characters?
(I am new to Python, and it is not my source code, so I have no idea how this is supposed to work. I suspect that the code only works with ASCII characters)
Something like this, perhaps. This is limited to 2-byte UTF-8 characters.
import re
j = u'{"something":"text \\u00ce\\u00b2 text..."}'
def decodeu (match):
u = '%c%c' % (int(match.group(1), 16), int(match.group(2), 16))
return repr(u.decode('utf-8'))[2:8]
j = re.sub(r'\\u00([cd][0-9a-f])\\u00([89ab][0-9a-f])',decodeu, j)
print(j)
returns {"something":"text \u03b2 text..."} for your sample. At this point, you can import it as regular JSON and get the final string you want.
result = json.loads(j)
Here's a string-fixer that works after loading the JSON. It handles any length UTF-8-like sequence and ignores escape sequences that don't look like UTF-8 sequences.
Example:
import json
import re
def fix(bad):
return re.sub(ur'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'),bad)
# 2- and 3-byte UTF-8-like sequences and onen correct escape code.
json_text = '''\
{
"something":"text \\u00ce\\u00b2 text \\u00e4\\u00bd\\u00a0\\u597d..."
}
'''
data = json.loads(json_text)
bad_str = data[u'something']
good_str = fix(bad_str)
print bad_str
print good_str
Output:
text β text ä½ 好...
text β text 你好...

How to convert byte to hex string starting with '%' in python

I would like to convert Unicode characters encoded by 'utf-8' to hex string of which character starts with '%' because the API server only recognize this form.
For example, if I need to input unicode character '※' to the server, I should convert this character to string '%E2%80%BB'. (It does not matter whether a character is upper or lower.)
I found the way to convert unicode character to bytes and convert bytes to hex string in https://stackoverflow.com/a/35599781.
>>> print('※'.encode('utf-8'))
b'\xe2\x80\xbb'
>>> print('※'.encode('utf-8').hex())
e280bb
But I need the form of starting with '%' like '%E2%80%BB' or %e2%80%bb'
Are there any concise way to implement this? Or do I need to make the function to add '%' to each hex character?
There is two ways to do this:
The preferred solution. Use urllib.parse.urlencode and specify multiple parameters and encode all at once:
urllib.parse.urlencode({'parameter': '※', 'test': 'True'})
# parameter=%E2%80%BB&test=True
Or, you can manually convert this into chunks of two symbols, then join with the % symbol:
def encode_symbol(symbol):
symbol_hex = symbol.encode('utf-8').hex()
symbol_groups = [symbol_hex[i:i + 2].upper() for i in range(0, len(symbol_hex), 2)]
symbol_groups.insert(0, '')
return '%'.join(symbol_groups)
encode_symbol('※')
# %E2%80%BB

Encode binary data so that \n is escaped

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?
Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.
The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.
If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?
I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.
If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).
How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'
The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Unescape/unquote binary strings in (extended) url encoding in python

for analysis I'd have to unescape URL-encoded binary strings (non-printable characters most likely). The strings sadly come in the extended URL-encoding form, e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
How do I get this into binary data in python? (urllib.unquote only handles the "%HH"-form)
The strings sadly come in the extended URL-encoding form, e.g. "%u616f"
Incidentally that's not anything to do with URL-encoding. It's an arbitrary made-up format produced by the JavaScript escape() function and pretty much nothing else. If you can, the best thing to do would be to change the JavaScript to use the encodeURIComponent function instead. This will give you a proper, standard URL-encoded UTF-8 string.
e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
Are you sure 0x61 0x6f (the letters "ao") is the byte stream you want to store? That would imply UTF-16BE encoding; are you treating all your strings that way?
Normally you'd want to turn the input into Unicode then write it out using an appropriate encoding, such as UTF-8 or UTF-16LE. Here's a quick way of doing it, relying on the hack of making Python read '%u1234' as the string-escaped format u'\u1234':
>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'
>>> print _
hello é 慯
>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'
I guess you will have to write the decoder function by yourself. Here is an implementation to get you started:
def decode(file):
while True:
c = file.read(1)
if c == "":
# End of file
break
if c != "%":
# Not an escape sequence
yield c
continue
c = file.read(1)
if c != "u":
# One hex-byte
yield chr(int(c + file.read(1), 16))
continue
# Two hex-bytes
yield chr(int(file.read(2), 16))
yield chr(int(file.read(2), 16))
Usage:
input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()
Here is a regex-based approach:
# the replace function concatenates the two matches after
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))
# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result
gives
ao

Categories

Resources