I have a database of badly formatted database of strings. The data looks like this:
"street"=>"\"\\u4e2d\\u534e\\u8def\""
when it should be like this:
"street"=>"中华路"
The problem I have is that when that doubly escaped strings comes from the database they are not being decoded to the chinese characters as they should be. So suppose I have this variable; street="\"\\u4e2d\\u534e\\u8def\"" and if I print that print(street) the result is a string of codepoints "\u4e2d\u534e\u8def"
What can I do at this point to convert "\u4e2d\u534e\u8def" to actual unicode characters ?
First encode this string as utf8 and then decode it with unicode-escape which will handle the \\ for you:
>>> line = "\"\\u4e2d\\u534e\\u8def\""
>>> line.encode('utf8').decode('unicode-escape')
'"中华路"'
You can then strip the " if necessary
You could remove the quotation marks with strip and split at every '\\u'. This would give you the characters as strings representing hex numbers. Then for each string you could convert it to int and back to string with chr:
>>> street = "\"\\u4e2d\\u534e\\u8def\""
>>> ''.join(chr(int(x, 16)) for x in street.strip('"').split('\\u') if x)
'中华路'
Based on what you wrote, the database appears to be storing an eval-uable ascii representation of a string with non-unicode chars.
>>> eval("\"\\u4e2d\\u534e\\u8def\"")
'中华路'
Python has a built-in function for this.
>>> ascii('中华路')
"'\\u4e2d\\u534e\\u8def'"
The only difference is the use of \" instead of ' for the needed internal quote.
Related
I would like to convert Unicode characters encoded by 'utf-8' to hex string of which character starts with '%' because the API server only recognize this form.
For example, if I need to input unicode character '※' to the server, I should convert this character to string '%E2%80%BB'. (It does not matter whether a character is upper or lower.)
I found the way to convert unicode character to bytes and convert bytes to hex string in https://stackoverflow.com/a/35599781.
>>> print('※'.encode('utf-8'))
b'\xe2\x80\xbb'
>>> print('※'.encode('utf-8').hex())
e280bb
But I need the form of starting with '%' like '%E2%80%BB' or %e2%80%bb'
Are there any concise way to implement this? Or do I need to make the function to add '%' to each hex character?
There is two ways to do this:
The preferred solution. Use urllib.parse.urlencode and specify multiple parameters and encode all at once:
urllib.parse.urlencode({'parameter': '※', 'test': 'True'})
# parameter=%E2%80%BB&test=True
Or, you can manually convert this into chunks of two symbols, then join with the % symbol:
def encode_symbol(symbol):
symbol_hex = symbol.encode('utf-8').hex()
symbol_groups = [symbol_hex[i:i + 2].upper() for i in range(0, len(symbol_hex), 2)]
symbol_groups.insert(0, '')
return '%'.join(symbol_groups)
encode_symbol('※')
# %E2%80%BB
I am trying to remove certain characters from a string in Python. I have a list of characters or range of characters that I need removed, represented in hexidecimal like so:
- "0x00:0x20"
- "0x7F:0xA0"
- "0x1680"
- "0x180E"
- "0x2000:0x200A"
I am turning this list into a regular expression that looks like this:
re.sub(u'[\x00-\x20 \x7F-\xA0 \x1680 \x180E \x2000-\x200A]', ' ', my_str)
However, I am getting an error when I have \x2000-\x200A in there.
I have found that Python does not actually interpret u'\x2000' as a character:
>>> '\x2000'
' 00'
It is treating it like 'x20' (a space) and whatever else is after it:
>>> '\x20blah'
' blah'
x2000 is a valid unicode character:
http://www.unicodemap.org/details/0x2000/index.html
I would like Python to treat it that way so I can use re to remove it from strings.
As an alternative, I would like to know of another way to remove these characters from strings.
I appreciate any help. Thanks!
In a unicode string, you need to specify unicode characters(\uNNNN not \xNNNN). The following works:
>>> import re
>>> my_str=u'\u2000abc'
>>> re.sub(u'[\x00-\x20 \x7F-\xA0 \u1680 \u180E \u2000-\u200A]', ' ', my_str)
' abc'
From the docs (https://docs.python.org/2/howto/unicode.html):
Unicode literals can also use the same escape sequences as 8-bit
strings, including \x, but \x only takes two hex digits so it can’t
express an arbitrary code point. Octal escapes can go up to U+01ff,
which is octal 777.
>>> s = u"a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768
In Python 3.3, I am trying to rebuild unicode characters from truncated unicode values,
and then print the character to console.
For example, from '4E00' I want to form the string '\u4E00'. I have tried:
base = '4E00'
uni = r'\u' + base
print(uni) # getting '\u4E00', want: '一'
print(repr(uni)) # '\\u4E00'
Is there a way to form an unescaped string like '\u4E00' in this situation?
Keep in mind that \u followed by a Unicode character code is only a thing in string literals. r'\u' + '4E00' has no special meaning as a Unicode character because it's not all in one literal; it's just a six-character string.
So you're trying to take a Unicode escape code as it would appear in a Python string literal, then decode that into a Unicode character. You can do that:
base = '4E00'
uni = str(bytes(r'\u' + base, encoding="ascii"), encoding="unicode_escape")
But it's the long way around (especially since you have to convert it to bytes first since it's already Unicode). Your Unicode character spec is in hexadecimal. So convert it directly to an integer and then use chr() to turn it into a Unicode character.
base = '4E00'
uni = chr(int(base, 16))
Use:
chr(int(base, 16))
to turn a hex value into a Unicode character.
The \u escape sequence only works in string literals. You could use:
(br'\u' + base.encode('ascii')).decode('unicode_escape')
but that's much more verbose than this needs to be.
Demo:
>>> base = '4E00'
>>> chr(int(base, 16))
'一'
>>> (br'\u' + base.encode('ascii')).decode('unicode_escape')
'一'
In Python 2.x, how can I convert an unicode string (ex, u'\xe0') to a string (here I need it to be '\u00E0')?
To make it clearer. I do like to have '\u00E0', a string with length of 6. That is, ¥u is treated as 2 chars instead of one escaped char.
\u doesn't exist as a string escape sequence in Python 2.
You might mean a JSON-encoded string:
>>> s = u'\xe0'
>>> import json
>>> json.dumps(s)
'"\\u00e0"'
or a UTF-16 (big-endian)-encoded string:
>>> s.encode("utf-16-be")
'\x00\xe0'
but your original request is not fulfillable.
As an aside, note that u'\u00e0' is identical to u'\xe0', but '\u00e0' doesn't exist:
>>> u'\u00e0'
u'\xe0'
I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?
Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.
The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.
If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?
I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.
If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).
How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'
The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.