How to get python to accept unicode character 0x2000 (and others) - python

I am trying to remove certain characters from a string in Python. I have a list of characters or range of characters that I need removed, represented in hexidecimal like so:
- "0x00:0x20"
- "0x7F:0xA0"
- "0x1680"
- "0x180E"
- "0x2000:0x200A"
I am turning this list into a regular expression that looks like this:
re.sub(u'[\x00-\x20 \x7F-\xA0 \x1680 \x180E \x2000-\x200A]', ' ', my_str)
However, I am getting an error when I have \x2000-\x200A in there.
I have found that Python does not actually interpret u'\x2000' as a character:
>>> '\x2000'
' 00'
It is treating it like 'x20' (a space) and whatever else is after it:
>>> '\x20blah'
' blah'
x2000 is a valid unicode character:
http://www.unicodemap.org/details/0x2000/index.html
I would like Python to treat it that way so I can use re to remove it from strings.
As an alternative, I would like to know of another way to remove these characters from strings.
I appreciate any help. Thanks!

In a unicode string, you need to specify unicode characters(\uNNNN not \xNNNN). The following works:
>>> import re
>>> my_str=u'\u2000abc'
>>> re.sub(u'[\x00-\x20 \x7F-\xA0 \u1680 \u180E \u2000-\u200A]', ' ', my_str)
' abc'

From the docs (https://docs.python.org/2/howto/unicode.html):
Unicode literals can also use the same escape sequences as 8-bit
strings, including \x, but \x only takes two hex digits so it can’t
express an arbitrary code point. Octal escapes can go up to U+01ff,
which is octal 777.
>>> s = u"a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768

Related

Python: How do string variables prevent escape?

>>>m = "\frac{7x+5}{1+y^2}"
>>>print(m)
rac{7x+5}{1+y^2}
>>>print(r""+m)
rac{7x+5}{1+y^2}
>>>print(r"{}".format(m))
rac{7x+5}{1+y^2}
>>>print(repr(m))
'\x0crac{7x+5}{1+y^2}'
I want the result:"\frac{7x+5}{1+y^2}"
Must be a string variable!!!
You need the string literal that contains the slash to be a raw string.
m = r"\frac{7x+5}{1+y^2}"
Raw strings are just another way of writing strings. They aren't a different type. For example r"" is exactly the same as "" because there are no characters to escape, it doesn't produce some kind of raw empty string and adding it to another string changes nothing.
Another option is to add the escape sign to the escape sign to signify that it is a string literal
m = "\\frac{7x+5}{1+y^2}"
print(m)
print(r""+m)
print(r"{}".format(m))
print(repr(m))
A good place to start is to read the docs here. So you can use either the escape character "\" as here
>>> m = "\\frac{7x+5}{1+y^2}"
>>> print(m)
\frac{7x+5}{1+y^2}
or use string literals, which takes the string to be as is
>>> m = r"\frac{7x+5}{1+y^2}"
>>> print(m)
\frac{7x+5}{1+y^2}

Dealing with doubly escaped unicode string

I have a database of badly formatted database of strings. The data looks like this:
"street"=>"\"\\u4e2d\\u534e\\u8def\""
when it should be like this:
"street"=>"中华路"
The problem I have is that when that doubly escaped strings comes from the database they are not being decoded to the chinese characters as they should be. So suppose I have this variable; street="\"\\u4e2d\\u534e\\u8def\"" and if I print that print(street) the result is a string of codepoints "\u4e2d\u534e\u8def"
What can I do at this point to convert "\u4e2d\u534e\u8def" to actual unicode characters ?
First encode this string as utf8 and then decode it with unicode-escape which will handle the \\ for you:
>>> line = "\"\\u4e2d\\u534e\\u8def\""
>>> line.encode('utf8').decode('unicode-escape')
'"中华路"'
You can then strip the " if necessary
You could remove the quotation marks with strip and split at every '\\u'. This would give you the characters as strings representing hex numbers. Then for each string you could convert it to int and back to string with chr:
>>> street = "\"\\u4e2d\\u534e\\u8def\""
>>> ''.join(chr(int(x, 16)) for x in street.strip('"').split('\\u') if x)
'中华路'
Based on what you wrote, the database appears to be storing an eval-uable ascii representation of a string with non-unicode chars.
>>> eval("\"\\u4e2d\\u534e\\u8def\"")
'中华路'
Python has a built-in function for this.
>>> ascii('中华路')
"'\\u4e2d\\u534e\\u8def'"
The only difference is the use of \" instead of ' for the needed internal quote.

Hexadecimal file is loading with 2 back slashes before each byte instead of one

I have a hex file in this format: \xda\xd8\xb8\x7d
When I load the file with Python, it loads with two back slashes instead of one.
with open('shellcode.txt', 'r') as file:
shellcode = file.read().replace('\n', '')
Like this: \\xda\\xd8\\xb8\\x7d
I've tried using hex.replace("\\", "\"), but I'm getting an error
EOL while scanning string literal
What is the proper way to replace \\ with \?
Here is an example
>>> h = "\\x123"
>>> h
'\\x123'
>>> print h
\x123
>>>
The two backslashes are needed because \ is an escape character, and so it needs to be escaped. When you print h, it shows what you want
Backshlash (\) is an escape character. It is used for changing the meaning of the character(s) following it.
For example, if you want to create a string which contains a quote, you have to escape it:
s = "abc\"def"
print s # prints: abc"def
If there was no backslash, the first quote would be interpreted as the end of the string.
Now, if you really wanted that backslash in the string, you would have to escape the bacsklash using another backslash:
s = "abc\\def"
print s # prints: abc\def
However, if you look at the representation of the string, it will be shown with the escape characters:
print repr(s) # prints: 'abc\\def'
Therefore, this line should include escapes for each backslash:
hex.replace("\\", "\") # wrong
hex.replace("\\\\", "\\") # correct
But that is not the solution to the problem!
There is no way that file.read().replace('\n', '') introduced additional backslashes. What probably happened is that OP printed the representation of the string with backslashes (\) which ended up printing escaped backslashes (\\).
You can make a bytes object with a utf-8 encoding, and then decode as unicode-escape.
>>> x = "\\x61\\x62\\x63"
>>> y = bytes(x, "utf-8").decode("unicode-escape")
>>> print(x)
\x61\x62\x63
>>> print(y)
abc

Leave only alphanumeric symbols in string in Python?

I am using Python 2.7. On SO I found the following regexp for removing non-word characters:
pat = re.compile('[\W]+', re.UNICODE)
I wrote the next function:
def leave_only_alphanumeric(string):
pat = re.compile('[\W]+', re.UNICODE)
return re.sub(pat,' ',string)
Though on the following string:
kr\xc3\xa9m
it produces the wrong result:
kr\xc3 m
\xa9 was deleted from the string, but should not have been.
You are confusing unicode codepoints and the utf-8 encoding.
The letter you are trying to handle is é, code point u00e9.
It is encoded in utf-8 as two bytes, 0xc3 and 0xa9.
Try:
>>> "kr\xc3\xa9m".decode('utf-8')
u'kr\xe9m'
>>> print("kr\xc3\xa9m")
krém
>>> print(u"kr\xe9m")
krém
With u"" you must use the actual code points. While with raw "", python just sees a chain of bytes.
Note that the second line only works because my terminal's encoding is utf-8, otherwise I'd see garbled output.
As a result, your string is not what you think:
>>> print(u"kr\xc3\xa9m")
krém
You actually entered two characters, with codepoint u00c3 and u00a9. The former is Ã, which is an alpha character and second is ©, which is not and is why your code removes it.
Now playing with your code:
>>> def leave_only_alphanumeric(string):
... pat = re.compile('[\W]+', re.UNICODE)
... return re.sub(pat,' ',string)
...
>>> leave_only_alphanumeric(u"kr\xe9m")
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m") # this is not unicode
'kr\xc3 m' # -> thus the wrong result
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8'))
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8')).encode('utf-8')
'kr\xc3\xa9m'
>>>
I believe regex might be a bit of an overkill here.
def leave_only_alphanumeric(string):
return ''.join(ch if ch.isalnum() else ' ' for ch in string)
EDIT: Your title says "alphanumeric" but your code removes digits as well. So there is a bit of unclarity.

I want one backslash - not two

I have a string that after print is like this: \x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71
But I want to change this string to "\x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71" which is not printable (it is necessary to write to serial port). I know that it ist problem with '\'. how can I replace this printable backslashes to unprintable?
If you want to decode your string, use decode() with 'string_escape' as parameter which will interpret the literals in your variable as python literal string (as if it were typed as constant string in your code).
mystr.decode('string_escape')
Use decode():
>>> st = r'\x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71'
>>> print st
\x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71
>>> print st.decode('string-escape')
MÿýfHq
That last garbage is what my Python prints when trying to print your unprintable string.
You are confusing the printable representation of a string literal with the string itself:
>>> c = '\x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71'
>>> c
'M\xff\xfd\x00\x02\x8f\x0e\x80fHq'
>>> len(c)
11
>>> len('\x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71')
11
>>> len(r'\x4d\xff\xfd\x00\x02\x8f\x0e\x80\x66\x48\x71')
44
your_string.decode('string_escape')

Categories

Resources