python string replace to \u - python

I have a string.
m = 'I have two element. <U+2F3E> and <U+2F8F>'
I want to m replace to:
m = 'I have two element. \u2F3E and \u2F8F' # utf-8
my code:
import re
p1 = re.compile('<U+\+') # start "<"
p2 = re.compile('>+') # end ">"
m = 'I have two element. <U+2F3E> and <U+2F8F>'
out = pattern.sub('\\u', m) # like: 'I have two element. \u2F3E> and \u2F8F>'
but I get this error message:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
How can I fix it. thanks.

import re
m = 'I have two element. <U+2F3E> and <U+2F8F>'
print(re.sub(r'<U\+(\w+)>', r"\\u\1", m))
# I have two element. \u2F3E and \u2F8F

You can use a single regular expression to find the strings and pull out the part you want to use in the replacement.
The reason you get an error is that '\\u' passes the literal string \u to the regex engine, which tries to parse it as a Unicode character, and fails; the \u needs to be followed by exactly four hex digits to form a valid Unicode code point. But you are still approaching this as if you wanted to replace with a literal string, which as per your clarifying comment is wrong.
import re
m = re.sub(r'<U\+([0-9a-fA-F]{4})>', lambda x: chr(int(x.group(1), 16)), m)
The lambda receives the match object as its argument; x.group(1) pulls out the first parenthesized group, and chr(int(that, 16)) produces the corresponding literal character.
If you actually want to produce the UTF-8 encoding of that, that's easy, too:
>>> re.sub(r'<U\+([0-9a-fA-F]{4})>', lambda x: chr(int(x.group(1), 16)), 'I have two element. <U+2F3E> and <U+2F8F>')
'I have two element. ⼾ and ⾏'
>>> re.sub(r'<U\+([0-9a-fA-F]{4})>', lambda x: chr(int(x.group(1), 16)), 'I have two element. <U+2F3E> and <U+2F8F>').encode('utf-8')
b'I have two element. \xe2\xbc\xbe and \xe2\xbe\x8f'
As you can see, the UTF-8 encoding is a sequence of bytes which do not correspond to printable characters at all. (Well, they could be printed in some other encodings; but then that's just mojibake.)

m.replace('<U+', '\u')
m.replace('>',' ')

Please use the below code, this might help
m = m.replace('<U+', '\\u').replace('>',' ')

Related

Python: How do string variables prevent escape?

>>>m = "\frac{7x+5}{1+y^2}"
>>>print(m)
rac{7x+5}{1+y^2}
>>>print(r""+m)
rac{7x+5}{1+y^2}
>>>print(r"{}".format(m))
rac{7x+5}{1+y^2}
>>>print(repr(m))
'\x0crac{7x+5}{1+y^2}'
I want the result:"\frac{7x+5}{1+y^2}"
Must be a string variable!!!
You need the string literal that contains the slash to be a raw string.
m = r"\frac{7x+5}{1+y^2}"
Raw strings are just another way of writing strings. They aren't a different type. For example r"" is exactly the same as "" because there are no characters to escape, it doesn't produce some kind of raw empty string and adding it to another string changes nothing.
Another option is to add the escape sign to the escape sign to signify that it is a string literal
m = "\\frac{7x+5}{1+y^2}"
print(m)
print(r""+m)
print(r"{}".format(m))
print(repr(m))
A good place to start is to read the docs here. So you can use either the escape character "\" as here
>>> m = "\\frac{7x+5}{1+y^2}"
>>> print(m)
\frac{7x+5}{1+y^2}
or use string literals, which takes the string to be as is
>>> m = r"\frac{7x+5}{1+y^2}"
>>> print(m)
\frac{7x+5}{1+y^2}

Leave only alphanumeric symbols in string in Python?

I am using Python 2.7. On SO I found the following regexp for removing non-word characters:
pat = re.compile('[\W]+', re.UNICODE)
I wrote the next function:
def leave_only_alphanumeric(string):
pat = re.compile('[\W]+', re.UNICODE)
return re.sub(pat,' ',string)
Though on the following string:
kr\xc3\xa9m
it produces the wrong result:
kr\xc3 m
\xa9 was deleted from the string, but should not have been.
You are confusing unicode codepoints and the utf-8 encoding.
The letter you are trying to handle is é, code point u00e9.
It is encoded in utf-8 as two bytes, 0xc3 and 0xa9.
Try:
>>> "kr\xc3\xa9m".decode('utf-8')
u'kr\xe9m'
>>> print("kr\xc3\xa9m")
krém
>>> print(u"kr\xe9m")
krém
With u"" you must use the actual code points. While with raw "", python just sees a chain of bytes.
Note that the second line only works because my terminal's encoding is utf-8, otherwise I'd see garbled output.
As a result, your string is not what you think:
>>> print(u"kr\xc3\xa9m")
krém
You actually entered two characters, with codepoint u00c3 and u00a9. The former is Ã, which is an alpha character and second is ©, which is not and is why your code removes it.
Now playing with your code:
>>> def leave_only_alphanumeric(string):
... pat = re.compile('[\W]+', re.UNICODE)
... return re.sub(pat,' ',string)
...
>>> leave_only_alphanumeric(u"kr\xe9m")
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m") # this is not unicode
'kr\xc3 m' # -> thus the wrong result
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8'))
u'kr\xe9m'
>>> leave_only_alphanumeric("kr\xc3\xa9m".decode('utf-8')).encode('utf-8')
'kr\xc3\xa9m'
>>>
I believe regex might be a bit of an overkill here.
def leave_only_alphanumeric(string):
return ''.join(ch if ch.isalnum() else ' ' for ch in string)
EDIT: Your title says "alphanumeric" but your code removes digits as well. So there is a bit of unclarity.

How to clean the tweets having a specific but varying length pattern?

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra
What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

Easiest way to remove unicode representations from a string in python 3?

I have a string in python 3 that has several unicode representations in it, for example:
t = 'R\\u00f3is\\u00edn'
and I want to convert t so that it has the proper representation when I print it, ie:
>>> print(t)
Róisín
However I just get the original string back. I've tried re.sub and some others, but I can't seem to find a way that will change these characters without having to iterate over each one.
What would be the easiest way to do so?
You want to use the built-in codec unicode_escape.
If t is already a bytes (an 8-bit string), it's as simple as this:
>>> print(t.decode('unicode_escape'))
Róisín
If t has already been decoded to Unicode, you can to encode it back to a bytes and then decode it this way. If you're sure that all of your Unicode characters have been escaped, it actually doesn't matter what codec you use to do the encode. Otherwise, you could try to get your original byte string back, but it's simpler, and probably safer, to just force any non-encoded characters to get encoded, and then they'll get decoded along with the already-encoded ones:
>>> print(t.encode('unicode_escape').decode('unicode_escape')
Róisín
In case you want to know how to do this kind of thing with regular expressions in the future, note that sub lets you pass a function instead of a pattern for the repl. And you can convert any hex string into an integer by calling int(hexstring, 16), and any integer into the corresponding Unicode character with chr (note that this is the one bit that's different in Python 2—you need unichr instead). So:
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', lambda matchobj: chr(int(matchobj.group(0)[2:], 16)), t)
Róisín
Or, making it a bit more clear:
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(digits, 16)
... char = chr(ordinal)
... return char
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, t)
Róisín
The unicode_escape codec actually handles \U, \x, \X, octal (\066), and special-character (\n) sequences as well as just \u, and it implements the proper rules for reading only the appropriate max number of digits (4 for \u, 8 for \U, etc., so r'\\u22222' decodes to '∢2' rather than '𢈢'), and probably more things I haven't thought of. But this should give you the idea.
First of all, it is rather confused what you what to convert to.
Just imagine that you may want to convert to 'o' and 'i'. In this case you can just make a map:
mp = {u'\u00f3':'o', u'\u00ed':'i'}
Than you may apply the replacement like:
t = u'R\u00f3is\u00edn'
for i in range(len(t)):
if t[i] in mp: t[i]=mp[t[i]]
print t
I apologize for posting as a second answer, I don't have the reputation to comment on abarnert's solution.
After using his function to process approximately 50K android strings I noticed that there is yet another small improvement possible for certain use-cases.
I changed the + to {1,4} to deal with the case where valid hex characters follow a 4-digit escape.
I also changed int(escapesequence) to read int(digits)
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(digits, 16)
... char = unichr(ordinal)
... return char
>>> print re.sub(r'(\\u[0-9A-Fa-f]{1,4})', unescapematch, "Wi\u2011Fi")
Wi‑Fi
>>> print re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, "Wi\u2011Fi")
Traceback (most recent call last):
File "<pyshell#102>", line 1, in <module>
print re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, "Wi\u2011Fi")
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "<pyshell#99>", line 5, in unescapematch
char = unichr(ordinal)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Encode binary data so that \n is escaped

I'm trying to work out a way to encode/decode binary data in such a way that the new line character is not part of the encoded string.
It seems to be a recursive problem, but I can't seem to work out a solution.
e.g. A naive implementation:
>>> original = 'binary\ndata'
>>> encoded = original.replace('\n', '=n')
'binary=ndata'
>>> decoded = original.replace('=n', '\n')
'binary\ndata'
What happens if there is already a =n in the original string?
>>> original = 'binary\ndata=n'
>>> encoded = original.replace('\n', '=n')
'binary=ndata=n'
>>> decoded = original.replace('=n', '\n')
'binary\ndata\n' # wrong
Try to escape existing =n's, but then what happens if there is already an escaped =n?
>>> original = '++nbinary\ndata=n'
>>> encoded = original.replace('=n', '++n').replace('\n', '=n')
'++nbinary=ndata++n'
How can I get around this recursive problem?
Solution
original = 'binary\ndata \\n'
# encoded = original.encode('string_escape') # escape many chr
encoded = original.replace('\\', '\\\\').replace('\n', '\\n') # escape \n and \\
decoded = encoded.decode('string_escape')
verified
>>> print encoded
binary\ndata \\n
>>> print decoded
binary
data \n
The solution is from How do I un-escape a backslash-escaped string in python?
Edit: I wrote it also with your ad-hoc economic encoding. The original "string_escape" codec escapes backslash, apostrophe and everything below chr(32) and above chr(126). Decoding is the same for both.
The way to encode strings that might contain the "escape" character is to escape the escape character as well. In python, the escape character is a backslash, but you could use anything you want. Your cost is one character for every occurrence of newline or the escape.
To avoid confusing you, I'll use forward slash:
# original
>>> print "slashes / and /newline/\nhere"
slashes / and /newline/
here
# encoding
>>> print "slashes / and /newline/\nhere".replace("/", "//").replace("\n", "/n")
slashes // and //newline///nhere
This encoding is unambiguous, since all real slashes are doubled; but it must be decoded in a single pass, so you can't just use two successive calls to replace():
# decoding
>>> def decode(c):
# Expand this into a real mapping if you have more substitutions
return '\n' if c == '/n' else c[0]
>>> print "".join( decode(c) for c in re.findall(r"(/.|.)",
"slashes // and //newline///nhere"))
slashes / and /newline/
here
Note that there is an actual /n in the input (and another slash before the newline): it all works correctly anyway.
If you encoded the entire string systematically, would you not end up escaping it? Say for every character you do chr(ord(char) + 1) or something trivial like that?
I don't have a great deal of experience with binary data, so this may be completely off/inefficient/both, but would this get around your issue?
In [40]: original = 'binary\ndata\nmorestuff'
In [41]: nlines = [index for index, i in enumerate(original) if i == '\n']
In [42]: encoded = original.replace('\n', '')
In [43]: encoded
Out[43]: 'binarydatamorestuff'
In [44]: decoded = list(encoded)
In [45]: map(lambda x: decoded.insert(x, '\n'), nlines)
Out[45]: [None, None]
In [46]: decoded = ''.join(decoded)
In [47]: decoded
Out[47]: 'binary\ndata\nmorestuff'
Again, I am sure there is a much better/more accurate way - this is just from a novice perspective.
If you are encoding an alphabet of n symbols (e.g. ASCII) into a smaller set of m symbols (e.g. ASCII except newline) you must allow the encoded string to be longer than the original string.
The typical way of doing this is to define one character as an "escape" character; the character following the "escape" represents an encoded character. This technique has been used since the 1940s in teletypewriters; that's where the "Esc" key you see on your keyboard came from.
Python (and other languages) already provide this in strings with the backslash character. Newlines are encoded as '\n' (or '\r\n'). Backslashes escape themselves, so the literal string '\r\n' would be encoded '\\r\\n'.
Note that the encoded length of a string that includes only the escaped character will be double that of the original string. If that is not acceptable you will have to use an encoding that uses a larger alphabet to avoid the escape characters (which may be longer than the original string) or compress it (which may also be longer than the original string).
How about:
In [8]: import urllib
In [9]: original = 'binary\ndata'
In [10]: encoded = urllib.quote(original)
In [11]: encoded
Out[11]: 'binary%0Adata'
In [12]: urllib.unquote(encoded)
Out[12]: 'binary\ndata'
The escapeless encodings are specifically designed to trim off certain characters from binary data. In your case of removing just the \n character, the overhead will be less than 0.4%.

Categories

Resources