Python. Replace "\\uxxxx" to "\uxxxx" [duplicate] - python

This question already has an answer here:
Python-encoding and decoding using codecs,unicode_escape()
(1 answer)
Closed 4 years ago.
I'm scraping a web, and I get Unicode characters as raw.
Instead of getting the "ó" character, I get \u00f3.
It is the same as write:
>>>print("\\u00f3")
I want to convert "\\u00f3" into "\u00f3" in all unicode characters. It is:
"\\uxxxx" -> "\uxxxx"
But if I try to replace \\ by \, next characters are interpreted as escape characters.
How can I do it?
Applying the next code, I can convert part of the characters:
def raw_to_utf8(matcher):
string2convert = matcher.group(0)
return(chr(int(string2convert[2:],base=16)))
def decode_utf8(text_raw):
text_raw_re=re.compile(r"\\u[0-9a-ce-z]\w{0,3}")
return text_raw_re.sub(raw_to_utf8, text_raw)
text_fixed = decode_utf8(text_raw)
As you can see in the regular expression pattern, I have skipped the 'd' character. It is because \udxxx characters can't be converted in UTF-8 by this metod and any other one. They aren't important characters for me, so it is not a problem.
Thanks for your help.
************************** Solved ********************************
The best solution was solved previously:
Python-encoding and decoding using codecs,unicode_escape()
Thanks for your help.

First: maybe you are not decoding the webpage with the correct charset. If the web server does not supply the charset you might have to find it in the meta tags or make an educated guess. Maybe try a couple of usual charsets and compare the results.
Second: I played around with strings and decoding for a while and it's really frustrating, but I found a possible solution in format():
s = "\\u00f3"
print('{:c}'.format(int(s[2:], 16)))
Format the extracted hex value as unicode seems to work.

You cannot replace '\\' by '\' because '\' is not a valid literal string.
Convert the hexadecimal expression into a number, then find the corresponding character:
original = '\\u00f3'
char = chr(int(original[2:], base=16))
You can check that this gives the desired result:
assert char == '\u00f3'

Related

How to get the Unicode character from a code point variable? [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 4 years ago.
I have a variable which stores the string "u05e2" (The value is constantly changing because I set it within a loop). I want to print the Hebrew letter with that Unicode value. I tried the following but it didn't work:
>>> a = 'u05e2'
>>> print(u'\{}'.format(a))
I got \u05e2 instead of ע(In this case).
I also tried to do:
>>> a = 'u05e2'
>>> b = '\\' + a
>>> print(u'{}'.format(b))
Neither one worked. How can I fix this?
Thanks in advance!
This seems like an X-Y Problem. If you want the Unicode character for a code point, use an integer variable and the function chr (or unichr on Python 2) instead of trying to format an escape code:
>>> for a in range(0x5e0,0x5eb):
... print(hex(a),chr(a))
...
0x5e0 נ
0x5e1 ס
0x5e2 ע
0x5e3 ף
0x5e4 פ
0x5e5 ץ
0x5e6 צ
0x5e7 ק
0x5e8 ר
0x5e9 ש
0x5ea ת
All you need is a \ before u05e2. To print a Unicode character, you must provide a unicode format string.
a = '\u05e2'
print(u'{}'.format(a))
#Output
ע
When you try the other approach by printing the \ within the print() function, Python first escapes the \ and does not show the desired result.
a = 'u05e2'
print(u'\{}'.format(a))
#Output
\u05e2
A way to verify the validity of Unicode format strings is using the ord() built-in function in the Python standard library. This returns the Unicode code point(an integer) of the character passed to it. This function only expects either a Unicode character or a string representing a Unicode character.
a = '\u05e2'
print(ord(a)) #1506, the Unicode code point for the Unicode string stored in a
To print the Unicode character for the above Unicode code value(1506), use the character type formatting with c. This is explained in the Python docs.
print('{0:c}'.format(1506))
#Output
ע
If we pass a normal string literal to ord(), we get an error. This is because this string does not represent a Unicode character.
a = 'u05e2'
print(ord(a))
#Error
TypeError: ord() expected a character, but string of length 5 found
This is happening because you have to add the suffix u outside of the string.
a = u'\u05e2'
print(a)
ע
Hope this helps you.

Same character, different length and bytes [duplicate]

This question already has answers here:
python different length for same unicode
(2 answers)
Closed 5 years ago.
Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:
>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>
Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:
What is the difference between these two strings?
Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
And how can I fix this? (e.g. convert second_string to the likeness of first_string)
Thank you.
An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.
The first one is:
<â> 226, Hex 00e2, Octal 342
And the second:
<a> 97, Hex 61, Octal 141 < ̂> 770, Hex 0302, Octal 1402
In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.
Ask the website operators. How would we know?!
You need something which turns combining characters into regular characters. A Google search yielded this question, for example.
As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.
There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.
A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.
You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:
from unicodedata import normalize
s = u'a\u0302'
print s, len(s), len(normalize("NFC", s))
will output â 2 1.

How to add the 'u' prefix to a list or string?

I've been having some unicode issues and realized (a bit too late, admittedly) that adding the 'u' prefix to a string did the trick:
print (u'No\xebl')
Noël
However, I am working with a lot of strings and lists of strings, so I need to add that prefix to each one (say, I want to add "u" to "string", with string = 'No\xebl'). I've tried different ways:
print "u"+"'"+string
print unicode(string)
print "u" + string
print repr(unicode(m)) #Doing so does add the prefix 'u', but adds an extra "\" to the string and no longer fixes the problem
u'No\xebl'
The list goes on but you get the gist. Basically, I was wondering if there was a way to do exactly the same as print (u'No\xebl'), but with any variable string without having to actually write the string down.
Any suggestion would be greatly appreciated!
\xeb encodes ë in ISO 8859-1. To convert from bytes to a Unicode string, use the .decode() method.
string.decode('iso-8859-1')
That being said: where is this data coming from? Do you know it's always ISO 8859-1, or may it be encoded differently? Why is it in bytes instead of a Unicode string already? Answers to these questions may allow for better solutions.
do this;
tempvariabel = u''
realvariabel = tempvariabel + somevariabelcontainUnicode
it's return error when you want to print it, but it's ready to write on file or database

replace or delete specific unicode characters in python

There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).
To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.
For instance:
thisToken = u'tandh\u2013bm'
print(thisToken)
prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)
I try using replace like you would with any other character:
newToke = thisToken.replace('\u2013','')
print(newToke)
but it just doesn't work. Any help is much appreciated.
Seth
The string you're searching for to replace must also be a Unicode string. Try:
newToke = thisToken.replace(u'\u2013','')
You can see the answer in this post: How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "Yalçınkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

Categories

Resources