Python - Unicode & double backslashes [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I scrapped a webpage with BeautifulSoup.
I got great output except parts of the list look like this after getting the text:
list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
My question now is how to get rid or replace these double backslashes with the special characters they are.
If i print the first the first element of the example list the output looks like
print list[0]
that\u2019s
I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.
I hope that someone could help me with this issue.
Thanks!
MG

Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:
data = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
result = [part.decode('unicode_escape') for part in data]
To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:
result = [part.encode('latin1').decode('unicode_escape') for part in data]

the problem here is that the site ended up double encoding those unicode arguments, just do the following:
ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
ls = map(lambda x: x.decode('unicode-escape'), ls)
now you have a list with properly unicode encoded strings:
for a in ls:
print a

Related

Python - Decode string already in bytes [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 3 years ago.
I'm trying that a Python program reads a word made by another Python program which was encoded to UTF-8 and saved on a txt file.
For example, the string it gets might be:
b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
being this a normal string, like doing this:
word_string = "b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'"
How do I make the script see this is a bytes string and not a normal string? I know this can be done like
word_bytes = b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
but if I have the content of that variable 'word_bytes' already written in a file, how can I get it and make the program understand it just has to decode it? Because I try to decode it and it says it's a string and can't be decoded. Any help?
Thanks in advance!
UPDATE: So just to put here to anyone who gets the string from a file on at least Windows (I'm using Windows 7), with tripleee's answer, it will encode and put double backslashes on the bytes part, and when it decodes, it will just remove one of the backslashes, putting it as it was before. So the way to get it from a file and decode it is the following:
s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'.encode().decode('unicode_escape') [having the bytes part between '' been gotten from a file using the open(file,"r") function, in my case]
s.encode('latin-1').decode('utf-8') [or ISO-8859-1, as it seems it's the same thing]
EDIT: tripleee's answer is almost what I wanted to know (50% missing), but it's already a way, so thank you! But how could I do it not knowing the encoding (because in this case, I didn't know the encoding was latin-1 and I can't put all the encodings there)? Like I would do by just putting a b before the bytes string like in 'word_bytes' variable (possibly it might encode with the right encoding automatically? I wanted to do that too but possibly with a funcion to a variable that has already the bytes part).
If you have the bytes in a variable already, you are all set. If you have the bytes in a string, I'm assuming you basically have a sequence of characters where the code point value of each is equivalent to the byte value it's supposed to hold. This happens to be the definition of the Latin-1 encoding - it feels a bit dirty, but the trick is to encode your string as Latin-1, then decode back as UTF-8.
>>> s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
>>> s.encode('latin-1').decode('utf-8')
'форум'
you can identify if the string is in bytes using
def identifystring(string):
if isinstance(string, str):
print ("ordinary string")
elif isinstance(string, unicode):
print ("unicode string")
else:
print ("no string")

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "Yalçınkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

What does 'u' before a string in python mean? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What does the ‘u’ symbol mean in front of string values?
I have this json text file:
{
"EASW_KEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"EASF_SESS":" aaaaaaaaaaaaaaaaaaaaaaa",
"PHISHKEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"XSID":" aaaaaaaaaaaaaaaaaaaaaaa"
}
If I decode it like this:
import json
data = open('data.txt').read()
data = json.loads(data)
print data['EASW_KEY']
I get this:
u' aaaaaaaaaaaaaaaaaaaaaaa'
instead of simply:
aaaaaaaaaaaaaaaaaaaaaaa
What does this u mean?
What does 'u' before a string in python mean?
The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations
The u stands for unicode. Alternative letters in that slot can be r'foobar' for raw string and b'foobar' for byte string.
Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html
Then to explore the nature of this, run this command:
type(u'abc')
returns:
<type 'unicode'>
If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.
For example, if you pass a unicode string to facebook's api.fql(...) function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.
As defined in this post:
FQL multiquery from python fails with unicode query
Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.
The u prefix in the repr of a unicode object indicates that it's a unicode object.
You should not be seeing this if you're really just printing the unicode object itself, and not, say, a list containing it. (and, in fact, running your code doesn't actually print it like this.)

Interpret "plain text" as utf-8 text in python [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a text file with text that should have been interpreted as utf-8 but wasn't (it was given to me this way).
Here is an example of a typical line of the file:
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
which should have been:
ロンドン在住
Now, I can do it manually on python by typing the following in the command line:
>>> h1 = u'\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'
>>> print h1
ロンドン在住
which gives me what I want. Is there a way that I can do this automatically? I've tried doing stuff like this
>>> f = codecs.open('testfile.txt', encoding='utf-8')
>>> h = f.next()
>>> print h
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
I've also tried with the 'encode' and 'decode' functions, any ideas?
Thanks!
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f is not UTF8; it's using the python unicode escape format. Use the unicode_escape codec instead:
>>> print '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape')
ロンドン在住
Here is the UTF-8 encoding of the above phrase, for comparison:
>>> '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape').encode('utf-8')
'\xe3\x83\xad\xe3\x83\xb3\xe3\x83\x89\xe3\x83\xb3\xe5\x9c\xa8\xe4\xbd\x8f'
Note that the data decoded with unicode_escape are treated as Latin-1 for anything that's not a recognised Python escape sequence.
Be careful however; it may be you are really looking at JSON-encoded data, which uses the same notation for specifying character escapes. Use json.loads() to decode actual JSON data; JSON strings with such escapes are delimited with " quotes and are usually part of larger structures (such as JSON lists or objects).

Unicode HTML Conversion to ASCII in Python [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Unescaping Characters in a String with Python
I have a string of unicode HTML in Python which begins with: \u003ctable>\u003ctr
I need to convert this to ascii so I can then parse it with BeautifulSoup. However, Python's encode and decode functions seem to have no effect; I get the original string no matter what I try. I'm new to Python and unicode in general, so help would be much appreciated.
Use
s.decode("unicode-escape")
to decode the html data first (no idea how you get this character crap from).
I have no clue what you're talking about. I suspect that I'm not the only one.
>>> s = BeautifulSoup.BeautifulSoup(u'<html><body>\u003ctable>\u003ctr</body></html>')
>>> s
<html><body><table><tr></tr></table></body></html>

Categories

Resources