Unicode HTML Conversion to ASCII in Python [duplicate]

Unicode HTML Conversion to ASCII in Python [duplicate] - python

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Unescaping Characters in a String with Python
I have a string of unicode HTML in Python which begins with: \u003ctable>\u003ctr
I need to convert this to ascii so I can then parse it with BeautifulSoup. However, Python's encode and decode functions seem to have no effect; I get the original string no matter what I try. I'm new to Python and unicode in general, so help would be much appreciated.

Use
s.decode("unicode-escape")
to decode the html data first (no idea how you get this character crap from).

I have no clue what you're talking about. I suspect that I'm not the only one.
>>> s = BeautifulSoup.BeautifulSoup(u'<html><body>\u003ctable>\u003ctr</body></html>')
>>> s
<html><body><table><tr></tr></table></body></html>

Related

Python regex and Unicode [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 5 years ago.
I am currently trying to figure out how to use Unicode in a regex in Python.
The regex I want to get to work is the following:
r"([A-ZÜÖÄß]+\s)+"
This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.
For example, in FUßBALL AND MORE only BALL AND MORE should be detected.
I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.
If you are able to enlighten me, please feel free to do so.

Use Unicode strings. Make sure your source is saved in the declared encoding:
#coding:utf8
import re
for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
print s.group()
Output:
FUßBALL
AND
MORE
Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints >U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

Python - Unicode & double backslashes [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I scrapped a webpage with BeautifulSoup.
I got great output except parts of the list look like this after getting the text:
list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
My question now is how to get rid or replace these double backslashes with the special characters they are.
If i print the first the first element of the example list the output looks like
print list[0]
that\u2019s
I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.
I hope that someone could help me with this issue.
Thanks!
MG

Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:
data = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
result = [part.decode('unicode_escape') for part in data]
To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:
result = [part.encode('latin1').decode('unicode_escape') for part in data]

the problem here is that the site ended up double encoding those unicode arguments, just do the following:
ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
ls = map(lambda x: x.decode('unicode-escape'), ls)
now you have a list with properly unicode encoded strings:
for a in ls:
print a

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was YalÃ§Ä±nkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "YalÃ§Ä±nkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')

Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular YalÃ§Ä±nkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

How to find right encoding in python? [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm trying to get rid of diacritics in my textfile. I converted a pdf to text with a tool, not made by myself. I wasn't able to understand which encoding they use. The text is written in Nahuatl, orthographically familiar with Spanish.
I transformed the text into a list of strings. No I'm trying to do the following:
# check whether there is a not-ascii character in the item
def is_ascii(word):
check = string.ascii_letters + "."
if word not in check:
return False
return True
# if there is a not ascii-character encode the string
def to_ascii(word):
if is_ascii(word) == False:
newWord = word.encode("utf8")
return newWord
return word
What I want to get is a unicode-version of my string. It doesn't work so far and I tried several encodings like latin1, cp1252, iso-8859-1. What I get is Can anybody tell me what I did wrong?
How can I find out the right encoding?
Thank you!
EDIT:
I wrote to the people that developed the converter (pdf-txt) and they said they were using unicode already. So John Machin was right with (1) in his answer.
As I wrote in some comment that wasn't clear to me, because in the Eclipse debugger the list itself showed some signs in unicodes, others not. And if I looked at the items seperately they were all decoded in some way, so that I actually saw unicode.
Thank you for your help!

Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.
You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8'). Note decode, NOT encode.
If decoding with UTF-8 does not raise UnicodeDecodeError and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.
If the above does not work, show us the result of print repr(the_text).
Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str objects and other as unicode is messy in Python 2.x and won't work in Python 3.X
In any case, your first function doesn't do what you think it does; it returns False for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.
Note that latin1 and iso-8859-1 are the same encoding. As latin1 encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError raised by text.decode('latin1'). "No error" is this case has exactly zero diagnostic value.
Update in response to this comment from OP:
I use Python 2.7. If I use text.decode("utf8") it raises the following
error: UnicodeEncodeError: 'latin-1' codec can't encode character
u'\u2014' in position 0: ordinal not in range(256).
That can happen two ways:
(1) In a single statement like foo = text.decode('utf8'), text is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).
(2) Possibly in two different statements, first foo = text.decode('utf8') where text is an str object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo and your sys.stdout.encoding is latin-1 (???).
I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!
Please edit your question to show your code (insert print repr(text) just before the text.decode("utf8") line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).
I ask again: can you make your file available for analysis?
By the way, u'\u2014' is an "EM DASH" and is a valid character in cp1252 (but not in latin-1, as you have seen from the error message). What version of what operating system are you using?
And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014' is enough evidence of that. Just show us your code and its results.

If you have read some bytes and want to interpret them as an unicode string, then you have to use .decode() rather than encode().
Like #delnan said in the comment, I hope you know the encoding. If not, the guesswork should go easy once you fix the function used.
BTW even if there are only ASCII characters in that word, why not .decode() it too? You'd have the same data type (unicode) everywhere, which will make your program simpler.

What does 'u' before a string in python mean? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What does the ‘u’ symbol mean in front of string values?
I have this json text file:
{
"EASW_KEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"EASF_SESS":" aaaaaaaaaaaaaaaaaaaaaaa",
"PHISHKEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"XSID":" aaaaaaaaaaaaaaaaaaaaaaa"
}
If I decode it like this:
import json
data = open('data.txt').read()
data = json.loads(data)
print data['EASW_KEY']
I get this:
u' aaaaaaaaaaaaaaaaaaaaaaa'
instead of simply:
aaaaaaaaaaaaaaaaaaaaaaa
What does this u mean?

What does 'u' before a string in python mean?
The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations
The u stands for unicode. Alternative letters in that slot can be r'foobar' for raw string and b'foobar' for byte string.
Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html
Then to explore the nature of this, run this command:
type(u'abc')
returns:
<type 'unicode'>
If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.
For example, if you pass a unicode string to facebook's api.fql(...) function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.
As defined in this post:
FQL multiquery from python fails with unicode query
Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.

The u prefix in the repr of a unicode object indicates that it's a unicode object.
You should not be seeing this if you're really just printing the unicode object itself, and not, say, a list containing it. (and, in fact, running your code doesn't actually print it like this.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.