How to find accented characters in a string in Python? - python

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.
I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.
sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False
I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.
I'm out of ideas here, any suggestions?
#icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.

Use unicodedata.normalize on the string before checking.
Explanation
Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)

I suspect your terminal is using UTF-8, so 'é'.decode('latin-1') is incorrect. Just use a Unicode constant instead u'é'.
To handle Unicode correctly in a script, declare the script and data file encodings, and decode incoming data, and encode outgoing data. Using Unicode strings for text in the script.
Example (save script in UTF-8):
# coding: utf8
import codecs
with codecs.open('input.txt',encoding='latin-1') as f:
sentence = f.readline()
if u'é' in sentence:
print u'Found é'
Note that print implicitly encodes the output in the terminal encoding.

Related

Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3?

I'm using Python 3.7 and Django 2.0. I want to strip out non-UTF-8 characters from a string, that I'm obtaining by reading this CSV file. I tried this ...
web_site = row['website'].strip().encode("utf-8", 'ignore').decode("utf-8")
but this doesn't seem to be doing the job, since I have a resulting string that looks like ...
web_site: "wbez.org<200e>"
Whatever this "<200e>" thing is, is evidently non-UTF-8 string, because when I try and insert this into a MySQL database (deployed as a docker image), I get the following error ...
web_1 | django.db.utils.OperationalError: Problem installing fixture '/app/maps/fixtures/seed_data.yaml': Could not load maps.Coop(pk=191): (1366, "Incorrect string value: '\\xE2\\x80\\x8E' for column 'web_site' at row 1")
Your row['website'] is already a Unicode string. UTF-8 can support all valid Unicode code points, so .encode('utf8','ignore') doesn't typically ignore anything and encodes the entire string in UTF-8, and .decode('utf8') changes it back to a Unicode string again.
If you simply want to strip non-ASCII characters, use the following to filter only ASCII characters and ignore the rest.
row['website'].encode('ascii','ignore').decode('ascii')
I think you are confusing the encodings.
Python has a standard character set: Unicode
UTF-8 is just and encoding of Unicode. All characters in Unicode can be encoded in UTF-8, and all valid UTF-8 codes can be interpreted as unicode characters.
So you are just encoding and decoding Unicode strings, so the code should do nothing. (There is really some exceptional cases: Python strings really are a superset of Unicode, so your code would just remove non Unicode characters, see surrogateescape, for such extremely seldom case, usually you will enconter only by reading sys.argv or os.environ).
In any case, I think you are doing thing wrong. Search in this site for the general question (e.g. "remove non-ascii characters"). It is often better to decompose (with K, compatibility), and then remove accent, and then remove non-ascii characters, so that you will get more characters translated. There are various function to create slug, which do a better job, or there is also a library which translate more characters in "nearly equivalent" ascii characters (Unicode has various representation of LETTER A, and you may want to translate also Alpha and Aleph and ... into A (better then discarding, especially if you have a foreign language, which possibly you will discard everything).

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "Yalçınkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

ASCII as default encoding in python instead of utf-8

I only code in English but I have to deal with python unicode all the time.
Sometimes its hard to remove unicode character from and dict.
How can I change python default character encoding to ASCII???
That would be the wrong thing to do. As in very wrong. To start with, it would only give you an UnicodeDecodeError instead of removing the characters. Learn proper encoding and decoding to/from unicode so that you can filter out tthe values using rules like errors="ignore"
You can't just ignore the characters taht are part of your data, just because
you 'dislike" then. It is text, and in an itnerconected World, text is not composed of only 26 glyphs.
I'd suggest you get started by reading this document: http://www.joelonsoftware.com/articles/Unicode.html

How to transliterate Cyrillic to Latin using Python 2.7? - not correct translation output

I am trying to transliterate Cyrillic to Latin from an excel file. I am working from the bottom up and can not figure out why this isn't working.
When I try to translate a simple text string, Python outputs "EEEEE EEE" instead of the correct translation. How can I fix this to give me the right translation?? I have been trying to figure this out all day!
symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'Добрый Ден'
print text.translate(tr)
>>EEEEEE EEE
I appreciate the help!
Your source input is wrong. However you entered your source and text literals, Python did not read the right unicode codepoints.
Instead, I strongly suspect something like the PYTHONIOENCODING variable has been set with the error handler set to replace. This causes Python to replace all codepoints that it does not recognize with question marks. All cyrillic input is treated as not-recognized.
As a result, the only codepoint in your translation map is 63, the question mark, mapped to the last character in symbols[1] (which is expected behaviour for the dictionary comprehension with only one unique key):
>>> unichr(63)
u'?'
>>> unichr(69)
u'E'
The same problem applies to your text unicode string; it too consists of only question marks. The translation mapping replaces each with the letter E:
>>> u'?????? ???'.translate({63, 69})
u'EEEEEE EEE'
You need to either avoid entering Cyrillic literal characters or fix your input method.
In the terminal, this is a function of the codec your terminal (or windows console) supports. Configure the correct codepage (windows) or locale (POSIX systems) to input and output an encoding that supports Cyrillic (UTF-8 would be best).
In a Python source file, tell Python about the encoding used for string literals with a codec comment at the top of the file.
Avoiding literals means using Unicode escape sequences:
symbols = (
u'\u0430\u0431\u0432\u0433\u0434\u0435\u0451\u0437\u0438\u0439\u043a\u043b\u043c'
u'\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u044a\u044b\u044c\u044d'
u'\u0410\u0411\u0412\u0413\u0414\u0415\u0401\u0417\u0418\u0419\u041a\u041b\u041c'
u'\u041d\u041e\u041f\u0420\u0421\u0422\u0423\u0424\u0425\u042a\u042b\u042c\u042d',
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E"
)
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'\u0414\u043e\u0431\u0440\u044b\u0439 \u0414\u0435\u043d'
print text.translate(tr)

How do I convert a file's format from Unicode to ASCII using Python?

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.
What is the best way to convert the entire file format using Python?
You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.
This blog recommends the unicodedata module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.
>>> title = u"Klüft skräms inför på fédéral électoral große"
is typically converted to
Klft skrms infr p fdral lectoral groe
which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.
This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html
Here's a useful quote from the site:
Python 1.6 also gets a "unicode"
built-in function, to which you can
specify the encoding:
> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>
All three of these return the same
thing, since the characters in 'Hello'
are common to all three encodings.
Now let's encode something with a
European accent, which is outside of
ASCII. What you see at a console may
depend on your operating system
locale; Windows lets me type in
ISO-Latin-1.
> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'
If you can't type an acute letter e,
you can enter the string 'Andr\202',
which is unambiguous.
Unicode supports all the common
operations such as iteration and
splitting. We won't run over them
here.
By the way, these is a linux command iconv to do this kind of job.
iconv -f utf8 -t ascii <input.txt >output.txt
Here's some simple (and stupid) code to do encoding translation. I'm assuming (but you shouldn't) that the input file is in UTF-16 (Windows calls this simply 'Unicode').
input_codec = 'UTF-16'
output_codec = 'ASCII'
unicode_file = open('filename')
unicode_data = unicode_file.read().decode(input_codec)
ascii_file = open('new filename', 'w')
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec)))
Note that this will not work if there are any characters in the Unicode file that are not also ASCII characters. You can do the following to turn unrecognized characters into '?'s:
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec, 'replace')))
Check out the docs for more simple choices. If you need to do anything more sophisticated, you may wish to check out The UNICODE Hammer at the Python Cookbook.
Like this:
uc = open(filename).read().decode('utf8')
ascii = uc.decode('ascii')
Note, however, that this will fail with a UnicodeDecodeError exception if there are any characters that can't be converted to ASCII.
EDIT: As Pete Karl just pointed out, there is no one-to-one mapping from Unicode to ASCII. So some characters simply can't be converted in an information-preserving way. Moreover, standard ASCII is more or less a subset of UTF-8, so you don't really even need to do any decoding.
For my problem where I just wanted to skip the Non-ascii characters and just output only ascii output, the below solution worked really well:
import unicodedata
input = open(filename).read().decode('UTF-16')
output = unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')
It's important to note that there is no 'Unicode' file format. Unicode can be encoded to bytes in several different ways. Most commonly UTF-8 or UTF-16. You'll need to know which one your 3rd-party tool is outputting. Once you know that, converting between different encodings is pretty easy:
in_file = open("myfile.txt", "rb")
out_file = open("mynewfile.txt", "wb")
in_byte_string = in_file.read()
unicode_string = bytestring.decode('UTF-16')
out_byte_string = unicode_string.encode('ASCII')
out_file.write(out_byte_string)
out_file.close()
As noted in the other replies, you're probably going to want to supply an error handler to the encode method. Using 'replace' as the error handler is simple, but will mangle your text if it contains characters that cannot be represented in ASCII.
As other posters have noted, ASCII is a subset of unicode.
However if you:
have a legacy app
you don't control the code for that app
you're sure your input falls into the ASCII subset
Then the example below shows how to do it:
mystring = u'bar'
type(mystring)
<type 'unicode'>
myasciistring = (mystring.encode('ASCII'))
type(myasciistring)
<type 'str'>

Categories

Resources