Problems with non-ASCII characters in path (Python) - python

I'm writing a python program that among other things is suppoused to move some files around. One of the directories in the path has a name containing a non-ASCII character:
src=os.path.normpath(u'C:\users\Mårten\python\nonpython')
dest=os.path.normpath(u'C:\users\Mårten\python\target')
files=getspc(src)
for mfile in files:
print "In the loop"
oldpath=os.path.normpath(os.path.join(src,mfile))
print "oldpath: ", oldpath
newpath=os.path.normpath(os.path.join(dest,mfile))
print "newpath", newpath
os.rename(oldpath,newpath)
with dbcon:
cur.execute("INSERT INTO spectra VALUES (?, CURRENT_DATE)",[newpath])
(Excerpt)
This makes the program crash, claiming that no encoding is declared. How does one declare encoding?

src=os.path.normpath(u'C:\users\Mårten\python\nonpython')
This isn't valid string syntax. Backslashes have special meaning in string literals, so if you want to use a literal backslash you need to escape it:
src=os.path.normpath(u'C:\\users\\Mårten\\python\\nonpython')
(Unfortunately ‘raw string’ r'' literals aren't usable here because of the unfortunate design decision that \u is still special in raw unicode strings. Boo.)
Also as #user58697 said if you want to use a non-ASCII character in your source code itself you must include a # encoding: something line at the top. The something should be the encoding you tell your text editor to save the file in—I suggest UTF-8. Unicode is not an encoding. (Except to some Windows editors, which use “Unicode” misleadingly to mean UTF-16LE. You don't want to save as UTF-16 as it's not ASCII-compatible.)
Alternatively you can avoid the problem by using the aforementioned backslash-escapes to name the non-ASCII characters:
src=os.path.normpath(u'C:\\users\\M\u00E5rten\\python\\nonpython')

You need a magic comment. See PEP 0263

Related

Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3?

I'm using Python 3.7 and Django 2.0. I want to strip out non-UTF-8 characters from a string, that I'm obtaining by reading this CSV file. I tried this ...
web_site = row['website'].strip().encode("utf-8", 'ignore').decode("utf-8")
but this doesn't seem to be doing the job, since I have a resulting string that looks like ...
web_site: "wbez.org<200e>"
Whatever this "<200e>" thing is, is evidently non-UTF-8 string, because when I try and insert this into a MySQL database (deployed as a docker image), I get the following error ...
web_1 | django.db.utils.OperationalError: Problem installing fixture '/app/maps/fixtures/seed_data.yaml': Could not load maps.Coop(pk=191): (1366, "Incorrect string value: '\\xE2\\x80\\x8E' for column 'web_site' at row 1")
Your row['website'] is already a Unicode string. UTF-8 can support all valid Unicode code points, so .encode('utf8','ignore') doesn't typically ignore anything and encodes the entire string in UTF-8, and .decode('utf8') changes it back to a Unicode string again.
If you simply want to strip non-ASCII characters, use the following to filter only ASCII characters and ignore the rest.
row['website'].encode('ascii','ignore').decode('ascii')
I think you are confusing the encodings.
Python has a standard character set: Unicode
UTF-8 is just and encoding of Unicode. All characters in Unicode can be encoded in UTF-8, and all valid UTF-8 codes can be interpreted as unicode characters.
So you are just encoding and decoding Unicode strings, so the code should do nothing. (There is really some exceptional cases: Python strings really are a superset of Unicode, so your code would just remove non Unicode characters, see surrogateescape, for such extremely seldom case, usually you will enconter only by reading sys.argv or os.environ).
In any case, I think you are doing thing wrong. Search in this site for the general question (e.g. "remove non-ascii characters"). It is often better to decompose (with K, compatibility), and then remove accent, and then remove non-ascii characters, so that you will get more characters translated. There are various function to create slug, which do a better job, or there is also a library which translate more characters in "nearly equivalent" ascii characters (Unicode has various representation of LETTER A, and you may want to translate also Alpha and Aleph and ... into A (better then discarding, especially if you have a foreign language, which possibly you will discard everything).

Python 3: is there any need of using unicode_escape encoding?

This link lists out some python specific encodings.
One of the encoding is "unicode_escape".
I am just trying to figure out, is this special encoding really needed?
>>> l = r'C:\Users\userx\toot'
>>> l
'C:\\Users\\userx\\toot'
>>> l.encode('unicode_escape').decode()
'C:\\\\Users\\\\userx\\\\toot'
If you could see above, 'l' which is a unicode object, has already taken care of escaping the backslashes. Converting it to "unicode_escape" encoding adds one more set of escaped backslashes which doesn't make any sense to me.
Questions:
Is "unicode_escape" encoding really needed?
why did "unicode_escape" added one more sets of backslashes above?
Quoting the document you linked:
Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.
Thus, print(l.encode('unicode_escape').decode()) does something almost exactly equivalent to print(repr(l)), except that it doesn't add quotes on the outside and escape quotes on the inside of the string.
When you leave off the print(), the REPL does a default repr(), so you get backslashes escaped twice -- exactly the same thing as happens when you run >>> repr(l).

How to transliterate Cyrillic to Latin using Python 2.7? - not correct translation output

I am trying to transliterate Cyrillic to Latin from an excel file. I am working from the bottom up and can not figure out why this isn't working.
When I try to translate a simple text string, Python outputs "EEEEE EEE" instead of the correct translation. How can I fix this to give me the right translation?? I have been trying to figure this out all day!
symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'Добрый Ден'
print text.translate(tr)
>>EEEEEE EEE
I appreciate the help!
Your source input is wrong. However you entered your source and text literals, Python did not read the right unicode codepoints.
Instead, I strongly suspect something like the PYTHONIOENCODING variable has been set with the error handler set to replace. This causes Python to replace all codepoints that it does not recognize with question marks. All cyrillic input is treated as not-recognized.
As a result, the only codepoint in your translation map is 63, the question mark, mapped to the last character in symbols[1] (which is expected behaviour for the dictionary comprehension with only one unique key):
>>> unichr(63)
u'?'
>>> unichr(69)
u'E'
The same problem applies to your text unicode string; it too consists of only question marks. The translation mapping replaces each with the letter E:
>>> u'?????? ???'.translate({63, 69})
u'EEEEEE EEE'
You need to either avoid entering Cyrillic literal characters or fix your input method.
In the terminal, this is a function of the codec your terminal (or windows console) supports. Configure the correct codepage (windows) or locale (POSIX systems) to input and output an encoding that supports Cyrillic (UTF-8 would be best).
In a Python source file, tell Python about the encoding used for string literals with a codec comment at the top of the file.
Avoiding literals means using Unicode escape sequences:
symbols = (
u'\u0430\u0431\u0432\u0433\u0434\u0435\u0451\u0437\u0438\u0439\u043a\u043b\u043c'
u'\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u044a\u044b\u044c\u044d'
u'\u0410\u0411\u0412\u0413\u0414\u0415\u0401\u0417\u0418\u0419\u041a\u041b\u041c'
u'\u041d\u041e\u041f\u0420\u0421\u0422\u0423\u0424\u0425\u042a\u042b\u042c\u042d',
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E"
)
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'\u0414\u043e\u0431\u0440\u044b\u0439 \u0414\u0435\u043d'
print text.translate(tr)

How to find accented characters in a string in Python?

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.
I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.
sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False
I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.
I'm out of ideas here, any suggestions?
#icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.
Use unicodedata.normalize on the string before checking.
Explanation
Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)
I suspect your terminal is using UTF-8, so 'é'.decode('latin-1') is incorrect. Just use a Unicode constant instead u'é'.
To handle Unicode correctly in a script, declare the script and data file encodings, and decode incoming data, and encode outgoing data. Using Unicode strings for text in the script.
Example (save script in UTF-8):
# coding: utf8
import codecs
with codecs.open('input.txt',encoding='latin-1') as f:
sentence = f.readline()
if u'é' in sentence:
print u'Found é'
Note that print implicitly encodes the output in the terminal encoding.

using extended ascii characters for wikimedia api

I am writing a simple search algorithm for wikipedia. I am having trouble when I send a query with characters that have accents and other characters that are not seen in regular english. Queries that return in error are:
http://en.wikipedia.org/w/api.php?action=query&titles=Albrecht%20Dürer&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Ancien%20Régime&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Feigenbaum-Cvitanović&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Banach–Tarski%20paradox&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Grundzüge%20der%20Mengenlehre&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Grundzüge%20einer%20Theorie%20der%20geordneten%20Mengen&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Karl%20Bögel&prop=links&pllimit=33&format=xml
But the query works fine if there are simple character such as "Fractals". How should I change the format of the query to make this work?
My code is open sourced at: http://code.google.com/p/wikipediafoundation/source/browse/. Please look at hg/src/list.py.
I don't see any trace in your Python source of how you're encoding any non-ascii characters you're sending in the query. For URLs (including query strings in them) using anything beyond ascii, you need to (make them unicode if they already aren't, then) encode them in utf-8 and percent-escape the result (for the latter use function urllib.quote_plus from the standard Python library module urllib, and for encoding, of course, the unicode string's .encode('utf8') method -- if you need to make a unicode string from a differently-encoded byte string, use the byte string's .decode('latin-1') -- or whatever the name of the encoding it's in, of course;-).

Categories

Resources