Best way to convert unicode in csv to plain text?

Best way to convert unicode in csv to plain text? - python

I have a large csv file that contains unicode characters which are causing errors in a Python script I am trying to run. My process for removing them so far has been quite tedious. I run my script and as soon as it hits a unicode character, I get an error:
'ascii' codec can't encode character u'\xef' in position 197: ordinal not in range(128)
Then I Google u'\xef' and try to figure out what the character actually is (Does anyone know of a website with a list of these definitions?). I'm using that information to build a dictionary and I have a second Python script that converts the unicode characters to regular text:
unicode_dict = {"\xb0":"deg", "\xa0":" ", "\xbd":"1/2", "\xbc":"1/4", "\xb2":"^2", "\xbe":"3/4"}
for f in glob.glob(r"C:\Folder1\*.csv"):
in_csv = f
out_csv = f.replace(".csv", "_2.csv")
write_f=open(out_csv, "wb")
writer = csv.writer(write_f)
with open(in_csv,'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
new_row = []
for s in row:
for k, v in unicode_dict.iteritems():
s = s.replace(k, v)
new_row.append(s)
writer.writerow(new_row)
write_f.close()
os.remove(in_csv)
os.rename(out_csv, in_csv)
Then I have to run the code again, get another error, and look up the next unicode character on Google. There must be a better way, right?

Read http://www.joelonsoftware.com/articles/Unicode.html . Carefully.
Then, you'll understand that you need to know which encoding your file is in. If you've been able to find out what \xbd means, maybe that some place mentions which encoding it is.
Then, use io.open(in_csv, 'rb', encoding='yourencodinghere') instead of the vanilla open call.
Then, apparently the csv module doesn't handle Unicode, sigh. Use something from SBillion's answer (e.g. http://www.joelonsoftware.com/articles/Unicode.html ) to work around it.

You should have a look at this for a way to handle Unicode via utf-8 in csv files with the standard python library:
https://docs.python.org/2/library/csv.html#csv-examples
But if you prefer, you can use this external unicode-compliant module: https://pypi.python.org/pypi/unicodecsv/0.9.0

Related

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Python 2.7 JSON dump UnicodeEncodeError

I have a file where each line is a json object like so:
{"name": "John", ...}
{...}
I am trying to create a new file with the same objects, but with certain properties removed from all of them.
When I do this, I get a UnicodeEncodeError. Strangely, If I instead loop over range(n) (for some number n) and use infile.next(), it works just as I want it to.
Why so? How do I get this to work by iterating over infile? I tried using dumps() instead of dump(), but that just makes a bunch of empty lines in the outfile.
with open(filename, 'r') as infile:
with open('_{}'.format(filename), 'w') as outfile:
for comment in infile:
decodedComment = json.loads(comment)
for prop in propsToRemove:
# use pop to avoid exception handling
decodedComment.pop(prop, None)
json.dump(decodedComment, outfile, ensure_ascii = False)
outfile.write('\n')
Here is the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f47d' in position 1: ordinal not in range(128)
Thanks for the help!

The problem you are facing is that the standard file.write() function (called by the json.dump() function) does not support unicode strings. From the error message, it turns out that your string contains the UTF character \U0001f47d (which turns out to code for the character EXTRATERRESTRIAL ALIEN, who knew?), and possibly other UTF characters. To handle these characters, either you can encode them into an ASCII encoding (they'll show up in your output file as \XXXXXX), or you need to use a file writer that can handle unicode.
To do the first option, replace your writing line with this line:
json.dump(unicode(decodedComment), outfile, ensure_ascii = False)
The second option is likely more what you want, and an easy option is to use the codecs module. Import it, and change your second line to:
with codecs.open('_{}'.format(filename), 'w', encoding="utf-8") as outfile:
Then, you'll be able to save the special characters in their original form.

International characters in Python

I'm currently working on a Python script that takes a list of log files (from a search engine) and produces a file with all the queries within these, for later analysis.
Another feature of the script is that it removes the most common words, which I've also implemented, but I've faced a problem I can't seem to overcome. The removing of words does work as intended, as long as the queries does not contain special characters. As the search logs are in Danish, the characters æ, ø and å will appear regularly.
Searching on the topic I'm now aware that I need to encode these into UTF-8, which I'm doing when obtaining the query:
tmp = t_query.encode("UTF-8").lower().split()
t_query is the query and I split it up to later compare each word with my list of forbidden words. If I do not use the encoding I'll get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 1: ordinal not in range(128)
Edit: I also tried using the decode instead, but get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa7' in position 3: ordinal not in range(128)
I loop through the words like this:
for i in tmp:
if i in words_to_filter:
tmp.remove(i)
As said this works perfectly for words not including special characters. I've tried to print the i along with the current forbidden word and will get e.g:
fÃ¦rdelsloven - færdelsloven
Where the first word is the ith element in tmp. The last word in the one from the forbidden words. Obviously something has gone wrong, but I just can't manage to find a solution. I've tried many suggestions found on Google and in here, but nothing have worked so far.
Edit 2: if it makes a difference, I've tried loading the log files both with and without the use of codec:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
I'm running Python 2.7.2 from a Windows environment, if it matters. The script should be able to run on other platforms (namely Linux and Mac OS).
I would really appreciate if one of you are able to help me out.
Best regards
Casper

If you are reading files, you want to decode them.
tmp = t_query.decode("UTF-8").lower().split()

Given a utf-8 file with json object per line, you could read all objects:
with open(filename) as file:
jlogs = [json.loads(line) for line in file]
Except for an embeded newline treatment the above code should produce the same result as yours:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
At this point all strings in jlogs are Unicode you don't need to do anything to handle "special" characters. Just make sure you are not mixing bytes and Unicode text in your code.
to get Unicode text from bytes: some_bytes.decode(character_encoding)
to get bytes from Unicode text: some_text.encode(character_encoding)
Don't encode bytes/decode Unicode.

If encoding is right and you just want to ignore unexpected characters you could use errors='ignore' or errors='replace' parameter passed to codecs.open function.
with codecs.open(file_name, encoding='utf-8', mode='r', errors='ignore') as f:
jlogs = map(json.loads, f.readlines())
Details in docs:
http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

I've finally solved it. As Lattyware Python 3.x seems to do much better. After changing the version and encoding the Python file to Unicode it works as intended.

Why all those unicode commands works CORRECT in Python? They all print my character correctly, no matter what i do

Probably I completely don't understand it, so can you take a look at code examples and tell my what should I do, to be sure it will work?
I tried it in Eclipse with Pydev. I use python 2.6.6 (becuase of some library that not support python 2.7).
First, without using codecs module
# -*- coding: utf-8 -*-
file1 = open("samoloty1.txt", "w")
file2 = open("samoloty2.txt", "w")
file3 = open("samoloty3.txt", "w")
file4 = open("samoloty4.txt", "w")
file5 = open("samoloty5.txt", "w")
file6 = open("samoloty6.txt", "w")
# I know that this is weird, but it shows that whatever i do, it not ruin anything...
print u"ą✈✈"
file1.write(u"ą✈✈")
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈".decode("utf-8")
file3.write("ą✈✈".decode("utf-8"))
print "ą✈✈".encode("utf-8")
file4.write("ą✈✈".encode("utf-8"))
print u"ą✈✈".decode("utf-8")
file5.write(u"ą✈✈".decode("utf-8"))
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
file1 = open("samoloty1.txt", "r")
file2 = open("samoloty2.txt", "r")
file3 = open("samoloty3.txt", "r")
file4 = open("samoloty4.txt", "r")
file5 = open("samoloty5.txt", "r")
file6 = open("samoloty6.txt", "r")
print file1.read()
print file2.read()
print file3.read()
print file4.read()
print file5.read()
print file6.read()
Every each of those prints works correctly and I don't get any funny characters.
Also i tried this: i delete all files made in the previous test and change only those lines:
file1 = open("samoloty1.txt", "w")
to those:
file1 = codecs.open("samoloty1.txt", "w", encoding='utf-8')
and again everything works...
Can anyone make some examples what works, and what not?
Should this be separate question?
I am downloading web pages, through this:
content = urllib.urlopen(some_url).read()
ucontent = unicode(content, encoding) # i get encoding from headers
Is this correct and enough? What should I do next with it to store it in utf-8 file? (I ask it because whatever I did before, it just works...)
** UPDATE **
Probably everything works ok because PyDev (or just Eclipse) has terminal encoded in UTF-8. So for tests i used cmd from Windows 7 and i get some errors. Now everything was crashing as expected. :D Here i am showing what i changed to get it working again (and all of those changes are reasonable for me and they agree with what i learn in answers and in docs in Python documentations).
print u"ą✈✈".encode("utf-8") # added encode
file1.write(u"ą✈✈".encode("utf-8")) # added encode
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈" # removed .decode("utf-8")
file3.write("ą✈✈") # removed .decode("utf-8"))
print "ą✈✈" # removed .encode("utf-8")
file4.write("ą✈✈") # removed .encode("utf-8"))
print u"ą✈✈".encode("utf-8") # changed from .decode("utf-8")
file5.write(u"ą✈✈".encode("utf-8")) # changed from .decode("utf-8")
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
And like someone said, when i use codecs, i not need to use encode() everytime before writing to file. :)
Question is, which answer should be marked as correct?

You are just lucky that the encoding of your console is utf-8 by default.
If you pass a unicode object to the write method method of a file object (sys.stdout) the object is implicitly decoded with its encoding attribute.
Thouse who work in Windows are not so lucky: How to workaround Python "WindowsError messages are not properly encoded" problem?

All those write exercises in the code snippet actually boil down to two situations:
when you write string to the file
when you try to write unicode string to the file
Lets call string as s and unicode string as u.
Then fileN.write(s) makes sense, and fileN.write(u) doesn't. I don't know about your setup (maybe you have made some changes to site's python), but the following expectedly breaks here:
# -*- coding: utf-8 -*-
ff = open("ff.txt", "w")
ff.write(u"ą✈✈")
ff.close()
with:
Traceback (most recent call last):
File "ex.py", line 5, in <module>
ff.write(u"ą✈✈")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It means, that unicode string should be changed to string before writing to file. And your file6 example shows how to do it:
u"ą✈✈".encode("utf-8")
The magic string -*- coding: utf-8 -*- is the one which enables you to write unicode string literals in a WYSIWYG way: u"ą✈✈", it doesn't help you to determine your encoding in any other situation.
Thus, do not give .write() method in Python2.6 any unicode string. The good practice is to work with unicode strings in your code but convert from/to concrete encoding at the input/output borders.
The codecs example is good, as well as urllib.

What you are doing is correct. See this Python unicode howto for more info.
The general principles are:
When binary data comes in to your application (e.g., open(), urllib.urlopen()), use the decode() method to get a unicode string.
If the byte string is invalid for the supplied encoding, you may get UnicodeDecodeError. In this case do one of the following:
Use the second argument to decode to either replace or ignore bad characters
try harder to find out what the real encoding is
fix the input if it really is mangled.
For files, you can use the codecs.open wrappers to do this transparently for you.
Network data you must generally decode by hand, but sometimes the payload declares its own encoding (e.g., html, XML), and sometimes it doesn't match the header!
For database data, usually the database driver will have some method of doing encoding/decoding transparently for you and always give you unicode strings. Otherwise you will need to encode/decode by hand.
Use unicode strings in your application.
Right before the binary data leaves your application, use encode() on the string to encode to your desired encoding.
If your target encoding cannot represent some of your unicode characters, you may get UnicodeEncodeError. In this case do one of the following:
Use the second argument to encode() to ignore or replace characters that can't be represented in the target encoding;
Don't generate these characters in your application.
Find an alternate way of representing them. E.g., in XML, you can use a numeric character entity.
For files, you may use the codecs.open wrapper to do encoding for you transparently.
For database connections, the driver will often have an option to accept unicode strings and encode for you.
For network connections, you must generally encode by hand. Sometimes the payload will be generated by a library that will encode properly for you (e.g., writing XML).

Because you are correctly using the magic "coding comment," everything works as supposed.

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

Try to decode it:
> print u'abcdé'.encode('utf-8')
> abcdÃ©
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

In case your string is 'str':
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
with tempfile.NamedTemporaryFile(delete=False) as fh:
dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
fh = codecs.lookup("utf-8")[3](fh)
dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

I encounter this error a few times, and my hacky way of dealing with it is just to do this:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.