Python Check for unicode files

Python Check for unicode files - python

I'm very new to python and I have a mix of both ansi and unicode (utf-16-le) text based files in a series of directories. I've got some code which reads the text files okay until it hits a unicode file which at the mo, I've written in the code to skip. . I'm wondering if there's anyway I can get python to run a
with codecs.open
type of thing when it hits a unicode file as part of one prog? With my currrent level of python experience, the only way I could see of doing this is to write two separate progs; one to process the ANSI stuff and one for the Unicode.
Thanks in advance for any help you can provide

use unicode by default(which is a good programming discipline) and switch to ansi only if necessary.
import codecs
def opener(filename):
try:
f = codecs.open(filename, encoding='utf-8')
except UnicodeError:
f = open(filename)
return f

Just open all files using UTF-8.
f = codecs.open(file_name, "r", "utf-8")

Related

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Read an Arabic File in Python

I'm working with an Arabic text file which is a corpus.
What should I do to be able to import the file in python so I can easily access the file and be able to analyze it instead of copying and pasting the content in the interpreter every time. It's an Arabic file, not English.

The most important thing when reading and writing plain text is to know and specify the plain text encoding. You shouldn't let Python guesses the encoding for you, especially in real world program (The encoding should be either configurable or you ask the user for the encoding).
Many people don't have an issue with English text because ASCII is a subset of most encodings. The issue is there and they will run into it as soon as the program tries to read or write texts in different encodings.
Most Arabic texts are encoded in (ordered by popularity1) Windows-1256, UTF-8, CP720, or ISO 8859-6. You should know ahead of time what encoding your plain text is using, for example when most text editors allow you to select the encoding when you save the file.
I have three files with your name طارق but in 3 different encodings. Reading the files as raw binary data show you how different these files are, though it's the the same text:
>>> f = open('file-utf8.txt', 'rb')
>>> f.read()
b'\xd8\xb7\xd8\xa7\xd8\xb1\xd9\x82'
>>>
>>> f = open('file-cp720.txt', 'rb')
>>> f.read()
b'\xe1\x9f\xa9\xe7'
>>>
>>> f = open('file-windows1256.txt', 'rb')
>>> f.read()
b'\xd8\xc7\xd1\xde'
>>>
The right way to read these files is by telling Python what encoding it should use so it decodes it to its internal Unicode representation (Using the mapping tables in /Python33/Lib/encodings/):
>>> f = open('file-utf8.txt', encoding='utf-8')
>>> f.read()
'طارق'
>>>
>>> f = open('file-cp720.txt', encoding='cp720')
>>> f.read()
'طارق'
>>>
>>> f = open('file-windows1256.txt', encoding='windows-1256')
>>> f.read()
'طارق'
>>>
The issue of encoding is not only related to files. Whenever you read texts from external source to the program, e.g. file, console, network socket, you must know the encoding. Also when you write to external source you have to encode the text to the right encoding.
The encoding have to be consistent, if your console is using Latin-1 and you tried to write to the console, i.e. print, you will get some meaningless word or, if you are lucky, you will get UnicodeEncodeError exception.
There are ways for guessing the encoding, but I won't bother to use them as they only mask the problem. It will come sooner or later.
1 If it's up to you, always go with UTF-8 because it's well supported.

The right encodings to read an Arabic text file are utf_8 and utf_16. But you have to try both and see which one is the right encoding for your file. You can do this by using the codecs package and setting the right encoding argument.
import codecs, sys
# pass your file as a command-line argument
# try "utf-16" encoding if this does not work
for line in codecs.open(sys.argv[1], encoding = "utf_8"):
print(line.strip()

Arabic is generally represented in Unicode.
Generally, you can read the file in and then convert to Unicode:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
For more information, refer to https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

File = open("Infixes.txt",encoding = "utf-16")
print(File.read())
This works for me in Windows 8.1 and 64 bit processor

Use this for Urdu file reading in Python:
File = open("Infixes.txt",encoding = "utf-8")
print(File.read())

Best way to convert unicode in csv to plain text?

I have a large csv file that contains unicode characters which are causing errors in a Python script I am trying to run. My process for removing them so far has been quite tedious. I run my script and as soon as it hits a unicode character, I get an error:
'ascii' codec can't encode character u'\xef' in position 197: ordinal not in range(128)
Then I Google u'\xef' and try to figure out what the character actually is (Does anyone know of a website with a list of these definitions?). I'm using that information to build a dictionary and I have a second Python script that converts the unicode characters to regular text:
unicode_dict = {"\xb0":"deg", "\xa0":" ", "\xbd":"1/2", "\xbc":"1/4", "\xb2":"^2", "\xbe":"3/4"}
for f in glob.glob(r"C:\Folder1\*.csv"):
in_csv = f
out_csv = f.replace(".csv", "_2.csv")
write_f=open(out_csv, "wb")
writer = csv.writer(write_f)
with open(in_csv,'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
new_row = []
for s in row:
for k, v in unicode_dict.iteritems():
s = s.replace(k, v)
new_row.append(s)
writer.writerow(new_row)
write_f.close()
os.remove(in_csv)
os.rename(out_csv, in_csv)
Then I have to run the code again, get another error, and look up the next unicode character on Google. There must be a better way, right?

Read http://www.joelonsoftware.com/articles/Unicode.html . Carefully.
Then, you'll understand that you need to know which encoding your file is in. If you've been able to find out what \xbd means, maybe that some place mentions which encoding it is.
Then, use io.open(in_csv, 'rb', encoding='yourencodinghere') instead of the vanilla open call.
Then, apparently the csv module doesn't handle Unicode, sigh. Use something from SBillion's answer (e.g. http://www.joelonsoftware.com/articles/Unicode.html ) to work around it.

You should have a look at this for a way to handle Unicode via utf-8 in csv files with the standard python library:
https://docs.python.org/2/library/csv.html#csv-examples
But if you prefer, you can use this external unicode-compliant module: https://pypi.python.org/pypi/unicodecsv/0.9.0

Wrong encoding in filenames created on Windows XP by Python script

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_AlcaÃ±iz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help

You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.

Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.

First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)

In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

Why all those unicode commands works CORRECT in Python? They all print my character correctly, no matter what i do

Probably I completely don't understand it, so can you take a look at code examples and tell my what should I do, to be sure it will work?
I tried it in Eclipse with Pydev. I use python 2.6.6 (becuase of some library that not support python 2.7).
First, without using codecs module
# -*- coding: utf-8 -*-
file1 = open("samoloty1.txt", "w")
file2 = open("samoloty2.txt", "w")
file3 = open("samoloty3.txt", "w")
file4 = open("samoloty4.txt", "w")
file5 = open("samoloty5.txt", "w")
file6 = open("samoloty6.txt", "w")
# I know that this is weird, but it shows that whatever i do, it not ruin anything...
print u"ą✈✈"
file1.write(u"ą✈✈")
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈".decode("utf-8")
file3.write("ą✈✈".decode("utf-8"))
print "ą✈✈".encode("utf-8")
file4.write("ą✈✈".encode("utf-8"))
print u"ą✈✈".decode("utf-8")
file5.write(u"ą✈✈".decode("utf-8"))
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
file1 = open("samoloty1.txt", "r")
file2 = open("samoloty2.txt", "r")
file3 = open("samoloty3.txt", "r")
file4 = open("samoloty4.txt", "r")
file5 = open("samoloty5.txt", "r")
file6 = open("samoloty6.txt", "r")
print file1.read()
print file2.read()
print file3.read()
print file4.read()
print file5.read()
print file6.read()
Every each of those prints works correctly and I don't get any funny characters.
Also i tried this: i delete all files made in the previous test and change only those lines:
file1 = open("samoloty1.txt", "w")
to those:
file1 = codecs.open("samoloty1.txt", "w", encoding='utf-8')
and again everything works...
Can anyone make some examples what works, and what not?
Should this be separate question?
I am downloading web pages, through this:
content = urllib.urlopen(some_url).read()
ucontent = unicode(content, encoding) # i get encoding from headers
Is this correct and enough? What should I do next with it to store it in utf-8 file? (I ask it because whatever I did before, it just works...)
** UPDATE **
Probably everything works ok because PyDev (or just Eclipse) has terminal encoded in UTF-8. So for tests i used cmd from Windows 7 and i get some errors. Now everything was crashing as expected. :D Here i am showing what i changed to get it working again (and all of those changes are reasonable for me and they agree with what i learn in answers and in docs in Python documentations).
print u"ą✈✈".encode("utf-8") # added encode
file1.write(u"ą✈✈".encode("utf-8")) # added encode
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈" # removed .decode("utf-8")
file3.write("ą✈✈") # removed .decode("utf-8"))
print "ą✈✈" # removed .encode("utf-8")
file4.write("ą✈✈") # removed .encode("utf-8"))
print u"ą✈✈".encode("utf-8") # changed from .decode("utf-8")
file5.write(u"ą✈✈".encode("utf-8")) # changed from .decode("utf-8")
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
And like someone said, when i use codecs, i not need to use encode() everytime before writing to file. :)
Question is, which answer should be marked as correct?

You are just lucky that the encoding of your console is utf-8 by default.
If you pass a unicode object to the write method method of a file object (sys.stdout) the object is implicitly decoded with its encoding attribute.
Thouse who work in Windows are not so lucky: How to workaround Python "WindowsError messages are not properly encoded" problem?

All those write exercises in the code snippet actually boil down to two situations:
when you write string to the file
when you try to write unicode string to the file
Lets call string as s and unicode string as u.
Then fileN.write(s) makes sense, and fileN.write(u) doesn't. I don't know about your setup (maybe you have made some changes to site's python), but the following expectedly breaks here:
# -*- coding: utf-8 -*-
ff = open("ff.txt", "w")
ff.write(u"ą✈✈")
ff.close()
with:
Traceback (most recent call last):
File "ex.py", line 5, in <module>
ff.write(u"ą✈✈")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It means, that unicode string should be changed to string before writing to file. And your file6 example shows how to do it:
u"ą✈✈".encode("utf-8")
The magic string -*- coding: utf-8 -*- is the one which enables you to write unicode string literals in a WYSIWYG way: u"ą✈✈", it doesn't help you to determine your encoding in any other situation.
Thus, do not give .write() method in Python2.6 any unicode string. The good practice is to work with unicode strings in your code but convert from/to concrete encoding at the input/output borders.
The codecs example is good, as well as urllib.

What you are doing is correct. See this Python unicode howto for more info.
The general principles are:
When binary data comes in to your application (e.g., open(), urllib.urlopen()), use the decode() method to get a unicode string.
If the byte string is invalid for the supplied encoding, you may get UnicodeDecodeError. In this case do one of the following:
Use the second argument to decode to either replace or ignore bad characters
try harder to find out what the real encoding is
fix the input if it really is mangled.
For files, you can use the codecs.open wrappers to do this transparently for you.
Network data you must generally decode by hand, but sometimes the payload declares its own encoding (e.g., html, XML), and sometimes it doesn't match the header!
For database data, usually the database driver will have some method of doing encoding/decoding transparently for you and always give you unicode strings. Otherwise you will need to encode/decode by hand.
Use unicode strings in your application.
Right before the binary data leaves your application, use encode() on the string to encode to your desired encoding.
If your target encoding cannot represent some of your unicode characters, you may get UnicodeEncodeError. In this case do one of the following:
Use the second argument to encode() to ignore or replace characters that can't be represented in the target encoding;
Don't generate these characters in your application.
Find an alternate way of representing them. E.g., in XML, you can use a numeric character entity.
For files, you may use the codecs.open wrapper to do encoding for you transparently.
For database connections, the driver will often have an option to accept unicode strings and encode for you.
For network connections, you must generally encode by hand. Sometimes the payload will be generated by a library that will encode properly for you (e.g., writing XML).

Because you are correctly using the magic "coding comment," everything works as supposed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.