I am trying to read some unicode files that I have locally. How do I read unicode files while using a list? I've read the python docs, and a ton of stackoverflow Q&A's, which have answered a lot of other questions I had, but I can't find the answer to this one.
Any help is appreciated.
Edit: Sorry, my files are in utf-8.
You can open UTF-8-encoded files by using
import codecs
with codecs.open("myutf8file.txt", encoding="utf-8-sig") as infile:
for line in infile:
# do something with line
Be aware that codecs.open() does not translate \r\n to \n, so if you're working with Windows files, you need to take that into account.
The utf-8-sig codec will read UTF-8 files with or without a BOM (Byte Order Mark) (and strip it if it's there). On writing, you should use utf-8 as a codec because the Unicode standard recommends against writing a BOM in UTF-8 files.
Related
Preface:
It's a cold, rainy day, in mid 2016, and a developer is still having encode issues with python for not using Python 3.0. Will the great S.O community help him ? I don't know, we will have to wait and see
Scope:
I have a UTF-8 encoded file that contains words with accentuation, such as CURRÍCULO and NÓS. For some reason I can't grasp, I can't manage to read them properly using Python 2.7.
Code Snippet:
import codecs
f_reader = codecs.open('PATH_TO_FILE/Data/Input/kw.txt', 'r', encoding='utf-8')
for line in f_reader:
keywords.append(line.strip().upper())
print line
The output I get is:
TRABALHE CONOSCO
ENVIE SEU CURRICULO
ENVIE SEU CURRÍCULO
UnicodeEncodeError, 'ascii' codec can't encode character u'\xcd' in position 14: ordinal not in range(128)
Encoding, Encoding, Encoding:
I have used notepad++ to convert the file to both regular utf-8 and the one without the ByteOrderMark, and it shows me the characters just fine, without any issue. I'm using Windows, by the way, which will create files as ANSI by default.
Question:
What should I do to be able to read this file properly, including the í and ó and other accentuated characters ?
Just to make it clearer, I want to keep the accentuation on the strings I use in memory.
Update:
Here's the List of Keywords, in memory, read from the file using the code you can see.
The problem seems not to be in the reading, but in the printing. You sad
I'm using Windows, by the way, which will create files as ANSI by default.
I think that includes printing to stdout. Try change the sys.output codec:
sys.stdout = codecs.getwriter("utf-8")(sys.stdout)
I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.
Is there any way to do read those files without knowing their encoding?
You can read those files in binary mode. And there is the chardet module. Whit it you can detect the encoding of your files and decode the data you get. Though this module has limitations.
As an example:
from chardet import detect
with open('your_file.txt', 'rb') as ef:
detect(ef.read())
If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.
You can use a try ... except block to try both:
try:
tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
tryToConvertMyFile(from, to, 'utf-16')
If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?
I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
The following text ends up to the file:
<BOM>123
<BOM>123
Isn't that a bug? This is so not logical.
Could anyone explain to me why it was done so?
Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?
No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.
Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.
If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.
Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).
With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:
import codecs
with open('temp.txt','w') as temp:
temp.write(codecs.BOM_UTF16_LE)
text = unichr(33034) # text = u'\u810a'
temp.write(text.encode('utf-16-le'))
But either of these things, when replaced above, make the code work.
unichr(33033) and unichr(33035) work correctly.
'utf-8' encoding (without BOM, byte-order mark).
How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?
You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.
Open the file in binary mode instead:
open('temp.txt','wb')
#Joni's answer is the root of the problem, but if you use codecs.open instead it always opens in binary mode, even if not specified. Using the utf16 codec also automatically writes the BOM using native endian-ness as well:
import codecs
with codecs.open('temp.txt','w','utf16') as temp:
temp.write(u'\u810a')
Hex dump of temp.txt:
FF FE 0A 81
Reference: codecs.open
You're already using the codecs library. When working with that file, you should swap out using open() with codecs.open() to transparently handle encoding.
import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
temp.write(unichr(33033))
temp.write(unichr(33034))
temp.write(unichr(33035))
If you have a problem after that, you might have an issue with your viewer, not your Python script.
I am using Python 2.7.2 on a Mac OS X 10.8.2. I need to write a .csv file which often contains several "Umlauts" like ä, ö and ü. When I write the .csv file Numbers and Open Office are all able to read the csv correctly and also display the Umlauts without any problems.
But if I read it with Microsoft Excel 2004 the words are display like that:
TuÃàrlersee
I know, Excel has problems dealing with UTF-8. I read something that Excel versions below 2007 are not able to read UTF-8 files properly, even if you have setted the UTF-8 BOM (Byte Order Marker). I'm setting the UTF-8 BOM with the following line:
e.write(codecs.BOM_UTF8)
So what I tried as next step was instead of exporting it as UTF-8 file I wanted to set the character encoding to mac-roman. With the following line I decoded the value from utf-8 and reencoded it with mac-roman.
projectName = projectDict['ProjectName'].decode('utf-8').encode('mac-roman')
But then I receive the following error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0308' in position 6: character maps to <undefined>
How can I export this data into a .csv where Excel is able to read also the Umlauts correctly? Python internally handles everything in UTF-8. Or maybe I'm not understanding the decoding/encoding correctly. In Python 3.0 they have adapted the whole encoding/decoding model, but I need to stay on version 2.7.2..
I am using the DictWriter like that:
w = csv.DictWriter(e, fieldnames=fieldnames, extrasaction='ignore', delimiter=';', quotechar='\"', quoting=csv.QUOTE_NONNUMERIC)
w.writeheader()
The \u0308 is a combining diaeresis; you'll need to normalize your unicode string before decoding to mac-roman:
import unicodedata
unicodedata.normalize('NFC', projectDict['ProjectName'].decode('utf-8')).encode('mac-roman')
Demo, encoding a ä character in denormalized form (a plus combining diaeresis) to mac-roman after normalization to composed characters:
>>> unicodedata.normalize('NFC', u'a\u0308').encode('mac-roman')
'\x8a'
I've used this technique in the past to produce CSV for Excel for specific clients where their platform encoding was known upfront (Excel will interpret the file in the current Windows encoding, IIRC). In that case I encoded to windows-1252.
CSV files are really only meant to be in ASCII - if what you're doing is just writing out data for import into Excel later, then I'd write it as an Excel workbook to start with which would avoid having to muck about with this kind of stuff.
Check http://www.python-excel.org/ for the xlwt module.