I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.
Is there any way to do read those files without knowing their encoding?
You can read those files in binary mode. And there is the chardet module. Whit it you can detect the encoding of your files and decode the data you get. Though this module has limitations.
As an example:
from chardet import detect
with open('your_file.txt', 'rb') as ef:
detect(ef.read())
If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.
You can use a try ... except block to try both:
try:
tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
tryToConvertMyFile(from, to, 'utf-16')
If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?
Related
I am consuming a text response from a third party API. This text is in an encoding which is unknown to me. I consume the text in python3 and want to change the encoding into UTF-8.
This is an example of the contents I get:
Danke
"Träume groß"
🙌ðŸ¼
Super Idee!!!
I was able to get the messed up characters readable by doing the following manually:
Open new document in Notepad++
Via the Encoding menu switch the encoding of the document to ANSI
Paste the contents
Again use the Encoding menu, this time switch to UTF-8
Now the text is properly legible like below
Correct content:
Danke
"Träume groß"
🙌🏼
Super Idee!!!
I want to repeat this process in python3, but struggle to do so. From the notepad workflow I gather that the encoding shouldn't be converted, rather the existing characters should be interpreted with a different encoding. That's because if I select Convert to UTF-8 in the Encoding menu, it doesn't work.
From what I have read on SO, there are the encode and decode methods to do that. Also ANSI isn't really an encoding but rather refers to the standard encoding the current machine uses. This would most likely be cp1525 on my windows machine. I have messed around with all combinations of cp1252 and utf-8 as source and/or target, but to no avail. I always end up with a UnicodeEncodeError.
I have also tried using the chardet module to determine the encoding of my input string, but it requires bytes as input and b'🙌ðŸ¼' is rejected with SyntaxError: bytes can only contain ASCII literal characters.
"Träume groß" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.
A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:
print('"Träume groß"'.encode('cp1252').decode('utf8'))
gives as expected:
"Träume groß"
But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.
You can use bytes() to convert a string to bytes, and then decode it with .decode()
>>> bytes("Träume groß", "cp1252").decode("utf-8")
'Träume groß'
chardet could probably be useful here -
Quoting straight from the docs
import urllib.request
rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
import chardet
chardet.detect(rawdata) {'encoding': 'EUC-JP', 'confidence': 0.99}
When a text file is open for reading using (say) UTF-8 encoding, is it possible to change encoding during the reading?
Motivation: It hapens that you need to read a text file that was written using non-default encoding. The text format may contain the information about the used encoding. Let an HTML file be the example, or XML, or ASCIIDOC, and many others. In such cases, the lines above the encoding information are allowed to contain only ASCII or some default encoding.
In Python, it is possible to read the file in binary mode, and translate the lines of bytes type to str on your own. When the information about the encoding is found on some line, you just switch the encoding to be used when converting the lines to unicode strings.
In Python 3, text files are implemented using TextIOBase that defines also the encoding attribute, the buffer, and other things.
Is there any nice way to change the encoding information (used for decoding the bytes) so that the next lines would be decoded in the wanted way?
Classic usage is:
Open the file in binary format (bytes string)
read a chunk and guess the encoding (For instance with a simple scanning or using RegEx)
Then:
close the file and re-open it in text mode with the found encoding
Or
move to the beginning: seek(0), read the whole content as a bytes string then decode the content using the found encoding.
See this example: Detect character encoding in an XML file (Python recipe)
note: the code is a little old, but useful.
I've got a script that basically aggregates students' code files into one file for plagiarism detection. It walks through a tree of files, copying all file contents into one file.
I've run the script on the exact same files on my Mac and my PC. On my PC, it works fine. On my Mac, it encounters 27 UnicodeDecodeErrors (probably 0.1% of all files I'm testing).
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
If relevant, the code is:
originalFile = open(originalFilename, "r")
newFile = open(newFilename, "a")
newFile.write(originalFile.read())
Figure out what encoding was used when saving that file. A safe bet is loading the file as 'utf-8'. If that succeeds then it's likely to be the correct encoding.
# try utf-8. If this fails, all bets are off.
open(originalFilename, "r", encoding="utf-8")
Now, if students are sending you these files, it's likely they just use the default encoding on their system. It is not possible to reliably guess the encoding. If they were using an 8-bit codec, like one of the ISO-8859 character sets, it will be almost impossible to guess which one was used. What to do then depends on what kind of files you're processing.
It is incorrect to read Python source files using open(originalFilename, "r") on Python 3. open() uses locale.getpreferredencoding(False) by default. A Python source may use a different character encoding; in the best case, it may cause UnicodeDecodeError -- usually, you just get a mojibake silently.
To read Python source taking into account the encoding declaration (# -*- coding: ...), use tokenize.open(filename). If it fails; the input is not valid Python 3 source code.
What could cause a UnicodeDecodeError on a Mac, but not on a PC?
locale.getpreferredencoding(False) is likely to be utf-8 on Mac. utf-8 doesn't accept arbitrary sequence of bytes as utf-8 encoded text. PC is likely to use a 8-bit character encoding that corrupts the input and produces a mojibake silently instead of raising an error due to a mismatched character encoding.
To read a text file, you should know its character encoding. If you don't know the character encoding then either read the file as a sequence of bytes ('rb' mode) or you could try to guess the encoding using chardet Python module (it would be only a guess but it might be good enough depending on your task).
I got the exact same problem. There seemed to be some characters in the file that gave a UnicodeDecodeError during readlines()
This only happened on my macbook, but not on a PC.
I solve the problem by simply skipping these characters:
with open(file_to_extract, errors='ignore') as f: reader = f.readlines()
I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
The following text ends up to the file:
<BOM>123
<BOM>123
Isn't that a bug? This is so not logical.
Could anyone explain to me why it was done so?
Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?
No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.
Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.
If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.
Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).
I am trying to read some unicode files that I have locally. How do I read unicode files while using a list? I've read the python docs, and a ton of stackoverflow Q&A's, which have answered a lot of other questions I had, but I can't find the answer to this one.
Any help is appreciated.
Edit: Sorry, my files are in utf-8.
You can open UTF-8-encoded files by using
import codecs
with codecs.open("myutf8file.txt", encoding="utf-8-sig") as infile:
for line in infile:
# do something with line
Be aware that codecs.open() does not translate \r\n to \n, so if you're working with Windows files, you need to take that into account.
The utf-8-sig codec will read UTF-8 files with or without a BOM (Byte Order Mark) (and strip it if it's there). On writing, you should use utf-8 as a codec because the Unicode standard recommends against writing a BOM in UTF-8 files.