decoding error while reading csv files from folder

decoding error while reading csv files from folder - python

I am reading headers of csv files from a folder.
code:
#mypath = folder directory with the csv files
for each_file in listdir(mypath):
with open(mypath +"//"+each_file) as f:
first_line = f.readline().strip().split(",")
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
Environment:
Spyder, Python 3
Not able to understand the encoding error since I have not done any encoding.

Try using encoding while opening the file in the with condition. I tried the below code and worked fine for me. Please try different encoding's and see if any of it works
for each_file in listdir(path):
with open(path +"//"+each_file,encoding='utf-8') as f:
first_line = f.readline().strip().split(",")
print(each_file ,' --> ',first_line)
Also, check this link for checking file encoding for CSV. hope it helps.
How to check encoding of CSV file
Happy Coding :)

try using single slash '/'
please try using
with open(mypath +"/"+each_file) as f:
Another problem may be the CSV file contains Unicode, not UTF8. It would be easy if you post sample of CSV file too.

The built in os.path.join provides a convenient way to join two or more paths, without worrying about Platform specific slashes '/' or '\'.
import os
files = os.listdir(path)
for file in files:
with open(os.path.join(path, file), encoding='utf-8') as f:
first_line = str(f.readline()).strip().split(",")
print(file, ' --> ', first_line)

Related

Is it possible to specify the encoding of a file with Paramiko?

I'm trying to read a CSV over SFTP using pysftp/Paramiko. My code looks like this:
input_conn = pysftp.Connection(hostname, username, password)
file = input_conn.open("Data.csv")
file_contents = list(csv.reader(file))
But when I do this, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 23: invalid start byte
I know that this means the file is expected to be in UTF-8 encoding but isn't. The strange thing is, if I download the file and then use my code to open the file, I can specify the encoding as "macroman" and get no error:
with open("Data.csv", "r", encoding="macroman") as csvfile:
file_contents = list(csv.reader(csvfile))
The Paramiko docs say that the encoding of a file is meaningless over SFTP because it treats all files as bytes – but then, how can I get Python's CSV module to recognize the encoding if I use Paramiko to open the file?

If the file is not huge, so it's not a problem to have it loaded twice into the memory, you can download and convert the contents in memory:
with io.BytesIO() as bio:
input_conn.getfo("Data.csv", bio)
bio.seek(0)
with io.TextIOWrapper(bio, encoding='macroman') as f:
file_contents = list(csv.reader(f))
Partially based on Convert io.BytesIO to io.StringIO to parse HTML page.

Reading Hong Kong Supplementary Character Set in python 3

I have a hkscs dataset that I am trying to read in python 3. Below code
encoding = 'big5hkscs'
lines = []
num_errors = 0
for line in open('file.txt'):
try:
lines.append(line.decode(encoding))
except UnicodeDecodeError as e:
num_errors += 1
It throws me error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte. Seems like there is a non utf-8 character in the dataset that the code is not able to decode.
I tried adding errors = ignore in this line
lines.append(line.decode(encoding, errors='ignore'))
But that does not solve the problem.
Can anyone please suggest?

If a text file contains text encoded with an encoding that is not the default encoding, the encoding must be specified when opening the file to avoid decoding errors:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'r', encoding=encoding,) as f:
for line in f:
# do something with line
Alternatively, the file may be opened in binary mode, and text decoded afterwards:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'rb') as f:
for line in f:
decoded = line.decode(encoding)
# do something with decoded text
In the question, the file is opened without specifying an encoding, so its contents are automatically decoded with the default encoding - apparently UTF-8 in the is case.

Looks like if I do NOT add the except clause except UnicodeDecodeError as e, it works fine
encoding = 'big5hkscs'
lines = []
path = 'file.txt'
with open(path, encoding=encoding, errors='ignore') as f:
for line in f:
line = '\t' + line
lines.append(line)

The answer of snakecharmerb is correct, but possibly you need an explanation.
You didn't write it in the original question, but I assume you have the error on the for line. In this line you are decoding the file from UTF-8 (probably the default on your environment, so not on Windows), but later you are trying to decode it again. So the error is not about decoding big5hkscs, but about opening the file as text.
As in the good answer of snakecharmerb (second part), you should open the file as binary, so without decoding texts, and then you can decode the text with line.decode(encoding). Note: I'm not sure you can read lines on a binary file. In this manner you can still catch the errors, e.g. to write a message. Else the normal way is to decode at open() time, but then you loose the ability to fall back and to get users better error message (e.g. line number).

Encoding Error While Merging Multiple Text Files in Python

I want to merge multiple files using python3. All the files are there in a single folder with .txt as extension.
In the folder, there are files starting with special characters like dot (.) and braces() etc. The code and dataset are there in separate folders. Please help.
What I have tried is as follows:
#encoding:utf-8
import os
from pprint import pprint
path = 'F:/abc/intt'
file_names = os.listdir(path)
with open('F://hi//Datasets//megef_fil.txt', 'w', encoding = 'utf-8') as outfile:
for fname in file_names:
with open(os.path.join(path,fname)) as infile:
for line in infile:
outfile.write(line)
The error in trace which I am facing is something like this.
File "C:\Users\Shanky\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 31: character maps to
I am absolutely not having any clue where I am going wrong. Please help me in fixing this issue. Any help is appreciated.

it's seems to be that you read your inerfile in byte with not the proper encoding, try to do :
with open(os.path.join(path,fname), mode="r", encoding="utf-8") as infile:

Python zipfile module can't extract filenames with Chinese characters

I'm trying to use a python script to download files from a Chinese service provider (I'm not from China myself). The provider is giving me a .zip file which contains a file which seems to have Chinese characters in its name. This seems to be causing the zipfile module to barf.
Code:
import zipfile
f = "/path/to/zip_file.zip"
if zipfile.is_zipfile(f):
fz = zipfile.ZipFile(f, 'r')
The zipfile itself doesn't contain any non-ASCII characters but the file inside it does. When I run the above script i get the following exception:
Traceback (most recent call last): File "./temp.py", line 9, in <module>
fz = zipfile.ZipFile(f, 'r') File "/usr/lib/python2.7/zipfile.py", line 770, in __init__
self._RealGetContents() File "/usr/lib/python2.7/zipfile.py", line 859, in _RealGetContents
x.filename = x._decodeFilename() File "/usr/lib/python2.7/zipfile.py", line 379, in _decodeFilename
return self.filename.decode('utf-8') File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xbd in position 30: invalid start byte
I've tried looking through the answers to many similar questions:
Read file with Chinese Characters
Extract zip files with non-unicode filenames
Extract files with invalid characters
Please correct me if I'm wrong, but it looks like an open issue with the zipfile module.
How do I get around this? Is there any alternative module for dealing with zipfiles that I should use? Or any other solution?
TIA.
Edit:
I can access/unzip the same file perfectly with the linux command-line utility "unzip".

The way of Python 2.x(2.7) and Python 3.x dealing with non utf-8 filename in module zipfile are a bit different.
First, they both check ZipInfo.flag_bits of the file, if ZipInfo.flag_bits & 0x800, name of the file will be decode with utf-8.
If the check of above is False, in Python 2.x, the byte string of the name will be returned; in Python 3.x, the module will decode the file with encoding cp437 and return decoded result. Of course, the module will not know the true encoding of the filename in both Python versions.
So, suppose you have got a filename from a ZipInfo object or zipfile.namelist method, and you have already know the filename is encoded with XXX encoding. Those are the ways you get the correct unicode filename:
# in python 2.x
filename = filename.decode('XXX')
# in python 3.x
filename = filename.encode('cp437').decode('XXX')

Recently I met the same problem. Here is my solution. I hope it is useful for you.
import shutil
import zipfile
f = zipfile.ZipFile('/path/to/zip_file.zip', 'r')
for fileinfo in f.infolist():
filename = fileinfo.filename.encode('cp437').decode('gbk')
outputfile = open(filename, "wb")
shutil.copyfileobj(f.open(fileinfo.filename), outputfile)
outputfile.close()
f.close()
UPDATE: You can use the following simpler solution with pathlib:
from pathlib import Path
import zipfile
with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
for fn in f.namelist():
extracted_path = Path(f.extract(fn))
extracted_path.rename(fn.encode('cp437').decode('gbk'))

The ZIP file is invalid. It has a flag that signals that filenames inside it are encoded as UTF-8, but they're actually not; they contain byte sequences that aren't valid as UTF-8. Maybe they're GBK? Maybe something else? Maybe some unholy inconsistent mixture? ZIP tools in the wild are unfortunately very very poor at handling non-ASCII filenames consistently.
A quick workaround might be to replace the library function that decodes the filenames. This is a monkey-patch as there isn't a simple way to inject your own ZipInfo class into ZipFile, but:
zipfile.ZipInfo._decodeFilename = lambda self: self.filename
would disable the attempt to decode the filename, and always return a ZipInfo with a byte string filename property that you can proceed to decode/handle manually in whatever way is appropriate.

This is almost 6 years late, but this was finally fixed in Python 3.11 with the addition of the metadata_encoding parameter. I posted this answer here anyway to help other people with similar issues.
import zipfile
f = "your/zip/file.zip"
t = "the/dir/where/you/want/to/extract/it/all"
with zipfile.ZipFile(f, "r", metadata_encoding = "utf-8") as zf:
zf.extractall(t)

What about this code?
import zipfile
with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
zipInfo = f.infolist()
for member in zipInfo:
member.filename = member.filename.encode('cp437').decode('gbk')
f.extract(member)

#Mr.Ham's solution perfectly solved my problem. I'm using the Chinese version of Win10. Which the default encoding of the file system is GBK.
I think for other language users. Just change decode from GBK to their system default encoding will also work. And the default system encoding could automaticly get by Python.
So the patched code looks like this:
import zipfile
import locale
default_encoding = locale.getpreferredencoding()
with zipfile.ZipFile("/path/to/zip_file.zip") as f:
zipinfo = f.infolist()
for member in zipinfo:
member.filename = member.filename.encode('cp437').decode(default_encoding)
# The second argument could make the extracted filese to the same dir as the zip file, or leave it blank to your work dir.
f.extract(member, "/path/to/zip_file")

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:
s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

import codecs
import shutil
import sys
s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin, sys.stdout)

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.
For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:
configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")
This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.
Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''
# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))
# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()
# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)
# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0

You can use codecs.
import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")

In python3 you should add encoding='utf-8-sig':
with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)
that's it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

decoding error while reading csv files from folder - python

try using single slash '/' please try using with open(mypath +"/"+each_file) as f: Another problem may be the CSV file contains Unicode, not UTF8. It would be easy if you post sample of CSV file too.

Related

Is it possible to specify the encoding of a file with Paramiko?

Reading Hong Kong Supplementary Character Set in python 3

Encoding Error While Merging Multiple Text Files in Python

Python zipfile module can't extract filenames with Chinese characters

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Categories

Resources