Python reading from file encoding problem - python

when I read like this, some files
list_of_files = glob.glob('./*.txt') # create the list of files
for file_name in list_of_files:
FI = open(file_name, 'r', encoding='cp1252')
Error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1260: character maps to
When I switch to this
list_of_files = glob.glob('./*.txt') # create the list of files
for file_name in list_of_files:
FI = open(file_name, 'r', encoding="utf-8")
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1459: invalid start byte
And I have read that I should open this as a binary file. But I'm not sure how to do this. Here is my function:
def readingAndAddToList():
list_of_files = glob.glob('./*.txt') # create the list of files
for file_name in list_of_files:
FI = open(file_name, 'r', encoding="utf-8")
stext = textProcessing(FI.read())# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s)
secondaryWord_list = stext.split()
word_list.extend(secondaryWord_list) # Add words to main list
print("Lungimea fisierului ",FI.name," este de", len(secondaryWord_list), "caractere")
sortingAndNumberOfApparitions(secondaryWord_list)
FI.close()
Just the beggining of my functions matter because I get the error at the reading part

If you are on windows,open the file in NotePad and save as desired encoding .
In Linux , DO the same in text editor.
hope your program runs.

Related

Transposing files from columns to rows for multiple files

I have approximately 200 files (plus more in the future) that I need to transpose data from columns into rows. I'm a microbiologist, so coding isn't my forte (have worked with Linux and R in the past). One of my computer science friends was trying to help me write code in Python, but I have never used it before today.
The files are in .lvm format, and I'm working on a Mac. Items with 2 stars on either side are paths that I've hidden to protect my privacy.
The for loop is where I've been getting the error, but I'm not sure if that's where my problem lies or if it's something else.
This is the Python code I've been working on:
import os
lvm_directory = "/Users/**path**"
output_file = "/Users/**path**/Transposed.lvm"
newFile = True
output_delim = "\t"
for filename in os.listdir(lvm_directory):
header = []
data = []
f = open(lvm_directory + "/" + filename)
for l in f:
sl = l.split()
if (newFile):
header += [sl[1]]
f. close()
This is the error message I've been getting and I can't figure out how to work through it:
File "<pyshell#97>", line 5, in <module>
for l in f:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 345: invalid continuation byte
The rest of the code after this error is as follows, but I haven't worked through it yet due to the above error:
f = open(output_file, 'w')
f.write(output_delim.join(header))
newFile = False
else:
f = open(output_file, 'a')
f.write("\n"+output_delim.join(data))
f.close()
Looks like your files have a different encoding than the default utf-8 format. Probably ASCII. You'd use something like:
with open(lvm_directory + "/" + filename, encoding="ascii") as f:
for l in f:
# rest of your code here
^ It's generally more "pythonic" to use a with statement to handle resource management (i.e. opening and closing a file), hence the with approach demonstrated above. If your files aren't ASCII, see if any other encoding work. There are command-line tools like chardet that can help you identify the file's encoding.

pandas reading csv file encoding error

i have a iso8859-9 encoded csv file and trying to read it into a dataframe.
here is the code and error I got.
iller = pd.read_csv('/Users/me/Documents/Works/map/dist.csv' ,sep=';',encoding='iso-8859-9')
iller.head()
and error is
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 250: ordinal not in range(128)
and code below works without error.
import codecs
myfile = codecs.open('/Users/me/Documents/Works/map/dist.csv', "r",encoding='iso-8859-9')
for a in myfile:
print a
My question is why pandas not reading my correctly encoded file ? and is there any way to make it read?
Not possible to see what could be off with you data of course, but if you can read in the data without issues with codecs, then maybe an idea would be to write out the file to UTF encoding(?)
import codecs
filename = '/Users/me/Documents/Works/map/dist.csv'
target_filename = '/Users/me/Documents/Works/map/dist-utf-8.csv'
myfile = codecs.open(filename, "r",encoding='iso-8859-9')
f_contents = myfile.read()
or
import codecs
with codecs.open(filename, 'r', encoding='iso-8859-9') as fh:
f_contents = fh.read()
# write out in UTF-8
with codecs.open(target_filename, 'w', encoding = 'utf-8') as fh:
fh.write(f_contents)
I hope this helps!

encoding issue when reading CSV file with python

I have hit a road block when trying to read a CSV file with python.
UPDATE:
if you want to just skip the character or error you can open the file like this:
with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:
So far I have tried.
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
with open(os.path.join(directory, file), 'r') as data_file:
reader = csv.reader(data_file)
for row in reader:
print (row)
the error I am getting is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
I have Tried
with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:
Error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>
Now if I just print the data_file it says they are cp1252 encoded but if I try
with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:
The error I get is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
I also tried the recommended package.
The error I get is:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>
The line I am trying to parse is:
2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT #WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None
any thoughts or help is appreciated.
I would use csvkit, that uses automatic detection of apposite encoding and decoding. e.g.
import csvkit
reader = csvkit.reader(data_file)
As disscussed in the chat- solution is-
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file:
reader = csv.reader(data_file)
for row in reader:
data = [i.encode('ascii', 'ignore').decode('ascii') for i in row]
print (data)

I'm trying to encode csv file to utf8 using python

I'm using python to read and encode many files to utf8 using python,I try it with the code below:
import os
from os import listdir
def find_csv_filenames(path_to_dir, suffix=".csv" ):
path_to_dir = os.path.normpath(path_to_dir)
filenames = listdir(path_to_dir)
#Check *csv directory
fp = lambda f: not os.path.isdir(path_to_dir+"/"+f) and f.endswith(suffix)
return [path_to_dir+"/"+fname for fname in filenames if fp(fname)]
def convert_files(files, ascii, to="utf-8"):
count = 0
lineno = 0
for name in files:
lineno = lineno+1
with open(name) as f:
file_target = open(name, mode='r', encoding='latin-1')
file_content = file_target.read()
file_target.close
print(lineno)
file_source = open("./csv/data{}.csv".format(lineno), mode='w', encoding='utf-8')
file_source.write(file_content)
csv_files = find_csv_filenames('./csv', ".csv")
convert_files(csv_files, "cp866")
The problem is that after I read and write data to other files and set encode it to utf8 but it still not work.
Before you open a file which encoding is not clear, you could use chardet to detect the file's encoding rather than use a encoding guessed to open a file. Usage is like this:
>>> import chardet
>>> encoding = chardet.detect('PATH/TO/FILE')['encoding']
And then open the file with the encoding detected and write the contents into a file opened with 'utf-8' encoding.
If you're not sure whether the file is converted using 'utf-8' encoding, you could use enca to see if the encoding of the file is 'ASCII' or 'utf-8' like this in Linux shell:
$ enca FILENAME

Delete all files in directory with non utf-8 symbols

I have a set of data, but I need to work only with utf-8 data, so I need to delete all data with non-utf-8 symbols.
When I try to work with these files, I receive:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte
My code
class Corpus:
def __init__(self,path_to_dir=None):
self.path_to_dir = path_to_dir if path_to_dir else []
def emails_as_string(self):
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
yield[file_name,body.read()]
def add_slash(self, path):
if path.endswith("/"): return path
return path + "/"
I recive error here yield[file_name,body.read()] and herelist_of_emails = mailsrch.findall(text), but when I work with utf-8 all great.
I suspect you want to use the errors='ignore' argument on bytes.decode. See http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode .for more info.
Edit:
Here's an example showing a good way to do this:
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
fullpath = os.path.join(self.path_to_dir, file_name)
with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
yield [file_name, body.read()]
Using os.path.join, you can eliminate your add_slash method, and ensure that it works cross-platform.

Categories

Resources