How to provide encoding while reading multiple files? - python

I'm reading multiple csv files in from a folder. While reading multiple files I receive UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 21: invalid start byte
When I try to read file one-by-one I provide encoding of type - "ISO-8859-1" in pandas.read_csv(file_name, encoding). My final objective is to append all files in single data frame. Following is the code I'm using for the mentioned purpose.
import glob
files = glob.glob('/path_name/*.csv')
df = None
for i, f in enumerate (files):
if i == 0:
df = pd.read_csv(f)
df['fname'] = f
else:
tmp = read_csv(f)
tmp['fname'] = f
df = df.append(tmp)
df.head()

Try adding errors='ignore', then everything works, but you will lose couple of characters.
with open(path, encoding="utf8", errors='ignore') as f:

Related

cannot replace a Text in a CSV using Python

I have a CSV file which has two columns 'title' and 'description' . The Description columns has HTML elements . I am trying to replace 'InterviewNotification' with InterviewAlert .
screenshot here of csv file
This is the code i wrote :
text = open("data.csv", "r")
text = ''.join([i for i in text]).replace("InterviewNotification", "InterviewAlert")
x = open("output.csv","w")
x.writelines(text)
x.close()
But, Im getting this Error :
File "C:\Users\Zed\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5786: character maps to <undefined>
Also used pandas , here is the code :
dataframe = pd.read_csv("data.csv")
# using the replace() method
dataframe.replace(to_replace ="InterviewNotification", value = "InterviewAlert", inplace = True)
still no Luck. Help pls
Have you tried specifying the encoding as "utf-8" in your first line? For example:
text = open("data.csv", encoding="utf8")
It seems that your issue may be related to this question
You open the file but you do not read it. To get the text itself do this:
textFile = open("data.csv", "r")
text = textFile.read()
textFile.close()
Or, to improve the code, use a context manager:
with open("data.csv", "r") as textFile:
text = textFile.read()
This ensures that the file is properly closed even if the intermediate code raises an exception.

UnicodeEncodeError: 'charmap' codec can't encode characters in position 9-10: character maps to <undefined>

I am trying to read .mat file and save all arrays in .txt format. I have been trying to write it in .txt file but after sometime I get this error
UnicodeEncodeError: 'charmap' codec can't encode characters in position 9-10: character maps to <undefined>
I tried to see the reason and came to know that in this .mat file I have an array which is like this
array([array(['丰田Avalon'], dtype='<U8')], dtype=object)
and I'm pretty sure the error is because of this. But can't find how can I convert it into .txt format to get rid of this error. My code is
import os
import scipy.io as cio
mat = cio.loadmat("D:/compCarsThesisData/data/misc/make_model_name.mat")
model_names = mat['model_names']
path = "D:/compCarsThesisData/data/image/"
count = 0
for root, _, files in os.walk(path):
cdp = os.path.abspath(root)
for f in files:
name,ext = os.path.splitext(f)
if(ext==".jpg"):
cip = os.path.join(cdp,f)
# print(model_names[int(cip.split('\\')[5])])
# print("Folder:",cip.split('\\')[4])
# print("Folder Inside:",cip.split('\\')[5])
f = open("car_modelss.txt", "a")
model_names[1369][0][0].encode('utf-8') #here at this I get the specific array which I tried to convert hardcoded.
f.write((str(model_names[int(cip.split('\\')[5])-1])))
f.write("Folder: %d\r\n" % (int(cip.split('\\')[4])))
f.write("Folder Inside: %d\r\n" % (int(cip.split('\\')[5])))
count = count + 1
print(count)
f.close()
please help

Pandas cannot load data, csv encoding mystery

I am trying to load a dataset into pandas and cannot get seem to get past step 1. I am new so please forgive if this is obvious, I have searched previous topics and not found an answer. The data is mostly in Chinese characters, which may be the issue.
The .csv is very large, and can be found here: http://weiboscope.jmsc.hku.hk/datazip/
I am trying on week 1.
In my code below, I identify 3 types of decoding I attempted, including an attempt to see what encoding was used
import pandas
import chardet
import os
#this is what I tried to start
data = pandas.read_csv('week1.csv', encoding="utf-8")
#spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte
#Code to check encoding -- this spits out ascii
bytes = min(32, os.path.getsize('week1.csv'))
raw = open('week1.csv', 'rb').read(bytes)
chardet.detect(raw)
#so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway
data = pandas.read_csv('week1.csv', encoding="ascii")
#spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
#for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars
data = pandas.read_csv('week1.csv', encoding="latin1")
Any help would be greatly appreciated!
EDIT: The answer provided by #Kristof does in fact work, as does the program a colleague of mine put together yesterday:
import csv
import pandas as pd
def clean_weiboscope(file, nrows=0):
res = []
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
reader = csv.reader(f)
for i, row in enumerate(f):
row = row.replace('\n', '')
if nrows > 0 and i > nrows:
break
if i == 0:
headers = row.split(',')
else:
res.append(tuple(row.split(',')))
df = pd.DataFrame(res)
return df
my_df = clean_weiboscope('week1.csv', nrows=0)
I also wanted to add for future searchers that this is the Weiboscope open data for 2012.
It seems that there's something very wrong with the input file. There are encoding errors throughout.
One thing you could do, is to read the CSV file as a binary, decode the binary string and replace the erroneous characters.
Example (source for the chunk-reading code):
in_filename = 'week1.csv'
out_filename = 'repaired.csv'
from functools import partial
chunksize = 100*1024*1024 # read 100MB at a time
# Decode with UTF-8 and replace errors with "?"
with open(in_filename, 'rb') as in_file:
with open(out_filename, 'w') as out_file:
for byte_fragment in iter(partial(in_file.read, chunksize), b''):
out_file.write(byte_fragment.decode(encoding='utf_8', errors='replace'))
# Now read the repaired file into a dataframe
import pandas as pd
df = pd.read_csv(out_filename)
df.shape
>> (4790108, 11)
df.head()

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Struggling with unicode in Python

I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position
5: ordinal not in range(128)
How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
for i in range (len(filenames)):
pathname = filenames[i]
fin = open(pathname, 'r')
with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find("Brand Title")
brand_end = lines.find("/>",brand_start)
brand = lines [brand_start+47:brand_end-2]
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
flog.close()
I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.
You are mixing bytestrings with Unicode values; your fin file object produces bytestrings, and you are mixing it with Unicode here:
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
brand is a bytestring, interpolated into a Unicode format string. Either decode brand there, or better yet, use io.open() (rather than codecs.open(), which is not as robust as the newer io module) to manage both your files:
with io.open('Assets.log', 'w', encoding='utf-8') as f,\
io.open(pathname, encoding='utf-8') as fin:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find(u"Brand Title")
brand_end = lines.find(u"/>", brand_start)
brand = lines[brand_start + 47:brand_end - 2]
f.write(u'{}|{}\n'.format(pathname[4:35], brand))
You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open(), so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.
This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):
import lxml
import io
import os
from lxml import etree
from glob import glob
nsmap = {'xmlns': 'thisnamespace'}
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
with io.open(('Assets.log'),'w',encoding='utf-8') as f:
f.write(u'File Path|Series|Brand\n')
for i in range (len(filenames)):
pathname = filenames[i]
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(pathname, parser)
root = tree.getroot()
fin = open(pathname, 'r')
with io.open(pathname, encoding='utf-8') as fin:
for info in root.xpath('//somepath'):
series_x = info.find ('./somemorepath')
series = series_x.get('Asset_Name') if series_x != None else 'Missing'
lines = fin.read()
brand_start = lines.find(u"sometext")
brand_end = lines.find(u"/>",brand_start)
brand = lines [brand_start:brand_end-2]
brand = brand[(brand.rfind("/"))+1:]
f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))
f.close()
Someone will now come along and do it all in one line!

Categories

Resources