cannot replace a Text in a CSV using Python - python

I have a CSV file which has two columns 'title' and 'description' . The Description columns has HTML elements . I am trying to replace 'InterviewNotification' with InterviewAlert .
screenshot here of csv file
This is the code i wrote :
text = open("data.csv", "r")
text = ''.join([i for i in text]).replace("InterviewNotification", "InterviewAlert")
x = open("output.csv","w")
x.writelines(text)
x.close()
But, Im getting this Error :
File "C:\Users\Zed\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5786: character maps to <undefined>
Also used pandas , here is the code :
dataframe = pd.read_csv("data.csv")
# using the replace() method
dataframe.replace(to_replace ="InterviewNotification", value = "InterviewAlert", inplace = True)
still no Luck. Help pls

Have you tried specifying the encoding as "utf-8" in your first line? For example:
text = open("data.csv", encoding="utf8")
It seems that your issue may be related to this question

You open the file but you do not read it. To get the text itself do this:
textFile = open("data.csv", "r")
text = textFile.read()
textFile.close()
Or, to improve the code, use a context manager:
with open("data.csv", "r") as textFile:
text = textFile.read()
This ensures that the file is properly closed even if the intermediate code raises an exception.

Related

python cant parse csv as list ( utf-8 bom ) [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed 6 months ago.
edit: this questions Convert UTF-8 with BOM to UTF-8 with no BOM in Python which only works on txt files, does not solve my issue with csv files
I have two csv files
rtc_csv_file="csv_migration\\rtc-test.csv"
ads_csv_file="csv_migration\\ads-test.csv"
here is the ads-test.csv file (which is causing issues)
https://easyupload.io/bk1krp
the file is UTF-8 with BOM is what vscode bottom right corner says when i open the csv.
and I am trying to write a python function to read in every row, and convert it to a dict object.
my function works for the first file rtc-test.csv just fine, but for the second file ads-test.csv I get an error UTF-16 stream does not start with BOM when i use utf-16. so ive tried to use utf-8 and utf-8-sig but it only reads in each line as a string with commas separating values. I cant split by comma because I will have column values which include commas.
my python code correctly reads in rtc-test.csv as a list of values. How can I read in ads-test.csv as a list of values when the csv is encoded using utf-8 with bom?
code:
rtc_csv_file="csv_migration\\rtc-test.csv"
ads_csv_file="csv_migration\\ads-test.csv"
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
if csv_cols is None:
csv_cols = row
dict['csv_cols']=csv_cols
print('csv_cols=',csv_cols)
else:
row_id_val = row[csv_cols.index(str(id_format))]
print('row_id_val=',row_id_val)
dict['rows'][row_id_val] = row
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
rtc_dict = read_csv_as_map(rtc_csv_file, 'Id', 'utf-16')
ads_dict = read_csv_as_map(ads_csv_file, 'ID', 'utf-16')
console output:
filename: csv_migration\rtc-test.csv, id_format: Id, encoding: utf-16
csv_cols= ['Summary', 'Status', 'Type', 'Id', '12NC']
row_id_val= 262998
done
filename: csv_migration\ads-test.csv, id_format: ID, encoding: utf-16
err= UTF-16 stream does not start with BOM
if i try to use utf-16-le instead, i get a different error 'utf-16-le' codec can't decode byte 0x22 in position 0: truncated data
if i try to use utf-16-be, i get this error: 'utf-16-be' codec can't decode byte 0x22 in position 0: truncated data
why cant my python code read this csv file?
Your CSV is encoded with UTF-8 (the default) instead of UTF-16, so pass that as the encoding:
ads_csv_file="ads-test.csv"
from csv import reader
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
if csv_cols is None:
csv_cols = row
dict['csv_cols']=csv_cols
print('csv_cols=',csv_cols)
else:
row_id_val = row[csv_cols.index(str(id_format))]
print('row_id_val=',row_id_val)
dict['rows'][row_id_val] = row
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(ads_csv_file, 'ID', 'utf-8') # <- updated here
Here's the CSV for reference:
Title,State,Work Item Type,ID,12NC
"453560751251 TOOL, SQ-59 CORNER CLAMP","To Do","FRUPS","6034","453560751251"

Pandas cannot load data, csv encoding mystery

I am trying to load a dataset into pandas and cannot get seem to get past step 1. I am new so please forgive if this is obvious, I have searched previous topics and not found an answer. The data is mostly in Chinese characters, which may be the issue.
The .csv is very large, and can be found here: http://weiboscope.jmsc.hku.hk/datazip/
I am trying on week 1.
In my code below, I identify 3 types of decoding I attempted, including an attempt to see what encoding was used
import pandas
import chardet
import os
#this is what I tried to start
data = pandas.read_csv('week1.csv', encoding="utf-8")
#spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte
#Code to check encoding -- this spits out ascii
bytes = min(32, os.path.getsize('week1.csv'))
raw = open('week1.csv', 'rb').read(bytes)
chardet.detect(raw)
#so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway
data = pandas.read_csv('week1.csv', encoding="ascii")
#spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
#for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars
data = pandas.read_csv('week1.csv', encoding="latin1")
Any help would be greatly appreciated!
EDIT: The answer provided by #Kristof does in fact work, as does the program a colleague of mine put together yesterday:
import csv
import pandas as pd
def clean_weiboscope(file, nrows=0):
res = []
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
reader = csv.reader(f)
for i, row in enumerate(f):
row = row.replace('\n', '')
if nrows > 0 and i > nrows:
break
if i == 0:
headers = row.split(',')
else:
res.append(tuple(row.split(',')))
df = pd.DataFrame(res)
return df
my_df = clean_weiboscope('week1.csv', nrows=0)
I also wanted to add for future searchers that this is the Weiboscope open data for 2012.
It seems that there's something very wrong with the input file. There are encoding errors throughout.
One thing you could do, is to read the CSV file as a binary, decode the binary string and replace the erroneous characters.
Example (source for the chunk-reading code):
in_filename = 'week1.csv'
out_filename = 'repaired.csv'
from functools import partial
chunksize = 100*1024*1024 # read 100MB at a time
# Decode with UTF-8 and replace errors with "?"
with open(in_filename, 'rb') as in_file:
with open(out_filename, 'w') as out_file:
for byte_fragment in iter(partial(in_file.read, chunksize), b''):
out_file.write(byte_fragment.decode(encoding='utf_8', errors='replace'))
# Now read the repaired file into a dataframe
import pandas as pd
df = pd.read_csv(out_filename)
df.shape
>> (4790108, 11)
df.head()

Unicode Error when extracting XML file Python

import os, csv, io
from xml.etree import ElementTree
file_name = "example.xml"
full_file = os.path.abspath(os.path.join("xml", file_name))
dom = ElementTree.parse(full_file)
Fruit = dom.findall("Fruit")
with io.open('test.csv','w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=',')
for f in Fruit:
Explanation = f.findtext("Explanation")
Types = f.findall("Type")
for t in Types:
Type = t.text
a.writerow([Type, Explanation])
I am extracting data from a XML file, and put it into a CSV file. I am getting this error message below. It is probably because the extracted data contains a Fahrenheit sign. How could I get rid of these Unicode errors, without fixing it manually the XML file?
For the last line of my code i get this error message
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xb0’ in position 1267: ordinal not in range(128)
<Fruits>
<Fruit>
<Family>Citrus</Family>
<Explanation>They cannot grow at a temperature below 32 °F</Explanation>
<Type>Orange</Type>
<Type>Lemon</Type>
<Type>Lime</Type>
<Type>Grapefruit</Type>
</Fruit>
</Fruits>
You didn't write, where the error occurs. Probably in the last line. You have to encode the strings yourself:
with open('test.csv','w') as fp:
a = csv.writer(fp, delimiter=',')
for f in Fruit:
explanation = f.findtext("Explanation")
types = f.findall("Type")
for t in types:
a.writerow([t.text.encode('utf8'), explanation.encode('utf8')])

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Python error "Ordinal not in range" with accents

I'm scraping a table from the Internet and saving as a CSV file. There are characters with French accents in the text, resulting in a unicode error on save:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128)
I'd like to find an elegant solution for saving accented characters that I can apply to any situation. I've sometimes used the following:
encode('ascii','ignore')
but it doesn't work this time, for reasons unknown. I'm also trying to replace the <sup> tags in a cell, so I'm converting it using str() first.
Here's the pertinent part of my code:
data = [
str(td[0]).split('<sup')[0].split('>')[1].split('<')[0],
td[1].getText()
]
output.append(data)
csv_file = csv.writer(open('savedFile.csv', 'w'), delimiter=',')
for line in output:
csv_file.writerow(line)
If td[0] is u"a<sup>b</sup>c" :
td[0].split('<sup') is u"a".
td[0].partition('>')[2].split('<')[0] is u"b".
td[0][td[0].rindex('>') + 1:] is u"c".
If this kind of string indexing and matching is too simple you might consider createing a regular expression and matching it against the text in the html tag:
import re
r = re.compile("[^<]*<sup>([^<]*)</sup>")
m = r.match("some<sup>text</sup>")
print(m.groups()[0])
The csv.reader() and csv.writer() require the files opened in binary mode. You should also close the file at the end. Therefore, you should write it like:
f = open('output.csv', 'wb')
writer = csv.writer(f, delimiter=',')
for row in output:
writer.writerow(row)
f.close()
Or you can use the with construct when using newer versions of Python:
with open('output.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for row in output:
writer.writerow(row)
... and the file will be closed automatically.
Anyway, the csv.writer() expects the row composed of the byte sequences (not Unicode strings). If you have Unicode strings, convert them using .encode('utf-8'):
for row in output:
encoded_row = [s.encode('utf-8') for s in row]
writer.writerow(encoded_row)

Categories

Resources