Strange character while reading a CSV file - python

I try to read a CSV file in Python, but the first element in the first row is read like that 0, while the strange character isn't in the file, its just a simple 0. Here is the code I used:
matriceDist=[]
file=csv.reader(open("distanceComm.csv","r"),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)

I had this same issue. Save your excel file as CSV (MS-DOS) vs. UTF-8 and those odd characters should be gone.

Specifying the byte order mark when opening the file as follows solved my issue:
open('inputfilename.csv', 'r', encoding='utf-8-sig')

Just use pandas together with some encoding (utf-8 for example) is gonna be easier
import pandas as pd
df = pd.read_csv('distanceComm.csv', header=None, encoding = 'utf8', delimiter=';')
print(df)

I don't know what your input file is. But since it has a Byte Order Mark for UTF-8, you can use something like this:
import codecs
matriceDist=[]
file=csv.reader(codecs.open('distanceComm.csv', encoding='utf-8'),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)

Related

Trouble reading CSV file using pandas

I'm working on a data analysis project & I wanted to read data from CSV files using pandas. I read the first CSV file and It was fine but the second one gave me a UTF 8 encoding error. I exported the file to csv and encoded it to UTF-8 in the numbers spreadsheet app. However, the data frame is not in the expected format. Any idea why?
the original CSV file in numbers
it looks like your file is semicolon separated not comma separated.
To fix this you need to add the sep=';' parameter to pd.read_csv function.
pd.read_csv("mitb.csv", sep=';')
Try adding the correct delimiter, in this case ";", to read the csv.
mitb = pd.read_csv('mitb.csv', sep=";")
The file is semicolon-separated and also decimal is comma, not dot
df = pd.read_csv('mitb.csv', sep=';', decmal=',')
And Please do not upload images of code/data/errors.

What is the simplest way to fix an existing csv unicode utf-8 without BOM file not displaying correctly in excel?

I have the task of converting utf-8 csv file to excel file, but it is not read properly in excel. Because there was no byte order mark (BOM) at the beginning of the file
I see how:
https://stackoverflow.com/a/38025106/6102332
with open('test.csv', 'w', newline='', encoding='utf-8-sig') as f:
w = csv.writer(f)
# Write Unicode strings.
w.writerow([u'English', u'Chinese'])
w.writerow([u'American', u'美国人'])
w.writerow([u'Chinese', u'中国人'])
But it seems like that only works with brand new files.
But not work for my file already has data.
Are there any easy ways to share?
Is there any other way than this? : https://stackoverflow.com/a/6488070/6102332
Save the exported file as a csv
Open Excel
Import the data using Data-->Import External Data --> Import Data
Select the file type of "csv" and browse to your file
In the import wizard change the File_Origin to "65001 UTF" (or choose correct language character identifier)
Change the Delimiter to comma
Select where to import to and Finish
Read the file in and write it back out with the encoding desired:
with open('input.csv','r',encoding='utf-8-sig') as fin:
with open('output.csv','w',encoding='utf-8-sig') as fout:
fout.write(fin.read())
utf-8-sig codec will remove BOM if present on read, and will add BOM on write, so the above can safely run on files with or without BOM originally.
You can convert in place by doing:
file = 'test.csv'
with open(file,'r',encoding='utf-8-sig') as f:
data = f.read()
with open(file,'w',encoding='utf-8-sig') as f:
f.write(data)
Note also that utf16 works as well. Some older Excels don't handle UTF-8 correctly.
Thank You!
I have found a way to automatically handle the missing BOM utf-8 signature.
In addition to the lack of BOM signature, there is another problem is that duplicate BOM signature is mixed in the file data. Excel does not show clearly and transparently. and make a mistake other data when compared, calculated. eg :
data -> Excel
Chinese -> Chinese
12 -> 12
If you compare it, obviously ChineseBOM will not be equal to Chinese.
Code python to solve the problem:
import codecs
bom_utf8 = codecs.BOM_UTF8
def fix_duplicate_bom_utf8(file, bom=bom_utf8):
with open(file, 'rb') as f:
data_f = f.read()
data_finish = bom + data_f.replace(bom, b'')
with open(file, 'wb') as f:
f.write(data_finish)
return
# Use:
file_csv = r"D:\data\d20200114.csv" # American, 美国人
fix_duplicate_bom_utf8(file_csv)
# file_csv -> American, 美国人

Import Data from scraping into CSV

I'm using pycharm and Python 3.7.
I would like to write data in a csv, but my code writes in the File just the first line of my data... someone knows why?
This is my code:
from pytrends.request import TrendReq
import csv
pytrend = TrendReq()
pytrend.build_payload(kw_list=['auto model A',
'auto model C'])
# Interest Over Time
interest_over_time_df = pytrend.interest_over_time()
print(interest_over_time_df.head(100))
writer=csv.writer(open("C:\\Users\\
Desktop\\Data\\c.csv", 'w', encoding='utf-8'))
writer.writerow(interest_over_time_df)
try using pandas,
import pandas as pd
interest_over_time_df.to_csv("file.csv")
Once i encountered the same problem and solve it like below:
with open("file.csv", "rb", encoding="utf-8) as fh:
precise details:
r = read mode
b = mode specifier in the open() states that the file shall be treated as binary,
so contents will remain a bytes. No decoding attempt will happen this way.
As we know python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
You could try something like:
import csv
with open(<path to output_csv>, "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in interest_over_time_df:
writer.writerow(line)
Read more here: https://www.pythonforbeginners.com/files/with-statement-in-python
You need to loop over the data and write in line by line

Decoding UTF8 literals in a CSV file

Question:
Does anyone know how I could transform this b"it\\xe2\\x80\\x99s time to eat" into this it's time to eat
More details & my code:
Hello everyone,
I'm currently working with a CSV file which full of rows with UTF8 literals in them, for example:
b"it\xe2\x80\x99s time to eat"
The end goal is to to get something like this:
it's time to eat
To achieve this I have tried using the following code:
import pandas as pd
file_open = pd.read_csv("/Users/Downloads/tweets.csv")
file_open["text"]=file_open["text"].str.replace("b\'", "")
file_open["text"]=file_open["text"].str.encode('ascii').astype(str)
file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]
print(file_open["text"])
After running the code the row that I took as an example is printed out as:
it\xe2\x80\x99s time to eat
I have tried solving this issue
using the following code to open the CSV file:
file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")
which printed out the example row in the following manner:
it\xe2\x80\x99s time to eat
and I have also tried decoding the rows using this:
file_open["text"]=file_open["text"].str.decode('utf-8')
Which gave me the following error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Thank you very much in advance for your help.
b"it\\xe2\\x80\\x99s time to eat" sounds like your file contains an escaped encoding.
In general, you can convert this to a proper Python3 string with something like:
x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x) # it’s time to eat
(Use of .encode('latin1') explained here)
So, if after you use pd.read_csv(..., encoding="utf8") you still have escaped strings, you can do something like:
pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
# itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val) # it’s time to eat
But I think it's probably better to do this to the whole file instead of to each value individually, for example with StringIO (if the file isn't too big):
from io import StringIO
# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
for line in f:
line = line.encode('latin1').decode('utf8')
sio.write(line)
sio.seek(0) # Reset file pointer to the beginning
# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

How to resolve an encoding issue?

I need to read the content of a csv file using Python. However when I run this code:
with(open(self.path, 'r')) as csv_file:
csv_reader = csv.reader(csv_file, dialect=csv.excel, delimiter=';')
self.data = [[cell for cell in row] for row in csv_reader]
I get this error:
File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1137: character maps to <undefined>
My understanding is that this file was not encoded in cp-1252, and that I need to find out what encoding was used. I tried a bunch of things, but nothing worked for now.
About the file:
It is sent by an external company, I can't have more information about it.
It comes with other similar files, with which I don't have any issue when I run the same code
It has an .xls extension, but is more a csv file delimited with semicolons
When I open it with Excel it opens in Compatibility mode. But I don't see any sort of encoding issue: everything displays right.
What I already tried:
Saving it under a different file format to get rid of the compatibility mode
Adding an encoding in the first line of my code: (I tried more or less randomly some encodings that I know of)
with(open(self.path, 'r', encoding = 'utf8')) as csv_file:
Copy-pasting the content of the file into a new file, or deleting the whole content of the file. Still does not work. This one really bugs me because I feel like it means the probelm is not in the content of the file, and not in the file itself.
Searching a lot everywhere how to solve this kind of issue.
I recommend using pandas library (as well as numpy), it is very handy when it comes to data manipulation. This function imports the data from an xlsx or csv file type.
/!\ change datapath according to your needs /!\
import os
import pandas as pd
def GetData(directory, dataUse, format):
dataPath = os.getcwd() + "\\Data\\" + directory + "\\" + dataUse + "Set." + format
if format == "xlsx":
dataSet = pd.read_excel(dataPath, sheetname = 'Sheet1')
elif format == "csv":
dataSet = pd.read_csv(dataPath)
return dataSet
I finally found some sort of solution :
Open the file with Excel
Display the file properly using the "Text to Columns" feature
Save the file to csv format
Run the code
This does not quite satisfy me, but it works.
I still don't understand what the problem actually is, and why this solved it, so I am interested in any additional information !

Categories

Resources