I'm working on a data analysis project & I wanted to read data from CSV files using pandas. I read the first CSV file and It was fine but the second one gave me a UTF 8 encoding error. I exported the file to csv and encoded it to UTF-8 in the numbers spreadsheet app. However, the data frame is not in the expected format. Any idea why?
the original CSV file in numbers
it looks like your file is semicolon separated not comma separated.
To fix this you need to add the sep=';' parameter to pd.read_csv function.
pd.read_csv("mitb.csv", sep=';')
Try adding the correct delimiter, in this case ";", to read the csv.
mitb = pd.read_csv('mitb.csv', sep=";")
The file is semicolon-separated and also decimal is comma, not dot
df = pd.read_csv('mitb.csv', sep=';', decmal=',')
And Please do not upload images of code/data/errors.
Related
I read csv file which include .html Entity / html character encoding like this: & for the Ampersand symbol (just an example) there are other chars as well.
At the end my goal is to read multiple csv files and combine them into a single csv and change the encoding to UTF, so that such symbols are gone.
I currently I do it like this:
for file in files:
df = pd.read_csv(file, sep=';', index_col=None, header=0, encoding='utf-8-sig')
list_.append(df)
df_total = pd.concat(list_)
df_total.to_csv('test.csv', sep=';', encoding='utf-8-sig', index=False)
which is very slow. Worse is that it does not change the encoding to UTF it seams.
So a) is there a quick way to get rid of these characters
and b) is there a better way to concat csv files maybe with the inbuild csv library AND get rid of the unwanted characters/ change their encoding?
Thank you in advance
I have tried writing the header and footer sequences with python excelwriter and then converting it to csv but it does not work. Can anyone suggest me a piece of code in python ?
Open your output file and simply write to it. Let pandas to_csv write to the open file object.
with open("myoutput.csv", "w") as file:
# output your first line
print("1890000123", file=file)
# continue to add the csv
df.to_csv(file, ... other options here)
print("178AD...", file=file)
When I read and print the CSV file I downloaded, then I got the following results
As you can see, the result is printed in a wired format.
If I want to print a specific column, here is the error message I got.
I believe the format of the winedata.csv file is wrong, because my code works for other csv file. How do I convert my csv file to right format?
Your winedata.csv file is separated by semi-colons, rather than commas. Therefore, you need to provide the sep option to pd.read_csv as follows:
wine_data = pd.read_csv("winedata.csv", sep=";")
You will then be able to access your pH column as:
wine_data["pH"]
I try to read a CSV file in Python, but the first element in the first row is read like that 0, while the strange character isn't in the file, its just a simple 0. Here is the code I used:
matriceDist=[]
file=csv.reader(open("distanceComm.csv","r"),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)
I had this same issue. Save your excel file as CSV (MS-DOS) vs. UTF-8 and those odd characters should be gone.
Specifying the byte order mark when opening the file as follows solved my issue:
open('inputfilename.csv', 'r', encoding='utf-8-sig')
Just use pandas together with some encoding (utf-8 for example) is gonna be easier
import pandas as pd
df = pd.read_csv('distanceComm.csv', header=None, encoding = 'utf8', delimiter=';')
print(df)
I don't know what your input file is. But since it has a Byte Order Mark for UTF-8, you can use something like this:
import codecs
matriceDist=[]
file=csv.reader(codecs.open('distanceComm.csv', encoding='utf-8'),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)
I'm trying to read a large and complex CSV file with pandas.read_csv.
The exact command is
pd.read_csv(filename, quotechar='"', low_memory=True, dtype=data_types, usecols= columns, true_values=['T'], false_values=['F'])
I am pretty sure that the data types are correct. I can read the first 16 million lines (setting nrows=16000000) without problems but somewhere after this I get the following error
ValueError: could not convert string to float: '1,123'
As it seems, for some reason pandas thinks two columns would be one.
What could be the problem? How can I fix it?
I found the mistake. The problem was a thousand separator.
When writing the CSV file, most numbers were below thousand and were correctly written to the CSV file. However, this one value was greater than thousand and it was written as "1,123" which pandas did not recognize as a number but as a string.