I'm trying loop through a folder of csv's and put them into a dataframe, change certain columns into an integer, before passing them through a Django model. Here is my code:
import glob
import pandas as pd
path = 'DIV1FCS_2017/*/*'
for fname in glob.glob(path):
df = pd.read_csv(fname)
df['Number'].apply(pd.to_numeric)
I am receiving the following: ValueError: Unable to parse string
Does anybody know if I can convert a column of strings into integers using pd.to_numeric from within a loop? Outside of the loop it seems to work properly.
I think you probably have some non-numbers data stored in your dataframe, and that's what's casuing the error.
You can examine your data and make sure everything's fine. In the meantime, you can also do pd.to_numeric(errors="ignore") to ignore errors for now.
Related
Having a comma separated string.When i export it in a CSV file,and then load the resultant CSV in pandas,i get my desired result. But when I try to load the string directly in pandas dataframe,i get error Filename too big. Please help me solve it.
I found the error. Actually it was showing that behaviour for large strings.
I used
import io
df= pd.read_csv(io.StringIO(str),sep=',', engine = 'python')
It solved the issue..str is the name of the string.
I have been trying to rename the column name in a csv file which I have been working on through Google-Colab. But the same line of code is working on one column name and is also not working for the other.
import pandas as pd
import numpy as np
data = pd.read_csv("Daily Bike Sharing.csv",
index_col="dteday",
parse_dates=True)
dataset = data.loc[:,["cnt","holiday","workingday","weathersit",
"temp","atemp","hum","windspeed"]]
dataset = dataset.rename(columns={'cnt' : 'y'})
dataset = dataset.rename(columns={"dteday" : 'ds'})
dataset.head(1)
The Image below is the dataframe called data
The Image below is dataset
This image is the final output which I get when I try to rename the dataframe.
The column name "dtedate" is not getting renamed but "cnt" is getting replaced "y" by the same code. Can someone help me out, I have been racking my brain on this for sometime now.
That's because you're setting dteday as your index, upon reading in the csv, whereas cnt is quite simply a column. Avoid the index_col attribute in read_csv and instead perform dataset = dataset.set_index('ds') after renaming.
An alternative in which only your penultimate line (trying to rename the index) would need to be changed:
dataset.index.names = ['ds']
You can remove the 'index-col' in the read statement, include 'dtedate' in your dataset and then change the column name. You can make the column index using df.set_index later.
I am reading a CSV file using this piece of code:
import pandas as pd
import os
#Open document (.csv format)
path=os.getcwd()
database=pd.read_csv(path+"/mydocument.csv",delimiter=';',header=0)
#Date in the requiered format
size=len(database.Date)
I get the next error: 'DataFrame' object has no attribute 'Date'
As you can see in the image, the first column of mydocument.csv is called Date. This is weird because I used this same procedure to work with this document, and it worked.
Try using delimeter=',' . It must be a comma.
(Can't post comments yet, but think I can help).
The explanation for your problem is, simply, you don't have a column called Date. Has pandas interpretted Date as an index column?
IF your Date is spelled correctly (no trailing whitespace or something else that might confuse things), then try this:
pd.read_csv(path+"/mydocument.csv",delimiter=';',header=0, index_col=False)
or if that fails this:
pd.read_csv(path+"/mydocument.csv",delimiter=';',header=0, index_col=None)
In my experience, I've had pandas unexpectedly infer an index_col.
So I'm trying to create a python script that allows me to perform SQL manipulations on a dataframe (masterfile) I created using pandas. The dataframe draws its contents from the csv files found in a specific folder.
I was able to successfully create everything else, but I am having trouble with the SQL manipulation part. I am trying to use the dataframe as the "database" where I will pull the data using my SQL query but I am getting a "AttributeError: 'DataFrame' object has no attribute 'cursor' " error.
I'm not really seeing a lot of examples for pandas.read_sql_query() so I am having a difficult time on understanding how I will use my dataframe in it.
import os
import glob
import pandas
os.chdir("SOMECENSOREDDIRECTORY")
all_csv = [i for i in glob.glob('*.{}'.format('csv')) if i != 'Masterfile.csv']
edited_files = []
for i in all_csv:
df = pandas.read_csv(i)
df["file_name"] = i.split('.')[0]
edited_files.append(df)
masterfile = pandas.concat(edited_files, sort=False)
print("Data fields are as shown below:")
print(masterfile.iloc[0])
sql_query = "SELECT Country, file_name as Year, Happiness_Score FROM masterfile WHERE Country = 'Switzerland'"
output = pandas.read_sql_query(sql_query, masterfile)
output.to_csv('data_pull')
I know this part is wrong, but this is the concept I am trying to get to work but don't know how:
output = pandas.read_sql_query(sql_query, masterfile)
I appreciate any help I can get! I am a self-thought python programmer by the way, so I might be missing some general rule or something. Thanks!
Edit: replaced "slice" with "manipulate" because I realized I didn't want to just slice it. Also fixed some alignment issues on my code block.
It's is possible to slice dataframe Which created through Pandas and SQL You can use loc function of pandas to slice dataframe.
pd.df.loc[row,colums]
I want to read this csv file in pandas as a DataFrame. Then I would like to split the resulting strings from colons.
I import using:
df_r = pd.read_csv("report.csv", sep=";|,", engine="python")
Then split using:
for c in df_r:
if df_r[c].dtype == "object":
df_r[c] = df_r[c].str.split(':')
But I get the following error:
ValueError: could not convert string to float: '"\x001\x000\x00'
Any idea what I am doing wrong?
Edit:
The error actually shows when I try to convert one of the strings to a float
print(float(df_r["Laptime"].iloc[0][2]))
I ran your code and everything works fine. You can try catching the error and print the row that has that strange behaviour and manual inspect that.
Is that the entire dump you are using? I saw that you are assigning the csv to the variable a and using df_r afterwards so I think you are doing something else in between.
If the csv file is complete be aware that the last line is empty and create a row full of NaNs. You want to read the csv with skipfooter=1.
a = pd.read_csv('report.csv', sep=";|,", engine="python", skipfooter=1)
Edit:
You can convert it to float like this:
print(float(df_r["Laptime"].iloc[0][2].replace(b'\x00',b'')))