Pandas apply function issue - python

I have a dataframe (data) of numeric variables and I want to analyse the distribution of each column by using the Shapiro test from scipy.
from scipy import stats
data.apply(stats.shapiro, axis=0)
But I keep getting the following error message:
ValueError: ('could not convert string to float: M', u'occurred at index 0')
I've checked the documentation and it says the first argument of the apply function should be a function, which stats.shapiro is (as far as I'm aware).
What am I doing wrong, and how can I fix it?

Found the problem. There was a column of type object which resulted in the error message above. Apply the function only to numeric columns solved this issue.

Related

pandas dataframe to_csv() with get_handle() error [duplicate]

I had big table which I sliced to many smaller tables based on their dates:
dfs={}
for fecha in fechas:
dfs[fecha]=df[df['date']==fecha].set_index('Hour')
#now I can acess the tables like this:
dfs['2019-06-23'].head()
I have done some modifictions to the dfs['2019-06-23'] specific table and now I would like to save it on my computer. I have tried to do this in two ways:
#first try:
dfs['2019-06-23'].to_csv('specific/path/file.csv')
#second try:
test=dfs['2019-06-23']
test.to_csv('test.csv')
both of them raised this error:
TypeError: get_handle() got an unexpected keyword argument 'errors'
I don't know why I get this error and haven't find any reason for that. I have saved many files this way but never had that before.
My goal: to be able to save this dataframe after my modification as csv
If you are getting this error, there are two things to check:
Whether the DataFrame is not actually a Series - see (Pandas : to_csv() got an unexpected keyword argument)
Your numpy version. For me, updating to numpy==1.20.1 with pandas==1.2.2 fixed the problem. If you are using Jupyter notebooks, remember to restart the kernel afterwards.
In the end what worked was to use pd.DataFrame and then to export it as following:
to_export=pd.DataFrame(dfs['2019-06-23'])
to_export.to_csv('my_table.csv')
that suprised me because when I checked the type of the table when I got the error it was dataframe . However, this way it works.

Why does dask throw an error when setting a String column as an index?

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen? How can I solve it?
As stated in the bug report here for an unrelated issue: https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray. But unlike pandas, NumPy doesn't know how to handle a StringDtype.
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

How to pass object to float ....ValueError: could not convert string to float: ''

I need help with this code. I've tried to change the column object of my dataframe to float using this code and it shows ... ValueError: could not convert string to float: '2685,040344'. This one is the first value of df1['Puissance']. I have other columns as objects and the program shows the same problem.
I want then create a new column call
df1['Torque']=float(df1['Puissance']) // float(df1['Vitesse'])
but still shows the problem. I can't change the object to float.
Is this a problem from the pd.read_csv?
How can I resolve the problem?
I've already tried #df1['Puissance'] = pd.to_numeric(df1['Puissance'], errors='ignore', downcast='float') and shows the same problem.
import pandas as pd
import numpy as np
df1=pd.read_csv("C:\\Users\\FGFGJ\\Documents\\Écorçage 2018\\Volet_1\\past5\\2A_.txt", sep='\t')
df1['Puissance']=df1.Puissance.astype(float, errors='raise')
#df1.dtypes
You have several options when using pandas.read_csv. The simplest, in your case, would probably be to pass decimal="," as parameter. That way pandas will recognize your Puissance column as numeric.
EDIT
For more information on the possible parameters see here. If you need to convert multiple columns to different dtypes you can use dtype={<column>: <dtype>}.

Panda Get_Value throwing error : '[xxxx]' is an invalid key

I am trying to use Python DataFrame.Get_Value(Index,ColumnName) to get value of column and it keep throwing following Error
"'[10004]' is an invalid key" where 10004 is index value.
This is how Dataframe looks:
I have successfully used get_value before.. I dont know whats wrong with this dataframe.
First, pandas.DataFrame.get_value is deprecated (and should have been get_value, as opposed to Get_Value). It's better to use a non-deprecated method such as .loc or .at instead:
df.loc[10004, 'Column_Name']
# Or:
df.at[10004, 'Column_Name']
Your issue with might be that you have 10004 stored as a string instead of an integer. Try surrounding the index by quotes (df.loc['10004', 'Column_Name']). You can check this easily by saying: df.index.dtype, and seeing if it returns dtype('O')

Error passing Pandas Dataframe to Scikit Learn

I get the following error when passing a pandas dataframe to scikitlearn algorithms:
invalid literal for float(): 2.,3
How do I find the row or column with the problem in order to fix or eliminate it? Is there something like df[df.isnull().any(axis=1)] for a specific value (in my case I guess 2.,3)?
If you know what column it is, you can use
df[df.your_column == 2.,3]
then you'll get all rows where the specified column has a value of 2.,3
You might have to use
df[df.your_column == '2.,3']

Categories

Resources