Error passing Pandas Dataframe to Scikit Learn

Error passing Pandas Dataframe to Scikit Learn - python

I get the following error when passing a pandas dataframe to scikitlearn algorithms:
invalid literal for float(): 2.,3
How do I find the row or column with the problem in order to fix or eliminate it? Is there something like df[df.isnull().any(axis=1)] for a specific value (in my case I guess 2.,3)?

If you know what column it is, you can use
df[df.your_column == 2.,3]
then you'll get all rows where the specified column has a value of 2.,3
You might have to use
df[df.your_column == '2.,3']

Related

Is there a function in python for renaming values in one column based on the values in another?

Thanks in advance for the advice. I have a pandas dataframe. What I would like to do is label a column (adata.obs['new_annotation]) as 'pDC' when another column (adata.obs['leiden_scVI'] == 10.
It seems to me that loc is probably the best way to go about this. Therefore I have tried:
adata.obs.loc[adata.obs['leiden_scVI']== '10', 'new_annotation'] = 'pDC'
But this generates a value error:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first.
I've tried appending .astype(category) but this does not seem to solve the problem.
Is there another way of overcoming this please?
Many thanks.
ADDENDUM
Now solved - just need to change columns to .astype(str)

Sort dataframe by absolute value without changing value or adding column

I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3

df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.

key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this

replace a string in entire dataframe from excel with value

I have this kind of data from excel
dminerals=pd.read_excel(datafile)
print(dminerals.head(5))
Then I replace the 'Tr' and NaN value using for loop with this script
for key, value in dminerals.iteritems():
dminerals[key] = dminerals[key].replace(to_replace='Tr', value=int(1))
dminerals[key] = dminerals[key].replace(to_replace=np.nan, value=int(0))
then print it again, it seems working and print the dataframe types.But it shows object data type.
print(dminerals.head(5))
print(dminerals['C'].dtypes)
I tried using this .astype to change one of the column ['C'] to integer but the result is value error
dminerals['C'].astype(int)
ValueError: invalid literal for int() with base 10: 'tr'
I thought I already change the 'Tr' in the dataframe into integer value. Is there anything that I miss in the process above? Please help, thank you in advance!

You are replacing Tr with 1, however there is a tr that's not being replaced (this is what you ValueError is saying. Remember python is case sensitive. Also, using for loops is extremely inefficient you might want to try using the following lines of code:
dminerales = dminerales.replace({'Tr':1,'tr':1}).fillna(0)
I'm using fillna() which is also better to fill the null values with the specified value 0 in this case, instead of using repalce.

Dropping a problematic column from a dask dataframe

I have a dask dataframe with one problematic column that (I believe) is the source of a particular error that is thrown every time I try to do anything with the dataframe (be it head, or to_csv, or even when I try to subset using a (different) column. The error is probably owing to a data type mismatch and shows up like this:
ValueError: invalid literal for int() with base 10: 'FIPS'
So I decided to drop that column ('FIPS') using
df = df.drop('FIPS', axis=1)
Now when I do df.columns, I don't see 'FIPS' any longer which I take to mean that it has indeed been dropped. But when I try to write a different column to a file
df.column_a.to_csv('example.csv')
I keep getting the same error
ValueError: invalid literal for int() with base 10: 'FIPS'
I assume it has something to do with dask's lazy approaches as a result of which it delays the drop, but any work-around would be very helpful.
Basically, I just need to extract a single column (column_a) from df.

try to convert to a pandas dataframe after the drop
df.compute()
and only then write to csv

What's the easiest way to replace categorical columns of data with codes in Pandas?

I have a table of data in .dta format which I have read into python using Pandas. The data is mostly in the categorical data type and I want to replace the columns with numerical data that can be used with machine learning, such as boolean (1/0) or codes. The trouble is that I can't directly replace the data because it won't let me change the categories, unless I add them.
I have tried using pd.get_dummies(), but it keeps returning an error:
TypeError: 'columns' is an invalid keyword argument for this function
print(pd.get_dummies(feature).head(), columns=['smkevr', 'cignow', 'dnnow',
'dnever', 'complst'])
Is there a simple way to replace this data with numerical codes based on the value (for example 'Not applicable' = 0)?

I do it the following way:
df_dumm = pd.get_dummies(feature).head()
df_dumm.columns = ['smkevr', 'cignow', 'dnnow',
'dnever', 'complst']
print (df_dumm.head())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error passing Pandas Dataframe to Scikit Learn - python

I get the following error when passing a pandas dataframe to scikitlearn algorithms: invalid literal for float(): 2.,3 How do I find the row or column with the problem in order to fix or eliminate it? Is there something like df[df.isnull().any(axis=1)] for a specific value (in my case I guess 2.,3)?

If you know what column it is, you can use df[df.your_column == 2.,3] then you'll get all rows where the specified column has a value of 2.,3 You might have to use df[df.your_column == '2.,3']

Related

Is there a function in python for renaming values in one column based on the values in another?

Sort dataframe by absolute value without changing value or adding column

replace a string in entire dataframe from excel with value

Dropping a problematic column from a dask dataframe

What's the easiest way to replace categorical columns of data with codes in Pandas?

Categories

Resources