I'm having some problems categorizing a column using pandas and OrdinalEncoder.
What I have to do is basically, convert a column to be categorical (so that I can use OrdinalEncoder after) but everything I try either doesn't work or returns NaN.
What I tried is the following:
df['Education'] is a column with all the data of the degrees obtained by the sample.
from pandas.api.types import CategoricalDtype
ordcat = CategoricalDtype(categories = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad',
'Prof-school', 'Assoc-acdm', 'Assoc-voc', 'Some-college', 'Bachelors', 'Masters',
'Doctorate'], ordered = True)
df['Education'] = df['Education'].astype(ordcat)
print(df['Education'])
The output is the following:
category
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
32556 NaN
32557 NaN
32558 NaN
32559 NaN
32560 NaN
Which isn't what I need.
I also tried doing something different but didn't really give me any result other than errors about Series being mutable or NaNs again.
It's been like 4 days and I can't figure things out, do you have any idea of what I'm missing?
Thanks in advance for your help.
Edit_0: The database I'm using is the following: UCI Machine Learning repository (http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data) which I alredy stripped and put np.nan in place of ?
How do you declare df? Maybe you can try to create an empty dataframe series first and name it as Education. Then, execute the df['Education'] = df['Education'].astype(ordcat).
Related
I am trying to replace NaN value in my 'price' column of my dataset, I tried using:
avg_price = car.groupby('make')['price'].agg(np.mean) # calculating average value of the price of each car company model
new_price= car['price'].fillna(avg_price,inplace=True)
car['price']=new_price
The code runs well without any error, but on checking, I can still see the NaN values in the dataset. Dataset snap shot is attached below:
Are you trying to fill the NaN with a grouped (by make) average? Will this work?
df.loc[df.price.isnull(), 'price'] = df.groupby('make').price.transform('mean')
Good evening everyone!
I have a problem with NaN values in python with pandas.
I am working on database with information on different countries. I cannot get rid of all of my NaN values altogether or I would lose too much data.
I wish to replace the NaN values based on some condition.
The dataframe I am working on
What I would like is to create a new column that would take the existing values of a column (Here: OECDSTInterbkRate) and replace all its NaN values based on a specific condition.
For example, I want to replace the NaN corresponding to Australia with the moving average of the values I already have for Australia.
Same thing for every other country for which I am missing values (Replace NaN observations in this column for France by the moving average of the values I already have for France, etc.).
What piece of code do you think I could use?
Thank you very much for your help !
Maybe you can try something like this df.fillna(df.mean(), inplace=True)
Replace df.mean() with your mean values.
I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]
First time posting here - have decided to try and learn how to use python whilst on Covid-19 forced holidays.
I'm trying to summarise some data from a pretty simple database and have been using the value_counts function.
Rather than running it on every column individually, I'd like to loop it over each one and return a summary table. I can do this using df.apply(pd.value_counts) but can't work out how to enter parameters into the the value counts as I want to have dropna = False.
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of winners and runnerup
data = [['john', 'barry'], ['john','barry'], [np.nan,'barry'], ['barry','john'],['john',np.nan],['linda','frank']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['winner', 'runnerup'])
# print dataframe.
df
How I was doing the value counts for each column:
#Who won the most?
df['winner'].value_counts(dropna=False)
Output:
john 3
linda 1
barry 1
NaN 1
Name: winner, dtype: int64
How can I enter the dropna=False when using apply function? I like the table it outputs below but want the NaN to appear in the list.
#value counts table
df.apply(pd.value_counts)
winner runnerup
barry 1.0 3.0
frank NaN 1.0
john 3.0 1.0
linda 1.0 NaN
#value that is missing from list
#NaN 1.0 1.0
Any help would be appreciated!!
You can use df.apply, like this:
df.apply(pd.value_counts, dropna=False)
In pandas apply function, if there is a single parameter, you simply do:
.apply(func_name)
The parameter is the value of the cell.
This works exactly the same way for pandas build in function as well as user defined functions (UDF).
for UDF, when there are more than one parameters:
.apply(func_name, args=(arg1, arg2, arg3, ...))
See: this link
I'm learning how to use the pandas library in python3 and I've run into an issue with dataframe.corr()
Here's an example of my dataset
Date,Gender,Age at Booking,Current Age
2015-12-23,M,21,22
2015-12-23,M,25,25
2015-12-23,M,37,37
2015-12-23,F,39,40
2015-12-23,M,24,24
And here is how I attempt to load it/transform it
crime_data = pd.read_csv(crime_data_s)
print(crime_data.head())
print(crime_data['Date'])
correlated_data = crime_data.corr()
print(correlated_data)
Printing crime data head shows the 4 columns with some associated data, accessing column 'Date' and printing its values works just as expected however when crime_data.corr() is called and I print the data it has stripped all other items except "age at booking" and "current age" therefore making it shape 2x2.
Calling the dataframe.info() method I can see that the date and gender columns are being labeled as objects rather than relevant data what can be done to fix this so that I can attempt to run a correlation on the data?
data['Gender']=data['Gender'].astype('category').cat.codes
data['Date']=data['Date'].astype('category').cat.codes
data.corr()
Output
Date Gender Age curage
Date NaN NaN NaN NaN
Gender NaN 1.000000 0.162804 -0.703474
Age NaN -0.162804 1.000000 0.814425
curage NaN -0.703474 0.814425 1.000000
It is because .corr() works only with numeric data type columns. You need to replace values M and F with for instance :
crime_data['Gender'] = crime_data['Gender'].replace('M',1).replace('F',0)