I am trying to replace NaN value in my 'price' column of my dataset, I tried using:
avg_price = car.groupby('make')['price'].agg(np.mean) # calculating average value of the price of each car company model
new_price= car['price'].fillna(avg_price,inplace=True)
car['price']=new_price
The code runs well without any error, but on checking, I can still see the NaN values in the dataset. Dataset snap shot is attached below:
Are you trying to fill the NaN with a grouped (by make) average? Will this work?
df.loc[df.price.isnull(), 'price'] = df.groupby('make').price.transform('mean')
Related
I'm trying to
step 1.
Get the min incidence of malaria for each country
step2
-If a country has a nan value in the 'IncidenceOfMalaria' column, fill nan values with the minimum value of that column FOR THAT VERY COUNTRY AND NOT THE MIN VALUE OF THE ENTIRE COLUMN.
My attempt
malaria_data = pd.read_csv('DatasetAfricaMalaria.csv')
malaria_data["IncidenceOfMalaria"].groupby(malaria_data['CountryName']).min().sort_values()
I get a series like so
Stuck at this level. How can I proceed or what would you rather have me do differently?
A better approach would be something like this
malaria_data.groupby('CountryName')['IncidenceOfMalaria'].apply(lambda gp : gp.fillna(gp.min())
Will probably give you what you want, i didnt test it out because there is no sample data but please tell me if an error occurs.
Good evening everyone!
I have a problem with NaN values in python with pandas.
I am working on database with information on different countries. I cannot get rid of all of my NaN values altogether or I would lose too much data.
I wish to replace the NaN values based on some condition.
The dataframe I am working on
What I would like is to create a new column that would take the existing values of a column (Here: OECDSTInterbkRate) and replace all its NaN values based on a specific condition.
For example, I want to replace the NaN corresponding to Australia with the moving average of the values I already have for Australia.
Same thing for every other country for which I am missing values (Replace NaN observations in this column for France by the moving average of the values I already have for France, etc.).
What piece of code do you think I could use?
Thank you very much for your help !
Maybe you can try something like this df.fillna(df.mean(), inplace=True)
Replace df.mean() with your mean values.
I'm learning how to use the pandas library in python3 and I've run into an issue with dataframe.corr()
Here's an example of my dataset
Date,Gender,Age at Booking,Current Age
2015-12-23,M,21,22
2015-12-23,M,25,25
2015-12-23,M,37,37
2015-12-23,F,39,40
2015-12-23,M,24,24
And here is how I attempt to load it/transform it
crime_data = pd.read_csv(crime_data_s)
print(crime_data.head())
print(crime_data['Date'])
correlated_data = crime_data.corr()
print(correlated_data)
Printing crime data head shows the 4 columns with some associated data, accessing column 'Date' and printing its values works just as expected however when crime_data.corr() is called and I print the data it has stripped all other items except "age at booking" and "current age" therefore making it shape 2x2.
Calling the dataframe.info() method I can see that the date and gender columns are being labeled as objects rather than relevant data what can be done to fix this so that I can attempt to run a correlation on the data?
data['Gender']=data['Gender'].astype('category').cat.codes
data['Date']=data['Date'].astype('category').cat.codes
data.corr()
Output
Date Gender Age curage
Date NaN NaN NaN NaN
Gender NaN 1.000000 0.162804 -0.703474
Age NaN -0.162804 1.000000 0.814425
curage NaN -0.703474 0.814425 1.000000
It is because .corr() works only with numeric data type columns. You need to replace values M and F with for instance :
crime_data['Gender'] = crime_data['Gender'].replace('M',1).replace('F',0)
I have a DataFrame, from which I am creating a new calculated column. I then use df.fillna(0.0) to make sure I have no NaN values.
df = pd.read_csv("my_data.csv")
df['units_per_month'] = df['units'] / df['months_since_first_order']
df = df.fillna(0.0)
I then group by DataFrame by category df_grp = df.groupby(['segments']) and attempt to calculate the standard deviation std_units_month = df_grp['units_per_month'].std()
This works perfectly fine for 8 of my 11 categories, but for 3 of them, the std is returned as NaN
I know that I have all valid values and that all NaN values were filled because df[['segments','units_per_month']][df['units_per_month'].isnull()] returns an empty DataFrame.
I've also downloaded all the data and confirmed that nothing is awry.. excel can calculate all the stdevs..
Any thoughts on where I could have gone wrong?
Hi so I have a function that plots timeseries data for a given argument (in my case its a country name). Now some of the columns have na values and when i try to plot them I cant because of thos NaN values. How can I solve this problem?
This is the code, which gets you dataframe and function im using:
url2='https://spreadsheets.google.com/pub?key=phAwcNAVuyj1jiMAkmq1iMg&output=xls'
source=io.BytesIO(requests.get(url2).content)
income=pd.read_excel(source)
income.head()
income.set_index("GDP per capita", inplace=True)
def gdpchange(country):
dfff=income.loc[country]
dfff.T.plot(kind='line')
plt.legend([country])
Now if I want to plot all of them on one graph it gives an error because of nan values in some columns. Any suggestions?
for ctr in income.index.values:
gdpchange(ctr)
You have to drop all nan values with pandas.dropna():
income.dropna(inplace=True)
This statement drops all rows that have any nan values in income dataframe.