How can I fill null values with a mean using Pandas? [duplicate] - python

This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed 18 days ago.
Having a hard time understanding why the apply function isn't working here. I'm trying to fill the null values for SalePrice with the mean sales price of their corresponding quality ratings (OverallQual)
I expected the function to itterate through each row and return the mean SalePrice for the coresponding OverallQual feature where SalePrice is a null, else return the original SalePrice.
sale_price_by_qual = df.groupby('OverallQual').mean()['SalePrice']
def fill_sales_price(SalePrice, OverallQual):
if np.isnan(SalePrice):
return sale_price_by_qual[SalePrice]
else:
return SalePrice
df[SalePrice] = df.apply(lambda x: fill_sales_price(x['SalePrice], x['OverallQaul]), axis=1)
KeyError: nan

could you maybe save the mean value into a variable and then do the .fillna()?
x = your mean value
df[SalePrice] = df[SalePrice].fillna(x)

Try this,
def fill_sales_price(SalePrice, OverallQual):
if np.isnan(SalePrice):
return sale_price_by_qual[OverallQual]
else:
return SalePrice
df['SalePrice'] = df.apply(lambda x: fill_sales_price(x['SalePrice'], x['OverallQual']), axis=1)

Related

PYTHON sort a column conditionally by put special char on the top [duplicate]

This question already has answers here:
Custom sorting in pandas dataframe
(5 answers)
Closed 11 months ago.
I am doing my dataset. I need to sort one of my dataset columns from the smallest to the largest like:
however, when I use :
count20 = count20.sort_values(by = ['Month Year', 'Age'])
I got:
Can anyone help me with this?
Thank you very much!
define a function like this:
def fn(x):
output = []
for item, value in x.iteritems():
output.append(''.join(e for e in value if e.isalnum()))
return output
and pass this function as key while sorting values.
count20 = count20.sort_values(by = ['Month Year', 'Age'], key= fn)

Exclude rows which have NA value for a column [duplicate]

This question already has answers here:
How to drop rows of Pandas DataFrame whose value in a certain column is NaN
(15 answers)
Closed 1 year ago.
This is a sample of my data
I have written this code which removes all categorical columns (eg. MsZoning). However, some non-categorical columns have NA value. How can I exclude them from my data set.
def main():
print('Starting program execution')
iowa_train_prices_file_path='C:\\...\\programs\\python\\kaggle_competition_iowa_house_prices_train.csv'
iowa_file_data = pd.read_csv(iowa_train_prices_file_path)
print('Read file')
model_random_forest = RandomForestRegressor(random_state=1)
features = ['MSSubClass','MSZoning',...]
y = iowa_file_data.SalePrice
# every colmn except SalePrice
X = iowa_file_data.drop('SalePrice', axis = 1)
#The object dtype indicates a column has text (hint that the column is categorical)
X_dropped = X.select_dtypes(exclude=['object'])
print("fitting model")
model_random_forest.fit(X_dropped, y)
print("MAE of dropped categorical approach");
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
main()
When I run the program, I get error ValueError: Input contains NaN, infinity or a value too large for dtype('float32') which I believe is due to NA value of Id=8.
Question 1 - How do I remove such rows entirely
Question 2 - What is the type of such columns which are mostly nos. but have text in between? I thought I'll do print("X types",type(X.columns)) but that doesn't give the result
To remove nans, you can replace them with another value. It is common practice to use zeros.
iowa_file_data = iowa_file_data.fillna(0)
If you still want to remove the whole column, use
iowa_file_data = iowa_file_data.dropna(axis='columns')
And if you want to remove the entire row, use
iowa_file_data = iowa_file_data.dropna()
For your second question, from what I understand, you might want to see some info about the pandas object dtype: link.

Column in DataFrame in Pandas with value 0

I try to create 2 new columns in DataFrame in Pandas Python and the first column aa which shows average temperaturÄ™ is correct, nevertheless, the second column bb which should present temperature in City minus average temperature in all cities displays value 0??
Where is the problem? Did I correctly use lambda? Could you give me the solution? Thank you very much!
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.groupby(['City'])["Temperature"].transform(lambda x: x - np.mean(x))
display(file.head(10))
EDIT: Updated according to gereleth's comment. You can simplify it even more!
file['bb'] = file.Temperature - file.aa
Since we've already calculated the mean value in the aa column we can simply reuse this column to calculate the difference of the Temperature and aa column of each row by using pandas apply method like below:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.apply(lambda row: row['Temperature'] - row['aa'], axis=1)
display(file.sample(10))
If you are looking to subtract the average of all cities temperature you can use mean on the column aa instead:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
avg_all_cities = file['aa'].mean()
file["bb"] = file.apply(lambda row: row['Temperature'] - avg_all_cities, axis=1)
display(file.sample(10))

How to get the second largest value in Pandas Python [duplicate]

This question already has answers here:
Get first and second highest values in pandas columns
(7 answers)
Closed 4 years ago.
This is my code:
maxData = all_data.groupby(['Id'])[features].agg('max')
all_data = pd.merge(all_data, maxData.reset_index(), suffixes=["", "_max"], how='left', on=['Id'])
Now Instead of getting the max value, How can I fetch the second max value in the above code (groupBy Id)
Try using nlargest
maxData = all_data.groupby(['Id'])[features].apply(lambda x:x.nlargest(2)[1]).reset_index(drop=True)
You can use the nth method just after sorting the values;
maxData = all_data.sort_values("features", ascending=False).groupby(['Id']).nth(1)
Please ignore apply method as it decreases performance of code.

Pandas get percent value of groupby [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 4 years ago.
I have a pandas groupby that I've done
grouped = df.groupby(['name','type'])['count'].count().reset_index()
Looks like this:
name type count
x a 32
x b 1111
x c 4214
What I need to do is take this and generate percentages, so i would get something like this (I realize the percentages are incorrect):
name type count
x a 1%
x b 49%
x c 50%
I can think of some pseudocode that might make sense but I haven't been able to get anything that actually works...
something like
def getPercentage(df):
for name in df:
total = 0
where df['name'] = name:
total = total + df['count']
type_percent = (df['type'] / total) * 100
return type_percent
df.apply(getPercentage)
Is there a good way to do this with pandas?
Try:
df.loc[:,'grouped'] = df.groupby(['name','type'])['count'].count() / df.groupby(['name','type'])['count'].sum()
Using crosstab + normalize
pd.crosstab(df.name,df.type,normalize='index').stack().reset_index()
Any series can be normalized by just passing in an argument "normalize=False" as follows (it's cleaner than deviding by count):
Series.value_counts(normalize=True, sort=True, ascending=False)
So, it will be something like (which is a series, not a dataframe):
df['type'].value_counts(normalize=True) * 100
or, if you use groupby, you can simply do:
total = grouped['count'].sum()
grouped['count'] = grouped['count']/total * 100

Categories

Resources