Adding a summarised column back into a dataframe in python - python

as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows
avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")
now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?

UPDATED
This is probably the most succinct way to do it:
all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))
This is another way to do it:
all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))
Output:
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 NaN
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 55.0
To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg is indeed a dataframe but using it may not be the most direct way to update missing data in the original dataframe. The dataframe avg looks like this for the sample input data above:
portfolio sales_mean
0 1 15.0
1 2 35.0
2 3 55.0
A related SO question that you may find helpful is here.

If you want to add a new column, you can use this code:
df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)

Related

Delete rows between NaN and a change in the column value

I am stuck on a problem which looks simple but for which I cannot find a proper solution.
Consider a given Pandas dataframe df, composed by multiple columns A1,A2, etc., and let Ai be one of its column filled for example as follows:
Ai
25
30
30
NaN
12
15
15
NaN
I would like to delete all the rows in df for which Ai values are between NaN and a "further change" in its value, so that my output (for column Ai) would be:
Ai
25
NaN
12
NaN
Any idea on how to do so would be very much appreciated. Thank you very much in advance.
update
Similar to the previous solution but with a filter per group to keep the early duplicates
m = df['Ai'].isna()
df.loc[((m|m.shift(fill_value=True))
.groupby(df['Ai'].ne(df['Ai'].shift()).cumsum())
.filter(lambda d: d.sum()>0).index
)]
output:
Ai
0 25.0
1 25.0
2 25.0
5 NaN
6 30.0
7 30.0
9 NaN
original answer
This is equivalent to selecting the NaNs and line below. You could use a mask:
m = df['Ai'].isna()
df[m|m.shift(fill_value=True)]
Output:
Ai
0 25.0
3 NaN
4 12.0
7 NaN

How to sum multiple columns to get a "groupby" type of output with Python?

The database I'm working with is a little strange but I basically just want to get a sum of each column and output it into a data frame similar to groupbys layout. My current dataframe is like this:
Diet Kin LLFC MH Nursing
1 1 3 1
1 1 1 1 1
1 2 1 3 1
1 2 1 6 1
I'd essentially like to sum the columns and output something like this:
Diet 4
Kin 5
LLFC 4
MH 13
Nursing 4
I tried to do a groupby but I got a valueerror that grouper and axis must be the same length. I added a total row that sums every column up, but I still am finding trouble to just output a small summary data frame like the one I would like above. Can someone help?
To achieve this you can simply use df.sum() which will give you:
Diet 4.0
Kin 5.0
LLFC 4.0
MH 13.0
Nursing 4.0
dtype: float64
To get the results in transposed form:
df_sum = pd.DataFrame(df.sum()).T
>>>> Diet Kin LLFC MH Nursing
0 4.0 5.0 4.0 13.0 4.0

Trying to fill NaNs with fillna() and groupby()

So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!
The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0

Trying to truncate decimal values in all the cells of dataframe, but not working

The Dataframe consists of table, the format of which is shown in the Attached image. I apologize for not being able to type the format here as while trying to type the format of the Dataframe, it was getting messed up due to long decimal values, so i thought to attach its snapshot.
Country names are the index of the data frame and the cell values consists of corresponding GDP value. The intent is to calculate the average of all the rows for each country. When np.average was applied -
#name of Dataframe - GDP
def function_average()
GDP['Average'] = np.average(GDP.iloc[:,0:])
return GDP
function_average()
The new column got created reflecting all the values as NaN. I assumed its probably due to the inappropriately formatted cell values. I tried truncating that using the following code -
GDP = np.round(GDP, decimals =2)
And yet, there was no change in values. The code ran successfully though and there was no error.
Please advise, how to proceed in this case, should i try to make change in the spreadsheet itself or attempt to format cell values in Dataframe?
I regret for any inconvenience caused for not being able to provide any other required information at this point. please let me know if any other detail is required.
Problem is need axis=1 for count mean per rows and change function to numpy.nanmean or DataFrame.mean:
Sample:
np.random.seed(100)
GDP = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
GDP.loc[0, 'A'] = np.nan
GDP['Average1'] = np.average(GDP.iloc[:,0:], axis=1)
GDP['Average2'] = np.nanmean(GDP.iloc[:,0:], axis=1)
GDP['Average3'] = GDP.iloc[:,0:].mean(axis=1)
print (GDP)
A B C D E Average1 Average2 Average3
0 NaN 8 3 7 7 NaN 6.25 6.25
1 0.0 4 2 5 2 2.6 2.60 2.60
2 2.0 2 1 0 8 2.6 2.60 2.60
3 4.0 0 9 6 2 4.2 4.20 4.20
4 4.0 1 5 3 4 3.4 3.40 3.40
You get NaN, because at least one NaN:
print (np.average(GDP.iloc[:,0:]))
nan
GDP['Average'] = np.average(GDP.iloc[:,0:])
print (GDP)
A B C D E Average
0 NaN 8 3 7 7 NaN
1 0.0 4 2 5 2 NaN
2 2.0 2 1 0 8 NaN
3 4.0 0 9 6 2 NaN
4 4.0 1 5 3 4 NaN

Use df.merge to populate a new column in df gives strange matchs

I just found the 2 issues causing this, see solution below
I want to create a new column in my dataframe (df) based on another dataframe.
Basically df2 contains updated informations that I want to plug into df.
In order to replicate my real case (>1m lines), I will just populate two random df with simple columns.
I use pandas.merge() to do this, but this is giving me strange results.
Here is a typical example. Let's create df randomly and create df2 with a simple relationship : "New Type" = "Type" + 1. I create this simple relationship so that we can check easily the ouput. In my real application I don't have such an easy relationship of course.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)),columns = ["Type"])
df.head()
Type
0 45
1 3
2 89
3 6
4 39
df1 = pd.DataFrame({"Type":range(1,100)})
df1["New Type"] = df1["Type"] + 1
print(df1.head())
Type New Type
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
Now let's say I want to update df "Type" based on the "New Type" on df1
df["Type2"] = df.merge(df1,on="Type")["New Type"]
print(df.head())
I get this strange output where we clearly see that it does not work
Type Type2
0 45 46.0
1 3 4.0
2 89 4.0
3 6 4.0
4 39 90.0
I would think output should be like
Type Type2
0 45 46.0
1 3 4.0
2 89 90.0
3 6 7.0
4 39 40.0
Only the first line is properly matched. Do you know what I've missed?
Solution
1.I need to do merge with how="left" otherwise the default choice is "inner" producing another table with a different dimension than df.
Also I need to use sort=false as attribute to my merge function. Otherwise the merge result is sorted before being applied to df.
One way you could do this using map, set_index, and squeeze:
df['Type2'] = df['Type'].map(df1.set_index('Type').squeeze())
Output:
Type Type2
0 22 23.0
1 56 57.0
2 63 64.0
3 33 34.0
4 25 26.0
First, I'd construct a Series of New Type indexed by the old Type from df1:
new_vals = df1.set_index('Type')['New Type']
Then it's simply:
df.replace(new_vals)
That will leave values which aren't mapped intact. If you want to instead have the output be NaN (null) where not mapped, do this:
new_vals[df.Type]

Categories

Resources