pandas groupby mean with nan - python

I have the following dataframe:
date id cars
2012 1 4
2013 1 6
2014 1 NaN
2012 2 10
2013 2 20
2014 2 NaN
Now, I want to get the mean of cars over the years for each id ignoring the NaN's. The result should be like this:
date id cars result
2012 1 4 5
2013 1 6 5
2014 1 NaN 5
2012 2 10 15
2013 2 20 15
2014 2 NaN 15
I have the following command:
df["result"]=df.groupby("id")["cars"].mean()
The command runs without errors, but the result column only has NaN's.
What did I do wrong?

Use transform, this returns a series the same size as the original:
df["result"]=df.groupby("id")["cars"].transform('mean')
print (df)
date id cars result
0 2012 1 4.0 5.0
1 2013 1 6.0 5.0
2 2014 1 NaN 5.0
3 2012 2 10.0 15.0
4 2013 2 20.0 15.0
5 2014 2 NaN 15.0

Hello good old 2017 question. This is just another way with a lot of overhead. You write about getting only NaN values as the mean (as soon as one of the numbers is NaN), with df["result"]=df.groupby("id")["cars"].mean(). In 2023, I did not run into this problem. Perhaps, this has been fixed in later versions? Anyway, if you face this in whatever time and space again, you might want to know in the first place how to get the mean per id without NaN weighing out everything:
import numpy as np
np.seterr(divide='ignore', invalid='ignore')
df.groupby(['id']).apply(lambda x: np.average(x['cars'].dropna()))
After this, join on the id:s. I do not take the time to show this since this answer has a lot of overhead for your question at hand and should not be put to work. There might just be some who search for a way to get the means without NaNs in the first place.

Related

Getting sum data for smoothly shifting groups of 3 months of a months data in Pandas

I have a time series data of the following form:
Item 2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 A 0 1 2 3 4 5
1 B 5 4 3 2 1 0
This is monthly data but I want to get quarterly data of this data. A normal quarterly data would be calculated by summing up Jan-Mar and Apr-Jun and would look like this:
Item 2020 Q1 2020 Q2
0 A 3 12
1 B 12 3
I want to get smoother quarterly data so it would shift by only 1 month for each new data item, not 3 months. So it would have Jan-Mar, then Feb-Apr, then Mar-May, and Apr-Jun. So the resulting data would look like this:
Item 2020 Q1 2020 Q1 2020 Q1 2020 Q2
0 A 3 6 9 12
1 B 12 9 6 3
I believe this is similar to cumsum which can be used as follows:
df_dates = df.iloc[:,1:]
df_dates.cumsum(axis=1)
which leads to the following result:
2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 0 1 3 6 10 15
1 5 9 12 14 15 15
but instead of getting the sum over the whole time, it gets the sum of the nearest 3 months (a quarter).
I do not know how this version of cumsum is called but I saw it in many places so I believe there might be a library function for that.
Let us solve in steps
Set the index to Item column
Parse the date like columns to quarterly period
Calculate the rolling sum with window of size 3
Shift the calculated rolling sum 2 units along the columns axis and get rid of the last two columns
s = df.set_index('Item')
s.columns = pd.PeriodIndex(s.columns, freq='M').strftime('%Y Q%q')
s = s.rolling(3, axis=1).sum().shift(-2, axis=1).iloc[:, :-2]
print(s)
2020 Q1 2020 Q1 2020 Q1 2020 Q2
Item
A 3.0 6.0 9.0 12.0
B 12.0 9.0 6.0 3.0
Try with column wise groupby with axis=1:
>>> df.iloc[:, [0]].join(df.iloc[:, 1:].groupby(pd.to_datetime(df.columns[1:], format='%Y %b').quarter, axis=1).sum().add_prefix('Q'))
Item Q1 Q2
0 A 3 12
1 B 12 3
>>>
Edit:
I misread your question, to do what you want try rolling sum:
>>> x = df.rolling(3, axis=1).sum().dropna(axis='columns')
>>> df.iloc[:, [0]].join(x.set_axis('Q' + pd.to_datetime(df.columns[1:], format='%Y %b').quarter.astype(str)[:len(x.T)], axis=1))
Item Q1 Q1 Q1 Q2
0 A 3.0 6.0 9.0 12.0
1 B 12.0 9.0 6.0 3.0
>>>

Sum based on criteria in row and column conditions [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a dataframe in Pandas that looks something like this:
Year Type Money
2012 A 2
2012 A 3
2012 B 4
2012 B 5
2012 C 7
2013 A 6
2013 A 4
2013 B 3
2013 B 2
2013 C 1
2014 A 3
2014 A 4
2014 B 5
I want to sum it up as such:
A B C
2012 5 9 7
2013 10 5 1
2014 7 5 0
For instance, the first entry of 5 is a sum of all entries in the data from year 2012 and with Type A.
Is there a simple way to go about doing this? I know how to go about this using SUMIFS in Excel but want to avoid that if possible.
Try:
df.groupby(['Year','Type']).Money.sum().unstack(level=1).fillna(0)
Output:
Type A B C
Year
2012 5.0 9.0 7.0
2013 10.0 5.0 1.0
2014 7.0 5.0 0.0

Reformatting a date-frame into a new output format

I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.
Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.

Panel data: mean, groupby and with a condition

I want to calculate first the mean of jobs whenever entr ==1 and second the mean of jobs by year_of_life.
id year entry cohort jobs year_of_life
1 2009 0 NaN 10 NaN
1 2012 1 2012 12 0
1 2013 0 2012 12 1
1 2014 0 2012 13 2
2 2010 1 2010 2 0
2 2011 0 2010 3 1
2 2012 0 2010 3 2
3 2007 0 NaN 4 Nan
3 2008 0 NaN 4 Nan
3 2012 1 2012 5 0
3 2013 0 2012 5 1
Thank you very much
Addressing your first requirement -
df.query('entry == 1').jobs.mean()
6.333333333333333
Addressing your second requirement - here, I consider only jobs where entry is 1.
df.assign(jobs=df.jobs.mask(df.entry == 1)).groupby('year_of_life').jobs.mean()
year_of_life
0 NaN
1 6.666667
2 8.000000
Nan 4.000000
Name: jobs, dtype: float64
If you just want mean by year_of_life, a simple groupby will suffice.
df.groupby('year_of_life').jobs.mean()
year_of_life
0 6.333333
1 6.666667
2 8.000000
Nan 4.000000
Name: jobs, dtype: float64
Note that this is different from what the other answer is suggesting, which I think isn't what you're looking for:
df.query('entry == 1').groupby('year_of_life').jobs.mean()
year_of_life
0 6.333333
Name: jobs, dtype: float64
For the first you can use boolean indexing to filter the dataframe for rows where the condition is True then take the mean df[df.entry == 1].mean(). For the second, groupby year_of_life then take the mean of each group df.groupby('year_of_life').mean(). If you want both of the condition to be satisfied then do the grouping try df[df.entry == 1].groupby('year_of_life').mean().

Expand panel data python

I am trying to expand the following data. I am a Stata user, and my problem can be fix by the command "fillin" in stata, now i am trying to rewrite this command in python and couldn't found any command that works.
For example: , transform this data frame:
(my dataframe is bigger than the example given, the example is just to illustrate what i want to do)
id year X Y
1 2008 10 20
1 2010 15 25
2 2011 2 4
2 2012 3 6
to this one
id year X Y
1 2008 10 20
1 2009 . .
1 2010 15 20
1 2011 . .
1 2012 . .
2 2008 . .
2 2009 . .
2 2010 . .
2 2011 2 4
2 2012 3 6
thank you, and sorry for my english
This can be done by using .loc[]
from itertools import product
import pandas as pd
df = pd.DataFrame([[1,2008,10,20],[1,2010,15,25],[2,2011,2,4],[2,2012,3,6]],columns=['id','year','X','Y'])
df = df.set_index(['id','year'])
# All combinations of index
#idx = list(product(df.index.levels[0], df.index.levels[1]))
idx = list(product(range(1,3), range(2008,2013)))
df.loc[idx]
Create a new multi-index from the dataframe and then reindex
years = np.tile(np.arange(df.year.min(), df.year.max()+1,1) ,2)
ids = np.repeat(df.id.unique(), df.year.max()-df.year.min()+1)
arrays = [ids.tolist(), years.tolist()]
new_idx = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['id', 'year'])
df = df.set_index(['id', 'year'])
df.reindex(new_idx).reset_index()
id year X Y
0 1 2008 10.0 20.0
1 1 2009 NaN NaN
2 1 2010 15.0 25.0
3 1 2011 NaN NaN
4 1 2012 NaN NaN
5 2 2008 NaN NaN
6 2 2009 NaN NaN
7 2 2010 NaN NaN
8 2 2011 2.0 4.0
9 2 2012 3.0 6.0

Categories

Resources