Panel data: mean, groupby and with a condition - python

I want to calculate first the mean of jobs whenever entr ==1 and second the mean of jobs by year_of_life.
id year entry cohort jobs year_of_life
1 2009 0 NaN 10 NaN
1 2012 1 2012 12 0
1 2013 0 2012 12 1
1 2014 0 2012 13 2
2 2010 1 2010 2 0
2 2011 0 2010 3 1
2 2012 0 2010 3 2
3 2007 0 NaN 4 Nan
3 2008 0 NaN 4 Nan
3 2012 1 2012 5 0
3 2013 0 2012 5 1
Thank you very much

Addressing your first requirement -
df.query('entry == 1').jobs.mean()
6.333333333333333
Addressing your second requirement - here, I consider only jobs where entry is 1.
df.assign(jobs=df.jobs.mask(df.entry == 1)).groupby('year_of_life').jobs.mean()
year_of_life
0 NaN
1 6.666667
2 8.000000
Nan 4.000000
Name: jobs, dtype: float64
If you just want mean by year_of_life, a simple groupby will suffice.
df.groupby('year_of_life').jobs.mean()
year_of_life
0 6.333333
1 6.666667
2 8.000000
Nan 4.000000
Name: jobs, dtype: float64
Note that this is different from what the other answer is suggesting, which I think isn't what you're looking for:
df.query('entry == 1').groupby('year_of_life').jobs.mean()
year_of_life
0 6.333333
Name: jobs, dtype: float64

For the first you can use boolean indexing to filter the dataframe for rows where the condition is True then take the mean df[df.entry == 1].mean(). For the second, groupby year_of_life then take the mean of each group df.groupby('year_of_life').mean(). If you want both of the condition to be satisfied then do the grouping try df[df.entry == 1].groupby('year_of_life').mean().

Related

Count the number of column values (number of unique values of column) that have at least one non null response

This is what my dataframe looks like:
Year
State
Var1
Var2
2018
1
1
3
2018
1
2
Nan
2018
1
NaN
1
2018
2
NaN
1
2018
2
NaN
2
2018
3
3
NaN
2019
1
1
NaN
2019
1
3
NaN
2019
1
2
NaN
2019
1
NaN
NaN
2019
2
NaN
NaN
2019
2
3
NaN
2020
1
1
NaN
2020
2
NaN
1
2020
2
NaN
3
2020
3
3
NaN
2020
4
NaN
NaN
2020
4
1
NaN
Desired Output
Year 2018 2019 2020
Var1 Num of States w/ non-null 2 2 3
Var2 Num of States w/ non-null 2 0 1
I want to count the number of unique values of the variable State that have at least one non null response for each variable.
IIUC you are looking for:
out = pd.concat([
df.dropna(subset='Var1').pivot_table(columns='Year',
values='State',
aggfunc='nunique'),
df.dropna(subset='Var2').pivot_table(columns='Year',
values='State',
aggfunc='nunique')
]).fillna(0).astype(int)
out.index = ['Var1 Num of States w/non-null', 'Var2 Num of states w/non-null']
print(out):
Year 2018 2019 2020
Var1 Num of States w/non-null 2 2 3
Var2 Num of states w/non-null 2 0 1

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

Expand panel data python

I am trying to expand the following data. I am a Stata user, and my problem can be fix by the command "fillin" in stata, now i am trying to rewrite this command in python and couldn't found any command that works.
For example: , transform this data frame:
(my dataframe is bigger than the example given, the example is just to illustrate what i want to do)
id year X Y
1 2008 10 20
1 2010 15 25
2 2011 2 4
2 2012 3 6
to this one
id year X Y
1 2008 10 20
1 2009 . .
1 2010 15 20
1 2011 . .
1 2012 . .
2 2008 . .
2 2009 . .
2 2010 . .
2 2011 2 4
2 2012 3 6
thank you, and sorry for my english
This can be done by using .loc[]
from itertools import product
import pandas as pd
df = pd.DataFrame([[1,2008,10,20],[1,2010,15,25],[2,2011,2,4],[2,2012,3,6]],columns=['id','year','X','Y'])
df = df.set_index(['id','year'])
# All combinations of index
#idx = list(product(df.index.levels[0], df.index.levels[1]))
idx = list(product(range(1,3), range(2008,2013)))
df.loc[idx]
Create a new multi-index from the dataframe and then reindex
years = np.tile(np.arange(df.year.min(), df.year.max()+1,1) ,2)
ids = np.repeat(df.id.unique(), df.year.max()-df.year.min()+1)
arrays = [ids.tolist(), years.tolist()]
new_idx = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['id', 'year'])
df = df.set_index(['id', 'year'])
df.reindex(new_idx).reset_index()
id year X Y
0 1 2008 10.0 20.0
1 1 2009 NaN NaN
2 1 2010 15.0 25.0
3 1 2011 NaN NaN
4 1 2012 NaN NaN
5 2 2008 NaN NaN
6 2 2009 NaN NaN
7 2 2010 NaN NaN
8 2 2011 2.0 4.0
9 2 2012 3.0 6.0

pandas groupby mean with nan

I have the following dataframe:
date id cars
2012 1 4
2013 1 6
2014 1 NaN
2012 2 10
2013 2 20
2014 2 NaN
Now, I want to get the mean of cars over the years for each id ignoring the NaN's. The result should be like this:
date id cars result
2012 1 4 5
2013 1 6 5
2014 1 NaN 5
2012 2 10 15
2013 2 20 15
2014 2 NaN 15
I have the following command:
df["result"]=df.groupby("id")["cars"].mean()
The command runs without errors, but the result column only has NaN's.
What did I do wrong?
Use transform, this returns a series the same size as the original:
df["result"]=df.groupby("id")["cars"].transform('mean')
print (df)
date id cars result
0 2012 1 4.0 5.0
1 2013 1 6.0 5.0
2 2014 1 NaN 5.0
3 2012 2 10.0 15.0
4 2013 2 20.0 15.0
5 2014 2 NaN 15.0
Hello good old 2017 question. This is just another way with a lot of overhead. You write about getting only NaN values as the mean (as soon as one of the numbers is NaN), with df["result"]=df.groupby("id")["cars"].mean(). In 2023, I did not run into this problem. Perhaps, this has been fixed in later versions? Anyway, if you face this in whatever time and space again, you might want to know in the first place how to get the mean per id without NaN weighing out everything:
import numpy as np
np.seterr(divide='ignore', invalid='ignore')
df.groupby(['id']).apply(lambda x: np.average(x['cars'].dropna()))
After this, join on the id:s. I do not take the time to show this since this answer has a lot of overhead for your question at hand and should not be put to work. There might just be some who search for a way to get the means without NaNs in the first place.

Update dataframe with hierarchical index

I have a data series that looks like this
Component Date Sev Counts
PS 2009 3 4
4 1
2010 1 2
3 2
4 1
2011 2 3
3 5
4 1
2012 1 1
2 5
3 7
2013 2 4
3 9
2014 1 2
2 3
3 4
2015 1 2
2 100
3 31
4 31
2016 1 44
2 27
3 45
Name: Alarm Name, dtype: int64
And I have a vector that gives a certain quantitiy per year
Number
Date
2009-12-31 8.0
2010-12-31 3.0
2011-12-31 13.0
2012-12-31 2.0
2013-12-31 3.0
2014-12-31 4.0
2015-12-31 6.0
2016-12-31 71.0
I want to make a divisoin of my counts in the seriesusing my vector = division of Counts/number. I also want to obtain my original dataframe with the updated numbers.
This is my code
count=0
for i in df3.index.year:
df2.ix['PS'].ix[i].apply(lambda x: x /float(df3.iloc[count]))
count = count + 1
But my dataframe df2 has not changed. Please any hints. Thanks.
I think you need divide by div column Number, but first convert index of df to years:
df.index = df.index.year
s = s.div(df.Number, level=1)
print (s)
Component Date Sev Counts
PS 2009 3 0.500000
4 0.125000
2010 1 0.666667
3 0.666667
4 0.333333
2011 2 0.230769
3 0.384615
4 0.076923
2012 1 0.500000
2 2.500000
3 3.500000
2013 2 1.333333
3 3.000000
2014 1 0.500000
2 0.750000
3 1.000000
2015 1 0.333333
2 16.666667
3 5.166667
4 5.166667
2016 1 0.619718
2 0.380282
3 0.633803
dtype: float64

Categories

Resources