Update dataframe with hierarchical index - python

I have a data series that looks like this
Component Date Sev Counts
PS 2009 3 4
4 1
2010 1 2
3 2
4 1
2011 2 3
3 5
4 1
2012 1 1
2 5
3 7
2013 2 4
3 9
2014 1 2
2 3
3 4
2015 1 2
2 100
3 31
4 31
2016 1 44
2 27
3 45
Name: Alarm Name, dtype: int64
And I have a vector that gives a certain quantitiy per year
Number
Date
2009-12-31 8.0
2010-12-31 3.0
2011-12-31 13.0
2012-12-31 2.0
2013-12-31 3.0
2014-12-31 4.0
2015-12-31 6.0
2016-12-31 71.0
I want to make a divisoin of my counts in the seriesusing my vector = division of Counts/number. I also want to obtain my original dataframe with the updated numbers.
This is my code
count=0
for i in df3.index.year:
df2.ix['PS'].ix[i].apply(lambda x: x /float(df3.iloc[count]))
count = count + 1
But my dataframe df2 has not changed. Please any hints. Thanks.

I think you need divide by div column Number, but first convert index of df to years:
df.index = df.index.year
s = s.div(df.Number, level=1)
print (s)
Component Date Sev Counts
PS 2009 3 0.500000
4 0.125000
2010 1 0.666667
3 0.666667
4 0.333333
2011 2 0.230769
3 0.384615
4 0.076923
2012 1 0.500000
2 2.500000
3 3.500000
2013 2 1.333333
3 3.000000
2014 1 0.500000
2 0.750000
3 1.000000
2015 1 0.333333
2 16.666667
3 5.166667
4 5.166667
2016 1 0.619718
2 0.380282
3 0.633803
dtype: float64

Related

Panel Data - dealing with missing year when creating lead and lag variables

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

Balancing a panel data for regression

I have a dataframe:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30
I wanna create a balanced data such that:
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2009 NaN
5 2 def10 2010 20
6 3 ghk 2008 NaN
7 3 ghk 2009 30
8 3 ghk 2009 NaN
if I use the following code:
df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))
This gives me following error:
cannot handle a non-unique multi-index!
You are close to the solution. You can amend your code slightly as follows:
idx = pd.MultiIndex.from_product([df['id'].unique(),range(df.year.min(),df.year.max()+1)],names=['id','year'])
df2 = df.set_index(['id', 'year']).reindex(idx).reset_index()
df2['city'] = df2.groupby('id')['city'].ffill().bfill()
Changes to your codes:
Create the MultiIndex by using unique values of id instead of from index
Set index on both id and year before reindex()
Fill-in the NaN values of column city by non-NaN entries of the same id
Result:
print(df2)
id year city value
0 1 2008 abc 10.0
1 1 2009 abc 20.0
2 1 2010 abc 30.0
3 2 2008 def10 10.0
4 2 2009 def10 NaN
5 2 2010 def10 20.0
6 3 2008 ghk NaN
7 3 2009 ghk 30.0
8 3 2010 ghk NaN
Optionally, you can re-arrange the column sequence, if you like:
df2.insert(2, 'year', df2.pop('year'))
print(df2)
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit
You can also do it using stack() and unstack() without using reindex(), as follows:
(df.set_index(['id', 'city', 'year'], append=True)
.unstack()
.groupby(level=[1, 2]).max()
.stack(dropna=False)
).reset_index()
Output:
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Pivot the table and stack year without drop NaN values:
>>> df.pivot(["id", "city"], "year", "value") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit: case of duplicate entries
I slightly modified your original dataframe:
df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
>>> df
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30 # The problem is here
6 3 ghk 2009 40 # same (id, city, year)
You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):
>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 35.0 # <- mean of (30, 40)
8 3 ghk 2010 NaN

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

Read values from multiple rows and combine them in another row in pandas dataframe

I have the following dataframe:
item_id bytes value_id value
1 0 2.0 year 2017
2 0 1.0 month 04
3 0 1.0 day 12
4 0 1.0 time 07
5 0 1.0 minute 13
6 1 2.0 year 2017
7 1 1.0 month 12
8 1 1.0 day 19
9 1 1.0 time 09
10 1 1.0 minute 32
11 2 2.0 year 2017
12 2 1.0 month 04
13 2 1.0 day 17
14 2 1.0 time 14
15 2 1.0 minute 24
I want to be able to calculate the time for each item_id. How do I use group by here or anything else to achieve the following?
item_id time
0 2017/04/12 07:13
1 2017/12/19 09:32
2 2017/04/17 14:24
Use pivot + to_datetime
pd.to_datetime(
df.drop('bytes', 1)
.pivot('item_id', 'value_id', 'value')
.rename(columns={'time' :'hour'})
).reset_index(name='time')
item_id time
0 0 2017-04-12 07:13:00
1 1 2017-12-19 09:32:00
2 2 2017-04-17 14:24:00
You can drop the bytes column before pivoting, it doesn't seem like you need it.
set_index +unstack also , pd.to_datatime can passed a dataframe, you only need to name your column correctly
pd.to_datetime(df.set_index(['item_id','value_id']).value.unstack().rename(columns={'time' :'hour'}))
Out[537]:
item_id
0 2017-04-12 07:13:00
1 2017-12-19 09:32:00
2 2017-04-17 14:24:00
dtype: datetime64[ns]

Panel data: mean, groupby and with a condition

I want to calculate first the mean of jobs whenever entr ==1 and second the mean of jobs by year_of_life.
id year entry cohort jobs year_of_life
1 2009 0 NaN 10 NaN
1 2012 1 2012 12 0
1 2013 0 2012 12 1
1 2014 0 2012 13 2
2 2010 1 2010 2 0
2 2011 0 2010 3 1
2 2012 0 2010 3 2
3 2007 0 NaN 4 Nan
3 2008 0 NaN 4 Nan
3 2012 1 2012 5 0
3 2013 0 2012 5 1
Thank you very much
Addressing your first requirement -
df.query('entry == 1').jobs.mean()
6.333333333333333
Addressing your second requirement - here, I consider only jobs where entry is 1.
df.assign(jobs=df.jobs.mask(df.entry == 1)).groupby('year_of_life').jobs.mean()
year_of_life
0 NaN
1 6.666667
2 8.000000
Nan 4.000000
Name: jobs, dtype: float64
If you just want mean by year_of_life, a simple groupby will suffice.
df.groupby('year_of_life').jobs.mean()
year_of_life
0 6.333333
1 6.666667
2 8.000000
Nan 4.000000
Name: jobs, dtype: float64
Note that this is different from what the other answer is suggesting, which I think isn't what you're looking for:
df.query('entry == 1').groupby('year_of_life').jobs.mean()
year_of_life
0 6.333333
Name: jobs, dtype: float64
For the first you can use boolean indexing to filter the dataframe for rows where the condition is True then take the mean df[df.entry == 1].mean(). For the second, groupby year_of_life then take the mean of each group df.groupby('year_of_life').mean(). If you want both of the condition to be satisfied then do the grouping try df[df.entry == 1].groupby('year_of_life').mean().

Categories

Resources