I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)
Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN
Related
I have data which looks like this:
month day
1 1 NaN
2 NaN
3 39.529999
4 40.570000
5 40.099998
...
12 27 NaN
28 NaN
29 NaN
30 NaN
31 39.049999
df55.iloc[df55.index.get_level_values('month') == 3]
month day
3 1 37.099998
2 38.060001
3 37.939999
4 37.230000
5 NaN
6 NaN
7 35.869999
8 35.660000
9 36.970001
10 36.660000
11 36.400002
12 NaN
13 NaN
14 36.860001
15 37.380001
16 38.430000
17 38.910000
18 39.000000
19 NaN
20 NaN
21 38.810001
22 39.439999
23 38.709999
24 39.020000
25 39.520000
26 NaN
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
I want to interpolate() the missing data but only till today, which is month 3 and day 26 from month 1 day 1 and leave all the other NaN as is. Could you please advise how can data between the range to interpolate()
Your idea to use iloc is good but you can use dayofyear to slice your dataframe because I guess your dataframe is well ordered.
today = pd.to_datetime('today')
df.iloc[:today.dayofyear] = df.iloc[:today.dayofyear].interpolate()
It seems easiest to temporarily reset the index so you can use a query:
today = pd.to_datetime('today')
idx = df.reset_index().query('month in [1,2] or (month == #today.month and day < #today.day)').index.max()
df.iloc[:idx] = df.iloc[:idx].interpolate()
Now all values from 1-1 (inclusive) to 3-25 (inclusive) will be non-NaN.
I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
I'd like to produce a summary dataframe after grouping by date. I want to have a column that shows the mean of a given column as it is and the mean of that same column after filtering for instances that are greater than 0. I figured out how I can do this (below), but it requires doing two separate groupby calls, renaming the columns, and then joining them back together. I fell like one should be able to do this all in one call. I was trying to use eval to do this but kept getting an error and being told to use apply, that I couldn't use eval on a groupby object.
Code which gets me what I want but doesn't seem very efficient:
# Sample data
data = pd.DataFrame(
{"year" : [2013, 2013, 2013, 2014, 2014, 2014],
"month" : [1, 2, 3, 1, 2, 3],
"day": [1, 1, 1, 1, 1, 1],
"delay": [0, -4, 50, -60, 9, 10]})
subset = (data
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay'})
)
subset_1 = (data[data.delay > 0]
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay_pos'})
)
combined = pd.merge(subset, subset_1, how='left', on=['year', 'month', 'day'])
combined
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0
IIUC, you could use the following code:
>>> data['avg_delay'] = data.pop('delay')
>>> data['avg_delay_pos'] = data.loc[data['avg_delay'].gt(0), 'avg_delay']
>>> data
day month year avg_delay avg_delay_pos
0 1 1 2013 0 NaN
1 1 2 2013 -4 NaN
2 1 3 2013 50 50.0
3 1 1 2014 -60 NaN
4 1 2 2014 9 9.0
5 1 3 2014 10 10.0
>>>
Explanation:
I first remove the delay column, and assign it to the new name of avg_delay, so I am virtually renaming the name of delay to avg_delay.
Then I create a new column called avg_delay_pos, which first uses loc to get the values greater than zero, and since the index doesn't reset, so it will make the indexes that are greater than zero to the values of avg_delay, and the others won't contain any assignments, that said they will be NaN as you expected.
The solution is specific to your problem, but you can do this using a single groupby call. To get "avg_delay_pos", you just have to remove negative (and zero) values.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
(df.filter(like='delay')
.groupby(pd.to_datetime(df[['year', 'month', 'day']]))
.mean()
.add_prefix('avg_'))
avg_delay avg_delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Breakdown
where is used to mask values that are not positive.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
# df['delay'].where(df['delay'] > 0)
0 NaN
1 NaN
2 50.0
3 NaN
4 9.0
5 10.0
Name: delay, dtype: float64
Next, extract the delay columns we want to group on,
df.filter(like='delay')
delay delay_pos
0 0 NaN
1 -4 NaN
2 50 50.0
3 -60 NaN
4 9 9.0
5 10 10.0
Then perform a groupby on the date,
_.groupby(pd.to_datetime(df[['year', 'month', 'day']])).mean()
delay delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Where pd.to_datetime is used to convert the year/month/day columns into a single datetime column, it's more efficient to group on a single column than multiple.
pd.to_datetime(df[['year', 'month', 'day']])
0 2013-01-01
1 2013-02-01
2 2013-03-01
3 2014-01-01
4 2014-02-01
5 2014-03-01
dtype: datetime64[ns]
The final .add_prefix('avg_') add prefix "_avg" to the result.
An alternative way to do this if you want separate year/month/day columns would be
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
df.groupby(['year', 'month', 'day']).mean().add_prefix('avg_').reset_index()
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0
I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.
Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.
I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]