Get range from sparse datetimeindex - python

I have this kind of pandas DataFrame for each user in a large database.
each row is a period of length [start_date, end_date], but sometimes 2 consecutive rows are in fact the same period : end_date is equal to the following start_date (red underlining). Sometimes periods even overlap on more than 1 date.
I would like to get the "real periods" by combining rows which corresponds to the same periods.
What I have tried
def split_range(name):
df_user = de_201512_echant[de_201512_echant.name == name]
# -- Create a date_range with a length [min_start_date, max_start_date]
t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date)
for row in range(0, df_user.shape[0]):
start_date = df_user.iloc[row].start_date
end_date = df_user.iloc[row].end_date
if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)):
t = pd.DataFrame(index=pd.date_range(start_date, end_date))
t["period_%s" % (row)] = 1
t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left")
else:
pass
return t_date
which yields a DataFrame where each colunms is a period (1 if in the range, NaN if not) :
t_date
Out[29]:
period_0 period_1 period_2 period_3 period_4 period_5 \
2005-01-01 NaN NaN NaN NaN NaN NaN
2005-01-02 NaN NaN NaN NaN NaN NaN
2005-01-03 NaN NaN NaN NaN NaN NaN
2005-01-04 NaN NaN NaN NaN NaN NaN
2005-01-05 NaN NaN NaN NaN NaN NaN
2005-01-06 NaN NaN NaN NaN NaN NaN
2005-01-07 NaN NaN NaN NaN NaN NaN
2005-01-08 NaN NaN NaN NaN NaN NaN
2005-01-09 NaN NaN NaN NaN NaN NaN
2005-01-10 NaN NaN NaN NaN NaN NaN
2005-01-11 NaN NaN NaN NaN NaN NaN
Then if I sum all the columns (periods) I got almost exactly what I want :
full_spell = t_date.sum(axis=1)
full_spell.loc[full_spell == 1]
Out[31]:
2005-11-14 1.0
2005-11-15 1.0
2005-11-16 1.0
2005-11-17 1.0
2005-11-18 1.0
2005-11-19 1.0
2005-11-20 1.0
2005-11-21 1.0
2005-11-22 1.0
2005-11-23 1.0
2005-11-24 1.0
2005-11-25 1.0
2005-11-26 1.0
2005-11-27 1.0
2005-11-28 1.0
2005-11-29 1.0
2005-11-30 1.0
2006-01-16 1.0
2006-01-17 1.0
2006-01-18 1.0
2006-01-19 1.0
2006-01-20 1.0
2006-01-21 1.0
2006-01-22 1.0
2006-01-23 1.0
2006-01-24 1.0
2006-01-25 1.0
2006-01-26 1.0
2006-01-27 1.0
2006-01-28 1.0
2015-07-06 1.0
2015-07-07 1.0
2015-07-08 1.0
2015-07-09 1.0
2015-07-10 1.0
2015-07-11 1.0
2015-07-12 1.0
2015-07-13 1.0
2015-07-14 1.0
2015-07-15 1.0
2015-07-16 1.0
2015-07-17 1.0
2015-07-18 1.0
2015-07-19 1.0
2015-08-02 1.0
2015-08-03 1.0
2015-08-04 1.0
2015-08-05 1.0
2015-08-06 1.0
2015-08-07 1.0
2015-08-08 1.0
2015-08-09 1.0
2015-08-10 1.0
2015-08-11 1.0
2015-08-12 1.0
2015-08-13 1.0
2015-08-14 1.0
2015-08-15 1.0
2015-08-16 1.0
2015-08-17 1.0
dtype: float64
But I could not find a way to slice all the time range of this sparse datetime index to finally get my desired output : the original dataframe containing the "real" period of time.
It might not be the most efficient way to do this, so If you have alternatives, do not hesitate!

I found a much more efficient way to do this by using apply:
def get_range(row):
'''returns a DataFrame containing the day-range from a "start_date"
and a "end_date"'''
start_date = row["start_date"]
end_date = row["end_date"]
period = pd.date_range(start_date, end_date, freq="1D")
return pd.Dataframe(period, columns='days_in_period')
# -- Apply get_range() to the initial df
t_all = df.apply(get_range)
# -- Drop overlapping dates
t_all.drop_duplicates(inplace=True)

Related

combine two pd df by index and column

The data looks like this:
df1 = 456089.0 456091.0 456093.0
5428709.0 1.0 1.0 NaN
5428711.0 1.0 1.0 NaN
5428713.0 NaN NaN 1.0
df2 = 456093.0 456095.0 456097.0
5428711.0 2.0 NaN NaN
5428713.0 NaN 2.0 NaN
5428715.0 NaN NaN 2.0
I would like to have this output:
df3 = 456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
I tried several combinations with pd.merge, pd.join, pd.concat but nothing worked the way I want it, since I want to combine the data by index and column.
Does anyone have an idea how to do this? Thanks in advance!
Let us try sum with concat
out = pd.concat([df1,df2]).sum(axis=1,level=0,min_count=1).sum(axis=0,level=0,min_count=1)
Out[150]:
456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

Calculating YoY growth for columns

I am trying to calculate the YoY change between columns. Lets say I have the df below
datedf = pd.DataFrame({'ID':list('12345'),'1/1/2019':[1,2,3,4,5],'2/1/2019':[1,2,3,4,5],'3/1/2019':[1,2,3,4,5],'1/1/2020':[2,4,6,8,10],'2/1/2020':[2,4,6,8,10],'3/1/2020':[2,4,6,8,10]})
What transformation would I have to do in order to get to this result below, to show 100% gain YoY.
endingdf = pd.DataFrame({'ID':list('12345'),'1/1/2020':[1,1,1,1,1],'2/1/2020':[1,1,1,1,1],'3/1/2020':[1,1,1,1,1]})
This is the code I have tried but it does not work. The real data I am working with has multiple years.
just_dates = datedf.loc[:,'1/1/2019':]
just_dates.columns = pd.to_datetime(just_dates.columns)
just_dates.groupby(pd.Grouper(level=0,freq='M',axis=1),axis=1).pct_change()
Try this:
result = datedf.set_index('ID')
result.columns = pd.to_datetime(result.columns)
result = result.pct_change(periods=12, freq='MS', axis=1)
Result:
2019-01-01 2019-02-01 2019-03-01 2020-01-01 2020-02-01 2020-03-01
ID
1 NaN NaN NaN 1.0 1.0 1.0
2 NaN NaN NaN 1.0 1.0 1.0
3 NaN NaN NaN 1.0 1.0 1.0
4 NaN NaN NaN 1.0 1.0 1.0
5 NaN NaN NaN 1.0 1.0 1.0

Pandas: datetime indexed series to time indexed date columns dataframe

I have a datetime indexed series like this:
2018-08-27 17:45:01 1
2018-08-27 16:01:12 1
2018-08-27 13:48:47 1
2018-08-26 22:26:40 2
2018-08-26 20:10:42 1
2018-08-26 18:20:32 1
2018-08-25 23:07:51 1
2018-08-25 01:46:08 1
2018-09-18 14:08:23 1
2018-09-17 19:38:38 1
2018-09-15 22:40:45 1
What is an elegant way to reformat this into a time indexed dataframe whose columns are dates? For example:
2018-10-24 2018-06-28 2018-10-23
15:16:41 1.0 NaN NaN
15:18:16 1.0 NaN NaN
15:21:42 1.0 NaN NaN
23:35:00 NaN NaN 1.0
23:53:13 NaN 1.0 NaN
Current approach:
time_date_dict = defaultdict(partial(defaultdict, int))
for i in series.iteritems():
datetime = i[0]
value = i[1]
time_date_dict[datetime.time()][datetime.date()] = value
time_date_df = pd.DataFrame.from_dict(time_date_dict, orient='index')
Use pivot:
df1 = pd.pivot(s.index.time, s.index.date, s)
#if want strings index and columns names
#df1 = pd.pivot(s.index.strftime('%H:%M:%S'), s.index.strftime('%Y-%m-%d'), s)
print (df1)
date 2018-08-25 2018-08-26 2018-08-27 2018-09-15 2018-09-17 \
date
01:46:08 1.0 NaN NaN NaN NaN
13:48:47 NaN NaN 1.0 NaN NaN
14:08:23 NaN NaN NaN NaN NaN
16:01:12 NaN NaN 1.0 NaN NaN
17:45:01 NaN NaN 1.0 NaN NaN
18:20:32 NaN 1.0 NaN NaN NaN
19:38:38 NaN NaN NaN NaN 1.0
20:10:42 NaN 1.0 NaN NaN NaN
22:26:40 NaN 2.0 NaN NaN NaN
22:40:45 NaN NaN NaN 1.0 NaN
23:07:51 1.0 NaN NaN NaN NaN
date 2018-09-18
date
01:46:08 NaN
13:48:47 NaN
14:08:23 1.0
16:01:12 NaN
17:45:01 NaN
18:20:32 NaN
19:38:38 NaN
20:10:42 NaN
22:26:40 NaN
22:40:45 NaN
23:07:51 NaN

Mean of a grouped-by pandas dataframe with flexible aggregation period

As here I need to calculate the mean of the colums duration and km for the
rows with value ==1 and values = 0.
This time I would like that the aggregation period is flexible.
df
Out[20]:
Date duration km value
0 2015-03-28 09:07:00.800001 0 0 0
1 2015-03-28 09:36:01.819998 1 2 1
2 2015-03-30 09:36:06.839997 1 3 1
3 2015-03-30 09:37:27.659997 nan 5 0
4 2015-04-22 09:51:40.440003 3 7 0
5 2015-04-23 10:15:25.080002 0 nan 1
For the aggregation period of 1 day I can use the solution suggested before:
df.pivot_table(values=['duration','km'],columns=['value'],index=df['Date'].dt.date,aggfunc='mean'
ndf.columns = [i[0]+str(i[1]) for i in ndf.columns]
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
However, I do not know how to change the aggregation period in case, for example, I want to pass it as an argument of a function...
For this reason an approach with pd.Grouper(freq=freq_aggregation), being freq_aggregation = 'd' or '60s' would be preferred...
You can pass grouper to the index of pivot table. Hope this is what you are looking for i.e
ndf = df.pivot_table(values=['duration','km'],columns=['value'],index=pd.Grouper(key='Date', freq='60s'),aggfunc='mean')
ndf.columns = [i[0]+str(i[1]) for i in ndf.columns]
Output:
duration0 duration1 km0 km1
Date
2015-03-28 09:07:00 0.0 NaN 0.0 NaN
2015-03-28 09:36:00 NaN 1.0 NaN 2.0
2015-03-30 09:36:00 NaN 1.0 NaN 3.0
2015-03-30 09:37:00 NaN NaN 5.0 NaN
2015-04-22 09:51:00 3.0 NaN 7.0 NaN
2015-04-23 10:15:00 NaN 0.0 NaN NaN
If frequency is D then
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
Let's use pd.Grouper, unstack, and columns map:
freq_str = '60s'
df_out = df.groupby([pd.Grouper(freq=freq_str, key='Date'),'value'])['duration','km'].agg('mean').unstack()
df_out.columns = df_out.columns.map('{0[0]}{0[1]}'.format)
df_out
Output:
duration0 duration1 km0 km1
Date
2015-03-28 09:07:00 0.0 NaN 0.0 NaN
2015-03-28 09:36:00 NaN 1.0 NaN 2.0
2015-03-30 09:36:00 NaN 1.0 NaN 3.0
2015-03-30 09:37:00 NaN NaN 5.0 NaN
2015-04-22 09:51:00 3.0 NaN 7.0 NaN
2015-04-23 10:15:00 NaN 0.0 NaN NaN
Now, let's change freq_str to 'D':
freq_str = 'D'
df_out = df.groupby([pd.Grouper(freq=freq_str, key='Date'),'value'])['duration','km'].agg('mean').unstack()
df_out.columns = df_out.columns.map('{0[0]}{0[1]}'.format)
print(df_out)
Output:
duration0 duration1 km0 km1
Date
2015-03-28 0.0 1.0 0.0 2.0
2015-03-30 NaN 1.0 5.0 3.0
2015-04-22 3.0 NaN 7.0 NaN
2015-04-23 NaN 0.0 NaN NaN
use groupby
df = df.set_index('Date')
df.groupby([pd.TimeGrouper('D'), 'value']).mean()
duration km
Date value
2017-10-11 0 1.500000 4.0
1 0.666667 2.5
df.groupby([pd.TimeGrouper('60s'), 'value']).mean()
duration km
Date value
2017-10-11 09:07:00 0 0.0 0.0
2017-10-11 09:36:00 1 1.0 2.5
2017-10-11 09:37:00 0 NaN 5.0
2017-10-11 09:51:00 0 3.0 7.0
2017-10-11 10:15:00 1 0.0 NaN
if you want it unstacked, then unstack it.
df.groupby([pd.TimeGrouper('D'), 'value']).mean().unstack()
duration km
value 0 1 0 1
Date
2017-10-11 1.50 0.67 4.00 2.50

Categories

Resources