What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18
Related
I have the following data frame in Pandas:
df = pd.DataFrame({
'ID': [1,2,1,1,2,3,1,3,3,3,2],
'date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28']
})
I want to produce a column time since first occur that is the time passed in days since their first occurrence.
Here is what I did:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df.sort_values(by=['ID', 'date'], ascending = [True, False], inplace=True)
and I got the sorted data frame
ID date
6 1 2022-01-12
3 1 2021-11-28
0 1 2021-04-28
2 1 2011-03-01
1 2 2022-05-21
10 2 2009-05-28
4 2 1992-12-01
9 3 2022-01-01
7 3 2019-02-28
8 3 2001-03-28
5 3 1999-10-28
so the output should look like
ID date time since first occur
6 1 2022-01-12 3970
3 1 2021-11-28 3925
0 1 2021-04-28 3711
2 1 2011-03-01 0
1 2 2022-05-21 10763
10 2 2009-05-28 6022
4 2 1992-12-01 0
9 3 2022-01-01 8101
7 3 2019-02-28 7063
8 3 2001-03-28 517
5 3 1999-10-28 0
Thanks in advance for helping.
After sorting the dataframe, you can get the difference between date and minimal date in group
df['time since first occur'] = (df['date'] - df.groupby('ID')['date'].transform('min')).dt.days
print(df)
ID date time since first occur
6 1 2022-01-12 3970
3 1 2021-11-28 3925
0 1 2021-04-28 3711
2 1 2011-03-01 0
1 2 2022-05-21 10763
10 2 2009-05-28 6022
4 2 1992-12-01 0
9 3 2022-01-01 8101
7 3 2019-02-28 7063
8 3 2001-03-28 517
5 3 1999-10-28 0
I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I have a data frame available with date column like below.
df = pd.DataFrame({'Date':pd.date_range('2018-10-01', periods=14)})
I want to append week number column based on date, so it will look like
so the 2018-10-01 will be week 1 and after 7 days 2018-10-08 would be week 2 and so on.
Any help how can I perform this?
Use weekday with factorize with add 1 for groups starting from 1:
df['Week'] = pd.factorize(df['Date'].dt.weekofyear)[0] + 1
print (df)
Date Week
0 2018-10-01 1
1 2018-10-02 1
2 2018-10-03 1
3 2018-10-04 1
4 2018-10-05 1
5 2018-10-06 1
6 2018-10-07 1
7 2018-10-08 2
8 2018-10-09 2
9 2018-10-10 2
10 2018-10-11 2
11 2018-10-12 2
12 2018-10-13 2
13 2018-10-14 2
I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]