Merge 2 dataframes with same values in a column - python

I have 2 dataframes. One is in this form:
df1:
date revenue
0 2016-11-17 385.943800
1 2016-11-18 1074.160340
2 2016-11-19 2980.857860
3 2016-11-20 1919.723960
4 2016-11-21 884.279340
5 2016-11-22 869.071070
6 2016-11-23 760.289260
7 2016-11-24 2481.689270
8 2016-11-25 2745.990070
9 2016-11-26 2273.413250
10 2016-11-27 2630.414900
The other one is in this form:
df2:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 11 9 7 100 85 63
1 2016-11-18 9 6 3 93 83 66
2 2016-11-19 8 6 4 93 87 76
3 2016-11-20 10 7 4 93 84 81
4 2016-11-21 14 10 7 100 89 77
5 2016-11-22 13 10 7 93 79 63
6 2016-11-23 11 8 5 100 91 82
7 2016-11-24 9 7 4 93 80 66
8 2016-11-25 7 4 1 87 74 57
9 2016-11-26 7 3 -1 100 88 61
10 2016-11-27 10 7 4 100 81 66
Both dataframes have more rows and the number of rows will be increasing every day.
I want to combine these 2 dataframes in a way, where every time we see the same date in df1['date'] and df2['CET'], we will add an extra column to df2, which will have the revenue value for this date. So I want to create this:
df2:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity revenue
0 2016-11-17 11 9 7 100 85 63 385.943800
1 2016-11-18 9 6 3 93 83 66 1074.160340
2 2016-11-19 8 6 4 93 87 76 2980.857860
3 2016-11-20 10 7 4 93 84 81 1919.723960
4 2016-11-21 14 10 7 100 89 77 884.279340
5 2016-11-22 13 10 7 93 79 63 869.071070
6 2016-11-23 11 8 5 100 91 82 760.289260
7 2016-11-24 9 7 4 93 80 66 2481.689270
8 2016-11-25 7 4 1 87 74 57 2745.990070
9 2016-11-26 7 3 -1 100 88 61 2273.413250
10 2016-11-27 10 7 4 100 81 66 2630.414900
Can someone help me how to do that?

I think you can use map:
df2['revenue'] = df2.CET.map(df1.set_index('date')['revenue'])
Also you can convert Series to dict, then it is a bit faster in large df:
df2['revenue'] = df2.CET.map(df1.set_index('date')['revenue'].to_dict())
print (df2)
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity \
0 2016-11-17 11 9 7 100 85
1 2016-11-18 9 6 3 93 83
2 2016-11-19 8 6 4 93 87
3 2016-11-20 10 7 4 93 84
4 2016-11-21 14 10 7 100 89
5 2016-11-22 13 10 7 93 79
6 2016-11-23 11 8 5 100 91
7 2016-11-24 9 7 4 93 80
8 2016-11-25 7 4 1 87 74
9 2016-11-26 7 3 -1 100 88
10 2016-11-27 10 7 4 100 81
MinHumidity revenue
0 63 385.94380
1 66 1074.16034
2 76 2980.85786
3 81 1919.72396
4 77 884.27934
5 63 869.07107
6 82 760.28926
7 66 2481.68927
8 57 2745.99007
9 61 2273.41325
10 66 2630.41490
If all output values are NAN problem is with different dtypes of columns CET and date:
print (df1.date.dtypes)
object
print (df2.CET.dtype)
datetime64[ns]
Solution is convert string column to_datetime:
df1.date = pd.to_datetime(df1.date)

.map() solution will work only if you have exctly the same values in date and CET columns.
If you have slightly different values, you can use pd.merge_asof() method:
In [17]: pd.merge_asof(df1, df2, left_on='date', right_on='CET', tolerance=pd.Timedelta('2 hours'))
Out[17]:
date revenue CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 385.94380 2016-11-17 11 9 7 100 85 63
1 2016-11-18 1074.16034 2016-11-18 9 6 3 93 83 66
2 2016-11-19 2980.85786 2016-11-19 8 6 4 93 87 76
3 2016-11-20 1919.72396 2016-11-20 10 7 4 93 84 81
4 2016-11-21 884.27934 2016-11-21 14 10 7 100 89 77
5 2016-11-22 869.07107 2016-11-22 13 10 7 93 79 63
6 2016-11-23 760.28926 2016-11-23 11 8 5 100 91 82
7 2016-11-24 2481.68927 2016-11-24 9 7 4 93 80 66
8 2016-11-25 2745.99007 2016-11-25 7 4 1 87 74 57
9 2016-11-26 2273.41325 2016-11-26 7 3 -1 100 88 61
10 2016-11-27 2630.41490 2016-11-27 10 7 4 100 81 66
NOTE: merge_asof() function has been added in Pandas 0.19.0 (i.e. it's not available in older versions)
Demo:
In [191]: df2
Out[191]:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 01:39:00 11 9 7 100 85 63
1 2016-11-18 01:39:00 9 6 3 93 83 66
2 2016-11-19 01:39:00 8 6 4 93 87 76
3 2016-11-20 01:39:00 10 7 4 93 84 81
4 2016-11-21 01:39:00 14 10 7 100 89 77
5 2016-11-22 01:39:00 13 10 7 93 79 63
6 2016-11-23 01:39:00 11 8 5 100 91 82
7 2016-11-24 01:39:00 9 7 4 93 80 66
8 2016-11-25 01:39:00 7 4 1 87 74 57
9 2016-11-26 01:39:00 7 3 -1 100 88 61
10 2016-11-27 01:39:00 10 7 4 100 81 66
In [192]: df1
Out[192]:
date revenue
0 2016-11-17 385.94380
1 2016-11-18 1074.16034
2 2016-11-19 2980.85786
3 2016-11-20 1919.72396
4 2016-11-21 884.27934
5 2016-11-22 869.07107
6 2016-11-23 760.28926
7 2016-11-24 2481.68927
8 2016-11-25 2745.99007
9 2016-11-26 2273.41325
10 2016-11-27 2630.41490
In [193]: pd.merge_asof(df2, df1, left_on='CET', right_on='date')
Out[193]:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity date revenue
0 2016-11-17 01:39:00 11 9 7 100 85 63 2016-11-17 385.94380
1 2016-11-18 01:39:00 9 6 3 93 83 66 2016-11-18 1074.16034
2 2016-11-19 01:39:00 8 6 4 93 87 76 2016-11-19 2980.85786
3 2016-11-20 01:39:00 10 7 4 93 84 81 2016-11-20 1919.72396
4 2016-11-21 01:39:00 14 10 7 100 89 77 2016-11-21 884.27934
5 2016-11-22 01:39:00 13 10 7 93 79 63 2016-11-22 869.07107
6 2016-11-23 01:39:00 11 8 5 100 91 82 2016-11-23 760.28926
7 2016-11-24 01:39:00 9 7 4 93 80 66 2016-11-24 2481.68927
8 2016-11-25 01:39:00 7 4 1 87 74 57 2016-11-25 2745.99007
9 2016-11-26 01:39:00 7 3 -1 100 88 61 2016-11-26 2273.41325
10 2016-11-27 01:39:00 10 7 4 100 81 66 2016-11-27 2630.41490
using .map() method:
In [194]: df2.CET.map(df1.set_index('date')['revenue'])
Out[194]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
Name: CET, dtype: float64

Related

New column based on last time row value equals some numbers in Pandas dataframe

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.
Date Student_ID Rank Predicted_Score
4/7/2021 33 2 87
13/6/2021 33 4 88
31/3/2021 33 7 88
28/2/2021 33 2 86
14/2/2021 33 10 86
31/1/2021 33 8 86
23/12/2020 33 1 81
8/11/2020 33 3 80
21/10/2020 33 3 80
23/9/2020 33 4 80
20/5/2020 33 3 80
29/4/2020 33 4 80
15/4/2020 33 2 79
26/2/2020 33 3 79
12/2/2020 33 5 79
29/1/2020 33 1 70
I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
4/7/2021 33 2 87 86
13/6/2021 33 4 88 86
31/3/2021 33 7 88 86
28/2/2021 33 2 86 81
14/2/2021 33 10 86 81
31/1/2021 33 8 86 81
23/12/2020 33 1 81 80
8/11/2020 33 3 80 80
21/10/2020 33 3 80 80
23/9/2020 33 4 80 80
20/5/2020 33 3 80 79
29/4/2020 33 4 80 79
15/4/2020 33 2 79 79
26/2/2020 33 3 79 70
12/2/2020 33 5 79 70
29/1/2020 33 1 70
Here's what I have tried but it doesn't quite work, not sure if I am on the right track:
df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))
Thanks in advance.
Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)
Prints:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-06-13 33 4 88 86.0
2 2021-03-31 33 7 88 86.0
3 2021-02-28 33 2 86 81.0
4 2021-02-14 33 10 86 81.0
5 2021-01-31 33 8 86 81.0
6 2020-12-23 33 1 81 80.0
7 2020-11-08 33 3 80 80.0
8 2020-10-21 33 3 80 80.0
9 2020-09-23 33 4 80 80.0
10 2020-05-20 33 3 80 79.0
11 2020-04-29 33 4 80 79.0
12 2020-04-15 33 2 79 79.0
13 2020-02-26 33 3 79 70.0
14 2020-02-12 33 5 79 70.0
15 2020-01-29 33 1 70
Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score
c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())
Result
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 4/7/2021 33 2 87 86.0
1 13/6/2021 33 4 88 86.0
2 31/3/2021 33 7 88 86.0
3 28/2/2021 33 2 86 81.0
4 14/2/2021 33 10 86 81.0
5 31/1/2021 33 8 86 81.0
6 23/12/2020 33 1 81 80.0
7 8/11/2020 33 3 80 80.0
8 21/10/2020 33 3 80 80.0
9 23/9/2020 33 4 80 80.0
10 20/5/2020 33 3 80 79.0
11 29/4/2020 33 4 80 79.0
12 15/4/2020 33 2 79 79.0
13 26/2/2020 33 3 79 70.0
14 12/2/2020 33 5 79 70.0
15 29/1/2020 33 1 70 NaN
Note: Make sure your dataframe is sorted on Date in descending order.
Let's assume:
there may be more than one unique Student_ID
the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
we want to preserve the index of the original dataframe
Subject to these assumptions, here's a way to do what your question asks:
df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
.groupby('Student_ID', sort=False)
.apply(lambda group: group.shift(-1).bfill())
['Recent_Predicted_Score'] )
Explanation:
create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.
Sample input (with two distinct Student_ID values):
Date Student_ID Rank Predicted_Score
0 2021-07-04 33 2 87
1 2021-07-04 66 2 87
2 2021-06-13 33 4 88
3 2021-06-13 66 4 88
4 2021-03-31 33 7 88
5 2021-03-31 66 7 88
6 2021-02-28 33 2 86
7 2021-02-28 66 2 86
8 2021-02-14 33 10 86
9 2021-02-14 66 10 86
10 2021-01-31 33 8 86
11 2021-01-31 66 8 86
12 2020-12-23 33 1 81
13 2020-12-23 66 1 81
14 2020-11-08 33 3 80
15 2020-11-08 66 3 80
16 2020-10-21 33 3 80
17 2020-10-21 66 3 80
18 2020-09-23 33 4 80
19 2020-09-23 66 4 80
20 2020-05-20 33 3 80
21 2020-05-20 66 3 80
22 2020-04-29 33 4 80
23 2020-04-29 66 4 80
24 2020-04-15 33 2 79
25 2020-04-15 66 2 79
26 2020-02-26 33 3 79
27 2020-02-26 66 3 79
28 2020-02-12 33 5 79
29 2020-02-12 66 5 79
30 2020-01-29 33 1 70
31 2020-01-29 66 1 70
Output:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-07-04 66 2 87 86.0
2 2021-06-13 33 4 88 86.0
3 2021-06-13 66 4 88 86.0
4 2021-03-31 33 7 88 86.0
5 2021-03-31 66 7 88 86.0
6 2021-02-28 33 2 86 81.0
7 2021-02-28 66 2 86 81.0
8 2021-02-14 33 10 86 81.0
9 2021-02-14 66 10 86 81.0
10 2021-01-31 33 8 86 81.0
11 2021-01-31 66 8 86 81.0
12 2020-12-23 33 1 81 80.0
13 2020-12-23 66 1 81 80.0
14 2020-11-08 33 3 80 80.0
15 2020-11-08 66 3 80 80.0
16 2020-10-21 33 3 80 80.0
17 2020-10-21 66 3 80 80.0
18 2020-09-23 33 4 80 80.0
19 2020-09-23 66 4 80 80.0
20 2020-05-20 33 3 80 79.0
21 2020-05-20 66 3 80 79.0
22 2020-04-29 33 4 80 79.0
23 2020-04-29 66 4 80 79.0
24 2020-04-15 33 2 79 79.0
25 2020-04-15 66 2 79 79.0
26 2020-02-26 33 3 79 70.0
27 2020-02-26 66 3 79 70.0
28 2020-02-12 33 5 79 70.0
29 2020-02-12 66 5 79 70.0
30 2020-01-29 33 1 70 NaN
31 2020-01-29 66 1 70 NaN
Output sorted by Student_ID, Date for easier inspection:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
2 2021-06-13 33 4 88 86.0
4 2021-03-31 33 7 88 86.0
6 2021-02-28 33 2 86 81.0
8 2021-02-14 33 10 86 81.0
10 2021-01-31 33 8 86 81.0
12 2020-12-23 33 1 81 80.0
14 2020-11-08 33 3 80 80.0
16 2020-10-21 33 3 80 80.0
18 2020-09-23 33 4 80 80.0
20 2020-05-20 33 3 80 79.0
22 2020-04-29 33 4 80 79.0
24 2020-04-15 33 2 79 79.0
26 2020-02-26 33 3 79 70.0
28 2020-02-12 33 5 79 70.0
30 2020-01-29 33 1 70 NaN
1 2021-07-04 66 2 87 86.0
3 2021-06-13 66 4 88 86.0
5 2021-03-31 66 7 88 86.0
7 2021-02-28 66 2 86 81.0
9 2021-02-14 66 10 86 81.0
11 2021-01-31 66 8 86 81.0
13 2020-12-23 66 1 81 80.0
15 2020-11-08 66 3 80 80.0
17 2020-10-21 66 3 80 80.0
19 2020-09-23 66 4 80 80.0
21 2020-05-20 66 3 80 79.0
23 2020-04-29 66 4 80 79.0
25 2020-04-15 66 2 79 79.0
27 2020-02-26 66 3 79 70.0
29 2020-02-12 66 5 79 70.0
31 2020-01-29 66 1 70 NaN

How to specify the week number for a given date using pandas?

I have a dataframe using
year_start = '2020-03-29'
year_end = '2021-04-10'
week_end_sat = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['a'])
How can I make another column specifying the week number making 2020-03-29 as the first day of the calendar since I am trying to make a 4-4-5 calendar which always ends on a Saturday?
Final df that I want is,
a | count
2020-04-04 | 1
2020-04-11 | 2
.
.
.
2021-04-03 | 53 #since 2020 is a leap year there are 53 weeks otherwise it will be 52 weeks
2021-04-10 | 1
2021-04-17 | 2
.
2022-03-02 | 52
2022-04-09 | 1
I think you can create a baseline date range start from the first day of your given year_start.
first_day_of_year = week_end_sat.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(week_end_sat), freq=f'W-SAT'))
The baseline's week of year is what you want.
week_end_sat['count'] = baseline['a'].dt.isocalendar().week
# print(week_end_sat)
a count
0 2020-04-04 1
1 2020-04-11 2
2 2020-04-18 3
3 2020-04-25 4
4 2020-05-02 5
5 2020-05-09 6
6 2020-05-16 7
7 2020-05-23 8
8 2020-05-30 9
9 2020-06-06 10
10 2020-06-13 11
11 2020-06-20 12
12 2020-06-27 13
13 2020-07-04 14
14 2020-07-11 15
15 2020-07-18 16
16 2020-07-25 17
17 2020-08-01 18
18 2020-08-08 19
19 2020-08-15 20
20 2020-08-22 21
21 2020-08-29 22
...
43 2021-01-30 44
44 2021-02-06 45
45 2021-02-13 46
46 2021-02-20 47
47 2021-02-27 48
48 2021-03-06 49
49 2021-03-13 50
50 2021-03-20 51
51 2021-03-27 52
52 2021-04-03 53
53 2021-04-10 1
I calculated the week number using a W-Sat frequency and isocalendar api and return the week number. I then create a baseline using the first day of the year and assign the week number to baseline_week. Now the week has an associated baseline_week number.
year_start = '2020-03-29'
year_end = '2021-04-10'
df = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['week_date'])
df['week_number']=df['week_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
first_day_of_year = df.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(df), freq=f'W-SAT'))
df['baseline_date']=baseline
df['baseline_week_number']=df['baseline_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
print(df)
output:
week_date week_number baseline_date baseline_week_number
0 2020-04-04 14 2020-01-04 1
1 2020-04-11 15 2020-01-11 2
2 2020-04-18 16 2020-01-18 3
3 2020-04-25 17 2020-01-25 4
4 2020-05-02 18 2020-02-01 5
5 2020-05-09 19 2020-02-08 6
6 2020-05-16 20 2020-02-15 7
7 2020-05-23 21 2020-02-22 8
8 2020-05-30 22 2020-02-29 9
9 2020-06-06 23 2020-03-07 10
10 2020-06-13 24 2020-03-14 11
11 2020-06-20 25 2020-03-21 12
12 2020-06-27 26 2020-03-28 13
13 2020-07-04 27 2020-04-04 14
14 2020-07-11 28 2020-04-11 15
15 2020-07-18 29 2020-04-18 16
16 2020-07-25 30 2020-04-25 17
17 2020-08-01 31 2020-05-02 18
18 2020-08-08 32 2020-05-09 19
19 2020-08-15 33 2020-05-16 20
20 2020-08-22 34 2020-05-23 21
21 2020-08-29 35 2020-05-30 22
22 2020-09-05 36 2020-06-06 23
23 2020-09-12 37 2020-06-13 24
24 2020-09-19 38 2020-06-20 25
25 2020-09-26 39 2020-06-27 26
26 2020-10-03 40 2020-07-04 27
27 2020-10-10 41 2020-07-11 28
28 2020-10-17 42 2020-07-18 29
29 2020-10-24 43 2020-07-25 30
30 2020-10-31 44 2020-08-01 31
31 2020-11-07 45 2020-08-08 32
32 2020-11-14 46 2020-08-15 33
33 2020-11-21 47 2020-08-22 34
34 2020-11-28 48 2020-08-29 35
35 2020-12-05 49 2020-09-05 36
36 2020-12-12 50 2020-09-12 37
37 2020-12-19 51 2020-09-19 38
38 2020-12-26 52 2020-09-26 39
39 2021-01-02 53 2020-10-03 40
40 2021-01-09 1 2020-10-10 41
41 2021-01-16 2 2020-10-17 42
42 2021-01-23 3 2020-10-24 43
43 2021-01-30 4 2020-10-31 44
44 2021-02-06 5 2020-11-07 45
45 2021-02-13 6 2020-11-14 46
46 2021-02-20 7 2020-11-21 47
47 2021-02-27 8 2020-11-28 48
48 2021-03-06 9 2020-12-05 49
49 2021-03-13 10 2020-12-12 50
50 2021-03-20 11 2020-12-19 51
51 2021-03-27 12 2020-12-26 52
52 2021-04-03 13 2021-01-02 53
53 2021-04-10 14 2021-01-09 1

Grouped time difference when a condition is met

I am working with structured log data structured as the following (here a pastebin snippet of mock data for easy tinkering):
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 123 32 0
1 123 2020-01-02 2 43 0
2 123 2020-01-03 43 4 1
3 123 2020-01-04 43 4 0
4 123 2020-01-05 43 4 0
5 123 2020-01-06 43 4 0
6 123 2020-01-07 43 4 1
7 123 2020-01-08 43 4 0
8 232 2020-01-04 56 4 0
9 232 2020-01-05 97 1 0
10 232 2020-01-06 23 74 0
11 232 2020-01-07 91 85 1
12 232 2020-01-08 91 85 0
13 232 2020-01-09 91 85 0
14 232 2020-01-10 91 85 1
Variables are pretty straightforward:
id: the id of the observed machine
date: observation date
info_a_cnt: counts of a specific kind of info event
info_b_cnt: same as above for a different event type
has_err: whether or not the machine logged any errors
Now, I'd like to group the dataframe by id to create a variable storing the number of days left before an error event. The desired dataframe should look like:
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 232 2020-01-04 56 4 0 3
8 232 2020-01-05 97 1 0 2
9 232 2020-01-06 23 74 0 1
10 232 2020-01-07 91 85 1 0
11 232 2020-01-08 91 85 0 2
12 232 2020-01-09 91 85 0 1
13 232 2020-01-10 91 85 1 0
I am having an hard time figuring out the correct implementation with the right grouping functions.
Edit:
All the answers below work really well when dealing with dates with a daily granularity. I am wondering how to adapt #jezrael solution below to a dataframe containing timestamps (logs will be batched with 15 minutes interval):
df:
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)
id date info_a_cnt info_b_cnt has_err
0 123 2020-01-01 12:00:00 123 32 0
1 123 2020-01-01 12:15:00 2 43 0
2 123 2020-01-01 12:30:00 43 4 1
3 123 2020-01-01 12:45:00 43 4 0
4 123 2020-01-01 13:00:00 43 4 0
5 123 2020-01-01 13:15:00 43 4 0
6 123 2020-01-01 13:30:00 43 4 1
7 123 2020-01-01 13:45:00 43 4 0
8 232 2020-01-04 17:00:00 56 4 0
9 232 2020-01-05 17:15:00 97 1 0
10 232 2020-01-06 17:30:00 23 74 0
11 232 2020-01-07 17:45:00 91 85 1
12 232 2020-01-08 18:00:00 91 85 0
13 232 2020-01-09 18:15:00 91 85 0
14 232 2020-01-10 18:30:00 91 85 1
I am wondering how to adapt #jezrael answer in order to land on something like:
id date info_a_cnt info_b_cnt has_err mins_to_err
0 123 2020-01-01 12:00:00 123 32 0 30
1 123 2020-01-01 12:15:00 2 43 0 15
2 123 2020-01-01 12:30:00 43 4 1 0
3 123 2020-01-01 12:45:00 43 4 0 45
4 123 2020-01-01 13:00:00 43 4 0 30
5 123 2020-01-01 13:15:00 43 4 0 15
6 123 2020-01-01 13:30:00 43 4 1 0
7 123 2020-01-01 13:45:00 43 4 0 60
8 232 2020-01-04 17:00:00 56 4 0 45
9 232 2020-01-05 17:15:00 97 1 0 30
10 232 2020-01-06 17:30:00 23 74 0 15
11 232 2020-01-07 17:45:00 91 85 1 0
12 232 2020-01-08 18:00:00 91 85 0 30
13 232 2020-01-09 18:15:00 91 85 0 15
14 232 2020-01-10 18:30:00 91 85 1 0
Use GroupBy.cumcount with ascending=False by column id and helper Series with Series.cumsum but form back - so added indexing by Series.iloc:
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0
EDIT: For count cumulative sum of differencies of dates use custom lambda function with GroupBy.transform:
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.days.cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2.0
1 123 2020-01-02 2 43 0 1.0
2 123 2020-01-03 43 4 1 0.0
3 123 2020-01-04 43 4 0 3.0
4 123 2020-01-05 43 4 0 2.0
5 123 2020-01-06 43 4 0 1.0
6 123 2020-01-07 43 4 1 0.0
7 123 2020-01-08 43 4 0 0.0
8 232 2020-01-04 56 4 0 3.0
9 232 2020-01-05 97 1 0 2.0
10 232 2020-01-06 23 74 0 1.0
11 232 2020-01-07 91 85 1 0.0
12 232 2020-01-08 91 85 0 2.0
13 232 2020-01-09 91 85 0 1.0
14 232 2020-01-10 91 85 1 0.0
EDIT1: Use Series.dt.total_seconds with divide by 60:
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
.transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
.fillna(0)
.to_numpy()[::-1])
print(df)
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 12:00:00 123 32 0 30.0
1 123 2020-01-01 12:15:00 2 43 0 15.0
2 123 2020-01-01 12:30:00 43 4 1 0.0
3 123 2020-01-01 12:45:00 43 4 0 45.0
4 123 2020-01-01 13:00:00 43 4 0 30.0
5 123 2020-01-01 13:15:00 43 4 0 15.0
6 123 2020-01-01 13:30:00 43 4 1 0.0
7 123 2020-01-01 13:45:00 43 4 0 0.0
8 232 2020-01-01 17:00:00 56 4 0 45.0
9 232 2020-01-01 17:15:00 97 1 0 30.0
10 232 2020-01-01 17:30:00 23 74 0 15.0
11 232 2020-01-01 17:45:00 91 85 1 0.0
12 232 2020-01-01 18:00:00 91 85 0 30.0
13 232 2020-01-01 18:15:00 91 85 0 15.0
14 232 2020-01-01 18:30:00 91 85 1 0.0
Use:
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()
id date info_a_cnt info_b_cnt has_err days_to_err
0 123 2020-01-01 123 32 0 2
1 123 2020-01-02 2 43 0 1
2 123 2020-01-03 43 4 1 0
3 123 2020-01-04 43 4 0 3
4 123 2020-01-05 43 4 0 2
5 123 2020-01-06 43 4 0 1
6 123 2020-01-07 43 4 1 0
7 123 2020-01-08 43 4 0 0
8 232 2020-01-04 56 4 0 3
9 232 2020-01-05 97 1 0 2
10 232 2020-01-06 23 74 0 1
11 232 2020-01-07 91 85 1 0
12 232 2020-01-08 91 85 0 2
13 232 2020-01-09 91 85 0 1
14 232 2020-01-10 91 85 1 0

DataFrame GroupBy into VolumeBars

I need to consolidate market data into VolumeBars (Blocks with same Volume).
As input data I have minute bars (possible with drops), where I have following columns: Time, OHLC (Open, High, Low, Close) and Volume.
Currently, I'm trying this way:
bar_volume_size = 100
df = hg
df['cumsum'] = (df["Volume"].cumsum() // bar_volume_size) + 1
df['over'] = (df["Volume"].cumsum() % bar_volume_size)
print(df.head(40))
Results of this operation looks like this:
Open High Low Close Volume BarNo over
2018-12-30 18:00:00 2.6780 2.6875 2.6755 2.6840 83 1 83
2018-12-30 18:01:00 2.6835 2.6875 2.6825 2.6875 40 2 23
2018-12-30 18:02:00 2.6875 2.6920 2.6875 2.6915 58 2 81
2018-12-30 18:03:00 2.6915 2.6945 2.6910 2.6920 36 3 17
2018-12-30 18:04:00 2.6910 2.6925 2.6910 2.6920 14 3 31
2018-12-30 18:05:00 2.6920 2.6920 2.6900 2.6900 16 3 47
2018-12-30 18:06:00 2.6905 2.6905 2.6880 2.6880 12 3 59
2018-12-30 18:07:00 2.6885 2.6890 2.6880 2.6880 5 3 64
2018-12-30 18:08:00 2.6885 2.6885 2.6880 2.6885 3 3 67
2018-12-30 18:09:00 2.6875 2.6875 2.6875 2.6875 1 3 68
2018-12-30 18:10:00 2.6875 2.6890 2.6875 2.6890 9 3 77
2018-12-30 18:11:00 2.6895 2.6895 2.6895 2.6895 4 3 81
2018-12-30 18:12:00 2.6900 2.6900 2.6890 2.6895 13 3 94
2018-12-30 18:13:00 2.6895 2.6895 2.6890 2.6890 3 3 97
2018-12-30 18:14:00 2.6890 2.6895 2.6890 2.6895 10 4 7
2018-12-30 18:15:00 2.6895 2.6895 2.6895 2.6895 0 4 7
2018-12-30 18:16:00 2.6895 2.6900 2.6895 2.6900 4 4 11
2018-12-30 18:17:00 2.6890 2.6895 2.6855 2.6870 31 4 42
2018-12-30 18:18:00 2.6875 2.6875 2.6875 2.6875 8 4 50
2018-12-30 18:19:00 2.6875 2.6885 2.6875 2.6885 5 4 55
2018-12-30 18:20:00 2.6890 2.6905 2.6890 2.6905 4 4 59
2018-12-30 18:21:00 2.6910 2.6910 2.6910 2.6910 2 4 61
2018-12-30 18:22:00 2.6910 2.6910 2.6910 2.6910 0 4 61
2018-12-30 18:23:00 2.6910 2.6910 2.6910 2.6910 0 4 61
2018-12-30 18:24:00 2.6910 2.6910 2.6910 2.6910 0 4 61
2018-12-30 18:25:00 2.6905 2.6905 2.6905 2.6905 1 4 62
2018-12-30 18:26:00 2.6890 2.6890 2.6890 2.6890 1 4 63
2018-12-30 18:27:00 2.6890 2.6890 2.6890 2.6890 1 4 64
2018-12-30 18:28:00 2.6890 2.6890 2.6890 2.6890 2 4 66
2018-12-30 18:29:00 2.6890 2.6890 2.6890 2.6890 0 4 66
2018-12-30 18:30:00 2.6895 2.6900 2.6890 2.6890 6 4 72
2018-12-30 18:31:00 2.6890 2.6890 2.6890 2.6890 1 4 73
2018-12-30 18:32:00 2.6890 2.6890 2.6890 2.6890 0 4 73
2018-12-30 18:33:00 2.6900 2.6900 2.6865 2.6890 14 4 87
2018-12-30 18:34:00 2.6870 2.6870 2.6865 2.6865 10 4 97
2018-12-30 18:35:00 2.6865 2.6865 2.6865 2.6865 0 4 97
2018-12-30 18:36:00 2.6860 2.6860 2.6850 2.6860 21 5 18
2018-12-30 18:37:00 2.6870 2.6875 2.6870 2.6875 4 5 22
2018-12-30 18:38:00 2.6865 2.6865 2.6865 2.6865 1 5 23
2018-12-30 18:39:00 2.6865 2.6865 2.6865 2.6865 1 5 24
In the BarNo column I have VolumeBar number. I think that this is possible to GroupBy dataframe with BarNo column, for example for 3:
Open High Low Close Volume BarNo over
2018-12-30 18:03:00 2.6915 2.6945 2.6910 2.6920 36 3 17
2018-12-30 18:04:00 2.6910 2.6925 2.6910 2.6920 14 3 31
2018-12-30 18:05:00 2.6920 2.6920 2.6900 2.6900 16 3 47
2018-12-30 18:06:00 2.6905 2.6905 2.6880 2.6880 12 3 59
2018-12-30 18:07:00 2.6885 2.6890 2.6880 2.6880 5 3 64
2018-12-30 18:08:00 2.6885 2.6885 2.6880 2.6885 3 3 67
2018-12-30 18:09:00 2.6875 2.6875 2.6875 2.6875 1 3 68
2018-12-30 18:10:00 2.6875 2.6890 2.6875 2.6890 9 3 77
2018-12-30 18:11:00 2.6895 2.6895 2.6895 2.6895 4 3 81
2018-12-30 18:12:00 2.6900 2.6900 2.6890 2.6895 13 3 94
2018-12-30 18:13:00 2.6895 2.6895 2.6890 2.6890 3 3 97
Take first element from "Open" column of this group, max of the "High" column of this group, min of the "Low" column of this group and last element of the "Close" column, take exactly bar_volume_size as "Volume" and put all this data into another DataFrame (or, maybe in this dataframe).
You can use groupby with aggregation like this.
group_rules = {'Open':'first', 'High':'max', 'Low':'min'}
df.groupby(['BarNo']).agg(group_rules).reset_index()

Pandas: Perform operation on various columns and create, rename new columns

We have a dataframe 'A' with 5 columns, and we want to add the rolling mean of each column, we could do:
A = pd.DataFrame(np.random.randint(100, size=(5, 5)))
for i in range(0,5):
A[i+6] = A[i].rolling(3).mean()
If however 'A' has column named 'A', 'B'...'E':
A = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = ['A', 'B',
'C', 'D', 'E'])
How could we neatly add 5 columns with the rolling mean, and each name being ['A_mean', 'B_mean', ....'E_mean']?
try this:
for col in df:
A[col+'_mean'] = A[col].rolling(3).mean()
Output with your way:
0 1 2 3 4 6 7 8 9 10
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
and Output with mine:
A B C D E A_mean B_mean C_mean D_mean E_mean
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
Without loops:
pd.concat([A, A.apply(lambda x:x.rolling(3).mean()).rename(
columns={col: str(col) + '_mean' for col in A})], axis=1)
A B C D E A_mean B_mean C_mean D_mean E_mean
0 67 54 85 61 62 NaN NaN NaN NaN NaN
1 44 53 30 80 58 NaN NaN NaN NaN NaN
2 10 59 14 39 12 40.333333 55.333333 43.0 60.000000 44.000000
3 47 25 58 93 38 33.666667 45.666667 34.0 70.666667 36.000000
4 73 80 30 51 77 43.333333 54.666667 34.0 61.000000 42.333333

Categories

Resources