How to specify the week number for a given date using pandas? - python

I have a dataframe using
year_start = '2020-03-29'
year_end = '2021-04-10'
week_end_sat = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['a'])
How can I make another column specifying the week number making 2020-03-29 as the first day of the calendar since I am trying to make a 4-4-5 calendar which always ends on a Saturday?
Final df that I want is,
a | count
2020-04-04 | 1
2020-04-11 | 2
.
.
.
2021-04-03 | 53 #since 2020 is a leap year there are 53 weeks otherwise it will be 52 weeks
2021-04-10 | 1
2021-04-17 | 2
.
2022-03-02 | 52
2022-04-09 | 1

I think you can create a baseline date range start from the first day of your given year_start.
first_day_of_year = week_end_sat.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(week_end_sat), freq=f'W-SAT'))
The baseline's week of year is what you want.
week_end_sat['count'] = baseline['a'].dt.isocalendar().week
# print(week_end_sat)
a count
0 2020-04-04 1
1 2020-04-11 2
2 2020-04-18 3
3 2020-04-25 4
4 2020-05-02 5
5 2020-05-09 6
6 2020-05-16 7
7 2020-05-23 8
8 2020-05-30 9
9 2020-06-06 10
10 2020-06-13 11
11 2020-06-20 12
12 2020-06-27 13
13 2020-07-04 14
14 2020-07-11 15
15 2020-07-18 16
16 2020-07-25 17
17 2020-08-01 18
18 2020-08-08 19
19 2020-08-15 20
20 2020-08-22 21
21 2020-08-29 22
...
43 2021-01-30 44
44 2021-02-06 45
45 2021-02-13 46
46 2021-02-20 47
47 2021-02-27 48
48 2021-03-06 49
49 2021-03-13 50
50 2021-03-20 51
51 2021-03-27 52
52 2021-04-03 53
53 2021-04-10 1

I calculated the week number using a W-Sat frequency and isocalendar api and return the week number. I then create a baseline using the first day of the year and assign the week number to baseline_week. Now the week has an associated baseline_week number.
year_start = '2020-03-29'
year_end = '2021-04-10'
df = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['week_date'])
df['week_number']=df['week_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
first_day_of_year = df.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(df), freq=f'W-SAT'))
df['baseline_date']=baseline
df['baseline_week_number']=df['baseline_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
print(df)
output:
week_date week_number baseline_date baseline_week_number
0 2020-04-04 14 2020-01-04 1
1 2020-04-11 15 2020-01-11 2
2 2020-04-18 16 2020-01-18 3
3 2020-04-25 17 2020-01-25 4
4 2020-05-02 18 2020-02-01 5
5 2020-05-09 19 2020-02-08 6
6 2020-05-16 20 2020-02-15 7
7 2020-05-23 21 2020-02-22 8
8 2020-05-30 22 2020-02-29 9
9 2020-06-06 23 2020-03-07 10
10 2020-06-13 24 2020-03-14 11
11 2020-06-20 25 2020-03-21 12
12 2020-06-27 26 2020-03-28 13
13 2020-07-04 27 2020-04-04 14
14 2020-07-11 28 2020-04-11 15
15 2020-07-18 29 2020-04-18 16
16 2020-07-25 30 2020-04-25 17
17 2020-08-01 31 2020-05-02 18
18 2020-08-08 32 2020-05-09 19
19 2020-08-15 33 2020-05-16 20
20 2020-08-22 34 2020-05-23 21
21 2020-08-29 35 2020-05-30 22
22 2020-09-05 36 2020-06-06 23
23 2020-09-12 37 2020-06-13 24
24 2020-09-19 38 2020-06-20 25
25 2020-09-26 39 2020-06-27 26
26 2020-10-03 40 2020-07-04 27
27 2020-10-10 41 2020-07-11 28
28 2020-10-17 42 2020-07-18 29
29 2020-10-24 43 2020-07-25 30
30 2020-10-31 44 2020-08-01 31
31 2020-11-07 45 2020-08-08 32
32 2020-11-14 46 2020-08-15 33
33 2020-11-21 47 2020-08-22 34
34 2020-11-28 48 2020-08-29 35
35 2020-12-05 49 2020-09-05 36
36 2020-12-12 50 2020-09-12 37
37 2020-12-19 51 2020-09-19 38
38 2020-12-26 52 2020-09-26 39
39 2021-01-02 53 2020-10-03 40
40 2021-01-09 1 2020-10-10 41
41 2021-01-16 2 2020-10-17 42
42 2021-01-23 3 2020-10-24 43
43 2021-01-30 4 2020-10-31 44
44 2021-02-06 5 2020-11-07 45
45 2021-02-13 6 2020-11-14 46
46 2021-02-20 7 2020-11-21 47
47 2021-02-27 8 2020-11-28 48
48 2021-03-06 9 2020-12-05 49
49 2021-03-13 10 2020-12-12 50
50 2021-03-20 11 2020-12-19 51
51 2021-03-27 12 2020-12-26 52
52 2021-04-03 13 2021-01-02 53
53 2021-04-10 14 2021-01-09 1

Related

New column based on last time row value equals some numbers in Pandas dataframe

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.
Date Student_ID Rank Predicted_Score
4/7/2021 33 2 87
13/6/2021 33 4 88
31/3/2021 33 7 88
28/2/2021 33 2 86
14/2/2021 33 10 86
31/1/2021 33 8 86
23/12/2020 33 1 81
8/11/2020 33 3 80
21/10/2020 33 3 80
23/9/2020 33 4 80
20/5/2020 33 3 80
29/4/2020 33 4 80
15/4/2020 33 2 79
26/2/2020 33 3 79
12/2/2020 33 5 79
29/1/2020 33 1 70
I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
4/7/2021 33 2 87 86
13/6/2021 33 4 88 86
31/3/2021 33 7 88 86
28/2/2021 33 2 86 81
14/2/2021 33 10 86 81
31/1/2021 33 8 86 81
23/12/2020 33 1 81 80
8/11/2020 33 3 80 80
21/10/2020 33 3 80 80
23/9/2020 33 4 80 80
20/5/2020 33 3 80 79
29/4/2020 33 4 80 79
15/4/2020 33 2 79 79
26/2/2020 33 3 79 70
12/2/2020 33 5 79 70
29/1/2020 33 1 70
Here's what I have tried but it doesn't quite work, not sure if I am on the right track:
df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))
Thanks in advance.
Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)
Prints:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-06-13 33 4 88 86.0
2 2021-03-31 33 7 88 86.0
3 2021-02-28 33 2 86 81.0
4 2021-02-14 33 10 86 81.0
5 2021-01-31 33 8 86 81.0
6 2020-12-23 33 1 81 80.0
7 2020-11-08 33 3 80 80.0
8 2020-10-21 33 3 80 80.0
9 2020-09-23 33 4 80 80.0
10 2020-05-20 33 3 80 79.0
11 2020-04-29 33 4 80 79.0
12 2020-04-15 33 2 79 79.0
13 2020-02-26 33 3 79 70.0
14 2020-02-12 33 5 79 70.0
15 2020-01-29 33 1 70
Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score
c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())
Result
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 4/7/2021 33 2 87 86.0
1 13/6/2021 33 4 88 86.0
2 31/3/2021 33 7 88 86.0
3 28/2/2021 33 2 86 81.0
4 14/2/2021 33 10 86 81.0
5 31/1/2021 33 8 86 81.0
6 23/12/2020 33 1 81 80.0
7 8/11/2020 33 3 80 80.0
8 21/10/2020 33 3 80 80.0
9 23/9/2020 33 4 80 80.0
10 20/5/2020 33 3 80 79.0
11 29/4/2020 33 4 80 79.0
12 15/4/2020 33 2 79 79.0
13 26/2/2020 33 3 79 70.0
14 12/2/2020 33 5 79 70.0
15 29/1/2020 33 1 70 NaN
Note: Make sure your dataframe is sorted on Date in descending order.
Let's assume:
there may be more than one unique Student_ID
the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
we want to preserve the index of the original dataframe
Subject to these assumptions, here's a way to do what your question asks:
df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
.groupby('Student_ID', sort=False)
.apply(lambda group: group.shift(-1).bfill())
['Recent_Predicted_Score'] )
Explanation:
create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.
Sample input (with two distinct Student_ID values):
Date Student_ID Rank Predicted_Score
0 2021-07-04 33 2 87
1 2021-07-04 66 2 87
2 2021-06-13 33 4 88
3 2021-06-13 66 4 88
4 2021-03-31 33 7 88
5 2021-03-31 66 7 88
6 2021-02-28 33 2 86
7 2021-02-28 66 2 86
8 2021-02-14 33 10 86
9 2021-02-14 66 10 86
10 2021-01-31 33 8 86
11 2021-01-31 66 8 86
12 2020-12-23 33 1 81
13 2020-12-23 66 1 81
14 2020-11-08 33 3 80
15 2020-11-08 66 3 80
16 2020-10-21 33 3 80
17 2020-10-21 66 3 80
18 2020-09-23 33 4 80
19 2020-09-23 66 4 80
20 2020-05-20 33 3 80
21 2020-05-20 66 3 80
22 2020-04-29 33 4 80
23 2020-04-29 66 4 80
24 2020-04-15 33 2 79
25 2020-04-15 66 2 79
26 2020-02-26 33 3 79
27 2020-02-26 66 3 79
28 2020-02-12 33 5 79
29 2020-02-12 66 5 79
30 2020-01-29 33 1 70
31 2020-01-29 66 1 70
Output:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-07-04 66 2 87 86.0
2 2021-06-13 33 4 88 86.0
3 2021-06-13 66 4 88 86.0
4 2021-03-31 33 7 88 86.0
5 2021-03-31 66 7 88 86.0
6 2021-02-28 33 2 86 81.0
7 2021-02-28 66 2 86 81.0
8 2021-02-14 33 10 86 81.0
9 2021-02-14 66 10 86 81.0
10 2021-01-31 33 8 86 81.0
11 2021-01-31 66 8 86 81.0
12 2020-12-23 33 1 81 80.0
13 2020-12-23 66 1 81 80.0
14 2020-11-08 33 3 80 80.0
15 2020-11-08 66 3 80 80.0
16 2020-10-21 33 3 80 80.0
17 2020-10-21 66 3 80 80.0
18 2020-09-23 33 4 80 80.0
19 2020-09-23 66 4 80 80.0
20 2020-05-20 33 3 80 79.0
21 2020-05-20 66 3 80 79.0
22 2020-04-29 33 4 80 79.0
23 2020-04-29 66 4 80 79.0
24 2020-04-15 33 2 79 79.0
25 2020-04-15 66 2 79 79.0
26 2020-02-26 33 3 79 70.0
27 2020-02-26 66 3 79 70.0
28 2020-02-12 33 5 79 70.0
29 2020-02-12 66 5 79 70.0
30 2020-01-29 33 1 70 NaN
31 2020-01-29 66 1 70 NaN
Output sorted by Student_ID, Date for easier inspection:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
2 2021-06-13 33 4 88 86.0
4 2021-03-31 33 7 88 86.0
6 2021-02-28 33 2 86 81.0
8 2021-02-14 33 10 86 81.0
10 2021-01-31 33 8 86 81.0
12 2020-12-23 33 1 81 80.0
14 2020-11-08 33 3 80 80.0
16 2020-10-21 33 3 80 80.0
18 2020-09-23 33 4 80 80.0
20 2020-05-20 33 3 80 79.0
22 2020-04-29 33 4 80 79.0
24 2020-04-15 33 2 79 79.0
26 2020-02-26 33 3 79 70.0
28 2020-02-12 33 5 79 70.0
30 2020-01-29 33 1 70 NaN
1 2021-07-04 66 2 87 86.0
3 2021-06-13 66 4 88 86.0
5 2021-03-31 66 7 88 86.0
7 2021-02-28 66 2 86 81.0
9 2021-02-14 66 10 86 81.0
11 2021-01-31 66 8 86 81.0
13 2020-12-23 66 1 81 80.0
15 2020-11-08 66 3 80 80.0
17 2020-10-21 66 3 80 80.0
19 2020-09-23 66 4 80 80.0
21 2020-05-20 66 3 80 79.0
23 2020-04-29 66 4 80 79.0
25 2020-04-15 66 2 79 79.0
27 2020-02-26 66 3 79 70.0
29 2020-02-12 66 5 79 70.0
31 2020-01-29 66 1 70 NaN

assign values from a dataframe to a new column in another datafraem base on date

I have a Dataframe that has date on it and I resample it monthly
**(T1)**
date_gr p
0 2017-03 24122.818182
1 2017-04 29696.000000
2 2017-05 37135.500000
3 2017-06 42871.555556
4 2017-07 46941.600000
5 2017-08 46963.750000
6 2017-09 40710.714286
7 2017-10 31212.200000
8 2017-11 28834.750000
9 2017-12 29319.666667
10 2018-01 28833.250000
11 2018-02 29657.800000
12 2018-03 28773.071429
13 2018-04 30049.142857
14 2018-05 34283.750000
15 2018-06 43694.222222
16 2018-07 51136.500000
17 2018-08 45297.250000
18 2018-09 39780.833333
19 2018-10 32073.600000
20 2018-11 28176.000000
21 2018-12 28315.250000
22 2019-01 28213.500000
23 2019-02 28886.500000
24 2019-03 26971.428571
25 2019-04 27644.875000
26 2019-05 38581.500000
27 2019-06 46501.857143
28 2019-07 50121.333333
29 2019-08 48226.250000
30 2019-09 42919.800000
31 2019-10 34589.571429
32 2019-11 29877.000000
33 2019-12 30223.000000
34 2020-01 30932.666667
35 2020-02 31630.800000
36 2020-03 27894.000000
37 2020-04 29523.000000
38 2020-05 40462.400000
39 2020-06 50798.428571
40 2020-07 51814.200000
41 2020-08 48111.714286
42 2020-09 46026.750000
43 2020-10 35544.000000
Now I need to create a new column and assign every monthly value you see above to it , base on month.I mean if value is for 2019-10 , The new column has 2019-10 value for all October days from 1 to 31.
for example we have :
**(T2)**
date_gr p_ins
0 2019-10-01 2122.818182
1 2019-10-02 2696.000000
2 2019-10-03 3135.500000
3 2019-10-04 4871.555556
4 2019-10-05 4941.600000
5 2019-10-06 4963.750000
6 2019-10-07 4710.714286
7 2019-10-08 3212.200000
8 2019-10-09 2834.750000
9 2019-10-10 2319.666667
10 2019-10-11 2833.250000
11 2019-10-12 2657.800000
12 2019-10-13 2773.071429
13 2019-10-14 3049.142857
14 2019-10-15 3283.750000
15 2019-10-16 4694.222222
16 2019-10-17 5136.500000
17 2019-10-18 4297.250000
18 2019-10-19 3780.833333
19 2019-10-20 3073.600000
20 2019-11-01 2176.000000
21 2019-11-02 2315.250000
22 2019-11-03 2213.500000
23 2019-11-04 2886.500000
24 2019-11-05 2971.428571
25 2019-11-06 2644.875000
26 2019-11-07 3581.500000
27 2019-11-08 4501.857143
28 2019-11-09 5121.333333
29 2019-11-10 4226.250000
30 2019-11-11 4919.800000
31 2019-11-12 3589.571429
32 2019-11-13 2877.000000
33 2019-11-14 3223.000000
34 2019-11-15 3932.666667
35 2019-11-16 3630.800000
36 2019-11-17 2894.000000
37 2019-11-18 2523.000000
38 2019-11-19 4462.400000
39 2019-11-20 5798.428571
We need to find the month value in (T2) that its month matche (T1)'s month and assign its value to every day of that month.We must do this for every month and day.
output:
date_gr p_ins p
0 2019-10-01 2122.818182 34589.571429
1 2019-10-02 2696.000000 34589.571429
2 2019-10-03 3135.500000 34589.571429
3 2019-10-04 4871.555556 34589.571429
4 2019-10-05 4941.600000 34589.571429
5 2019-10-06 4963.750000 34589.571429
6 2019-10-07 4710.714286 34589.571429
7 2019-10-08 3212.200000 34589.571429
8 2019-10-09 2834.750000 34589.571429
9 2019-10-10 2319.666667 34589.571429
10 2019-10-11 2833.250000 34589.571429
11 2019-10-12 2657.800000 34589.571429
12 2019-10-13 2773.071429 34589.571429
13 2019-10-14 3049.142857 34589.571429
14 2019-10-15 3283.750000 34589.571429
15 2019-10-16 4694.222222 34589.571429
16 2019-10-17 5136.500000 34589.571429
17 2019-10-18 4297.250000 34589.571429
18 2019-10-19 3780.833333 34589.571429
19 2019-10-20 3073.600000 34589.571429
20 2019-11-01 2176.000000 29877.000000
21 2019-11-02 2315.250000 29877.000000
22 2019-11-03 2213.500000 29877.000000
23 2019-11-04 2886.500000 29877.000000
24 2019-11-05 2971.428571 29877.000000
25 2019-11-06 2644.875000 29877.000000
26 2019-11-07 3581.500000 29877.000000
27 2019-11-08 4501.857143 29877.000000
28 2019-11-09 5121.333333 29877.000000
29 2019-11-10 4226.250000 29877.000000
30 2019-11-11 4919.800000 29877.000000
31 2019-11-12 3589.571429 29877.000000
32 2019-11-13 2877.000000 29877.000000
33 2019-11-14 3223.000000 29877.000000
34 2019-11-15 3932.666667 29877.000000
35 2019-11-16 3630.800000 29877.000000
36 2019-11-17 2894.000000 29877.000000
37 2019-11-18 2523.000000 29877.000000
38 2019-11-19 4462.400000 29877.000000
39 2019-11-20 5798.428571 29877.000000
How can I do that in pandas? Thank you in advance for your help.
There are several approaches. One thing you could do is take your monthly data and convert each date to the start of the month. Then you can use merge_asof to join the two data frames.
# Add day one to each monthly value
t1.date_gr = t1.date_gr + '-01'
# Convert to datetime objects
t1.date_gr = pd.to_datetime(t1.date_gr)
t2.date_gr = pd.to_datetime(t2.date_gr)
# merge t2 to closest previous date in t1
pd.merge_asof(t2, t1, on='date_gr', direction='backward')
Output
date_gr p_ins p
0 2019-10-01 2122.818182 34589.571429
1 2019-10-02 2696.000000 34589.571429
2 2019-10-03 3135.500000 34589.571429
3 2019-10-04 4871.555556 34589.571429
4 2019-10-05 4941.600000 34589.571429
5 2019-10-06 4963.750000 34589.571429
6 2019-10-07 4710.714286 34589.571429
7 2019-10-08 3212.200000 34589.571429
8 2019-10-09 2834.750000 34589.571429
9 2019-10-10 2319.666667 34589.571429
10 2019-10-11 2833.250000 34589.571429
11 2019-10-12 2657.800000 34589.571429
12 2019-10-13 2773.071429 34589.571429
13 2019-10-14 3049.142857 34589.571429
14 2019-10-15 3283.750000 34589.571429
15 2019-10-16 4694.222222 34589.571429
16 2019-10-17 5136.500000 34589.571429
17 2019-10-18 4297.250000 34589.571429
18 2019-10-19 3780.833333 34589.571429
19 2019-10-20 3073.600000 34589.571429
20 2019-11-01 2176.000000 29877.000000
21 2019-11-02 2315.250000 29877.000000
22 2019-11-03 2213.500000 29877.000000
23 2019-11-04 2886.500000 29877.000000
24 2019-11-05 2971.428571 29877.000000
25 2019-11-06 2644.875000 29877.000000
26 2019-11-07 3581.500000 29877.000000
27 2019-11-08 4501.857143 29877.000000
28 2019-11-09 5121.333333 29877.000000
29 2019-11-10 4226.250000 29877.000000
30 2019-11-11 4919.800000 29877.000000
31 2019-11-12 3589.571429 29877.000000
32 2019-11-13 2877.000000 29877.000000
33 2019-11-14 3223.000000 29877.000000
34 2019-11-15 3932.666667 29877.000000
35 2019-11-16 3630.800000 29877.000000
36 2019-11-17 2894.000000 29877.000000
37 2019-11-18 2523.000000 29877.000000
38 2019-11-19 4462.400000 29877.000000
39 2019-11-20 5798.428571 29877.000000

How to conditionally aggregate values of previous rows of Pandas DataFrame?

I have the following example Pandas DataFrame
UserID Total Date
1 20 2019-01-01
1 18 2019-01-04
1 22 2019-01-05
1 16 2019-01-07
1 17 2019-01-09
1 26 2019-01-11
1 30 2019-01-12
1 28 2019-01-13
1 28 2019-01-15
1 28 2019-01-16
2 22 2019-01-06
2 11 2019-01-07
2 23 2019-01-09
2 14 2019-01-13
2 19 2019-01-14
2 29 2019-01-15
2 21 2019-01-16
2 22 2019-01-18
2 30 2019-01-22
2 16 2019-01-23
3 27 2019-01-01
3 13 2019-01-04
3 12 2019-01-05
3 27 2019-01-06
3 26 2019-01-09
3 26 2019-01-10
3 30 2019-01-11
3 19 2019-01-12
3 27 2019-01-13
3 29 2019-01-14
4 29 2019-01-07
4 12 2019-01-09
4 25 2019-01-10
4 11 2019-01-11
4 19 2019-01-13
4 20 2019-01-14
4 33 2019-01-15
4 24 2019-01-18
4 22 2019-01-19
4 24 2019-01-21
My goal is to add a column named TotalPrev10Days which is basically the sum of Total for previous 10 days (for each UserID)
I did a basic implementation using nested loops and comparing the current date with a timedelta.
Here's my code:
users = set(df.UserID) # get set of all unique user IDs
TotalPrev10Days = []
delta = timedelta(days=10) # 10 day time delta to subtract from each row date
for user in users: # looping over all user IDs
user_df = df[df["UserID"] == user] #creating dataframe that includes only current userID data
for row_index in user_df.index: #looping over each row from UserID dataframe
row_date = user_df["Date"][row_index]
row_date_minus_10 = row_date - delta #subtracting 10 days
sum_prev_10_days = user_df[(user_df["Date"] < row_date) & (user_df["Date"] >= row_date_minus_10)]["Total"].sum()
TotalPrev10Days.append(sum_prev_10_days) #appending total to a list
df["TotalPrev10Days"] = TotalPrev10Days #Assigning list to new DataFrame column
While it works perfectly, it's very slow for large datasets.
Is there a faster, more Pandas-native approach to this problem?
IIUC, try:
df["TotalPrev10Days"] = df.groupby("UserID") \
.rolling("9D", on="Date") \
.sum() \
.shift() \
.fillna(0)["Total"] \
.droplevel(0)
>>> df
UserID Total Date TotalPrev10Days
0 1 20 2019-01-01 0.0
1 1 18 2019-01-04 20.0
2 1 22 2019-01-05 38.0
3 1 16 2019-01-07 60.0
4 1 17 2019-01-09 76.0
5 1 26 2019-01-11 93.0
6 1 30 2019-01-12 99.0
7 1 28 2019-01-13 129.0
8 1 28 2019-01-15 139.0
9 1 28 2019-01-16 145.0
10 2 22 2019-01-06 0.0
11 2 11 2019-01-07 22.0
12 2 23 2019-01-09 33.0
13 2 14 2019-01-13 56.0
14 2 19 2019-01-14 70.0
15 2 29 2019-01-15 89.0
16 2 21 2019-01-16 96.0
17 2 22 2019-01-18 106.0
18 2 30 2019-01-22 105.0
19 2 16 2019-01-23 121.0
20 3 27 2019-01-01 0.0
21 3 13 2019-01-04 27.0
22 3 12 2019-01-05 40.0
23 3 27 2019-01-06 52.0
24 3 26 2019-01-09 79.0
25 3 26 2019-01-10 105.0
26 3 30 2019-01-11 104.0
27 3 19 2019-01-12 134.0
28 3 27 2019-01-13 153.0
29 3 29 2019-01-14 167.0
30 4 29 2019-01-07 0.0
31 4 12 2019-01-09 29.0
32 4 25 2019-01-10 41.0
33 4 11 2019-01-11 66.0
34 4 19 2019-01-13 77.0
35 4 20 2019-01-14 96.0
36 4 33 2019-01-15 116.0
37 4 24 2019-01-18 149.0
38 4 22 2019-01-19 132.0
39 4 24 2019-01-21 129.0

Reshape data in pandas

I have a csv file which when read in pandas produces the dataframe in below format
0 1 2 3 4 5 6
Day Time 2020-05-01 00:00 2020-05-02 00:00 2020-05-03 00:00 2020-05-04 00:00 2020-05-05 00:00
Night 23:00:00 33 45 33 23 19
Night 1900-01-01 00:00 33 45 33 23 19
Night 1900-01-01 01:00 33 45 33 23 19
Night 1900-01-01 02:00 33 45 33 23 19
Night 1900-01-01 03:00 33 41 23 23 19
Night 1900-01-01 04:00 33 41 23 23 19
Is there a way to convert the first row as a new column in the pandas that would output data as
0 1 2 3 4 5 6
Day Time Date
Night 23:00 2020-05-01 33 45 33 23 19
Night 00:00 2020-05-02 33 45 33 23 19
Night 01:00 2020-05-03 33 45 33 23 19
Night 02:00 2020-05-04 33 45 33 23 19
Night 03:00 2020-05-05 33 41 23 23 19
Night 04:00 2020-05-06 33 41 23 23 19
First step is get columns names by second row:
df = pd.read_csv(file, header=[1])
Then split Time columns with replace:
df['Time'] = df['Time'].str.split().str[-1].str.replace(':00:00', ':00')
Add new column to 3rd position by DataFrame.insert
df.insert(2, 'Date', pd.date_range(df.columns[2], periods=len(df)))
Set new columns names:
df.columns = df.columns[:3].tolist() + np.arange(3, len(df.columns)).tolist()
print (df)
Day Time Date 3 4 5 6 7
0 Night 23:00 2020-05-01 33 45 33 23 19
1 Night 00:00 2020-05-02 33 45 33 23 19
2 Night 01:00 2020-05-03 33 45 33 23 19
3 Night 02:00 2020-05-04 33 45 33 23 19
4 Night 03:00 2020-05-05 33 41 23 23 19
5 Night 04:00 2020-05-06 33 41 23 23 19

Merge 2 dataframes with same values in a column

I have 2 dataframes. One is in this form:
df1:
date revenue
0 2016-11-17 385.943800
1 2016-11-18 1074.160340
2 2016-11-19 2980.857860
3 2016-11-20 1919.723960
4 2016-11-21 884.279340
5 2016-11-22 869.071070
6 2016-11-23 760.289260
7 2016-11-24 2481.689270
8 2016-11-25 2745.990070
9 2016-11-26 2273.413250
10 2016-11-27 2630.414900
The other one is in this form:
df2:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 11 9 7 100 85 63
1 2016-11-18 9 6 3 93 83 66
2 2016-11-19 8 6 4 93 87 76
3 2016-11-20 10 7 4 93 84 81
4 2016-11-21 14 10 7 100 89 77
5 2016-11-22 13 10 7 93 79 63
6 2016-11-23 11 8 5 100 91 82
7 2016-11-24 9 7 4 93 80 66
8 2016-11-25 7 4 1 87 74 57
9 2016-11-26 7 3 -1 100 88 61
10 2016-11-27 10 7 4 100 81 66
Both dataframes have more rows and the number of rows will be increasing every day.
I want to combine these 2 dataframes in a way, where every time we see the same date in df1['date'] and df2['CET'], we will add an extra column to df2, which will have the revenue value for this date. So I want to create this:
df2:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity revenue
0 2016-11-17 11 9 7 100 85 63 385.943800
1 2016-11-18 9 6 3 93 83 66 1074.160340
2 2016-11-19 8 6 4 93 87 76 2980.857860
3 2016-11-20 10 7 4 93 84 81 1919.723960
4 2016-11-21 14 10 7 100 89 77 884.279340
5 2016-11-22 13 10 7 93 79 63 869.071070
6 2016-11-23 11 8 5 100 91 82 760.289260
7 2016-11-24 9 7 4 93 80 66 2481.689270
8 2016-11-25 7 4 1 87 74 57 2745.990070
9 2016-11-26 7 3 -1 100 88 61 2273.413250
10 2016-11-27 10 7 4 100 81 66 2630.414900
Can someone help me how to do that?
I think you can use map:
df2['revenue'] = df2.CET.map(df1.set_index('date')['revenue'])
Also you can convert Series to dict, then it is a bit faster in large df:
df2['revenue'] = df2.CET.map(df1.set_index('date')['revenue'].to_dict())
print (df2)
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity \
0 2016-11-17 11 9 7 100 85
1 2016-11-18 9 6 3 93 83
2 2016-11-19 8 6 4 93 87
3 2016-11-20 10 7 4 93 84
4 2016-11-21 14 10 7 100 89
5 2016-11-22 13 10 7 93 79
6 2016-11-23 11 8 5 100 91
7 2016-11-24 9 7 4 93 80
8 2016-11-25 7 4 1 87 74
9 2016-11-26 7 3 -1 100 88
10 2016-11-27 10 7 4 100 81
MinHumidity revenue
0 63 385.94380
1 66 1074.16034
2 76 2980.85786
3 81 1919.72396
4 77 884.27934
5 63 869.07107
6 82 760.28926
7 66 2481.68927
8 57 2745.99007
9 61 2273.41325
10 66 2630.41490
If all output values are NAN problem is with different dtypes of columns CET and date:
print (df1.date.dtypes)
object
print (df2.CET.dtype)
datetime64[ns]
Solution is convert string column to_datetime:
df1.date = pd.to_datetime(df1.date)
.map() solution will work only if you have exctly the same values in date and CET columns.
If you have slightly different values, you can use pd.merge_asof() method:
In [17]: pd.merge_asof(df1, df2, left_on='date', right_on='CET', tolerance=pd.Timedelta('2 hours'))
Out[17]:
date revenue CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 385.94380 2016-11-17 11 9 7 100 85 63
1 2016-11-18 1074.16034 2016-11-18 9 6 3 93 83 66
2 2016-11-19 2980.85786 2016-11-19 8 6 4 93 87 76
3 2016-11-20 1919.72396 2016-11-20 10 7 4 93 84 81
4 2016-11-21 884.27934 2016-11-21 14 10 7 100 89 77
5 2016-11-22 869.07107 2016-11-22 13 10 7 93 79 63
6 2016-11-23 760.28926 2016-11-23 11 8 5 100 91 82
7 2016-11-24 2481.68927 2016-11-24 9 7 4 93 80 66
8 2016-11-25 2745.99007 2016-11-25 7 4 1 87 74 57
9 2016-11-26 2273.41325 2016-11-26 7 3 -1 100 88 61
10 2016-11-27 2630.41490 2016-11-27 10 7 4 100 81 66
NOTE: merge_asof() function has been added in Pandas 0.19.0 (i.e. it's not available in older versions)
Demo:
In [191]: df2
Out[191]:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 01:39:00 11 9 7 100 85 63
1 2016-11-18 01:39:00 9 6 3 93 83 66
2 2016-11-19 01:39:00 8 6 4 93 87 76
3 2016-11-20 01:39:00 10 7 4 93 84 81
4 2016-11-21 01:39:00 14 10 7 100 89 77
5 2016-11-22 01:39:00 13 10 7 93 79 63
6 2016-11-23 01:39:00 11 8 5 100 91 82
7 2016-11-24 01:39:00 9 7 4 93 80 66
8 2016-11-25 01:39:00 7 4 1 87 74 57
9 2016-11-26 01:39:00 7 3 -1 100 88 61
10 2016-11-27 01:39:00 10 7 4 100 81 66
In [192]: df1
Out[192]:
date revenue
0 2016-11-17 385.94380
1 2016-11-18 1074.16034
2 2016-11-19 2980.85786
3 2016-11-20 1919.72396
4 2016-11-21 884.27934
5 2016-11-22 869.07107
6 2016-11-23 760.28926
7 2016-11-24 2481.68927
8 2016-11-25 2745.99007
9 2016-11-26 2273.41325
10 2016-11-27 2630.41490
In [193]: pd.merge_asof(df2, df1, left_on='CET', right_on='date')
Out[193]:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity date revenue
0 2016-11-17 01:39:00 11 9 7 100 85 63 2016-11-17 385.94380
1 2016-11-18 01:39:00 9 6 3 93 83 66 2016-11-18 1074.16034
2 2016-11-19 01:39:00 8 6 4 93 87 76 2016-11-19 2980.85786
3 2016-11-20 01:39:00 10 7 4 93 84 81 2016-11-20 1919.72396
4 2016-11-21 01:39:00 14 10 7 100 89 77 2016-11-21 884.27934
5 2016-11-22 01:39:00 13 10 7 93 79 63 2016-11-22 869.07107
6 2016-11-23 01:39:00 11 8 5 100 91 82 2016-11-23 760.28926
7 2016-11-24 01:39:00 9 7 4 93 80 66 2016-11-24 2481.68927
8 2016-11-25 01:39:00 7 4 1 87 74 57 2016-11-25 2745.99007
9 2016-11-26 01:39:00 7 3 -1 100 88 61 2016-11-26 2273.41325
10 2016-11-27 01:39:00 10 7 4 100 81 66 2016-11-27 2630.41490
using .map() method:
In [194]: df2.CET.map(df1.set_index('date')['revenue'])
Out[194]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
Name: CET, dtype: float64

Categories

Resources