New column based on last time row value equals some numbers in Pandas dataframe - python

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.
Date Student_ID Rank Predicted_Score
4/7/2021 33 2 87
13/6/2021 33 4 88
31/3/2021 33 7 88
28/2/2021 33 2 86
14/2/2021 33 10 86
31/1/2021 33 8 86
23/12/2020 33 1 81
8/11/2020 33 3 80
21/10/2020 33 3 80
23/9/2020 33 4 80
20/5/2020 33 3 80
29/4/2020 33 4 80
15/4/2020 33 2 79
26/2/2020 33 3 79
12/2/2020 33 5 79
29/1/2020 33 1 70
I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
4/7/2021 33 2 87 86
13/6/2021 33 4 88 86
31/3/2021 33 7 88 86
28/2/2021 33 2 86 81
14/2/2021 33 10 86 81
31/1/2021 33 8 86 81
23/12/2020 33 1 81 80
8/11/2020 33 3 80 80
21/10/2020 33 3 80 80
23/9/2020 33 4 80 80
20/5/2020 33 3 80 79
29/4/2020 33 4 80 79
15/4/2020 33 2 79 79
26/2/2020 33 3 79 70
12/2/2020 33 5 79 70
29/1/2020 33 1 70
Here's what I have tried but it doesn't quite work, not sure if I am on the right track:
df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))
Thanks in advance.

Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)
Prints:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-06-13 33 4 88 86.0
2 2021-03-31 33 7 88 86.0
3 2021-02-28 33 2 86 81.0
4 2021-02-14 33 10 86 81.0
5 2021-01-31 33 8 86 81.0
6 2020-12-23 33 1 81 80.0
7 2020-11-08 33 3 80 80.0
8 2020-10-21 33 3 80 80.0
9 2020-09-23 33 4 80 80.0
10 2020-05-20 33 3 80 79.0
11 2020-04-29 33 4 80 79.0
12 2020-04-15 33 2 79 79.0
13 2020-02-26 33 3 79 70.0
14 2020-02-12 33 5 79 70.0
15 2020-01-29 33 1 70

Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score
c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())
Result
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 4/7/2021 33 2 87 86.0
1 13/6/2021 33 4 88 86.0
2 31/3/2021 33 7 88 86.0
3 28/2/2021 33 2 86 81.0
4 14/2/2021 33 10 86 81.0
5 31/1/2021 33 8 86 81.0
6 23/12/2020 33 1 81 80.0
7 8/11/2020 33 3 80 80.0
8 21/10/2020 33 3 80 80.0
9 23/9/2020 33 4 80 80.0
10 20/5/2020 33 3 80 79.0
11 29/4/2020 33 4 80 79.0
12 15/4/2020 33 2 79 79.0
13 26/2/2020 33 3 79 70.0
14 12/2/2020 33 5 79 70.0
15 29/1/2020 33 1 70 NaN
Note: Make sure your dataframe is sorted on Date in descending order.

Let's assume:
there may be more than one unique Student_ID
the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
we want to preserve the index of the original dataframe
Subject to these assumptions, here's a way to do what your question asks:
df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
.groupby('Student_ID', sort=False)
.apply(lambda group: group.shift(-1).bfill())
['Recent_Predicted_Score'] )
Explanation:
create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.
Sample input (with two distinct Student_ID values):
Date Student_ID Rank Predicted_Score
0 2021-07-04 33 2 87
1 2021-07-04 66 2 87
2 2021-06-13 33 4 88
3 2021-06-13 66 4 88
4 2021-03-31 33 7 88
5 2021-03-31 66 7 88
6 2021-02-28 33 2 86
7 2021-02-28 66 2 86
8 2021-02-14 33 10 86
9 2021-02-14 66 10 86
10 2021-01-31 33 8 86
11 2021-01-31 66 8 86
12 2020-12-23 33 1 81
13 2020-12-23 66 1 81
14 2020-11-08 33 3 80
15 2020-11-08 66 3 80
16 2020-10-21 33 3 80
17 2020-10-21 66 3 80
18 2020-09-23 33 4 80
19 2020-09-23 66 4 80
20 2020-05-20 33 3 80
21 2020-05-20 66 3 80
22 2020-04-29 33 4 80
23 2020-04-29 66 4 80
24 2020-04-15 33 2 79
25 2020-04-15 66 2 79
26 2020-02-26 33 3 79
27 2020-02-26 66 3 79
28 2020-02-12 33 5 79
29 2020-02-12 66 5 79
30 2020-01-29 33 1 70
31 2020-01-29 66 1 70
Output:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-07-04 66 2 87 86.0
2 2021-06-13 33 4 88 86.0
3 2021-06-13 66 4 88 86.0
4 2021-03-31 33 7 88 86.0
5 2021-03-31 66 7 88 86.0
6 2021-02-28 33 2 86 81.0
7 2021-02-28 66 2 86 81.0
8 2021-02-14 33 10 86 81.0
9 2021-02-14 66 10 86 81.0
10 2021-01-31 33 8 86 81.0
11 2021-01-31 66 8 86 81.0
12 2020-12-23 33 1 81 80.0
13 2020-12-23 66 1 81 80.0
14 2020-11-08 33 3 80 80.0
15 2020-11-08 66 3 80 80.0
16 2020-10-21 33 3 80 80.0
17 2020-10-21 66 3 80 80.0
18 2020-09-23 33 4 80 80.0
19 2020-09-23 66 4 80 80.0
20 2020-05-20 33 3 80 79.0
21 2020-05-20 66 3 80 79.0
22 2020-04-29 33 4 80 79.0
23 2020-04-29 66 4 80 79.0
24 2020-04-15 33 2 79 79.0
25 2020-04-15 66 2 79 79.0
26 2020-02-26 33 3 79 70.0
27 2020-02-26 66 3 79 70.0
28 2020-02-12 33 5 79 70.0
29 2020-02-12 66 5 79 70.0
30 2020-01-29 33 1 70 NaN
31 2020-01-29 66 1 70 NaN
Output sorted by Student_ID, Date for easier inspection:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
2 2021-06-13 33 4 88 86.0
4 2021-03-31 33 7 88 86.0
6 2021-02-28 33 2 86 81.0
8 2021-02-14 33 10 86 81.0
10 2021-01-31 33 8 86 81.0
12 2020-12-23 33 1 81 80.0
14 2020-11-08 33 3 80 80.0
16 2020-10-21 33 3 80 80.0
18 2020-09-23 33 4 80 80.0
20 2020-05-20 33 3 80 79.0
22 2020-04-29 33 4 80 79.0
24 2020-04-15 33 2 79 79.0
26 2020-02-26 33 3 79 70.0
28 2020-02-12 33 5 79 70.0
30 2020-01-29 33 1 70 NaN
1 2021-07-04 66 2 87 86.0
3 2021-06-13 66 4 88 86.0
5 2021-03-31 66 7 88 86.0
7 2021-02-28 66 2 86 81.0
9 2021-02-14 66 10 86 81.0
11 2021-01-31 66 8 86 81.0
13 2020-12-23 66 1 81 80.0
15 2020-11-08 66 3 80 80.0
17 2020-10-21 66 3 80 80.0
19 2020-09-23 66 4 80 80.0
21 2020-05-20 66 3 80 79.0
23 2020-04-29 66 4 80 79.0
25 2020-04-15 66 2 79 79.0
27 2020-02-26 66 3 79 70.0
29 2020-02-12 66 5 79 70.0
31 2020-01-29 66 1 70 NaN

Related

Reading multiple DataFrames from a given input

I have a couple of data frames given this way :
38 47 7 20 35
45 76 63 96 24
98 53 2 87 80
83 86 92 48 1
73 60 26 94 6
80 50 29 53 92
66 90 79 98 46
40 21 58 38 60
35 13 72 28 6
48 76 51 96 12
79 80 24 37 51
86 70 1 22 71
52 69 10 83 13
12 40 3 0 30
46 50 48 76 5
Could you please tell me how it is possible to add them to a list of dataframes?
Thanks a lot!
First convert values to one DataFrame with separator misisng values (converted from blank lines):
df = pd.read_csv(file, header=None, skip_blank_lines=False)
print (df)
0 1 2 3 4
0 38.0 47.0 7.0 20.0 35.0
1 45.0 76.0 63.0 96.0 24.0
2 98.0 53.0 2.0 87.0 80.0
3 83.0 86.0 92.0 48.0 1.0
4 73.0 60.0 26.0 94.0 6.0
5 NaN NaN NaN NaN NaN
6 80.0 50.0 29.0 53.0 92.0
7 66.0 90.0 79.0 98.0 46.0
8 40.0 21.0 58.0 38.0 60.0
9 35.0 13.0 72.0 28.0 6.0
10 48.0 76.0 51.0 96.0 12.0
11 NaN NaN NaN NaN NaN
12 79.0 80.0 24.0 37.0 51.0
13 86.0 70.0 1.0 22.0 71.0
14 52.0 69.0 10.0 83.0 13.0
15 12.0 40.0 3.0 0.0 30.0
16 46.0 50.0 48.0 76.0 5.0
And then in list comprehension create smaller DataFrames in list:
dfs = [g.iloc[1:].astype(int).reset_index(drop=True)
for _, g in df.groupby(df[0].isna().cumsum())]
print (dfs[1])
0 1 2 3 4
0 80 50 29 53 92
1 66 90 79 98 46
2 40 21 58 38 60
3 35 13 72 28 6
4 48 76 51 96 12

How to specify the week number for a given date using pandas?

I have a dataframe using
year_start = '2020-03-29'
year_end = '2021-04-10'
week_end_sat = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['a'])
How can I make another column specifying the week number making 2020-03-29 as the first day of the calendar since I am trying to make a 4-4-5 calendar which always ends on a Saturday?
Final df that I want is,
a | count
2020-04-04 | 1
2020-04-11 | 2
.
.
.
2021-04-03 | 53 #since 2020 is a leap year there are 53 weeks otherwise it will be 52 weeks
2021-04-10 | 1
2021-04-17 | 2
.
2022-03-02 | 52
2022-04-09 | 1
I think you can create a baseline date range start from the first day of your given year_start.
first_day_of_year = week_end_sat.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(week_end_sat), freq=f'W-SAT'))
The baseline's week of year is what you want.
week_end_sat['count'] = baseline['a'].dt.isocalendar().week
# print(week_end_sat)
a count
0 2020-04-04 1
1 2020-04-11 2
2 2020-04-18 3
3 2020-04-25 4
4 2020-05-02 5
5 2020-05-09 6
6 2020-05-16 7
7 2020-05-23 8
8 2020-05-30 9
9 2020-06-06 10
10 2020-06-13 11
11 2020-06-20 12
12 2020-06-27 13
13 2020-07-04 14
14 2020-07-11 15
15 2020-07-18 16
16 2020-07-25 17
17 2020-08-01 18
18 2020-08-08 19
19 2020-08-15 20
20 2020-08-22 21
21 2020-08-29 22
...
43 2021-01-30 44
44 2021-02-06 45
45 2021-02-13 46
46 2021-02-20 47
47 2021-02-27 48
48 2021-03-06 49
49 2021-03-13 50
50 2021-03-20 51
51 2021-03-27 52
52 2021-04-03 53
53 2021-04-10 1
I calculated the week number using a W-Sat frequency and isocalendar api and return the week number. I then create a baseline using the first day of the year and assign the week number to baseline_week. Now the week has an associated baseline_week number.
year_start = '2020-03-29'
year_end = '2021-04-10'
df = pd.DataFrame(pd.date_range(year_start, year_end, freq=f'W-SAT'), columns=['week_date'])
df['week_number']=df['week_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
first_day_of_year = df.iloc[0, 0].replace(day=1, month=1)
baseline = pd.Series(pd.date_range(first_day_of_year, periods=len(df), freq=f'W-SAT'))
df['baseline_date']=baseline
df['baseline_week_number']=df['baseline_date'].apply(lambda row: datetime.date(row.year, row.month, row.day).isocalendar()[1])
print(df)
output:
week_date week_number baseline_date baseline_week_number
0 2020-04-04 14 2020-01-04 1
1 2020-04-11 15 2020-01-11 2
2 2020-04-18 16 2020-01-18 3
3 2020-04-25 17 2020-01-25 4
4 2020-05-02 18 2020-02-01 5
5 2020-05-09 19 2020-02-08 6
6 2020-05-16 20 2020-02-15 7
7 2020-05-23 21 2020-02-22 8
8 2020-05-30 22 2020-02-29 9
9 2020-06-06 23 2020-03-07 10
10 2020-06-13 24 2020-03-14 11
11 2020-06-20 25 2020-03-21 12
12 2020-06-27 26 2020-03-28 13
13 2020-07-04 27 2020-04-04 14
14 2020-07-11 28 2020-04-11 15
15 2020-07-18 29 2020-04-18 16
16 2020-07-25 30 2020-04-25 17
17 2020-08-01 31 2020-05-02 18
18 2020-08-08 32 2020-05-09 19
19 2020-08-15 33 2020-05-16 20
20 2020-08-22 34 2020-05-23 21
21 2020-08-29 35 2020-05-30 22
22 2020-09-05 36 2020-06-06 23
23 2020-09-12 37 2020-06-13 24
24 2020-09-19 38 2020-06-20 25
25 2020-09-26 39 2020-06-27 26
26 2020-10-03 40 2020-07-04 27
27 2020-10-10 41 2020-07-11 28
28 2020-10-17 42 2020-07-18 29
29 2020-10-24 43 2020-07-25 30
30 2020-10-31 44 2020-08-01 31
31 2020-11-07 45 2020-08-08 32
32 2020-11-14 46 2020-08-15 33
33 2020-11-21 47 2020-08-22 34
34 2020-11-28 48 2020-08-29 35
35 2020-12-05 49 2020-09-05 36
36 2020-12-12 50 2020-09-12 37
37 2020-12-19 51 2020-09-19 38
38 2020-12-26 52 2020-09-26 39
39 2021-01-02 53 2020-10-03 40
40 2021-01-09 1 2020-10-10 41
41 2021-01-16 2 2020-10-17 42
42 2021-01-23 3 2020-10-24 43
43 2021-01-30 4 2020-10-31 44
44 2021-02-06 5 2020-11-07 45
45 2021-02-13 6 2020-11-14 46
46 2021-02-20 7 2020-11-21 47
47 2021-02-27 8 2020-11-28 48
48 2021-03-06 9 2020-12-05 49
49 2021-03-13 10 2020-12-12 50
50 2021-03-20 11 2020-12-19 51
51 2021-03-27 12 2020-12-26 52
52 2021-04-03 13 2021-01-02 53
53 2021-04-10 14 2021-01-09 1

Pandas: Perform operation on various columns and create, rename new columns

We have a dataframe 'A' with 5 columns, and we want to add the rolling mean of each column, we could do:
A = pd.DataFrame(np.random.randint(100, size=(5, 5)))
for i in range(0,5):
A[i+6] = A[i].rolling(3).mean()
If however 'A' has column named 'A', 'B'...'E':
A = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = ['A', 'B',
'C', 'D', 'E'])
How could we neatly add 5 columns with the rolling mean, and each name being ['A_mean', 'B_mean', ....'E_mean']?
try this:
for col in df:
A[col+'_mean'] = A[col].rolling(3).mean()
Output with your way:
0 1 2 3 4 6 7 8 9 10
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
and Output with mine:
A B C D E A_mean B_mean C_mean D_mean E_mean
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
Without loops:
pd.concat([A, A.apply(lambda x:x.rolling(3).mean()).rename(
columns={col: str(col) + '_mean' for col in A})], axis=1)
A B C D E A_mean B_mean C_mean D_mean E_mean
0 67 54 85 61 62 NaN NaN NaN NaN NaN
1 44 53 30 80 58 NaN NaN NaN NaN NaN
2 10 59 14 39 12 40.333333 55.333333 43.0 60.000000 44.000000
3 47 25 58 93 38 33.666667 45.666667 34.0 70.666667 36.000000
4 73 80 30 51 77 43.333333 54.666667 34.0 61.000000 42.333333

Merge 2 dataframes with same values in a column

I have 2 dataframes. One is in this form:
df1:
date revenue
0 2016-11-17 385.943800
1 2016-11-18 1074.160340
2 2016-11-19 2980.857860
3 2016-11-20 1919.723960
4 2016-11-21 884.279340
5 2016-11-22 869.071070
6 2016-11-23 760.289260
7 2016-11-24 2481.689270
8 2016-11-25 2745.990070
9 2016-11-26 2273.413250
10 2016-11-27 2630.414900
The other one is in this form:
df2:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 11 9 7 100 85 63
1 2016-11-18 9 6 3 93 83 66
2 2016-11-19 8 6 4 93 87 76
3 2016-11-20 10 7 4 93 84 81
4 2016-11-21 14 10 7 100 89 77
5 2016-11-22 13 10 7 93 79 63
6 2016-11-23 11 8 5 100 91 82
7 2016-11-24 9 7 4 93 80 66
8 2016-11-25 7 4 1 87 74 57
9 2016-11-26 7 3 -1 100 88 61
10 2016-11-27 10 7 4 100 81 66
Both dataframes have more rows and the number of rows will be increasing every day.
I want to combine these 2 dataframes in a way, where every time we see the same date in df1['date'] and df2['CET'], we will add an extra column to df2, which will have the revenue value for this date. So I want to create this:
df2:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity revenue
0 2016-11-17 11 9 7 100 85 63 385.943800
1 2016-11-18 9 6 3 93 83 66 1074.160340
2 2016-11-19 8 6 4 93 87 76 2980.857860
3 2016-11-20 10 7 4 93 84 81 1919.723960
4 2016-11-21 14 10 7 100 89 77 884.279340
5 2016-11-22 13 10 7 93 79 63 869.071070
6 2016-11-23 11 8 5 100 91 82 760.289260
7 2016-11-24 9 7 4 93 80 66 2481.689270
8 2016-11-25 7 4 1 87 74 57 2745.990070
9 2016-11-26 7 3 -1 100 88 61 2273.413250
10 2016-11-27 10 7 4 100 81 66 2630.414900
Can someone help me how to do that?
I think you can use map:
df2['revenue'] = df2.CET.map(df1.set_index('date')['revenue'])
Also you can convert Series to dict, then it is a bit faster in large df:
df2['revenue'] = df2.CET.map(df1.set_index('date')['revenue'].to_dict())
print (df2)
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity \
0 2016-11-17 11 9 7 100 85
1 2016-11-18 9 6 3 93 83
2 2016-11-19 8 6 4 93 87
3 2016-11-20 10 7 4 93 84
4 2016-11-21 14 10 7 100 89
5 2016-11-22 13 10 7 93 79
6 2016-11-23 11 8 5 100 91
7 2016-11-24 9 7 4 93 80
8 2016-11-25 7 4 1 87 74
9 2016-11-26 7 3 -1 100 88
10 2016-11-27 10 7 4 100 81
MinHumidity revenue
0 63 385.94380
1 66 1074.16034
2 76 2980.85786
3 81 1919.72396
4 77 884.27934
5 63 869.07107
6 82 760.28926
7 66 2481.68927
8 57 2745.99007
9 61 2273.41325
10 66 2630.41490
If all output values are NAN problem is with different dtypes of columns CET and date:
print (df1.date.dtypes)
object
print (df2.CET.dtype)
datetime64[ns]
Solution is convert string column to_datetime:
df1.date = pd.to_datetime(df1.date)
.map() solution will work only if you have exctly the same values in date and CET columns.
If you have slightly different values, you can use pd.merge_asof() method:
In [17]: pd.merge_asof(df1, df2, left_on='date', right_on='CET', tolerance=pd.Timedelta('2 hours'))
Out[17]:
date revenue CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 385.94380 2016-11-17 11 9 7 100 85 63
1 2016-11-18 1074.16034 2016-11-18 9 6 3 93 83 66
2 2016-11-19 2980.85786 2016-11-19 8 6 4 93 87 76
3 2016-11-20 1919.72396 2016-11-20 10 7 4 93 84 81
4 2016-11-21 884.27934 2016-11-21 14 10 7 100 89 77
5 2016-11-22 869.07107 2016-11-22 13 10 7 93 79 63
6 2016-11-23 760.28926 2016-11-23 11 8 5 100 91 82
7 2016-11-24 2481.68927 2016-11-24 9 7 4 93 80 66
8 2016-11-25 2745.99007 2016-11-25 7 4 1 87 74 57
9 2016-11-26 2273.41325 2016-11-26 7 3 -1 100 88 61
10 2016-11-27 2630.41490 2016-11-27 10 7 4 100 81 66
NOTE: merge_asof() function has been added in Pandas 0.19.0 (i.e. it's not available in older versions)
Demo:
In [191]: df2
Out[191]:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity
0 2016-11-17 01:39:00 11 9 7 100 85 63
1 2016-11-18 01:39:00 9 6 3 93 83 66
2 2016-11-19 01:39:00 8 6 4 93 87 76
3 2016-11-20 01:39:00 10 7 4 93 84 81
4 2016-11-21 01:39:00 14 10 7 100 89 77
5 2016-11-22 01:39:00 13 10 7 93 79 63
6 2016-11-23 01:39:00 11 8 5 100 91 82
7 2016-11-24 01:39:00 9 7 4 93 80 66
8 2016-11-25 01:39:00 7 4 1 87 74 57
9 2016-11-26 01:39:00 7 3 -1 100 88 61
10 2016-11-27 01:39:00 10 7 4 100 81 66
In [192]: df1
Out[192]:
date revenue
0 2016-11-17 385.94380
1 2016-11-18 1074.16034
2 2016-11-19 2980.85786
3 2016-11-20 1919.72396
4 2016-11-21 884.27934
5 2016-11-22 869.07107
6 2016-11-23 760.28926
7 2016-11-24 2481.68927
8 2016-11-25 2745.99007
9 2016-11-26 2273.41325
10 2016-11-27 2630.41490
In [193]: pd.merge_asof(df2, df1, left_on='CET', right_on='date')
Out[193]:
CET MaxTemp MeanTemp MinTemp MaxHumidity MeanHumidity MinHumidity date revenue
0 2016-11-17 01:39:00 11 9 7 100 85 63 2016-11-17 385.94380
1 2016-11-18 01:39:00 9 6 3 93 83 66 2016-11-18 1074.16034
2 2016-11-19 01:39:00 8 6 4 93 87 76 2016-11-19 2980.85786
3 2016-11-20 01:39:00 10 7 4 93 84 81 2016-11-20 1919.72396
4 2016-11-21 01:39:00 14 10 7 100 89 77 2016-11-21 884.27934
5 2016-11-22 01:39:00 13 10 7 93 79 63 2016-11-22 869.07107
6 2016-11-23 01:39:00 11 8 5 100 91 82 2016-11-23 760.28926
7 2016-11-24 01:39:00 9 7 4 93 80 66 2016-11-24 2481.68927
8 2016-11-25 01:39:00 7 4 1 87 74 57 2016-11-25 2745.99007
9 2016-11-26 01:39:00 7 3 -1 100 88 61 2016-11-26 2273.41325
10 2016-11-27 01:39:00 10 7 4 100 81 66 2016-11-27 2630.41490
using .map() method:
In [194]: df2.CET.map(df1.set_index('date')['revenue'])
Out[194]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
Name: CET, dtype: float64

Appending or Adding Rows in Pandas Dataframe

In the following DataFrame I would like to add rows if the count of values in the column A is less than 10.
For eg., in the following Table column A group 60 appears 12 times, however gorup 61 appears 9 times. I would like to add a row after last record of group 61 and copy the value in column B,C,D from the corresponding values group 60. Similar operation for group 62 and so on.
A B C D
0 60 0.235 4 7.86
1 60 1.235 5 8.86
2 60 2.235 6 9.86
3 60 3.235 7 10.86
4 60 4.235 8 11.86
5 60 5.235 9 12.86
6 60 6.235 10 13.86
7 60 7.235 11 14.86
8 60 8.235 12 15.86
9 60 9.235 13 16.86
10 60 10.235 14 17.86
11 60 11.235 15 18.86
12 61 12.235 16 19.86
13 61 13.235 17 20.86
14 61 14.235 18 21.86
15 61 15.235 19 22.86
16 61 16.235 20 23.86
17 61 17.235 21 24.86
18 61 18.235 22 25.86
19 61 19.235 23 26.86
20 61 20.235 24 27.86
21 62 20.235 24 28.86
22 62 20.235 24 29.86
23 62 20.235 24 30.86
24 62 20.235 24 31.86
25 62 20.235 24 32.86
You can use:
#cumulative count per group
df['G'] = df.groupby('A').cumcount()
df = df.groupby(['A','G'])
.first() #agregate first
.unstack() #reshape DataFrame
.ffill() #same as fillna(method='ffill')
.stack() #get original shape
.reset_index(drop=True, level=1) #remove level G in index
.reset_index()
print (df)
A B C D
0 60 0.235 4.0 7.86
1 60 1.235 5.0 8.86
2 60 2.235 6.0 9.86
3 60 3.235 7.0 10.86
4 60 4.235 8.0 11.86
5 60 5.235 9.0 12.86
6 60 6.235 10.0 13.86
7 60 7.235 11.0 14.86
8 60 8.235 12.0 15.86
9 60 9.235 13.0 16.86
10 60 10.235 14.0 17.86
11 60 11.235 15.0 18.86
12 61 12.235 16.0 19.86
13 61 13.235 17.0 20.86
14 61 14.235 18.0 21.86
15 61 15.235 19.0 22.86
16 61 16.235 20.0 23.86
17 61 17.235 21.0 24.86
18 61 18.235 22.0 25.86
19 61 19.235 23.0 26.86
20 61 20.235 24.0 27.86
21 61 9.235 13.0 16.86
22 61 10.235 14.0 17.86
23 61 11.235 15.0 18.86
24 62 20.235 24.0 28.86
25 62 20.235 24.0 29.86
26 62 20.235 24.0 30.86
27 62 20.235 24.0 31.86
28 62 20.235 24.0 32.86
29 62 17.235 21.0 24.86
30 62 18.235 22.0 25.86
31 62 19.235 23.0 26.86
32 62 20.235 24.0 27.86
33 62 9.235 13.0 16.86
34 62 10.235 14.0 17.86
35 62 11.235 15.0 18.86
Another solution with pivot_table:
df['G'] = df.groupby('A').cumcount()
df = df.pivot_table(index='A', columns='G')
.ffill()
.stack()
.reset_index(drop=True, level=1)
.reset_index()
print (df)
A B C D
0 60 0.235 4.0 7.86
1 60 1.235 5.0 8.86
2 60 2.235 6.0 9.86
3 60 3.235 7.0 10.86
4 60 4.235 8.0 11.86
5 60 5.235 9.0 12.86
6 60 6.235 10.0 13.86
7 60 7.235 11.0 14.86
8 60 8.235 12.0 15.86
9 60 9.235 13.0 16.86
10 60 10.235 14.0 17.86
11 60 11.235 15.0 18.86
12 61 12.235 16.0 19.86
13 61 13.235 17.0 20.86
14 61 14.235 18.0 21.86
15 61 15.235 19.0 22.86
16 61 16.235 20.0 23.86
17 61 17.235 21.0 24.86
18 61 18.235 22.0 25.86
19 61 19.235 23.0 26.86
20 61 20.235 24.0 27.86
21 61 9.235 13.0 16.86
22 61 10.235 14.0 17.86
23 61 11.235 15.0 18.86
24 62 20.235 24.0 28.86
25 62 20.235 24.0 29.86
26 62 20.235 24.0 30.86
27 62 20.235 24.0 31.86
28 62 20.235 24.0 32.86
29 62 17.235 21.0 24.86
30 62 18.235 22.0 25.86
31 62 19.235 23.0 26.86
32 62 20.235 24.0 27.86
33 62 9.235 13.0 16.86
34 62 10.235 14.0 17.86
35 62 11.235 15.0 18.86

Categories

Resources