this is actually an extension of my previous question, but I was requested to put it as a separate question
Rolling average on previous dates per group
I have the following dataset:
Name Loc Site Date Total
Alex Italy A 12.31.2020 30
Alex Italy B 12.31.2020 20
Alex Italy B 12.30.2020 100
Alex Italy B 12.28.2020 40
Alex Italy A 12.23.2020 80
Alex France A 12.28.2020 10
Alex France B 12.28.2020 20
Alex France B 12.23.2020 10
Alex France A 12.23.2020 100
Alex France B 12.21.2020 25
I want to add per each row the average of total in arbitrary time frame before the Date per Name, Loc and Date
This is the outcome I'm looking for previous 5 days (excluding Date):
Name Loc Site Date Total Prv_Avg
Alex Italy A 12.31.2020 30 70
Alex Italy B 12.31.2020 20 70
Alex Italy B 12.30.2020 100 40
Alex Italy B 12.28.2020 40 80
Alex Italy A 12.23.2020 80 NaN
Alex France A 12.28.2020 10 55
Alex France B 12.28.2020 20 55
Alex France B 12.23.2020 10 25
Alex France A 12.23.2020 100 25
Alex France B 12.21.2020 25 NaN
The Nulls are for rows where there are not 5 previous days in the data
Use custom lambda function in GroupBy.transform with replace match values to NaNs and create averages by numpy.nanmean:
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
arr = x.index.to_numpy()
s = x.to_numpy()
prev = arr - pd.Timedelta(5, 'day')
return np.nanmean(np.where((arr[:, None] > arr) &
(arr >= prev[:, None]), s, np.nan), axis=1)
df['Prv_Avg'] = (df.set_index('Date')
.groupby(['Name','Loc'])['Total']
.transform(f)
.to_numpy())
print (df)
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 2020-12-31 30 70.0
1 Alex Italy B 2020-12-31 20 70.0
2 Alex Italy B 2020-12-30 100 40.0
3 Alex Italy B 2020-12-28 40 80.0
4 Alex Italy A 2020-12-23 80 NaN
5 Alex France A 2020-12-28 10 55.0
6 Alex France B 2020-12-28 20 55.0
7 Alex France B 2020-12-23 10 25.0
8 Alex France A 2020-12-23 100 25.0
9 Alex France B 2020-12-21 25 NaN
How to fill the values of column ["state"] with another column ["country"] only in NaN values?
Like in this Pandas DataFrame:
state country sum
0 NaN China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 NaN India 5
5 NaN Srilanka 6
6 NaN Malaysia 7
7 NaN Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 NaN US 12
12 NaN Canada 13
What code should I do to fill state columns with country column only in NaN values, like this:
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I can use this code:
df.loc[df['state'].isnull(), 'state'] = df[df['state'].isnull()]['country'].replace(df['country'])
But in a very large dataset with 300K of rows, it compute for 5-6 minutes and crashed every time. Because it is replacing one value at a time.
Like this
Can anyone help me with efficient code for this?
Please!
Perhaps using fillna without checking for isnull() and replace():
df['state'].fillna(df['country'], inplace=True)
print(df)
Output
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
Jim
1/01/2021
2/05/2021
3
Jim
1/15/2021
4/06/2021
26
Jim
2/01/2021
2/06/2021
7
Jim
2/15/2021
3/14/2021
12
Jim
3/01/2021
4/22/2021
13
Jim
3/15/2021
3/29/2021
5
Jim
4/01/2021
4/20/2021
17
Bob
1/01/2021
1/12/2021
23
Bob
1/15/2021
2/16/2021
12
Bob
2/01/2021
3/04/2021
4
Bob
2/15/2021
4/05/2021
23
Bob
3/01/2021
3/24/2021
12
Bob
3/15/2021
4/15/2021
7
Bob
4/01/2021
5/01/2021
20
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
open_amount_sum
Jim
1/01/2021
2/05/2021
3
36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.)
Jim
1/15/2021
4/06/2021
26
56
Jim
2/01/2021
2/06/2021
7
33
Jim
2/15/2021
3/14/2021
12
51
Jim
3/01/2021
4/22/2021
13
13
Jim
3/15/2021
3/29/2021
5
44
Jim
4/01/2021
4/20/2021
17
30
Bob
1/01/2021
1/12/2021
23
23
Bob
1/15/2021
2/16/2021
12
39
Bob
2/01/2021
3/04/2021
4
39
Bob
2/15/2021
4/05/2021
23
50
Bob
3/01/2021
3/24/2021
12
42
Bob
3/15/2021
4/15/2021
7
27
Bob
4/01/2021
5/01/2021
20
20
Edit #RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df
Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20
hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)
I have two dataframes df and df1. I want to join the both dataframes and get the output in different ways
df
City Date Wind Temperature
London 5/11/2019 14 5
London 6/11/2019 28 6
London 7/11/2019 10 5
Berlin 5/11/2019 23 12
Berlin 6/11/2019 24 12
Berlin 7/11/2019 16 16
Munich 5/11/2019 12 10
Munich 6/11/2019 33 11
Munich 7/11/2019 44 13
Paris 5/11/2019 27 6
Paris 6/11/2019 16 7
Paris 7/11/2019 14 8
Paris 8/11/2019 10 6
df1
ID City Delivery_Date Provider
1456223 London 7/11/2019 Amazon
1456345 London 6/11/2019 Amazon
2345623 Paris 8/11/2019 Walmart
1287456 Paris 7/11/2019 Amazon
4568971 Munich 7/11/2019 Amazon
3456789 Berlin 6/11/2019 Walmart
Output1
ID City Delivery_Date Wind Temperature
1456223 London 7/11/2019 10 5
1456345 London 6/11/2019 28 6
2345623 Paris 8/11/2019 10 6
1287456 Paris 7/11/2019 14 8
4568971 Munich 7/11/2019 44 13
Output 2
Here the weather details of the Item should displayed till its delivery date is met
ID City Delivery_Date Wind Temperature
1456223 London 5/11/2019 14 5
1456223 London 6/11/2019 28 6
1456223 London 7/11/2019 10 5
1287456 Paris 5/11/2019 27 6
1287456 Paris 6/11/2019 16 7
1287456 Paris 7/11/2019 14 8
How can this be done.
considering DF and DF1 as data frames as you explained.
import pandas as pd
output1 = pd.merge(DF1, DF,left_on = ['City','Date'] ,right_on = ['City','Delivery_Date'], how='inner' )
res1 = df1.groupby('City').max() [['Delivery_Date']]
result1 = pd.merge(df,res1, on ='City')
output2 = result1 [result1['Date'] <= result1['Delivery_Date']]
You can use df.merge
import pandas as pd
df.merge(df1[['City','Delivery_Date','ID']],left_on = ['City','Date'] ,right_on = ['City','Delivery_Date'],how='inner')