Merge with groupby and where in Pandas (Python) - python

I have two table like that:
Customr Issue Date_Issue
1 1 01/01/2019
1 2 03/06/2019
1 3 04/07/2019
1 4 13/09/2019
2 5 01/02/2019
2 6 16/03/2019
2 7 20/08/2019
2 8 30/08/2019
2 9 01/09/2019
3 10 01/02/2019
3 11 03/02/2019
3 12 05/03/2019
3 13 20/04/2019
3 14 25/04/2019
3 15 13/05/2019
3 16 20/05/2019
3 17 25/05/2019
3 18 01/06/2019
3 19 03/07/2019
3 20 20/08/2019
Customr Date_Survey df_Score
1 06/04/2019 10
2 10/06/2019 9
3 01/08/2019 3
And I need to obtain the number of issues of each customer in the three month before the date of survey.
But I can not get this query in Pandas.
#first table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
#And second table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
I expect the result
Custr Date_Survey Score Count_issues
1 06/04/2019 10 0
2 10/06/2019 9 1
3 01/08/2019 3 5

Use:
#convert columns to datetimes
df1['Date_Issue'] = pd.to_datetime(df1['Date_Issue'], dayfirst=True)
df2['Date_Survey'] = pd.to_datetime(df2['Date_Survey'], dayfirst=True)
#create datetimes for 3 months before
df2['Date1'] = df2['Date_Survey'] - pd.offsets.DateOffset(months=3)
#merge together
df = df1.merge(df2, on='Customr')
#filter by between, select only Customr and get counts
s = df.loc[df['Date_Issue'].between(df['Date1'], df['Date_Survey']), 'Customr'].value_counts()
#map to new column and replace NaNs to 0
df2['Count_issues'] = df2['Customr'].map(s).fillna(0, downcast='int')
print (df2)
Customr Date_Survey df_Score Date1 Count_issues
0 1 2019-04-06 10 2019-01-06 0
1 2 2019-06-10 9 2019-03-10 1
2 3 2019-08-01 3 2019-05-01 5

Related

Pandas Merge Columns with Priority

My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)

How to Transpose multiple columns into multiple rows but retain primary keys as is using Pandas

I have a dataframe which can be generated from the code given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})
The dataframe looks like as shown below
I would like to retain the rows of each person as seperate rows and not as columns like shown in screenshot above.In addition, I want the date1derived,date2derived columns to be dropped.
I did try below approaches but they didn't provide the expected output
1) df.set_index(['person_id']).stack()/unstack
2) df.set_index(['person_id','date1','date2','date3']).stack()/unstack()
3) df.set_index('person_id').unstack()/stack
How can I get an output to be like this? I have more than 600 columns, so I don't think writing the column names manually would help me.
This is a wide_to_long problem:
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
date val
person_id grp
1 1 12/31/2007 2
2 12/31/2017 1
3 12/31/2027 7
2 1 11/25/2009 4
2 11/25/2019 3
3 11/25/2029 9
3 1 10/06/2005 6
2 10/06/2015 5
3 10/06/2025 11
To match your expected output:
df = pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
df = df.reset_index(level=1, drop=True).reset_index()
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11
You can do it without wide_to_long() but just with append()
df2 = pd.DataFrame()
for i in range(1, 4):
new_df = df[['person_id', f'date{i}', f'val{i}']]
new_df.columns = ['person_id', 'date', 'val']
df2 = df2.append(new_df)
df2.sort_values('person_id').reset_index(drop=True)
ouput :
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11

Subtracting Rows based on ID Column - Pandas

I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37

Expanding mean grouped by multiple columns in pandas

I have a dataframe that I'd like to calculate expanding mean over one column (quiz_score), but need to group by two different columns (userid and week). The data looks like this:
data = {"userid": ['1','1','1','1','1','1','1','1', '2','2','2','2','2','2','2','2'],\
"week": [1,1,2,2,3,3,4,4, 1,2,2,3,3,4,4,5],\
"quiz_score": [12, 14, 14, 15, 9, 15, 11, 14, 15, 14, 15, 13, 15, 10, 14, 14]}
>>> df = pd.DataFrame(data, columns = ['userid', 'week', 'quiz_score'])
>>> df
userid week quiz_score
0 1 1 12
1 1 1 14
2 1 2 14
3 1 2 15
4 1 3 9
5 1 3 15
6 1 4 11
7 1 4 14
8 2 1 15
9 2 2 14
10 2 2 15
11 2 3 13
12 2 3 15
13 2 4 10
14 2 4 14
15 2 5 14
I need to calculate expanding means by userid over each week--that is, for each user each week, I need their average quiz score over the preceding weeks. I know that a solution will involve using shift() and pd.expanding_mean() or .expanding().mean() in some form, but I've been unable to get the grouping and shift-ing correct -- even when I try without shifting, the results aren't grouped properly and seem to be just expanding mean across the rows as if there were no grouping at all:
df.groupby(['userid', 'week']).apply(pd.expanding_mean).reset_index()
To be clear, the correct result would look like this:
userid week expanding_mean_quiz_score
0 1 1 NA
1 1 2 13
2 1 3 13.75
3 1 4 13.166666
4 1 5 13
5 1 6 13
6 2 1 NA
7 2 2 15
8 2 3 14.666666
9 2 4 14.4
10 2 5 13.714
11 2 6 13.75
Note that the expanding_mean_quiz_score for each user/week is the mean of the scores for that user across all previous weeks.
Thanks for your help, I've never used expanding_mean() and am stumped here.
You can groupby userid and 'week' and keep track of the total scores and count for those groupings. Then use the expanding method on the groupby object to accumulate the scores and counts. Finally, get the desired column by dividing both accumulations.
a=df.groupby(['userid', 'week'])['quiz_score'].agg(('sum', 'count'))
a = a.reindex(pd.MultiIndex.from_product([['1', '2'], range(1,7)], names=['userid', 'week']))
b = a.groupby(level=0).cumsum().groupby(level=0).shift(1)
b['em_quiz_score'] = b['sum'] / b['count']
c = b.reset_index().drop(['count', 'sum'], axis=1)
d = c.groupby('userid').fillna(method='ffill')
d['userid'] = c['userid']
d = d[['userid', 'week', 'em_quiz_score']]
userid week em_quiz_score
0 1 1 NaN
1 1 2 13.000000
2 1 3 13.750000
3 1 4 13.166667
4 1 5 13.000000
5 1 6 13.000000
6 2 1 NaN
7 2 2 15.000000
8 2 3 14.666667
9 2 4 14.400000
10 2 5 13.714286
11 2 6 13.750000

Select values in Pandas groupby dataframe that are present in n previous groups

I have a Pandas dataframe groupby object which looks like the following:
ID
2014-11-30 1
2
3
2014-12-31 1
2
3
4
2015-01-31 2
3
4
2015-02-28 1
3
4
5
2015-03-31 1
2
4
5
6
2015-04-30 3
4
5
6
What I want to do is create another dataframe where the values in groupby date x are values that are in each of groupby dates y(x-1) thru y(x-n) where y is the n period previous groupby. So for instance, if n=1, then if x groupby period is '2015-04-30', then you would check against '2015-03-31'. If n=2, then if groupby date '2015-02-28', then you would check against groupby dates ['2015-01-31', '2014-12-31'].
The resulting dataframe from the above would look like this for n=1:
ID
2014-12-31 1
2
3
2015-01-31 2
3
4
2015-02-28 3
4
2015-03-31 1
4
5
2015-04-30 4
5
6
The resulting dataframe for n=2 would be:
2015-01-31 2
3
2015-02-28 3
4
2015-03-31 4
2015-04-30 4
5
Looking forward to some pythonic solutions!
This would seem to work:
def filter_unique(df, n):
data_by_date = df.groupby('date')['ID'].apply(lambda x: x.tolist())
filtered_data = {}
previous = []
for i, (date, data) in enumerate(data_by_date.items()):
if i >= n:
if len(previous)==1:
filtered_data[date] = list(set(previous[i-n]).intersection(data))
else:
filtered_data[date] = list(set.intersection(*[set(x) for x in previous[i-n:]]).intersection(data))
else:
filtered_data[date] = data
previous.append(data)
result = pd.DataFrame.from_dict(filtered_data, orient='index').stack()
result.index = result.index.droplevel(1)
filter_unique(df, 2)
1/31/15 2
1/31/15 3
1/31/15 4
11/30/14 1
11/30/14 2
11/30/14 3
12/31/14 2
12/31/14 3
2/28/15 1
2/28/15 3
3/31/15 1
3/31/15 4
4/30/15 4
4/30/15 5

Categories

Resources