Merge pandas Data Frames based on conditions - python

I have two files which show information about a transaction over products
Operations of type 1
d_op_1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3],'cost':[10,20,20,20,10,20,20,20],
'date':[2000,2006,2012,2000,2009,2009,2002,2006]})
Operations of type 2
d_op_2 = pd.DataFrame({'id':[1,1,2,2,3,4,5,5],'cost':[3000,3100,3200,4000,4200,3400,2000,2500],
'date':[2010,2015,2008,2010,2006,2010,1990,2000]})
I want to keep only those registers were there have been operations of type one between two operations of type 2.
E.G. for the product wit the id "1" there was an operation of type 1 (2012) between two operations of type 2 (2010,2015) so I want to keep that record.
The desired output it cloud be either this:
or this:
Using pd.merge() I got this result:
How can I filter this to get the desired output?

You can use:
#concat DataFrames together
df4 = pd.concat([d_op_1.rename(columns={'cost':'cost1'}),
d_op_2.rename(columns={'cost':'cost2'})]).fillna(0).astype(int)
#print (df4)
#find max and min dates per goups
df3 = d_op_2.groupby('id')['date'].agg({'start':'min','end':'max'})
#print (df3)
#join max and min dates to concated df
df = df4.join(df3, on='id')
df = df[(df.date > df.start) & (df.date < df.end)]
#reshape df for min, max and dated between them
df = pd.melt(df,
id_vars=['id','cost1'],
value_vars=['date','start','end'],
value_name='date')
#remove columns
df = df.drop(['cost1','variable'], axis=1) \
.drop_duplicates()
#merge to original, sorting
df = pd.merge(df, df4, on=['id', 'date']) \
.sort_values(['id','date']).reset_index(drop=True)
#reorder columns
df = df[['id','cost1','cost2','date']]
print (df)
id cost1 cost2 date
0 1 0 3000 2010
1 1 20 0 2012
2 1 0 3100 2015
3 2 0 3200 2008
4 2 10 0 2009
5 2 20 0 2009
6 2 0 4000 2010
#if need lists for duplicates
df = df.groupby(['id','cost2', 'date'])['cost1'] \
.apply(lambda x: list(x) if len(x) > 1 else x.values[0]) \
.reset_index()
df = df[['id','cost1','cost2','date']]
print (df)
id cost1 cost2 date
0 1 20 0 2012
1 1 0 3000 2010
2 1 0 3100 2015
3 2 [10, 20] 0 2009
4 2 0 3200 2008
5 2 0 4000 2010

Related

(Pandas) How to replace certain values of a column from a different dataset but leave other values in the dataset untouched?

Let's say I have the dataset:
df1 = pd.DataFrame()
df1['number'] = [0,0,0,0,0]
df1["decade"] = ["1970", "1980", "1990", "2000", "2010"]`
print(df1)
#output:
number decade
0 0 1970
1 0 1980
2 0 1990
3 0 2000
4 0 2010
and I want to merge it with another dataset:
df2 = pd.DataFrame()
df2['number'] = [1,1]
df2["decade"] = ["1990", "2010"]
print(df2)
#output:
number decade
0 1 1990
1 1 2010
such that it get's values only from the decades from df2 that have values in them and leaves the others untouched, yielding:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
how must one go about doing that in pandas? I've tried stuff like join, merge, and concat but they all seem to either not give the desired result or not work because of the different dimensions of the 2 datasets. Any suggestions regarding which function I should be looking at?
Thank you so much!
You can use pandas.DataFrame.merge with pandas.Series.fillna :
out = (
df1[["decade"]]
.merge(df2, on="decade", how="left")
.fillna({"number": df1["number"]}, downcast="infer")
)
# Output :
print(out)
decade number
0 1970 0
1 1980 0
2 1990 1
3 2000 0
4 2010 1
What about using apply?
First you create a function
def validation(previous,latest):
if pd.isna(latest):
return previous
else:
return latest
Then you can use the function dataframe.apply to compare the data in df1 to df2
df1['number'] = df1.apply(lambda row: validation(row['number'],df2.loc[df2['decade'] == row.decade].number.max()),axis = 1)
Your result:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010

Aggregating the counts on a certain day of the week in python

I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00

How to get most recent date based on a given date using python?

Consider the following two dataframes:
Dataframe1 contains a list of users and stop_dates
Dataframe2 contains a history of user transactions and dates
I want to get the last transaction date before the stop date for all users in Dataframe1 (some users in Dataframe1 have multiple stop dates)
I want the output to look like the following:
Please always provide data in a form that makes it easy to use as samples (i.e. as text, not as images - see here).
You could try:
df1["Stop_Date"] = pd.to_datetime(df1["Stop_Date"], format="%m/%d/%y")
df2["Transaction_Date"] = pd.to_datetime(df2["Transaction_Date"], format="%m/%d/%y")
df = (
df1.merge(df2, on="UserID", how="left")
.loc[lambda df: df["Stop_Date"] >= df["Transaction_Date"]]
.groupby(["UserID", "Stop_Date"])["Transaction_Date"].max()
.to_frame().reset_index().drop(columns="Stop_Date")
)
Make datetimes out of the date columns.
Merge df2 on df1 along UserID.
Remove the rows which have a Transaction_Date greater than Stop_Date.
Group the result by UserID and Stop_Date, and fetch the maximum Transaction_Date.
Bring the result in shape.
Result for
df1:
UserID Stop_Date
0 1 2/2/22
1 2 6/9/22
2 3 7/25/22
3 3 9/14/22
df2:
UserID Transaction_Date
0 1 1/2/22
1 1 2/1/22
2 1 2/3/22
3 2 1/24/22
4 2 3/22/22
5 3 6/25/22
6 3 7/20/22
7 3 9/13/22
8 3 9/14/22
9 4 2/2/22
is
UserID Transaction_Date
0 1 2022-02-01
1 2 2022-03-22
2 3 2022-07-20
3 3 2022-09-14
If you don't want to permanently change the dtype to datetime, and also want the result as string, similarly formatted as the input (with padding), then you could try:
df = (
df1
.assign(Stop_Date=pd.to_datetime(df1["Stop_Date"], format="%m/%d/%y"))
.merge(
df2.assign(Transaction_Date=pd.to_datetime(df2["Transaction_Date"], format="%m/%d/%y")),
on="UserID", how="left"
)
.loc[lambda df: df["Stop_Date"] >= df["Transaction_Date"]]
.groupby(["UserID", "Stop_Date"])["Transaction_Date"].max()
.to_frame().reset_index().drop(columns="Stop_Date")
.assign(Transaction_Date=lambda df: df["Transaction_Date"].dt.strftime("%m/%d/%y"))
)
Result:
UserID Transaction_Date
0 1 02/01/22
1 2 03/22/22
2 3 07/20/22
3 3 09/14/22
Here is one way to accomplish (make sure both date columns are already datetime):
df = pd.merge(df1, df2, on="UserID")
df["Last_Before_Stop"] = df["Stop_Date"].apply(lambda x: max(df["Transaction_Date"][df["Transaction_Date"] <= x]))

Python DF - apply same procedure to multiple columns using multiple parameters

I have a dataframe with id, v1 and v2. For v1, first, for each id I pick the first 4 rows (lookback or lb=4) . Then within these 4 rows, I identify the top 3 value of v1 and create a table for these top 3. I do the same thing but this time I use the first 5 rows (lb=5).
For v2, I apply the same procedure as v1.
Finally, I combine all top3 results together.
The code below yields exactly what I want. However my real work has multiple v1,v2... and requires multiple lookback as well. So I wonder if you could guide me to make an efficient code? Many thanks.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,4], [1,6,7], [1,39,9],[1,30,8],[1,40,6],[1,140,0], [2,2,1], [2,1,99], [2,20,88], [2,15,25], [2,99,25], [2,9,0]], columns=['id', 'v1','v2'])
print(df)
# PART 1: WORK ON Value 1 ************************************************************************************************************************
# lookback the first 4 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:4]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v1'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v1'))
#Transposing using unstack
v1_top3_lb4 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v1'].unstack().add_prefix('v1_lb4_top');
v1_top3_lb4 = v1_top3_lb4.reset_index();
# lookback the first 5 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:5]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v1'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v1'))
#Transposing using unstack
v1_top3_lb5 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v1'].unstack().add_prefix('v1_lb5_top');
v1_top3_lb5 = v1_top3_lb5.reset_index();
# PART 2: WORK ON Value 2 ************************************************************************************************************************
# lookback the first 4 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:4]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v2'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v2'))
#Transposing using unstack
v2_top3_lb4 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v2'].unstack().add_prefix('v2_lb4_top');
v2_top3_lb4 = v2_top3_lb4.reset_index();
# lookback the first 5 rows
df_temp = df.groupby('id').apply(lambda t: t.iloc[0:5]);df_temp.reset_index(drop=True, inplace=True)
#Filtering the largest 3 values
xx = (df_temp.groupby('id')['v2'] .apply(lambda x: x.nlargest(3)) .reset_index(level=1, drop=True) .to_frame('v2'))
#Transposing using unstack
v2_top3_lb5 = xx.set_index(np.arange(len(xx)) % 3, append=True)['v2'].unstack().add_prefix('v2_lb5_top');
v2_top3_lb5 = v2_top3_lb5.reset_index();
# PART 3: Merge all ************************************************************************************************************************
combine = pd.merge(v1_top3_lb4, v1_top3_lb5, on='id')
combine = pd.merge(combine, v2_top3_lb4, on='id')
combine = pd.merge(combine, v2_top3_lb5, on='id')
combine
id v1 v2
0 1 1 4
1 1 6 7
2 1 39 9
3 1 30 8
4 1 40 6
5 1 140 0
6 2 2 1
7 2 1 99
8 2 20 88
9 2 15 25
10 2 99 25
11 2 9 0
id v1_lb4_top0 v1_lb4_top1 v1_lb4_top2 v1_lb5_top0 v1_lb5_top1 v1_lb5_top2 v2_lb4_top0 v2_lb4_top1 v2_lb4_top2 v2_lb5_top0 v2_lb5_top1 v2_lb5_top2
0 1 39 30 6 40 39 30 9 8 7 9 8 7
1 2 20 15 2 99 20 15 99 88 25 99 88 25

How to duplicate entries in a dataframe

I have a dataframe of the form:
df2 = pd.DataFrame({'Date': np.array([2018,2017,2016,2015]),
'Rev': np.array([4000,5000,6000,7000]),
'Other': np.array([0,0,0,0]),
'High':np.array([75.11,70.93,48.63,43.59]),
'Low':np.array([60.42,45.74,34.15,33.12]),
'Mean':np.array([67.765,58.335,41.390,39.355]) #mean of high/low columns
})
This looks like:
I want to convert this dataframe to something that looks like:
Basically you are copying each row two more times. Then you are taking the high, low, and mean values and column-wise under the 'price' column. Then you add a new 'category' that keeps a track of which is from high/low/medium (0 meaning high, 1 meaning low, and 2 meaning mean).
This is a simple melt (wide to long) problem:
# convert df2 from wide to long, melting the High, Low and Mean cols
df3 = df2.melt(df2.columns.difference(['High', 'Low', 'Mean']).tolist(),
var_name='category',
value_name='price')
# remap "category" to integer
df3['category'] = pd.factorize(df['category'])[0]
# sort and display
df3.sort_values('Date', ascending=False))
Date Other Rev category price
0 2018 0 4000 0 75.110
4 2018 0 4000 1 60.420
8 2018 0 4000 2 67.765
1 2017 0 5000 0 70.930
5 2017 0 5000 1 45.740
9 2017 0 5000 2 58.335
2 2016 0 6000 0 48.630
6 2016 0 6000 1 34.150
10 2016 0 6000 2 41.390
3 2015 0 7000 0 43.590
7 2015 0 7000 1 33.120
11 2015 0 7000 2 39.355
instead of melt, you can use stack, which saves you the sort_values:
new_df = (df2.set_index(['Date','Rev', 'Other'])
.stack()
.to_frame(name='price')
.reset_index()
)
output:
Date Rev Other level_3 price
0 2018 4000 0 High 75.110
1 2018 4000 0 Low 60.420
2 2018 4000 0 Mean 67.765
3 2017 5000 0 High 70.930
4 2017 5000 0 Low 45.740
5 2017 5000 0 Mean 58.335
6 2016 6000 0 High 48.630
7 2016 6000 0 Low 34.150
8 2016 6000 0 Mean 41.390
9 2015 7000 0 High 43.590
10 2015 7000 0 Low 33.120
11 2015 7000 0 Mean 39.355
and if you want the category column:
new_df['category'] = new_df['level_3'].map({'High':0, 'Low':1, 'Mean':2'})
Here's another version:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'Date': np.array([2018,2017,2016,2015]),
'Rev': np.array([4000,5000,6000,7000]),
'Other': np.array([0,0,0,0]),
'High':np.array([75.11,70.93,48.63,43.59]),
'Low':np.array([60.42,45.74,34.15,33.12]),
'Mean':np.array([67.765,58.335,41.390,39.355]) #mean of high/low columns
})
#create one dataframe per category
df_high = df2[['Date', 'Other', 'Rev', 'High']]
df_mean = df2[['Date', 'Other', 'Rev', 'Mean']]
df_low = df2[['Date', 'Other', 'Rev', 'Low']]
#rename the category column to price
df_high = df_high.rename(index = str, columns = {'High': 'price'})
df_mean = df_mean.rename(index = str, columns = {'Mean': 'price'})
df_low = df_low.rename(index = str, columns = {'Low': 'price'})
#create new category column
df_high['category'] = 0
df_mean['category'] = 2
df_low['category'] = 1
#concatenate the dataframes together
frames = [df_high, df_mean, df_low]
df_concat = pd.concat(frames)
#sort values per example
df_concat = df_concat.sort_values(by = ['Date', 'category'], ascending = [False, True])
#print result
print(df_concat)
Result:
Date Other Rev price category
0 2018 0 4000 75.110 0
0 2018 0 4000 60.420 1
0 2018 0 4000 67.765 2
1 2017 0 5000 70.930 0
1 2017 0 5000 45.740 1
1 2017 0 5000 58.335 2
2 2016 0 6000 48.630 0
2 2016 0 6000 34.150 1
2 2016 0 6000 41.390 2
3 2015 0 7000 43.590 0
3 2015 0 7000 33.120 1
3 2015 0 7000 39.355 2

Categories

Resources