I have three dataframes
df1 :
Date ID Number ID2 info_df1
2021-12-11 1 34 36 60
2021-12-10 2 33 35 57
2021-12-09 3 32 34 80
2021-12-08 4 3133 55
df2:
Date ID Number ID2 info_df2
2021-12-10 2 18 20 50
2021-12-11 1 34 36 89
2021-12-10 2 33 35 40
2021-12-09 3 32 34 92
df3:
Date ID Number ID2 info_df3
2021-12-10 2 18 20 57
2021-12-10 2 18 20 63
2021-12-11 1 34 36 52
2021-12-10 2 33 35 33
I need a data frame with info column from df1,df2 and df3 and Date,ID,Number,ID2 as index.
Format of the merged dataframe should consist these columns:
Date ID Number ID2 info_df1 info_df2
info_df3
If you trying to merge the dataframe based on Date, I think what you need is merge function:
mergedDf = df1.merge(df2, on="Date").merge(df3, on="Date");
mergedDf.set_index("ID2", inplace = True)
But if you are trying to merge dataframes based on multiple columns, you can use a list of column names on the on argument:
mergedDf = df1.merge(df2, on=["Date", "ID", "ID2"]).merge(df3, on=["Date", "ID", "ID2"]);
mergedDf.set_index("ID2", inplace = True)
Two steps:
first, pandas.concat(<DFs-list>) all those DFs into a df;
then, define a multi-index with df.set_index(<col-names-list>).
That will do it. Sure, you have to read some docs (here below), but those two steps should be about it.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.set_levels.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html
As others have mentioned, you need to merge the dataframes together. Using the built-in function functools.reduce, we can do this dynamically (for any number of dataframes) and easily:
i = 0
def func(x, y):
global i
i += 1
return y.merge(x.rename({'info': f'info_df{i + 1}'}, axis=1), on=['Date', 'ID', 'Number', 'ID2'], how='outer')
dfs = [df1, df2, df3]
new_df = ft.reduce(func, dfs).rename({'info': 'info_df1'}, axis=1)
Output:
>>> new_df
Date ID Number ID2 info_df1 info_df7 info_df6
0 2021-12-10 2 18 20 57.0 50.0 NaN
1 2021-12-10 2 18 20 63.0 50.0 NaN
2 2021-12-11 1 34 36 52.0 89.0 60.0
3 2021-12-10 2 33 35 33.0 40.0 57.0
4 2021-12-09 3 32 34 NaN 92.0 80.0
5 2021-12-08 4 31 33 NaN NaN 55.0
I have a Pandas dataframe with a datetime column (that I've used as a DatetimeIndex) that has a categorical column, and a numerical column. I'd like to apply a complex function to the numerical column when the categorical column is the same as the current row, in a short (ten-day) window lagging the current row (non-inclusive).
As a contrived example:
name = ['steve', 'bob', 'harry', 'jeff'] * 5
df = pd.DataFrame(
index=pd.DatetimeIndex(start='2018-10-10', end='2018-10-29', freq='D'),
data={'value': [x for x in range(20)],
'name': names
}
)
produces a simple dataframe, to which I'd like to add another column (result) that calculates the number of rows * the sum of the values in 'value' (or something - just a formula that there's not a Pandas built-in function for). So for the dataframe above, I'd like the following:
num name result
2018-10-10 0 steve NaN
2018-10-11 1 bob NaN
2018-10-12 2 harry NaN
2018-10-13 3 jeff NaN
2018-10-14 4 steve 0
2018-10-15 5 bob 1
2018-10-16 6 harry 2
2018-10-17 7 jeff 3
2018-10-18 8 steve 8
2018-10-19 9 bob 12
2018-10-20 10 harry 16
2018-10-21 11 jeff 20
2018-10-22 12 steve 24
2018-10-23 13 bob 28
2018-10-24 14 harry 32
2018-10-25 15 jeff 36
2018-10-26 16 steve 40
2018-10-27 17 bob 44
2018-10-28 18 harry 48
2018-10-29 19 jeff 52
I can write my own function for this and use it in pandas.apply:
def rolling_apply(df, time, window_size=timedelta(days=10)):
event_time = time
event_name = df[df.index == time]['names'].iloc[0]
return df[
(df['names'] == event_name) &
(df.index < event_time) &
(df.index >= event_time - window_size)
]
df['result'] = df.apply(lambda x: rolling_apply(df, x.name)['value'].sum() * rolling_apply(df, x.name).count(), axis=1)
but performance gets pretty terrible pretty quickly as my data grows. pandas.rolling.apply seems sort of appropriate, but I can't quite make it fit what I want to do.
Any suggestions or help would be very much appreciated!
I have the following example Python 3.4 script. It does the following:
creates a dataframe,
converts the date variable to datetime64 format,
creates a groupby object based on two categorical variables,
produces a dataframe that contains a count of the number items in each group,
merges count dataframe back with original dataframe to create a column containing the number of rows in each group
creates a column containing the difference in dates between sequential rows.
Here is the script:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
This script produces the following output:
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 243 days 06:18:00
2 old 2015-06-04 12:34:00 female 2 3 NaT
3 old 2015-09-04 23:03:00 female 3 3 92 days 10:29:00
4 old 2015-04-21 12:59:00 female 6 3 -137 days +13:56:00
5 old 2015-12-04 01:00:00 male 4 6 NaT
6 old 2015-04-15 07:12:00 male 5 6 -233 days +06:12:00
7 old 2015-06-05 11:12:00 male 9 6 51 days 04:00:00
8 old 2015-05-19 19:22:00 male 12 6 -17 days +08:10:00
9 old 2015-04-06 12:57:00 male 15 6 -44 days +17:35:00
10 old 2015-06-15 03:23:00 male 17 6 69 days 14:26:00
11 young 2015-12-05 14:19:00 female 11 4 NaT
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 163 days 18:28:00
And this exactly what I'd expect. However, it seems to rely on creating the groupby object twice (in exactly the same way). If the second groupby definition is commented out, it seems to lead to a very different output in the diff column:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
# ****** THIS TIME THE FOLLOWING GROUPBY DEFINITION IS COMMENTED OUT *****
# tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
And, this time the output is very different (and NOT what I wanted at all)
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 NaT
2 old 2015-06-04 12:34:00 female 2 3 92 days 10:29:00
3 old 2015-09-04 23:03:00 female 3 3 NaT
4 old 2015-04-21 12:59:00 female 6 3 -233 days +06:12:00
5 old 2015-12-04 01:00:00 male 4 6 -137 days +13:56:00
6 old 2015-04-15 07:12:00 male 5 6 NaT
7 old 2015-06-05 11:12:00 male 9 6 NaT
8 old 2015-05-19 19:22:00 male 12 6 51 days 04:00:00
9 old 2015-04-06 12:57:00 male 15 6 243 days 06:18:00
10 old 2015-06-15 03:23:00 male 17 6 NaT
11 young 2015-12-05 14:19:00 female 11 4 -17 days +08:10:00
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 -44 days +17:35:00
(In my real-life script the results seem to be a little erratic, sometimes it works and sometimes it doesn't. But in the above script, the different outputs seem to occur consistently.)
Why is it necessary to recreate the groupby object on what is, essentially, the same dataframe (albeit with an additional column added) immediately before using the .diff() function? This seems very dangerous to me.
Not the same, the index has changed. For example:
tempDF.loc[1].id # before
10
tempDF.loc[1].id # after
2
So if you compute tempGroupby with the old tempDF and then change the indexes in tempDF when you do this:
tempDF['diff'] = tempGroupby['date'].diff()
the indexes do not match as you expect. You are assigning to each row the difference corresponding to the row that had that index in the old tempDF.