I'm trying to merge two dataframes with different datetime frequencies and also filling up missing values with duplicates.
Dataframe df1 with minute frequency:
time
0 2017-06-01 00:00:00
1 2017-06-01 00:01:00
2 2017-06-01 00:02:00
3 2017-06-01 00:03:00
4 2017-06-01 00:04:00
Dataframe df2 with hourly frequency:
time2 temp hum
1 2017-06-01 00:00:00 13.5 90.0
2 2017-06-01 01:00:00 12.2 95.0
3 2017-06-01 02:00:00 11.7 96.0
4 2017-06-01 03:00:00 11.5 96.0
5 2017-06-01 04:00:00 11.1 97.0
So far i merged these dataframe but get NaNs:
m2o_merge = df1.merge(df2, left_on= 'time', right_on= 'time2', how='outer')
m2o_merge.head()
time time2 temp hum
0 2017-06-01 00:00:00 2017-06-01 13.5 90.0
1 2017-06-01 00:01:00 NaT NaN NaN
2 2017-06-01 00:02:00 NaT NaN NaN
3 2017-06-01 00:03:00 NaT NaN NaN
4 2017-06-01 00:04:00 NaT NaN NaN
My desired dataframe should look like this (NaN filled up with hourly value df2):
time temp hum
0 2017-06-01 00:00:00 13.5 90.0
1 2017-06-01 00:01:00 13.5 90.0
2 2017-06-01 00:02:00 13.5 90.0
3 2017-06-01 00:03:00 13.5 90.0
4 2017-06-01 00:04:00 13.5 90.0
...
So far i found this solution: merge series/dataframe with different time frequencies in python, but the Datetime column is not my index. Does anyone know how to get there ?
As suggested by Ben Pap i did the following two Steps as a solution:
import pandas as pd
data1 = {'time':pd.date_range('2017-06-01 00:00:00','2017-06-01 00:09:00', freq='T')}
data2 = {'time2':pd.date_range('2017-06-01 00:00:00','2017-06-01 04:00:00', freq='H'), 'temp':[13.5,12.2,11.7,11.5,11.1], 'hum':[90.0,95.0,96.0,96.0,97.0]}
# Create DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
m2o_merge = df1.merge(df2, left_on= 'time', right_on= 'time2', how='outer')
m2o_merge.head()
m2o_merge.fillna(method='ffill', inplace=True)
filled_df = m2o_merge.drop(['time2'], axis=1)
filled_df.head()
Related
I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0
Trying to create new column with values that meet specific conditions. Below I have set out code which goes some way in explaining the logic but does not produce the correct output:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'],
'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0],
'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })
max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
(df['type'] == 1) & (df['colour'] == 'red')]
max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]
df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)
Incorrect output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 12.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 6000.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is incorrect:
Value 12.0 in index row 1 for column df['pixelLimit'] is incorrect because this value is from df['minPixel'] index row 6 which has has a df['date'] datetime of 2019-08-08 17:00:00 which is greater than the 2019-08-08 16:00:00 df['date'] datetime contained in index row 1.
Value 6000.0 in index row 3 for column df['pixelLimit'] is incorrect because this value is from df['maxPixel'] index row 7 which has a df['date'] datetime of 2019-08-09 16:00:00 which is greater than the 2019-08-06 22:00:00 df['date'] datetime contained in index row .
Correct output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is correct:
Value 14.0 in index row 1 for column df['pixelLimit'] is correct because we are looking for the minimum value in column df['minPixel'] which has a datetime in column df['date'] less than the datetime in index row 1 for column df['colourDate'] and greater or equal to the datetime in index row 1 for column df['date']
Value 5184.0 in index row 3 for column df['pixelLimit'] is correct because we are looking for the maximum value in column df['maxPixel'] which has a datetime in column df['date'] less than the datetime in index row 3 for column df['colourDate'] and greater or equal to the datetime in index row 3 for column df['date']
Considerations:
Maybe np.select is not best suited for this task and some sort of function might serve the task better?
Also, maybe I need to create some sort of dynamic len to use as a starting point for each row?
Request
Please can anyone out there help me amend my code to achieve the correct output
For matching problems like this one possibility is to do the complete merge, then subset, using a Boolean Series, to all rows that satisfy your condition (for that row) and find the max or min among all the possible matches. Since this requires slightly different columns and different functions I split the operations into 2 very similar pieces of code, one to deal with 1/blue and the other for 1/red.
First some housekeeping, make things datetime
import pandas as pd
df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])
Calculate the min pixel for 1/red between the times for each row
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use:
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')
# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
.groupby('index')['minPixel_y'].min()
.rename('pixel_limit'))
#index
#1 14
#Name: pixel_limit, dtype: int64
# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')
smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
.groupby('index')['maxPixel_y'].max()
.rename('pixel_limit'))
Finally because the above groups over the original index (i.e. 'index') we can simply assign back to align with the original DataFrame.
df['pixel_limit'] = pd.concat([smin, smax])
date type colour maxPixel minPixel colourDate pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
If you need to bring along a lot of different information for the row with the min/max Pixel then instead of groupby min/max we will sort_values and then gropuby + head or tail to get the min or max pixel. For the min this would look like (slight renaming of suffixes):
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
# .merge(df[['date', 'minPixel']].reset_index().assign(t=1),
# on='t', suffixes=['', '_match']))
# Only keep rows between the dates, then among those find the min minPixel row.
# A bunch of renaming.
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
.sort_values('minPixel_match', ascending=True)
.groupby('index').head(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'minPixel_match': 'pixel_limit'}))
The Max would then be similar using .tail
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
.sort_values('maxPixel_match', ascending=True)
.groupby('index').tail(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'maxPixel_match': 'pixel_limit'}))
And finally we concat along axis=1 now that we need to join multiple columns to the original:
result = pd.concat([df, pd.concat([smin, smax])], axis=1)
date type colour maxPixel minPixel colourDate index_match date_match pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN NaN NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 2.0 2019-08-06 18:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN NaN NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 3.0 2019-08-06 21:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN NaN NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN NaN NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN NaN NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN NaN NaN
I have a pandas series object consists of a datetime_index and some values, looks like following:
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 00:30:00 35.6
2020-01-01 00:45:00 39.2
2020-01-01 01:00:00 56.7
...
2020-12-31 23:45:00 56.3
I am adding some values to this df with .append(). Since it is not sorted then I sort its index via .sort_index(). However what I would like to achieve is that I want to sort only for given day.
So for example I add some values to day 2020-01-01, and since the added values will be after the end of the day 2020-01-01 I just need to sort the first day of the year. NOT ALL THE DF.
Here is an example, NaN value is added with .append():
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
...
2020-01-01 23:45:00 34.3
2020-01-01 15:00:00 NaN
...
2020-12-31 23:45:00 56.3
Now I cannot df.sort_index(), because it breaks other days. That is why I just want to apply .sort_index() to the day 2020-01-01. How do I do that?
WHAT I TRIED SO FAR AND DOES NOT WORK:
df.loc['2020-01-01'] = df.loc['2020-01-01'].sort_index()
Filter rows for 2020-01-01 days, sorting and join back with not matched rows:
mask = df.index.normalize() == '2020-01-01'
df = pd.concat([df[mask].sort_index(), df[~mask]])
print (df)
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 15:00:00 NaN
2020-01-01 23:45:00 34.3
2020-12-31 23:45:00 56.3
Name: a, dtype: float64
Another idea:
df1 = df['2020-01-01'].sort_index()
df = pd.concat([df1, df.drop(df1.index)])
For a dataframe with no missing values, this would be as easy as df.diff(periods=24, axis=0). But how is it possible to connect the calculations to the index values?
Reproducible dataframe - Code:
# Imports
import pandas as pd
import numpy as np
# A dataframe with two variables, random numbers and hourly time series
np.random.seed(123)
rows = 36
rng = pd.date_range('1/1/2017', periods=rows, freq='H')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['A', 'B'])
df = df.set_index(rng)
Reproducible dataframe - Screenshot:
Desired output - Code:
# Running difference step = 24
df = df.diff(periods=24, axis=0)
df = df.dropna(axis=0, how='all')
Desired output - Screenshot
The real challenge
The problem is that my real-world examples are full of missing values.
So I'll have to connect the difference intervals with the index values, and I have no Idea how. I've tried a few solutions with filling in the missing hours in the index first, and then running the differences like before, but it's not very elegant.
Thank you for any suggestions!
Edit - As requested in the comments, here's my best attempt for a bit longer time period:
df_missing = df.drop(df.index[[2,3]])
newIndex = pd.date_range(start = '1/1/2017', end = '1/3/2017', freq='H')
df_missing = df_missing.reindex(newIndex, fill_value = np.nan)
df_refilled = df_missing.diff(periods=24, axis=0)
Compared to the other suggestions, I would say that this is not very elegant =)
I think maybe you can use groupby
df.groupby(df.index.hour).diff().dropna()
Out[784]:
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0
You can snap your dataframe to hourly recordings using asfreq, and then use diff?
df.asfreq('1H').diff(periods=24, axis=0).dropna()
Or, use shift and then subtract (instead of diff),
v = df.asfreq('1h')
(v - v.shift(periods=24)).dropna()
A B
2017-01-02 00:00:00 -3.0 3.0
2017-01-02 01:00:00 -28.0 -23.0
2017-01-02 02:00:00 -4.0 -7.0
2017-01-02 03:00:00 3.0 -29.0
2017-01-02 04:00:00 -4.0 3.0
2017-01-02 05:00:00 -17.0 -6.0
2017-01-02 06:00:00 -20.0 35.0
2017-01-02 07:00:00 -2.0 -40.0
2017-01-02 08:00:00 13.0 -21.0
2017-01-02 09:00:00 -9.0 -13.0
2017-01-02 10:00:00 0.0 3.0
2017-01-02 11:00:00 -21.0 -9.0
I have 4 dfs, which look like below
df1
_id bs ds as pf
0 2017-05-01 00:00:00 0.982218 0.906662 0.614119 0.999471
1 2017-05-01 00:05:00 0.983751 0.913266 0.585237 0.999571
2 2017-05-01 00:10:00 0.983012 0.914875 0.592698 0.999631
3 2017-05-01 00:15:00 0.981884 0.910922 0.589013 0.999536
4 2017-05-01 00:20:00 0.982611 0.913082 0.601056 0.999556
5 2017-05-01 00:25:00 0.982386 0.912358 0.598856 0.999650
df2
_id avg_time_serve
0 2017-05-01 00:00:00 0.520681
1 2017-05-01 00:05:00 0.521580
2 2017-05-01 00:10:00 0.517993
3 2017-05-01 00:15:00 0.520662
4 2017-05-01 00:20:00 0.514146
5 2017-05-01 00:25:00 0.513723
df3
_id total_distinct_ips
0 2017-05-01 00:00:00 291094.0
1 2017-05-01 00:05:00 287922.0
2 2017-05-01 00:10:00 292103.0
3 2017-05-01 00:15:00 295675.0
4 2017-05-01 00:20:00 297813.0
5 2017-05-01 00:25:00 302406.0
df4
_id total_40x total_50x
0 2017-05-01 00:00:00 162034 0
1 2017-05-01 00:05:00 162497 0
2 2017-05-01 00:10:00 161079 0
3 2017-05-01 00:15:00 163338 0
4 2017-05-01 00:20:00 167901 0
5 2017-05-01 00:25:00 164394 0
I'm trying to combine them by '_id' column. The '_id' column is in timestamp format.
I tried using the below approaches:
**Approach 1**
from functools import reduce
dfs = [df1, df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id',
how='outer'), dfs)
**Approach 2**
final_df = pd.Dataframe()
for df in dfs:
if final_df.empty:
final_df = df
else:
final_df = pd.merge(final_df, df, how='outer', on='_id')
Both approaches give below result:
_id bs ds as pf \
0 2017-05-01 00:00:00 0.982218 0.906662 0.614119 0.999471
1 2017-05-01 00:00:00 NaN NaN NaN NaN
2 2017-05-01 00:05:00 0.983751 0.913266 0.585237 0.999571
3 2017-05-01 00:05:00 NaN NaN NaN NaN
4 2017-05-01 00:10:00 0.983012 0.914875 0.592698 0.999631
5 2017-05-01 00:10:00 NaN NaN NaN NaN
avg_time_serve total_distinct_ips total_40x total_50x
0 NaN 291094.0 162034 0
1 0.520681 291094.0 162034 0
2 NaN 287922.0 162497 0
3 0.521580 287922.0 162497 0
4 NaN 292103.0 161079 0
5 0.517993 292103.0 161079 0
Approach 3
I took out 'df1' from dfs list, and added a 'join'.
from functools import reduce
dfs = [df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id',
how='outer'), dfs)
final_df = final_df.join(df1.set_index('_id'), on='_id')
and finally got the right result
_id avg_time_serve total_distinct_ips total_40x
0 2017-05-01 00:00:00 0.520681 291094.0 162034
1 2017-05-01 00:05:00 0.521580 287922.0 162497
2 2017-05-01 00:10:00 0.517993 292103.0 161079
3 2017-05-01 00:15:00 0.520662 295675.0 163338
4 2017-05-01 00:20:00 0.514146 297813.0 167901
5 2017-05-01 00:25:00 0.513723 302406.0 164394
total_50x bs ds as pf
0 0 0.982218 0.906662 0.614119 0.999471
1 0 0.983751 0.913266 0.585237 0.999571
2 0 0.983012 0.914875 0.592698 0.999631
3 0 0.981884 0.910922 0.589013 0.999536
4 0 0.982611 0.913082 0.601056 0.999556
5 0 0.982386 0.912358 0.598856 0.999650
Question:
Shouldn't approach #1 and #2 work for any amount of dataframes merged together?
Why did approach 1 and 2 created duplicates of '_id' and insert NaN values?
You can also use pd.concat with set_index
pd.concat([df1.set_index('_id'), df2.set_index('_id'), df3.set_index('_id'), df4.set_index('_id')], axis = 1).reset_index()