merging intervals and timestamps dataframes - python

I have a table which contains intervals
dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
I have another table which contains timestamps and values
dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
'ValueB': [1, 2, 21]})
I need to create a dataframe same size as dfa, with added a columns which contains the result of some aggregation of ValueA/ValueB, for all the rows in dfb with a Timestamp contained between Start and Stop.
So here if define my aggregation as
{'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
my desired output would be:
ValueA ValueA ValueB
nanmean nanmin nanmax Start Stop
nan nan nan 0 100
8 1 21 101 200
nan nan nan 666 1000

Use merge with cross join with helper columns created by assign:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
Then filter by Start and Stop and aggregate by dictionary:
df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
.groupby(['Start','Stop']).agg(d))
Flatten MultiIndex by map with join:
df.columns = df.columns.map('_'.join)
print (df)
ValueA_nanmean ValueA_nanmin ValueB_nanmax
Start Stop
101 200 8 1 21
And last join to original:
df = dfa.join(df, on=['Start','Stop'])
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
EDIT:
Solution with cut:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
#if not default index create it
dfa = dfa.reset_index(drop=True)
print (dfa)
Start Stop
0 0 100
1 101 200
2 666 1000
#add to bins first value of Start
bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
print (bins)
[ 0 100 200 1000]
#binning
dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
print (dfb)
Timestamp ValueA ValueB id
0 102 1 1 1
1 145 2 2 1
2 113 21 21 1
#aggregate and flatten
df = dfb.groupby('id').agg(d)
df.columns = df.columns.map('_'.join)
#add to dfa
df = pd.concat([dfa, df], axis=1)
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN

Related

How to merge two dataframes without generating extra rows in result?

I am doing the following with two dataframes but it generates duplicates and does not get sorted as the first dataframe.
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:10"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
result = pd.merge(df1, df2, on="time", how="left")
This generates the result of 8 rows! I am removing 3 digits from time column in df1 to match the time in df2.
time value counts growth
0 15:09 10 fg 1.0
1 15:09 10 mn 3.0
2 15:09 20 fg 1.0
3 15:09 20 mn 3.0
4 15:10 30 gl 6.0
5 15:11 40 NaN NaN
6 15:12 50 NaN NaN
7 15:12 60 NaN NaN
There are duplicated columns due to join.
Is it possible to join the dataframes based on time column in df1 where events are sorted well with more time granularity? Is there a way to partially match the time column values of two dataframes and merge? Ideal result would look like the following
time value counts growth
0 15:09.123 10 fg 1.0
1 15:09.234 20 mn 3.0
2 15:10.123 30 gl 6.0
3 15:11.123 40 NaN NaN
4 15:12.123 50 NaN NaN
5 15:12.987 60 NaN NaN
here is one way to do it
Assusmption: number of rows for a time without seconds in df1 and df2 will be same
# create time without seconds
df1['time2']=df1['time'].str[:-4]
# add a sequence when there are multiple rows for any time
df1['seq']=df1.groupby('time2')['time2'].cumcount()
# add a sequence when there are multiple rows for any time
df2['seq']=df2.groupby('time').cumcount()
# do a merge on time (stripped) in df1 and sequence
pd.merge(df1,
df2,
left_on=['time2', 'seq'],
right_on=['time','seq'],
how='left',
suffixes=(None,'_y')).drop(columns=['time2', 'seq'])
time value time_y counts growth
0 15:09.123 10 15:09 fg 1.0
1 15:09.234 20 15:09 mn 3.0
2 15:10.123 30 15:10 gl 6.0
3 15:11.123 40 NaN NaN NaN
4 15:12.123 50 NaN NaN NaN
5 15:12.987 60 NaN NaN NaN
Merge on column 'time' with preserved order
Assumption: Data from df1 and df2 are in order of occurrence
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:11"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
df1_keys = df1["time"].unique()
df_list = list()
for key in df1_keys:
tmp_df1 = df1[df1["time"] == key]
tmp_df1 = tmp_df1.reset_index(drop=True)
tmp_df2 = df2[df2["time"] == key]
tmp_df2 = tmp_df2.reset_index(drop=True)
df_list.append(pd.merge(tmp_df1, tmp_df2, left_index=True, right_index=True, how="left"))
print(pd.concat(df_list, axis = 0))

Insert/replace/merge values from one dataframe to another

I have two dataframes like this:
df1 = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['0','10','80','0','0','0']})
df2 = pd.DataFrame({'ID1':['A','D','E','F'],
'ID2':['50','30','90','50'],
'aa':['1','2','3','4']})
I want to insert ID2 in df2 into ID2 in df1, and at the same time insert aa into df1 according to ID1 to obtain a new dataframe like this:
df_result = pd.DataFrame({'ID1':['A','B','C','D','E','F'],
'ID2':['50','10','80','30','90','50'],
'aa':['1','NaN','NaN','2','3','4']})
I've tried to use merge, but it didn't work.
You can use combine_first on the DataFrame after setting the index to ID1:
(df2.set_index('ID1') # values of df2 have priority in case of overlap
.combine_first(df1.set_index('ID1')) # add missing values from df1
.reset_index() # reset ID1 as column
)
output:
ID1 ID2 aa
0 A 50 1
1 B 10 NaN
2 C 80 NaN
3 D 30 2
4 E 90 3
5 F 50 4
Try this:
new_df = df1.assign(ID2=df1['ID2'].replace('0', np.nan)).merge(df2, on='ID1', how='left').pipe(lambda g: g.assign(ID2=g.filter(like='ID2').bfill(axis=1).iloc[:, 0]).drop(['ID2_x', 'ID2_y'], axis=1))
Output:
>>> new_df
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
Use df.merge with Series.combine_first:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [571]: x['ID2'] = x.ID2_y.combine_first(x.ID2_x)
In [574]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)
In [575]: x
Out[575]:
ID1 aa ID2
0 A 1 50
1 B NaN 10
2 C NaN 80
3 D 2 30
4 E 3 90
5 F 4 50
OR use df.filter with df.ffill:
In [568]: x = df1.merge(df2, on='ID1', how='left')
In [597]: x['ID2'] = x.filter(like='ID2').ffill(axis=1)['ID2_y']
In [599]: x.drop(['ID2_x', 'ID2_y'], 1, inplace=True)

Replace values in a dataframe with values from another dataframe when common value is found in specific column

I am trying to replace hours in df with hours from replacements for project IDs that exist in both dataframes:
import pandas as pd
df = pd.DataFrame({
'project_ids': [1, 2, 3, 4, 5],
'hours': [111, 222, 333, 444, 555],
'else' :['a', 'b', 'c', 'd', 'e']
})
replacements = pd.DataFrame({
'project_ids': [2, 5, 3],
'hours': [666, 999, 1000],
})
for project in replacements['project_ids']:
df.loc[df['project_ids'] == project, 'hours'] = replacements.loc[replacements['project_ids'] == project, 'hours']
print(df)
However, only the project ID 3 gets correct assignment (1000), but both 2 and 5 get NaN:
projects hours else
0 1 111.0 a
1 2 NaN b
2 3 1000.0 c
3 4 444.0 d
4 5 NaN e
How can I fix it?
Is there a better way to do this?
Use Series.map with another Series created by replacements with DataFrame.set_index:
s = replacements.set_index('project_ids')['hours']
df['hours'] = df['project_ids'].map(s).fillna(df['hours'])
print(df)
project_ids hours else
0 1 111.0 a
1 2 666.0 b
2 3 1000.0 c
3 4 444.0 d
4 5 999.0 e
Another way using df.update():
m=df.set_index('project_ids')
m.update(replacements.set_index('project_ids')['hours'])
print(m.reset_index())
project_ids hours else
0 1 111.0 a
1 2 666.0 b
2 3 1000.0 c
3 4 444.0 d
4 5 999.0 e
Another solution would be to use pandas.merge and then use fillna:
df_new = pd.merge(df, replacements, on='project_ids', how='left', suffixes=['_1', ''])
df_new['hours'].fillna(df_new['hours_1'], inplace=True)
df_new.drop('hours_1', axis=1, inplace=True)
print(df_new)
project_ids else hours
0 1 a 111.0
1 2 b 666.0
2 3 c 1000.0
3 4 d 444.0
4 5 e 999.0

Pandas, replace NaNs with values from MultiIndex DataFrame

Problem
I have a dataframe with some NaNs that I am trying to fill intelligently based off values from another dataframe. I have not found an efficient way to do this but I suspect there is a way with pandas.
Minimal Example
index1 = [1, 1, 1, 2, 2, 2]
index2 = ['a', 'b', 'a', 'b', 'a', 'b']
# dataframe to fillna
df = pd.DataFrame(
np.asarray([[np.nan, 90, 90, 100, 100, np.nan], index1, index2]).T,
columns=['data', 'index1', 'index2']
)
# dataframe to lookup fill values from
multi_index = pd.MultiIndex.from_product([sorted(list(set(index1))), sorted(list(set(index2)))])
fill_val_lookup = pd.DataFrame([89, 91, 99, 101], index=multi_index, columns=
['fill_vals'])
Starting data (df):
data index1 index2
0 nan 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 nan 2 b
Lookup table to find values to fill NaNs:
fill_vals
1 a 89
b 91
2 a 99
b 101
Desired output:
data index1 index2
0 89 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 101 2 b
Ideas
The closest post I have found is about filling NaNs with values from one level of a multiindex.
I've also tried setting the index of df to be a multiindex using columns index1 and index2 and then using df.fillna, however this does not work.
combine_first is the function that you need. But first, update the index names of the other dataframe.
fill_val_lookup.index.names = ["index1", "index2"]
fill_val_lookup.columns = ["data"]
df.index1 = df.index1.astype(int)
df.data = df.data.astype(float)
df.set_index(["index1","index2"]).combine_first(fill_val_lookup)\
.reset_index()
# index1 index2 data
#0 1 a 89.0
#1 1 a 90.0
#2 1 b 90.0
#3 2 a 100.0
#4 2 b 100.0
#5 2 b 101.0

Find different rows between 2 dataframes of different size with Pandas

I have 2 dataframes df1 and df2 of different size.
df1 = pd.DataFrame({'A':[np.nan, np.nan, np.nan, 'AAA','SSS','DDD'], 'B':[np.nan,np.nan,'ciao',np.nan,np.nan,np.nan]})
df2 = pd.DataFrame({'C':[np.nan, np.nan, np.nan, 'SSS','FFF','KKK','AAA'], 'D':[np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan]})
My goal is to identify the elements of df1 which do not appear in df2.
I was able to achieve my goal using the following lines of code.
df = pd.DataFrame({})
for i, row1 in df1.iterrows():
found = False
for j, row2, in df2.iterrows():
if row1['A']==row2['C']:
found = True
print(row1.to_frame().T)
if found==False and pd.isnull(row1['A'])==False:
df = pd.concat([df, row1.to_frame().T], axis=0)
df.reset_index(drop=True)
Is there a more elegant and efficient way to achieve my goal?
Note: the solution is
A B
0 DDD NaN
I believe need isin withboolean indexing :
Also omit NaNs rows by default chain new condition:
#changed df2 with no NaN in C column
df2 = pd.DataFrame({'C':[4, 5, 5, 'SSS','FFF','KKK','AAA'],
'D':[np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan]})
print (df2)
C D
0 4 NaN
1 5 NaN
2 5 NaN
3 SSS 1.0
4 FFF NaN
5 KKK NaN
6 AAA NaN
df = df1[~(df1['A'].isin(df2['C']) | (df1['A'].isnull()))]
print (df)
A B
5 DDD NaN
If not necessary omit NaNs if not exist in C column:
df = df1[~df1['A'].isin(df2['C'])]
print (df)
A B
0 NaN NaN
1 NaN NaN
2 NaN ciao
5 DDD NaN
If exist NaNs in both columns use second solution:
(input DataFrames are from question)
df = df1[~df1['A'].isin(df2['C'])]
print (df)
A B
5 DDD NaN

Categories

Resources