Combining two dataframes - python

I've tried merging two dataframes, but I can't seem to get it to work. Each time I merge, the rows where I expect values are all 0. Dataframe df1 already as some data in it, with some left blank. Dataframe df2 will populate those blank rows in df1 where column names match at each value in "TempBin" and each value in "Month" in df1.
EDIT:
Both dataframes are in a for loop. df1 acts as my "storage", df2 changes for each location iteration. So if df2 contained the results for LocationZP, I would also want that data inserted in the matching df1 rows. If I use df1 = df1.append(df2) in the for loop, all of the rows from df2 keep inserting at the very end of df1 for each iteration.
df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 0 0 0
13 1 0 0 0
13 2 0 0 0
13 3 0 0 0
13 4 0 0 0
13 5 0 0 0
df2:
Month TempBin LocationAA
13 0 11
13 1 22
13 2 33
13 3 44
13 4 55
13 5 66
desired output in df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 11 0 0
13 1 22 0 0
13 2 33 0 0
13 3 44 0 0
13 4 55 0 0
13 5 66 0 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]}
)
df1 = pd.merge(df1, df2, on=["Month","TempBin","LocationAA"], how="left")
result:
Month TempBin LocationAA LocationXA LocationZP
1 0 7.0 1.0 2.0
1 1 98.0 0.0 89.0
1 2 12.0 23.0 38.0
1 3 3.0 14.0 17.0
1 4 7.0 9.0 14.0
1 5 1.0 8.0 99.0
13 0 NaN NaN NaN
13 1 NaN NaN NaN
13 2 NaN NaN NaN
13 3 NaN NaN NaN
13 4 NaN NaN NaN
13 5 NaN NaN NaN

Here's some code that worked for me:
# Merge two df into one dataframe on the columns "TempBin" and "Month" filling nan values with 0.
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]})
df_merge = pd.merge(df1, df2, how='left',
left_on=['TempBin', 'Month'],
right_on=['TempBin', 'Month'])
df_merge.fillna(0, inplace=True)
# add column LocationAA and fill it with the not null value from column LocationAA_x and LocationAA_y
df_merge['LocationAA'] = df_merge.apply(lambda x: x['LocationAA_x'] if pd.isnull(x['LocationAA_y']) else x['LocationAA_y'], axis=1)
# remove column LocationAA_x and LocationAA_y
df_merge.drop(['LocationAA_x', 'LocationAA_y'], axis=1, inplace=True)
print(df_merge)
Output:
Month TempBin LocationXA LocationZP LocationAA
0 1 0 1.0 2.0 0.0
1 1 1 0.0 89.0 0.0
2 1 2 23.0 38.0 0.0
3 1 3 14.0 17.0 0.0
4 1 4 9.0 14.0 0.0
5 1 5 8.0 99.0 0.0
6 13 0 0.0 0.0 11.0
7 13 1 0.0 0.0 22.0
8 13 2 0.0 0.0 33.0
9 13 3 0.0 0.0 44.0
10 13 4 0.0 0.0 55.0
11 13 5 0.0 0.0 66.0
Let me know if there's something you don't understand in the comments :)
PS: Sorry for the extra comments. But I left them there for some more explanations.

You need to use append to get the desired output:
df1 = df1.append(df2)
and if you want to replace the Nulls to zeros add:
df1 = df1.fillna(0)

Here is another way using combine_first()
i = ['Month','TempBin']
df2.set_index(i).combine_first(df1.set_index(i)).reset_index()

Related

Replace columns with the same value between two dataframes

I have two dataframes
df1
Date RPM
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
and df2
Date RPM
0 0 0
1 2 2
2 4 4
3 6 6
I want to replace the RPM in df1 with the RPM in df2 where they have the same Date
I tried with replace but it didn't work out
Use Series.map by Series created from df2 and then replace misisng valeus by original column by Series.fillna:
df1['RPM'] = df1['Date'].map(df2.set_index('Date')['RPM']).fillna(df1['RPM'])
You could merge() the two frames on the Date column to get the new RPM against the corresponding date row:
df = df1.merge(df2, on='Date', how='left', suffixes=[None, ' new'])
Date RPM RPM new
0 1 0 NaN
1 2 0 2.0
2 3 0 NaN
3 4 0 4.0
4 5 0 NaN
5 6 0 6.0
6 7 0 NaN
You can then fill in the nans in RPM new using .fillna() to get the RPM column:
df['RPM'] = df['RPM new'].fillna(df['RPM'])
Date RPM RPM new
0 1 0.0 NaN
1 2 2.0 2.0
2 3 0.0 NaN
3 4 4.0 4.0
4 5 0.0 NaN
5 6 6.0 6.0
6 7 0.0 NaN
Then drop the RPM new column:
df = df.drop('RPM new', axis=1)
Date RPM
0 1 0.0
1 2 2.0
2 3 0.0
3 4 4.0
4 5 0.0
5 6 6.0
6 7 0.0
Full code:
df = df1.merge(df2, on='Date', how='left', suffixes=[None, ' new'])
df['RPM'] = df['RPM new'].fillna(df['RPM'])
df = df.drop('RPM new', axis=1)

Convert a python df which is in pivot format to a proper row column format

i have the following dataframe
id a_1_1, a_1_2, a_1_3, a_1_4, b_1_1, b_1_2, b_1_3, c_1_1, c_1_2, c_1_3
1 10 20 30 40 90 80 70 Nan Nan Nan
2 33 34 35 36 nan nan nan 11 12 13
and i want my result to be as follow
id col_name 1 2 3
1 a 10 20 30
1 b 90 80 70
2 a 33 34 35
2 c 11 12 13
I am trying to use pd.melt function, but not yielding correct result ?
IIUC, you can reshape using an intermediate MultiIndex after extracting the letter and last digit from the original column names:
(df.set_index('id')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract(r'^([^_]+).*(\d+)'),
names=['col_name', None]
), axis=1))
.stack('col_name')
.dropna(axis=1) # assuming you don't want columns with NaNs
.reset_index()
)
Variant using janitor's pivot_longer:
# pip install janitor
import janitor
(df
.pivot_longer(index='id', names_to=('col name', '.value'),
names_pattern=r'([^_]+).*(\d+)')
.pipe(lambda d: d.dropna(thresh=d.shape[1]-2))
.dropna(axis=1)
)
output:
id col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0
Code:
df = df1.melt(id_vars=["id"],
var_name="Col_name",
value_name="Value").dropna()
df['Num'] = df['Col_name'].apply(lambda x: x[-1])
df['Col_name'] = df['Col_name'].apply(lambda x: x[0])
df = df.pivot(index=['id','Col_name'], columns='Num', values='Value').reset_index().dropna(axis=1)
df
Output:
Num id Col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0

Repeat the dataframe and increment the timestamps

I have a dataframe like this
Temp time[s]
0 20 0
1 21 1
2 21.5 2
I want to repeat the dataframe but add the timestamp, my output should like this ,
Temp time[s]
0 20 0
1 21 1
2 21.5 2
3 20 3
4 21 4
5 21.5 5
I tried something like this , but didn't worked for me
df2 = pd.concat([df]*2, ignore_index=True)
df2['time[s]'] = df2.groupby(level=0).cumcount() * 1
Can anyone help me plz ?
For a generic approach, you can use:
N = 2 # number or repetitions
df2 = pd.concat([df]*N, ignore_index=True)
df2['time[s]'] += np.repeat(np.arange(N), len(df))*len(df)
# or
# df2['time[s]'] += np.arange(len(df2))//len(df)*len(df)
output:
Temp time[s]
0 20.0 0
1 21.0 1
2 21.5 2
3 20.0 3
4 21.0 4
5 21.5 5
Try this for your exact use case with one repetition:
df_x = pd.concat([df, df]).reset_index(drop=True)
df_x["time[s]"] = df_x.index
Print out:
Temp time[s]
0 20.0 0
1 21.0 1
2 21.5 2
3 20.0 3
4 21.0 4
5 21.5 5

How to slice a row with duplicate column names and stack that rows in order

I have a dataframe as shown in the image and I want to convert it into multiple rows without changing the order.
RESP HR SPO2 PULSE
1 46 122 0 0
2 46 122 0 0
3
4
One possible solution is use reshape, only necessary modulo of length of columns is 0 (so is possible convert all data to 4 columns DataFrame):
df1 = pd.Dataframe(df.values.reshape(-1, 4), columns=['RESP','HR','SPO2','PULSE'])
df1['RESP1'] = df['RESP'].shift(-1)
General data solution:
a = '46 122 0 0 46 122 0 0 45 122 0 0 45 122 0'.split()
df = pd.DataFrame([a]).astype(int)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 46 122 0 0 46 122 0 0 45 122 0 0 45 122 0
#flatten values
a = df.values.ravel()
#number of new columns
N = 4
#array filled by NaNs for possible add NaNs to end of last row
arr = np.full(((len(a) - 1)//N + 1)*N, np.nan)
#fill array by flatten values
arr[:len(a)] = a
#reshape to new DataFrame (last value is NaN)
df1 = pd.DataFrame(arr.reshape((-1, N)), columns=['RESP','HR','SPO2','PULSE'])
#new column with shifting first col
df1['RESP1'] = df1['RESP'].shift(-1)
print(df1)
RESP HR SPO2 PULSE RESP1
0 46.0 122.0 0.0 0.0 46.0
1 46.0 122.0 0.0 0.0 45.0
2 45.0 122.0 0.0 0.0 45.0
3 45.0 122.0 0.0 NaN NaN
Here's another way with groupby:
df = pd.DataFrame(np.random.arange(12), columns=list('abcd'*3))
new_df = pd.concat((x.stack().reset_index(drop=True)
.rename(k) for k,x in df.groupby(df.columns, axis=1)),
axis=1)
new_df = (new_df.assign(a1=lambda x: x['a'].shift(-1))
.rename(columns={'a1':'a'})
)
Output:
a b c d a
0 0 1 2 3 4.0
1 4 5 6 7 8.0
2 8 9 10 11 NaN

Add Series to DataFrame with additional index values

I have a DataFrame which looks like this:
Value
1 23
2 12
3 4
And a Series which looks like this:
1 24
2 12
4 34
Is there a way to add the Series to the DataFrame to obtain a result which looks like this:
Value New
1 23 24
2 12 12
3 4 0
4 0 34
Using concat(..., axis=1) and .fillna():
import pandas as pd
df = pd.DataFrame([23,12,4], columns=["Value"], index=[1,2,3])
s = pd.Series([24,12,34],index=[1,2,4], name="New")
df = pd.concat([df,s],axis=1)
print(df)
df = df.fillna(0) # or df.fillna(0, inplace=True)
print(df)
Output:
Value New
1 23.0 24.0
2 12.0 12.0
3 4.0 NaN
4 NaN 34.0
# If replacing NaNs with 0:
Value New
1 23.0 24.0
2 12.0 12.0
3 4.0 NaN
4 NaN 34.0
You can use join between a series and a dataframe:
my_df.join(my_series, how='outer').fillna(0)
Example:
>>> df
Value
1 23
2 12
3 4
>>> s
0
1 24
2 12
4 34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
>>> type(s)
<class 'pandas.core.series.Series'>
>>> df.join(s, how='outer').fillna(0)
Value 1
1 23.0 24.0
2 12.0 12.0
3 4.0 0.0
4 0.0 34.0

Categories

Resources