Combine two data frames by matching index - python

I have a data frame that has the form:
index predicted
1 1
2 1
3 0
4 0
5 1
And another that has the form:
index actual
2 1
4 0
I want the data frame:
index predicted actual
1 1 nan
2 1 1
3 0 nan
4 0 0
5 1 nan
I've tried pd.concat([df1,df2], on="index", how="left") and pd.merge(df1, df2, axis=1)
Both give the dataframe:
index predicted actual
1 1 1
2 1 0
3 0 nan
4 0 nan
5 1 nan
How can I get the data frame I need. Also thanks in advance.

You can use the pd.merge() setting the parameters left_index = True and right_index = True
import pandas as pd
df1 = pd.DataFrame({'predicted': [1,1,0,0,1]}, index = (1,2,3,4,5))
df2 = pd.DataFrame({'actual': [1,0]}, index = (2,4))
pd.merge(df1, df2, how = 'left', left_index=True, right_index=True)
This will merge the two dataframes on index and produce the intended result required.
index predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN

If you make sure that your index column is actually the df.index, pd.concat should work:
import pandas as pd
left = pd.DataFrame({"predicted": [1, 1, 0, 0, 1]}, index=[1, 2, 3, 4, 5])
right = pd.DataFrame({"actual": [1, 0]}, index=[2, 4])
out = pd.concat([left, right], axis=1)
predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN
If they're just columns, such as the following:
left = left.reset_index(names="index")
right = right.reset_index(names="index")
then you can use:
left.merge(right, on="index", how="left")
index predicted actual
0 1 1 NaN
1 2 1 1.0
2 3 0 NaN
3 4 0 0.0
4 5 1 NaN

Create index as a temporary column then left join using that then set it as index.
predict_df = pd.DataFrame({'predicted': [1,1,0,0,1]}, index=range(1,6))
actual_df = pd.DataFrame({'actual': [1,0]}, index=[2,4])
pd.merge(
left=predict_df.reset_index(),
right=actual_df.reset_index(),
how='left',
on='index'
).set_index('index')
predicted actual
index
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN

df1.join(df2)
out:
predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN

Related

Setting the last n non NaN vale per group with nan

I have a DataFrame with (several) grouping variables and (several) value variables. My goal is to set the last n non nan values to nan. So let's take a simple example:
df = pd.DataFrame({'id':[1,1,1,2,2,],
'value':[1,2,np.nan, 9,8]})
df
Out[1]:
id value
0 1 1.0
1 1 2.0
2 1 NaN
3 2 9.0
4 2 8.0
The desired result for n=1 would look like the following:
Out[53]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Use with groupby().cumcount():
N=1
groups = df.loc[df['value'].notna()].groupby('id')
enum = groups.cumcount()
sizes = groups['value'].transform('size')
df['value'] = df['value'].where(enum < sizes - N)
Output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
You can check cumsum after groupby get how many notna value per-row
df['value'].where(df['value'].notna().iloc[::-1].groupby(df['id']).cumsum()>1,inplace=True)
df
Out[86]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
One option: create a reversed cumcount on the non-NA values:
N = 1
m = (df
.loc[df['value'].notna()]
.groupby('id')
.cumcount(ascending=False)
.lt(N)
)
df.loc[m[m].index, 'value'] = np.nan
Similar approach with boolean masking:
m = df['value'].notna()
df['value'] = df['value'].mask(m[::-1].groupby(df['id']).cumsum().le(N))
output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN

combining two dataframes giving NaN values [duplicate]

I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1
You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.
As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)

How to fill nan column with natural numbers beginning in order?

I have a data-frame
Columns
0 Nan
1 Nan
2 Nan
3 Nan
I want to fill all the Nan columns here with natural numbers starting from 1 to rest of the empty columns in increasing order.
Expected Output
Columns
0 1
1 2
2 3
3 4
Any suggestions to do this?
df['Columns'] = df['Columns'].fillna(??????????)
Solution if need replace only missing values use DataFrame.loc with Series.cumsum, then Trues are processing like 1:
m = df['Columns'].isna()
#nice solution from #Ch3steR, thank you
df.loc[m, 'Columns'] = m.cumsum()
#alternative
#df.loc[m, 'Columns'] = range(1, m.sum() + 1)
print (df)
Columns
0 1
1 2
2 3
3 4
Test for another data:
print (df)
Columns
0 NaN
1 NaN
2 100.0
3 NaN
m = df['Columns'].isna()
df.loc[m, 'Columns'] = m.cumsum()
print (df)
Columns
0 1.0
1 2.0
2 100.0
3 3.0
If need set values by range, so original column values are overwritten, use:
df['Columns'] = range(1, len(df) + 1)
print (df)
Columns
0 1
1 2
2 3
3 4

Python Pandas creating column on condition with dynamic amount of columns

If I create a new dataframe based on a user parameter, say a = 2. Therefore my dataframe df shrinks to 4 (ax2) columns into df_new. For example:
df_new = pd.DataFrame(data = {'col_01_01': [float('nan'),float('nan'),1,2,float('nan')], 'col_02_01': [float('nan'),float('nan'),1,2,float('nan')],'col_01_02': [0,0,0,0,1],'col_02_02': [1,0,0,1,1],'output':[1,0,1,1,1]})
To be more precise on the output column, let's look at the first row. [(nan,nan,0,1)] -> apply notna()-function to the first two entries and a comparison '==1' to the third and fourth row. -> This gives [(false, false, false, true)] -> compare these with an OR-expression and receive the desired result True -> 1
In the second row we find [(nan,nan,0,0)] therefore we find the output to be 0, since there is no valid value in the first two cols and 0 in the last two.
For a parameter a=3 we find 6 columns.
The result loos like this:
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1
You can use vectorised operations with notnull and eq:
null_cols = ['col_01_01', 'col_02_01']
int_cols = ['col_01_02', 'col_02_02']
df['output'] = (df[null_cols].notnull().any(1) | df[int_cols].eq(1).any(1)).astype(int)
print(df)
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

Merge two dataframes in pandas

i have a logical problem to solve an issue. I have two dataframes.
Dataframe_one have the following columns:
[Id, workflowprofile_A, workflow_profile_B, important_string_info ]
Dataframe_two have the following columns:
[workflowprofile, option, workflow]
My Problem is that workflowprofile from Dataframe_two can be a workflowprofile_A OR and AND a workflow_profile_B from Dataframe_one. How can i get a merged dataframe where the columns would look like these.
dataframe_three:
[Id, workflowprofile_A,workflowprofile_fromA, option_fromA, workflow_fromA,important_string_info_fromA workflow_profile_B, workflowprofile_fromB, option_fromB, workflow_fromB, important_string_info_fromB]
You can create new column by fillna or combine_first, because always one value is NaN and then merge by this column:
df1['workflowprofile'] = df1['workflowprofile_A'].fillna(df1['workflow_profile_B'])
#alternative
#df1['workflowprofile'] = df1['workflowprofile_A'].combine_first(df1['workflow_profile_B'])
df3 = pd.merge(df1, df2, on='workflowprofile')
Sample:
print (df1)
Id workflowprofile_A workflow_profile_B important_string_info
0 1 7.0 NaN 8
1 2 NaN 5.0 1
print (df2)
workflowprofile option workflow
0 7 0 0
1 5 9 0
2 7 0 0
3 4 1 2
df1['workflowprofile'] = df1['workflowprofile_A'].fillna(df1['workflow_profile_B'])
df3 = pd.merge(df1, df2, on='workflowprofile')
print (df3)
Id workflowprofile_A workflow_profile_B important_string_info \
0 1 7.0 NaN 8
1 1 7.0 NaN 8
2 2 NaN 5.0 1
workflowprofile option workflow
0 7 0 0
1 7 0 0
2 5 9 0

Categories

Resources