combining two dataframes giving NaN values [duplicate]

combining two dataframes giving NaN values [duplicate] - python

I am curious why a simple concatenation of two dataframes in pandas:
initId.shape # (66441, 1)
initId.isnull().sum() # 0
ypred.shape # (66441, 1)
ypred.isnull().sum() # 0
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1)
foo.shape # (83384, 2)
foo.isnull().sum() # 16943
can result in a lot of NaN values if joined.
How can I fix this problem and prevent NaN values being introduced?
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.

I think there is problem with different index values, so where concat cannot align get NaN:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
prediction
4 0
5 1
8 0
7 1
10 0
12 0
bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 NaN 0.0
1 NaN 0.0
2 NaN 1.0
3 NaN 0.0
4 0.0 1.0
5 1.0 1.0
7 1.0 NaN
8 0.0 NaN
10 0.0 NaN
12 0.0 NaN
Solution is reset_index if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)
print(aaa)
prediction
0 0
1 1
2 0
3 1
4 0
5 0
print(bbb)
groundTruth
0 0
1 0
2 1
3 0
4 1
5 1
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1
EDIT: If need same index like aaa and length of DataFrames is same use:
bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
prediction groundTruth
4 0 0
5 1 0
8 0 1
7 1 0
10 0 1
12 0 1

You can do something like this:
concatenated_dataframes = concat(
[
dataframe_1.reset_index(drop=True),
dataframe_2.reset_index(drop=True),
dataframe_3.reset_index(drop=True)
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dataframe_1.columns),
list(dataframe_2.columns),
list(dataframe_3.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
To concatenate multiple DataFrames and keep the columns names / avoid NaN.

As jezrael pointed out, this is due to different index labels. concat matches on index, so if they are not the same, this problem will occur. For a straightforward horizontal concatenation, you must "coerce" the index labels to be the same. One way is via set_axis method. This makes the second dataframes index to be the same as the first's.
joined_df = pd.concat([df1, df2.set_axis(df1.index)], axis=1)
or just reset the index of both frames
joined_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?

Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Combine two data frames by matching index

I have a data frame that has the form:
index predicted
1 1
2 1
3 0
4 0
5 1
And another that has the form:
index actual
2 1
4 0
I want the data frame:
index predicted actual
1 1 nan
2 1 1
3 0 nan
4 0 0
5 1 nan
I've tried pd.concat([df1,df2], on="index", how="left") and pd.merge(df1, df2, axis=1)
Both give the dataframe:
index predicted actual
1 1 1
2 1 0
3 0 nan
4 0 nan
5 1 nan
How can I get the data frame I need. Also thanks in advance.

You can use the pd.merge() setting the parameters left_index = True and right_index = True
import pandas as pd
df1 = pd.DataFrame({'predicted': [1,1,0,0,1]}, index = (1,2,3,4,5))
df2 = pd.DataFrame({'actual': [1,0]}, index = (2,4))
pd.merge(df1, df2, how = 'left', left_index=True, right_index=True)
This will merge the two dataframes on index and produce the intended result required.
index predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN

If you make sure that your index column is actually the df.index, pd.concat should work:
import pandas as pd
left = pd.DataFrame({"predicted": [1, 1, 0, 0, 1]}, index=[1, 2, 3, 4, 5])
right = pd.DataFrame({"actual": [1, 0]}, index=[2, 4])
out = pd.concat([left, right], axis=1)
predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN
If they're just columns, such as the following:
left = left.reset_index(names="index")
right = right.reset_index(names="index")
then you can use:
left.merge(right, on="index", how="left")
index predicted actual
0 1 1 NaN
1 2 1 1.0
2 3 0 NaN
3 4 0 0.0
4 5 1 NaN

Create index as a temporary column then left join using that then set it as index.
predict_df = pd.DataFrame({'predicted': [1,1,0,0,1]}, index=range(1,6))
actual_df = pd.DataFrame({'actual': [1,0]}, index=[2,4])
pd.merge(
left=predict_df.reset_index(),
right=actual_df.reset_index(),
how='left',
on='index'
).set_index('index')
predicted actual
index
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN

df1.join(df2)
out:
predicted actual
1 1 NaN
2 1 1.0
3 0 NaN
4 0 0.0
5 1 NaN

Python Pandas creating column on condition with dynamic amount of columns

If I create a new dataframe based on a user parameter, say a = 2. Therefore my dataframe df shrinks to 4 (ax2) columns into df_new. For example:
df_new = pd.DataFrame(data = {'col_01_01': [float('nan'),float('nan'),1,2,float('nan')], 'col_02_01': [float('nan'),float('nan'),1,2,float('nan')],'col_01_02': [0,0,0,0,1],'col_02_02': [1,0,0,1,1],'output':[1,0,1,1,1]})
To be more precise on the output column, let's look at the first row. [(nan,nan,0,1)] -> apply notna()-function to the first two entries and a comparison '==1' to the third and fourth row. -> This gives [(false, false, false, true)] -> compare these with an OR-expression and receive the desired result True -> 1
In the second row we find [(nan,nan,0,0)] therefore we find the output to be 0, since there is no valid value in the first two cols and 0 in the last two.
For a parameter a=3 we find 6 columns.
The result loos like this:
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

You can use vectorised operations with notnull and eq:
null_cols = ['col_01_01', 'col_02_01']
int_cols = ['col_01_02', 'col_02_02']
df['output'] = (df[null_cols].notnull().any(1) | df[int_cols].eq(1).any(1)).astype(int)
print(df)
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

Merge two dataframes in pandas

i have a logical problem to solve an issue. I have two dataframes.
Dataframe_one have the following columns:
[Id, workflowprofile_A, workflow_profile_B, important_string_info ]
Dataframe_two have the following columns:
[workflowprofile, option, workflow]
My Problem is that workflowprofile from Dataframe_two can be a workflowprofile_A OR and AND a workflow_profile_B from Dataframe_one. How can i get a merged dataframe where the columns would look like these.
dataframe_three:
[Id, workflowprofile_A,workflowprofile_fromA, option_fromA, workflow_fromA,important_string_info_fromA workflow_profile_B, workflowprofile_fromB, option_fromB, workflow_fromB, important_string_info_fromB]

You can create new column by fillna or combine_first, because always one value is NaN and then merge by this column:
df1['workflowprofile'] = df1['workflowprofile_A'].fillna(df1['workflow_profile_B'])
#alternative
#df1['workflowprofile'] = df1['workflowprofile_A'].combine_first(df1['workflow_profile_B'])
df3 = pd.merge(df1, df2, on='workflowprofile')
Sample:
print (df1)
Id workflowprofile_A workflow_profile_B important_string_info
0 1 7.0 NaN 8
1 2 NaN 5.0 1
print (df2)
workflowprofile option workflow
0 7 0 0
1 5 9 0
2 7 0 0
3 4 1 2
df1['workflowprofile'] = df1['workflowprofile_A'].fillna(df1['workflow_profile_B'])
df3 = pd.merge(df1, df2, on='workflowprofile')
print (df3)
Id workflowprofile_A workflow_profile_B important_string_info \
0 1 7.0 NaN 8
1 1 7.0 NaN 8
2 2 NaN 5.0 1
workflowprofile option workflow
0 7 0 0
1 7 0 0
2 5 9 0

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.

Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

combining two dataframes giving NaN values [duplicate] - python

Related

Ignore nan elements in a list using loc pandas

Combine two data frames by matching index

Python Pandas creating column on condition with dynamic amount of columns

Merge two dataframes in pandas

Use Pandas dataframe to add lag feature from MultiIindex Series

Categories

Resources