Compare two pandas dataframes and replace value based on condition - python

I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.

IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN

You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN

Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN

A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

how to append two dataframes with different column names and avoid columns with nan values

xyarr= [[0,1,2],[1,1,3],[2,1,2]]
df1 = pd.DataFrame(xyarr, columns=['a', 'b','c'])
df2 = pd.DataFrame([['text','text2']], columns=['x','y'])
df3 = pd.concat([df1,df2],axis=0, ignore_index=True)
df3 will have NaN values, from the empty columns a b c.
a b c x y
0 0.0 1.0 2.0 NaN NaN
1 1.0 1.0 3.0 NaN NaN
2 2.0 1.0 2.0 NaN NaN
3 NaN NaN NaN text text2
I want to save df3 to a csv, but without the extra commas
any suggestions?
As pd.concat is an outer join by default, you will get the NaN values from the empty columns a b c. If you use other Pandas function e.g. .join() which is left join by default, you can get around the problem here.
You can try using .join(), as follows:
df3 = df1.join(df2)
Result:
print(df3)
a b c x y
0 0 1 2 text text2
1 1 1 3 NaN NaN
2 2 1 2 NaN NaN

create new pandas column based on if and else rule

I have this dataframe and I want to create column e:
df
a b c d
1 2 1 2
Nan Nan 3 1
Nan Nan Nan 5
4 5 0 2
I want create a new column based on this criteria:
The highest of column a vs column b.
If no value in column a and column b , then look column c
if no value in column c, then look column d.
df
a b c d e
1 2 1 2 2
Nan Nan 3 1 3
Nan Nan Nan 5 5
4 5 0 2 5
my idea just until step number 2.
def e(x):
if x['a'] >= x['b']:
return x['a']
elif x['a'] <= x['b']:
return x['b']
else:
x['c']
df['e'] = df.apply(e, axis=1)
IIUC, use pandas.DataFrame.bfill:
df["e"] = df.bfill(1)[["a", "b"]].max(1)
print(df)
Output:
a b c d e
0 1 2 1 2 2.0
1 NaN NaN 3 1 3.0
2 NaN NaN NaN 5 5.0
3 4 5 0 2 5.0
You can always use np.where()
df['e'] = df['d']
df['e'] = np.where((df['a'].isna()) & (df['b'].isna()) & (df['c'].notnull()), df['c'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['a'] > df['b']), df['a'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['b'] > df['a']), df['b'], df['e'])
df
First get maximum a, b values and assign to a column, then back filling missing values and select first column for prioritize c and then d columns:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 1 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If want test only a,b,c,d columns and possible some another columns:
df['e'] = df[['a','b']].max(axis=1).fillna(df.c).fillna(df.d)
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If changed second row to 3,5 output is:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0 <- changed d=5
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0

Creating two shifted columns in grouped pandas data-frame

I have looked all over and I still can't find an example of how to create two shifted columns in a Pandas Dataframe within its groups.
I have done it with one column as follows:
data_frame['previous_category'] = data_frame.groupby('id')['category'].shift()
But I have to do it with 2 columns, shifting one upwards and the other downwards.
Any ideas?
It is possible by custom function with GroupBy.apply, because one column need shift down and second shift up:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'F':list('aaabbb')
})
def f(x):
x['B'] = x['B'].shift()
x['C'] = x['C'].shift(-1)
return x
df = df.groupby('F').apply(f)
print (df)
B C F
0 NaN 8.0 a
1 4.0 9.0 a
2 5.0 NaN a
3 NaN 2.0 b
4 5.0 3.0 b
5 5.0 NaN b
If want shift same way only specify all columns in lists:
df[['B','C']] = df.groupby('F')['B','C'].shift()
print (df)
B C F
0 NaN NaN a
1 4.0 7.0 a
2 5.0 8.0 a
3 NaN NaN b
4 5.0 4.0 b
5 5.0 2.0 b

How to search all data frame rows for values outside a defined range of numbers?

So I have a data frame that's 50 columns and 400 rows consisting of all numbers. I'm trying to display only the columns that have values that fall outside a pre-defined range (i.e. only show values that aren't between -1 to +3).
So far I have:
df[(df.T > 3).all()]
to display values greater than 2 then I can change the integer to the other number of interest but how I can write something to display numbers that fall outside a range (i.e. display all columns that have values outside the range of -1 to +3).
you can use pd.DataFrame.mask
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(-2, 4, (5, 3)), columns=list('abc'))
print(df)
a b c
0 -2 1 0
1 1 0 0
2 3 1 3
3 0 1 -2
4 0 -2 -2
Mask makes cells that evaluate to True NaN
df.mask(df.ge(3) | df.le(-1))
a b c
0 NaN 1.0 0.0
1 1.0 0.0 0.0
2 NaN 1.0 NaN
3 0.0 1.0 NaN
4 0.0 NaN NaN
Or the opposite
df.mask(df.lt(3) & df.gt(-1))
a b c
0 -2.0 NaN NaN
1 NaN NaN NaN
2 3.0 NaN 3.0
3 NaN NaN -2.0
4 NaN -2.0 -2.0
You could call stack to stack all columns so that you can use between to generate the mask on a range and then invert the mask using ~ and then call dropna(axis=1):
In [193]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[193]:
a b c
0 0.088639 0.275458 0.837952
1 1.395237 -0.582110 0.614160
2 -1.114384 -2.774358 2.119473
3 1.050008 -1.195167 -0.343875
4 -0.006156 -2.028601 -0.071448
In [198]:
df[~df.stack().between(0.1,1).unstack()].dropna(axis=1)
Out[198]:
a
0 0.088639
1 1.395237
2 -1.114384
3 1.050008
4 -0.006156
So here only column 'a' has values not between 0.1 and 1
prior to the dropna you can see that the other columns don't meet this criteria so they generate NaN:
In [199]:
df[~df.stack().between(0.1,1).unstack()]
Out[199]:
a b c
0 0.088639 NaN NaN
1 1.395237 -0.582110 NaN
2 -1.114384 -2.774358 2.119473
3 1.050008 -1.195167 -0.343875
4 -0.006156 -2.028601 -0.071448
By default the left and right values are included, if this isn't required then pass inclusive=False to between

Categories

Resources