Python Compare rows in two columns and write a result conditionally - python

I've been searching for quite a while not not getting anywhere close to what I wanted to do...
I have a pandas dataframe in which I want to compare the value of column A to B and write a 1 or 0 in a new column if A and B are equal.
I could write an ugly for loop but I know this is not very pythony.
I'm pretty sure there is a way to do this with apply() but I'm not getting anywhere.
I'd like to be able to compare columns that contain integers as well as columns containing strings.
Thanks in advance for your help.

If df is a Pandas DataFrame, then
df['newcol'] = (df['A'] == df['B']).astype('int')
For example,
In [20]: df = pd.DataFrame({'A': [1,2,'foo'], 'B': [1,99,'foo']})
In [21]: df
Out[21]:
A B
0 1 1
1 2 99
2 foo foo
In [22]: df['newcol'] = (df['A'] == df['B']).astype('int')
In [23]: df
Out[23]:
A B newcol
0 1 1 1
1 2 99 0
2 foo foo 1
df['A'] == df['B'] returns a boolean Series:
In [24]: df['A'] == df['B']
Out[24]:
0 True
1 False
2 True
dtype: bool
astype('int') converts the True/False values to integers -- 0 for False and 1 for True.

Related

How to conditionally change pandas DataFrame values into f-strings?

I have a pandas DataFrame whose values I want to conditionally change into strings without looping over every value.
Example input:
In [1]: df = pd.DataFrame(data = [[1,2], [4,5]], columns = ['a', 'b'])
Out[2]:
a b
0 1 2
1 4 5
This is my best attempt which doesn't work properly
df['a'] = np.where(df['a'] < 3, f'string-{df["a"]}', df['a'])
In [1]: df
Out[2]:
a b
0 string0 1\n1 4\nName: a, dtype: int64 2
1 4 5
Desired output:
Out[2]:
A B
0 string-1 2
1 4 5
I am using np.where() since looping is not feasible due to the size of the actual DataFrame. The actual f-string I am using is also more complex and has two variables that include column names, but the problem is the same.
Are there other ways to conditionally change pandas values into f-strings without looping over each value?
You can use .map() together with f-string, as follows:
df['a'] = df['a'].map(lambda x: f'string-{x}' if x < 3 else x)
Alternatively, you can also use .loc together with string concatenation, as follows:
df.loc[df['a'] < 3, 'a'] = 'string-' + df['a'].astype(str)
#OR
df['a']=np.where(df['a'] < 3, 'string-'+df['a'].astype(str), df['a'])
Result:
print(df)
a b
0 string-1 2
1 4 5

Python Pandas make calculation in single cell

I have a TYPE column
and a VOLUME column
What I'm looking to do if first check if TYPE column == 'var1'
If so I would like to make a calculation in the VOLUME column.
So far I have something like this:
data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 2
data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 4
This seems to set the entire column that meets the condition to the last variable. So I end up with just two values.
Out:
4
4
4
4
8
8
8
Another option:
data['VOLUME'] = data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 2
This works for the first condition but show NaN for the second condition
Then when I run:
data['VOLUME'] = data.loc[data['TYPE'] == 'var2', ['VOLUME']] * 4
The whole column show as NaN.
Consider a simple example which demonstrates what is happening.
df = pd.DataFrame({'A': [1, 2, 3]})
df
A
0 1
1 2
2 3
Now, only values below 2 in column "A" are to be modified. So, try something like
df.loc[df.A < 2, 'A'] * 2
0 2
Name: A, dtype: int64
This series only has 1 row at index 0. If you try assigning this back, the implicit assumption is that the other index values are to be reset to NaN.
df.assign(A=df.loc[df.A < 2, 'A'] * 2)
A
0 2.0
1 NaN
2 NaN
What we want to do is to modify only the rows we're interested in. This is best done with the in-place modification arithmetic operator *=:
df.loc[df.A < 2, 'A'] *= 2
In your case, it is
data.loc[data['TYPE'] == 'var1', 'VOLUME'] *= 2
You are really close. The problem is in how you are storing the result. This should work:
data.loc[data['TYPE'] == 'var1', ['VOLUME']] = data['VOLUME'] * 2
You can use *= with loc:
In [11]: df = pd.DataFrame([[1], [2]], columns=["A"])
In [12]: df
Out[12]:
A
0 1
1 2
In [13]: df.loc[df.A == 1, "A"] *= 3
In [14]: df
Out[14]:
A
0 3
1 2

matching of columns between two pandas dataframe

import pandas as pd
temp1 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp1['a'] = [1,2,2,3,3,4,4,4,9,11]
temp1['b'] = 'B'
temp2 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp2['a'] = [1,2,3,4,5,6,7,8,9,10]
temp2['b'] = 'B'
As the script above, I want to pickup rows from temp1 that column a was not seen at temp2. I can use %in% in R to do it easily, how can I do it in pandas?
update 01
the output should be one row which column a is 11 and column b is B
You can use isin to perform boolean indexing:
isin will produce a boolean index:
In [95]:
temp1.a.isin(temp2.a)
Out[95]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: a, dtype: bool
This can then be used as a mask in the final output:
In [94]:
# note the ~ this negates the result so equivalent of NOT
temp1[~temp1.a.isin(temp2.a)]
Out[94]:
a b
9 11 B
You can use isin to get the indices that are seen, and then negate the boolean indices:
temp1[~temp1.a.isin(temp2.a)]

How to check whether a pandas DataFrame is empty?

How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.
You can use the attribute df.empty to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Source: Pandas Documentation
I use the len function. It's much faster than empty. len(df.index) is even faster.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def empty(df):
return df.empty
def lenz(df):
return len(df) == 0
def lenzi(df):
return len(df.index) == 0
'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)
10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop
len on index seems to be faster
'''
To see if a dataframe is empty, I argue that one should test for the length of a dataframe's columns index:
if len(df.columns) == 0: 1
Reason:
According to the Pandas Reference API, there is a distinction between:
an empty dataframe with 0 rows and 0 columns
an empty dataframe with rows containing NaN hence at least 1 column
Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.
Examples
Example 1: An empty dataframe with 0 rows and 0 columns
In [1]: import pandas as pd
df1 = pd.DataFrame()
df1
Out[1]: Empty DataFrame
Columns: []
Index: []
In [2]: len(df1.index) # or len(df1)
Out[2]: 0
In [3]: df1.empty
Out[3]: True
Example 2: A dataframe which is emptied to 0 rows but still retains n columns
In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df2
Out[4]: AA BB
0 1 11
1 2 22
2 3 33
In [5]: df2 = df2[df2['AA'] == 5]
df2
Out[5]: Empty DataFrame
Columns: [AA, BB]
Index: []
In [6]: len(df2.index) # or len(df2)
Out[6]: 0
In [7]: df2.empty
Out[7]: True
Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.
In [8]: len(df1.columns)
Out[8]: 0
In [9]: len(df2.columns)
Out[9]: 2
Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.
Why it matters
Let's add a new column to these dataframes to understand the implications:
# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
df1
Out[10]: CC
0 111
1 222
2 333
In [11]: len(df1.columns)
Out[11]: 1
# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
df2
Out[12]: AA BB CC
0 NaN NaN 111
1 NaN NaN 222
2 NaN NaN 333
In [13]: len(df2.columns)
Out[13]: 3
It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.
Practical solution
# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df
Out[1]: AA BB
0 1 11
1 2 22
2 3 33
# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
df
Out[2]: Empty DataFrame
Columns: [AA, BB]
Index: []
# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2
# And accordingly, the other answers on this page
In [4]: len(df.index) # or len(df)
Out[4]: 0
In [5]: df.empty
Out[5]: True
# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0: # <--- here
# Do something, e.g.
# drop any columns containing rows with `NaN`
# to make the df really empty
df = df.dropna(how='all', axis=1)
df
Out[6]: Empty DataFrame
Columns: []
Index: []
# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0
Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):
In [8]: df['CC'] = [111, 222, 333]
df
Out[8]: CC
0 111
1 222
2 333
In [9]: len(df.columns)
Out[9]: 1
I prefer going the long route. These are the checks I follow to avoid using a try-except clause -
check if variable is not None
then check if its a dataframe and
make sure its not empty
Here, DATA is the suspect variable -
DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty
If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
when this situation can happen?
This situation happens when a single function is used to plot more than one DataFrame
which are passed as parameter.In such a situation the function try to plot the data even
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
why?
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message
if a DataFrame is empty(instead of plotting):
df1 df2
col1 col2 col1 col2
Nan 2 Nan Nan
2 Nan Nan Nan
and the function:
def myfunc(df):
if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
print('not empty')
df.plot(kind='barh')
else:
display a message instead of plotting if it is empty
print('empty')

Pandas DataFrames with NaNs equality comparison

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:
ipdb> expect
1 2
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df
identifier 1 2
timestamp
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df[1][0]
nan
ipdb> df[1][0], expect[1][0]
(nan, nan)
ipdb> df[1][0] == expect[1][0]
False
ipdb> df[1][1] == expect[1][1]
True
ipdb> type(df[1][0])
<type 'numpy.float64'>
ipdb> type(expect[1][0])
<type 'numpy.float64'>
ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])
ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False
Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?
What is the simplest way to compare equality of Series/DataFrames including NaNs?
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
In [11]: from pandas.testing import assert_frame_equal
In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
try:
assert_frame_equal(df, expected, check_names=False)
return True
except AssertionError:
return False
In more recent pandas this functionality has been added as .equals:
df.equals(expected)
One of the properties of NaN is that NaN != NaN is True.
Check out this answer for a nice way to do this using numexpr.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
a == b or (isnan(a) and isnan(b))
So, either a equals b, or both a and b are NaN.
If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.
In [1]: df = DataFrame(rand(1e7, 15))
In [2]: df = df[df > 0.5]
In [3]: df2 = df.copy()
In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)
In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop
timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop
Like #PhillipCloud answer, but more written out
In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])
In [27]: df2 = df1.copy()
They really are equivalent
In [28]: result = df1 == df2
In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [30]: result
Out[30]:
0 1
0 True True
1 True True
A nan in df2 that doesn't exist in df1
In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
In [32]: result = df1 == df2
In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [34]: result
Out[34]:
0 1
0 True True
1 False True
You can also fill with a value you know not to be in the frame
In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]:
0 1
0 True True
1 True True
Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.
Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.
To be safe, do the following:
Example a) Compare two dataframes with NaN values
bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()
Example b) Filter rows in df1 that do not match with df2
bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]
(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)
df.fillna(0) == df2.fillna(0)
You can use fillna(). Documenation here.
from pandas import DataFrame
# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df
# comparison fails!
print df == df2
# all is well
print df.fillna(0) == df2.fillna(0)

Categories

Resources