import pandas as pd
temp1 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp1['a'] = [1,2,2,3,3,4,4,4,9,11]
temp1['b'] = 'B'
temp2 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp2['a'] = [1,2,3,4,5,6,7,8,9,10]
temp2['b'] = 'B'
As the script above, I want to pickup rows from temp1 that column a was not seen at temp2. I can use %in% in R to do it easily, how can I do it in pandas?
update 01
the output should be one row which column a is 11 and column b is B
You can use isin to perform boolean indexing:
isin will produce a boolean index:
In [95]:
temp1.a.isin(temp2.a)
Out[95]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: a, dtype: bool
This can then be used as a mask in the final output:
In [94]:
# note the ~ this negates the result so equivalent of NOT
temp1[~temp1.a.isin(temp2.a)]
Out[94]:
a b
9 11 B
You can use isin to get the indices that are seen, and then negate the boolean indices:
temp1[~temp1.a.isin(temp2.a)]
Related
I am trying for a while to solve this problem:
I have a daraframe like this:
import pandas as pd
df=pd.DataFrame(np.array([['A', 2, 3], ['B', 5, 6], ['C', 8, 9]]),columns=['a', 'b', 'c'])
j=[0,2]
But then when i try to select just a part of it filtering by a list of index and a condition on a column I get error...
df[df.loc[j]['a']=='A']
There is somenting wrong, but i don't get what is the problem here. Can you help me?
This is the error message:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
There is filtered DataFrame compared by original, so indices are different, so error is raised.
You need compare filtered DataFrame:
df1 = df.loc[j]
print (df1)
a b c
0 A 2 3
2 C 8 9
out = df1[df1['a']=='A']
print(out)
a b c
0 A 2 3
Your solution is possible use with convert ndices of filtered mask by original indices by Series.reindex:
out = df[(df.loc[j, 'a']=='A').reindex(df.index, fill_value=False)]
print(out)
a b c
0 A 2 3
Or nicer solution:
out = df[(df['a'] == 'A') & (df.index.isin(j))]
print(out)
a b c
0 A 2 3
A boolean array and the dataframe should be the same length. here your df length is 3 but the boolean array df.loc[j]['a']=='A' length is 2
You should do:
>>> df.loc[j][df.loc[j]['a']=='A']
a b c
0 A 2 3
I have two dataframes say df and thresh_df. The shape of df is say 1000*200 and thresh_df is 1*200.
I need to compare the thresh_df row with each row of df element wise respectively and I have to fetch the corresponding column number whose values are less than the values of thresh_df.
I tried the following
compared_df = df.apply(lambda x : np.where(x < thresh_df.values))
But I get an empty dataframe! If question is unclear and need any explanations,please let me know in the comments.
I think apply is not necessary only compare one row DataFrame converted to Series by selecting first row:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
thresh_df = pd.DataFrame({
'B':[4],
'C':[7],
'D':[4],
'E':[5],
})
compared_df = df < thresh_df.iloc[0]
print (compared_df)
B C D E
0 False False True False
1 False False True True
2 False False False False
3 False True False False
4 False True True True
5 False True True True
Then use DataFrame.any for filter at least one True per row and filter index values:
idx = df.index[compared_df.any(axis=1)]
print (idx)
Int64Index([0, 1, 3, 4, 5], dtype='int64')
Detail:
print (compared_df.any(axis=1))
0 True
1 True
2 False
3 True
4 True
5 True
dtype: bool
I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]
Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool
This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.
I've been searching for quite a while not not getting anywhere close to what I wanted to do...
I have a pandas dataframe in which I want to compare the value of column A to B and write a 1 or 0 in a new column if A and B are equal.
I could write an ugly for loop but I know this is not very pythony.
I'm pretty sure there is a way to do this with apply() but I'm not getting anywhere.
I'd like to be able to compare columns that contain integers as well as columns containing strings.
Thanks in advance for your help.
If df is a Pandas DataFrame, then
df['newcol'] = (df['A'] == df['B']).astype('int')
For example,
In [20]: df = pd.DataFrame({'A': [1,2,'foo'], 'B': [1,99,'foo']})
In [21]: df
Out[21]:
A B
0 1 1
1 2 99
2 foo foo
In [22]: df['newcol'] = (df['A'] == df['B']).astype('int')
In [23]: df
Out[23]:
A B newcol
0 1 1 1
1 2 99 0
2 foo foo 1
df['A'] == df['B'] returns a boolean Series:
In [24]: df['A'] == df['B']
Out[24]:
0 True
1 False
2 True
dtype: bool
astype('int') converts the True/False values to integers -- 0 for False and 1 for True.
In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:
ipdb> expect
1 2
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df
identifier 1 2
timestamp
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df[1][0]
nan
ipdb> df[1][0], expect[1][0]
(nan, nan)
ipdb> df[1][0] == expect[1][0]
False
ipdb> df[1][1] == expect[1][1]
True
ipdb> type(df[1][0])
<type 'numpy.float64'>
ipdb> type(expect[1][0])
<type 'numpy.float64'>
ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])
ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False
Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?
What is the simplest way to compare equality of Series/DataFrames including NaNs?
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
In [11]: from pandas.testing import assert_frame_equal
In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
try:
assert_frame_equal(df, expected, check_names=False)
return True
except AssertionError:
return False
In more recent pandas this functionality has been added as .equals:
df.equals(expected)
One of the properties of NaN is that NaN != NaN is True.
Check out this answer for a nice way to do this using numexpr.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
a == b or (isnan(a) and isnan(b))
So, either a equals b, or both a and b are NaN.
If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.
In [1]: df = DataFrame(rand(1e7, 15))
In [2]: df = df[df > 0.5]
In [3]: df2 = df.copy()
In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)
In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop
timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop
Like #PhillipCloud answer, but more written out
In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])
In [27]: df2 = df1.copy()
They really are equivalent
In [28]: result = df1 == df2
In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [30]: result
Out[30]:
0 1
0 True True
1 True True
A nan in df2 that doesn't exist in df1
In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
In [32]: result = df1 == df2
In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [34]: result
Out[34]:
0 1
0 True True
1 False True
You can also fill with a value you know not to be in the frame
In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]:
0 1
0 True True
1 True True
Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.
Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.
To be safe, do the following:
Example a) Compare two dataframes with NaN values
bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()
Example b) Filter rows in df1 that do not match with df2
bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]
(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)
df.fillna(0) == df2.fillna(0)
You can use fillna(). Documenation here.
from pandas import DataFrame
# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df
# comparison fails!
print df == df2
# all is well
print df.fillna(0) == df2.fillna(0)