Filter pandas df by boolean series - python

I have a dataframe foo and a True/False series bar:
foo = pd.DataFrame(
[['a', 1], ['b', 2], ['a', 3]],
index=[0, 1, 2], columns=['col1', 'col2'])
bar = pd.Series({'a': True, 'b': False})
I want to filter foo on col1 based on the truthiness of bar. Here are some approaches that work:
foo[foo['col1'].isin(bar.where(bar == True).dropna().index)
foo[foo['col1'].isin([k for k, v in bar.to_dict().items() if v])
# desired result
col1 col2
0 a 1
2 a 3
However, I think both approaches are a bit messy / not so intuitive to read, was wondering if I was missing any basic Pandas filtering concepts that allow for a simpler approach.

Use Series.map and index with the result:
foo[foo.col1.map(bar)]
col1 col2
0 a 1
2 a 3

Related

Identify the columns which contain zero and output its location

Suppose I have a dataframe where some columns contain a zero value as one of their elements (or potentially more than one zero). I don't specifically want to retrieve these columns or discard them (I know how to do that) - I just want to locate these. For instance: if there is are zeros somewhere in the 4th, 6th and the 23rd columns, I want a list with the output [4,6,23].
You could iterate over the columns, checking whether 0 occurs in each columns values:
[i for i, c in enumerate(df.columns) if 0 in df[c].values]
Use any() for the fastest, vectorized approach.
For instance,
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0, 100, 200],
'col3': ['a', 'b', 'c']})
Then,
>>> s = df.eq(0).any()
col1 False
col2 True
col3 False
dtype: bool
From here, it's easy to get the indexes. For example,
>>> s[s].tolist()
['col2']
Many ways to retrieve the indexes from a pd.Series of booleans.
Here is an approach that leverages a couple of lambda functions:
d = {'a': np.random.randint(10, size=100),
'b': np.random.randint(1,10, size=100),
'c': np.random.randint(10, size=100),
'd': np.random.randint(1,10, size=100)
}
df = pd.DataFrame(d)
df.apply(lambda x: (x==0).any())[lambda x: x].reset_index().index.to_list()
[0, 2]
Another idea based on #rafaelc slick answer (but returning relative locations of the columns instead of column names):
df.eq(0).any().reset_index()[lambda x: x[0]].index.to_list()
[0, 2]
Or with the column names instead of locations:
df.apply(lambda x: (x==0).any())[lambda x: x].index.to_list()
['a', 'c']

Combine 2 dataframes when the dataframes have different size

I have 2 df one is
df1 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = {'col_1': [3, 2, 1, 3]}
I want the result as follows
df3 = {'col_1': [3, 2, 1, 3], 'col_2': ['a', 'b', 'c', 'a']}
The column 2 of the new df is the same as the column 2 of the df1 depending on the value of the df1.
Add the new column by mapping the values from df1 after setting its first column as index:
df3 = df2.copy()
df3['col_2'] = df2['col_1'].map(df1.set_index('col_1')['col_2'])
output:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
You can do it with merge after converting the dicts to df with pd.DataFrame():
output = pd.DataFrame(df2)
output = output.merge(pd.DataFrame(df1),on='col_1',how='left')
Or in a one-liner:
output = pd.DataFrame(df2).merge(pd.DataFrame(df1),on='col_1',how='left')
Outputs:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
This could be a simple way of doing it.
# use df1 to create a lookup dictionary
lookup = df1.set_index("col_1").to_dict()["col_2"]
# look up each value from df2's "col_1" in the lookup dict
df2["col_2"] = df2["col_1"].apply(lambda d: lookup[d])

Pandas: how to write a groupby plus an aggregation that can group by one or many columns?

How can I use this groupby plus aggregation operation in such a way that it can flexibly handle one or more groupby columns?
# some data
df = pd.DataFrame({'col1': [1, 5, 1, 2, 2, 2], 'col2': [2, 2, 2, 3, 3, 3], 'col3': [999, 999, 999, 999, 999, 999],
'time': ['2020-01-25 12:24:33', '2020-01-25 14:24:33', '2020-01-25 18:24:33',
'2020-01-25 09:24:33', '2020-01-25 10:24:33', '2020-01-25 11:24:33']})
# convert time
df['time'] = pd.to_datetime(df['time'])
# groupby with one col, works
df.groupby(['col1', df['time'].dt.floor('d')]).tail(1)
# how to use this structure while being flexibly able to group by one or more cols?
two_cols = ['col1', 'col2']
df.groupby([two_cols, df['time'].dt.floor('d')]).tail(1)
The expected output is the same for both operations:
col1 col2 col3 time
5 2 999 2020-01-25 14:24:33
1 2 999 2020-01-25 18:24:33
2 3 999 2020-01-25 11:24:33
Pandas is looking for a list of labels for the groupby() function, and so we need to make sure that we give them a list. I believe this works.
df.groupby(two_cols + [df['time'].dt.floor('d')]).tail(1)
You can see that our parameter in groupby() is our list two_cols + another list (in the []) that contains just the df['time']... series. Thus, we are combining two lists into a new listobject, and that is what groupby() will run on.

How to list highest correlation pairs (one spec. column with all others) in pandas?

To find all top correlations you can use the following code according List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [7,3]}
df = pd.DataFrame(data=d)
df.corr().unstack().sort_values().drop_duplicates()
How do I have to change the above line in order to compare just one specific column with all others?
I do not want to compare col2 to col3. Just the correlation of col1 to col2 and col1 to col3 is important to me.
You can first compute the full correlation just using df.corr().
After that you can select the row of the correlation matrix that is returned by df.corr() in which you are interested in.
Say you are interested in the correlation between col1 and the others:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [7,3]}
df = pd.DataFrame(data=d)
df.corr().loc['col1']
# col1 1.0
# col2 1.0
# col3 -1.0
# Name: col1, dtype: float64

How to compare two dataframes ignoring column names?

Suppose I want to compare the content of two dataframes, but not the column names (or index names). Is it possible to achieve this without renaming the columns?
For example:
df = pd.DataFrame({'A': [1,2], 'B':[3,4]})
df_equal = pd.DataFrame({'a': [1,2], 'b':[3,4]})
df_diff = pd.DataFrame({'A': [1,2], 'B':[3,5]})
In this case, df is df_equal but different to df_diff, because the values in df_equal has the same content, but the ones in df_diff. Notice that the column names in df_equal are different, but I still want to get a true value.
I have tried the following:
equals:
# Returns false because of the column names
df.equals(df_equal)
eq:
# doesn't work as it compares four columns (A,B,a,b) assuming nulls for the one that doesn't exist
df.eq(df_equal).all().all()
pandas.testing.assert_frame_equal:
# same as equals
pd.testing.assert_frame_equal(df, df_equal, check_names=False)
I thought that it was going to be possible to use the assert_frame_equal, but none of the parameters seem to work to ignore column names.
pd.DataFrame is built around pd.Series, so it's unlikely you will be able to perform comparisons without column names.
But the most efficient way would be to drop down to numpy:
assert_equal = (df.values == df_equal.values).all()
To deal with np.nan, you can use np.testing.assert_equal and catch AssertionError, as suggested by #Avaris :
import numpy as np
def nan_equal(a,b):
try:
np.testing.assert_equal(a,b)
except AssertionError:
return False
return True
assert_equal = nan_equal(df.values, df_equal.values)
I just needed to get the values (numpy array) from the data frame, so the column names won't be considered.
df.eq(df_equal.values).all().all()
I would still like to see a parameter on equals, or assert_frame_equal. Maybe I am missing something.
An advantage of this compared to #jpp answer is that, I can get see which columns do not match, calling only all() only once:
df.eq(df_diff.values).all()
Out[24]:
A True
B False
dtype: bool
One problem is that when eq is used, then np.nan is not equal to np.nan, in which case the following expression, would serve well:
(df.eq(df_equal.values) | (df.isnull().values & df_equal.isnull().values)).all().all()
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
print(df1.iloc[i, j] == df2.iloc[i, j])
Will return:
True
True
True
True
Same thing for:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
One obvious issue is that column names matters in Pandas to sort dataframes. For example:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'B': [3, 4]})
print(df1)
print(df2)
renders as ('B' is before 'a' in df2):
a b
0 1 3
1 2 4
B a
0 3 1
1 4 2

Categories

Resources