Index of All columns Whose Particular rows Satisfies Given Condition pandas - python

Suppose we have a dataframe of N rows and m columns. For each column, I want to find the first index for which a condition is satisfied.
df = pd.DataFrame(np.random.random((50,5)), columns=['A', 'B', 'C', 'D', 'E'])
For each column, I can do
df[df['A']<0.5].index[0]
to find the first row for that column. I was wondering if there is a way to get it for all columns without for loop?

You can use numpy.argmax
>>> (df.to_numpy() < 0.5).argmax(axis=0)
array([0, 0, 1, 0, 1])

Related

Identify the columns which contain zero and output its location

Suppose I have a dataframe where some columns contain a zero value as one of their elements (or potentially more than one zero). I don't specifically want to retrieve these columns or discard them (I know how to do that) - I just want to locate these. For instance: if there is are zeros somewhere in the 4th, 6th and the 23rd columns, I want a list with the output [4,6,23].
You could iterate over the columns, checking whether 0 occurs in each columns values:
[i for i, c in enumerate(df.columns) if 0 in df[c].values]
Use any() for the fastest, vectorized approach.
For instance,
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0, 100, 200],
'col3': ['a', 'b', 'c']})
Then,
>>> s = df.eq(0).any()
col1 False
col2 True
col3 False
dtype: bool
From here, it's easy to get the indexes. For example,
>>> s[s].tolist()
['col2']
Many ways to retrieve the indexes from a pd.Series of booleans.
Here is an approach that leverages a couple of lambda functions:
d = {'a': np.random.randint(10, size=100),
'b': np.random.randint(1,10, size=100),
'c': np.random.randint(10, size=100),
'd': np.random.randint(1,10, size=100)
}
df = pd.DataFrame(d)
df.apply(lambda x: (x==0).any())[lambda x: x].reset_index().index.to_list()
[0, 2]
Another idea based on #rafaelc slick answer (but returning relative locations of the columns instead of column names):
df.eq(0).any().reset_index()[lambda x: x[0]].index.to_list()
[0, 2]
Or with the column names instead of locations:
df.apply(lambda x: (x==0).any())[lambda x: x].index.to_list()
['a', 'c']

How to index a dataframe using a condition on a column that is a column of numpy arrays?

I currently have a pandas dataframe that has a column of values that are numpy arrays. I am trying to get the rows of the dataframe where the value of the column is an empty numpy array but I can't index using the pandas method.
Here is an example dataframe.
data = {'Name': ['A', 'B', 'C', 'D'], 'stats': [np.array([1,1,1]), np.array([]), np.array([2,2,2]), np.array([])]}
df = pd.DataFrame(data)
I am trying to just get the rows where 'stats' is None, but when I try df[df['stats'] is None] I just get a KeyError: False.
How can I filter by rows that contain an empty list?
Additionally, how can I filter by row where the numpy array is something specific? i.e. get all rows of df where df['stats'] == np.array([1, 1, 1])
Thanks
You can check length by Series.str.len, because it working with all Iterables:
print (df['stats'].str.len())
0 3
1 0
2 3
3 0
Name: stats, dtype: int64
And then filter, e.g. rows with len=0:
df = df[df['stats'].str.len().eq(0)]
#alternative
#df = df[df['stats'].apply(len).eq(0)]
print (df)
Name stats
1 B []
3 D []
If need test specific array is possible use tuples:
df =df[ df['stats'].apply(tuple) == tuple(np.array([1, 1, 1]))]
print (df)
Name stats
0 A [1, 1, 1]
for this question:
"Additionally, how can I filter by row where the numpy array is something specific? i.e. get all rows of df where df['stats'] == np.array([1, 1, 1])"
data = {'Name': ['A', 'B', 'C', 'D'], 'stats': [np.array([1,1,1]), np.array([]), np.array([2,2,2]), np.array([])]}
df = pd.DataFrame(data)
df = df[df['stats'].apply(lambda x: np.array_equal(x, np.array([1,1,1])))]

Subset data by group within pandas dataframe

I need to subset a dataframe using groups and three conditional rules. If within a group all values in the Value column are none, I need to retain the first row for that group. If within a group all values in the Value column are not none, I need to retain all the values. If within a group some of the values in the Value column are none and others not none, I need to drop all rows where there is a none. Columns Region and ID together define a unique group within the dataframe.
My first approach was to separate the dataframe into two chunks. The first chunk is rows where for a group there are all nulls. The second chunk is everything else. For the chunk of data where rows for a group contained all nulls, I would create a rownumber using a cumulative count of rows by group and query rows where the cumulative count = 1. For the second chunk, I would drop all rows where Value is null. Then I would append the dataframes.
Sample source dataframe
dfInput = pd.DataFrame({
'Region': [1, 1, 2, 2, 2, 2, 2],
'ID': ['A', 'A', 'B', 'B', 'B', 'A', 'A'],
'Value':[0, 1, 1, None, 2, None, None],
})
Desired output dataframe:
dfOutput = pd.DataFrame({
'Region': [1, 1, 2, 2, 2],
'ID': ['A', 'A', 'B', 'B', 'A'],
'Value':[0, 1, 1, 2, None],
})
Just follow your logic and using groupby
dfInput.groupby(['Region','ID']).Value.apply(lambda x : x.head(1) if x.isnull().all() else x.dropna()).\
reset_index(level=[0,1]).sort_index()
Out[86]:
Region ID Value
0 1 A 0.0
1 1 A 1.0
2 2 B 1.0
4 2 B 2.0
5 2 A NaN

How to find all occurrences for each value in column A that is also in column B

Using Pandas, I'm trying to find the most recent overlapping occurrence of some value in Column A that also happens to be in Column B (though, not necessarily occurring in the same row); This is to be done for all rows in column A.
I've accomplished something close with an n^2 solution (by creating a list of each column and iterating through with a nested for-loop), but I would like to use something faster if possible; as this needs to be implemented in a table with tens of thousands of entries. (So, a Vectorized solution would be ideal, but I am more looking for the "right" way to do this.)
df['idx'] = range(0, len(df.index))
A = list(df['r_A'])
B = list(df['r_B'])
A_B_Dict = {}
for i in range(0, len(B)-1):
for j in range(0, len(A)-1):
if B[i] == A[j]:
A_search = df.loc[df['r_A'] == A[j]].index
A_B_Dict[B[i]] = A_search
Given some df like so:
df = [[1, 'A', 'A'],
[2, 'B', 'D'],
[3, 'C', 'B']
[4, 'D', 'D']
]
df = pd.DataFrame(data, columns = ['idx', 'A', 'B'])
It should give back something like:
A_B_Dict = {'A': 1, 'B': 3, 'C':None', 'D':4}
Such that, the most recent observance (Or all observances, for that matter) from Column A that occur in Column B are stored as the value of A_B_Dict where the key of A_B_Dict is the original value observed in Column A.
IIUC
d=dict(zip(df.B,df.idx))
dict(zip(df.A,df.A.map(d)))
{'A': 1.0, 'B': 3.0, 'C': nan, 'D': 4.0}

Count and sort pandas dataframe

I have a dataframe with column 'code' which I have sorted based on frequency.
In order to see what each code means, there is also a column 'note'.
For each counting/grouping of the 'code' column, I display the first note that is attached to the first 'code'
df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
Now my question is, how do I display only those rows that have frequency of e.g. >= 30?
Add a query call before you sort. Also, if you only want those rows EQUALing < insert frequency here >, sort_values isn't needed (right?!).
df.groupby('code')['note'].agg(['count', 'first']).query('count == 30')
If the question is for all groups with AT LEAST < insert frequency here >, then
(
df.groupby('code')
.note.agg(['count', 'first'])
.query('count >= 30')
.sort_values('count', ascending=False)
)
Why do I use query? It's a lot easier to pipe and chain with it.
You can just filter your result accordingly:
grp = grp[grp['count'] >= 30]
Example with data
import pandas as pd
df = pd.DataFrame({'code': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'note': ['A', 'B', 'A', 'A', 'C', 'C', 'C', 'A', 'A',
'B', 'B', 'C', 'A', 'B'] })
res = df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
# count first
# code
# 2 5 C
# 3 5 B
# 1 4 A
res2 = res[res['count'] >= 5]
# count first
# code
# 2 5 C
# 3 5 B

Categories

Resources