Compare two rows on a loop for on Pandas - python

I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0.
d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] }
df = pd.DataFrame(data=d)
df
I have tried this with the np.where and it is working:
df[B] = np.where(df[A] > df[B], 0, df[B])
df[C] = np.where(df[B] > df[C], 0, df[C])
However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for.
Thanks

Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:
df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
A B C
0 6 0 0
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:
for a, b in zip(df.columns, df.columns[1:]):
df[b] = np.where(df[a] > df[b], 0, df[b])
print (df)
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen.
You can achieve a correct vectorial replacement using a shifted mask:
m1 = df.diff(axis=1).lt(0) # check if < than previous
m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice
df2 = df.mask(m1&m2, 0)
output:
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

Delete rows in apply() function or depending on apply() result

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8

Python Pandas Dataframe, remove all rows where 'None' is the value in any column

I have a large dataframe. When it was created 'None' was used as the value where a number could not be calculated (instead of 'nan')
How can I delete all rows that have 'None' in any of it's columns? I though I could use df.dropna and set the value of na, but I can't seem to be able to.
Thanks
I think this is a good representation of the dataframe:
temp = pd.DataFrame(data=[['str1','str2',2,3,5,6,76,8],['str3','str4',2,3,'None',6,76,8]])
Setup
Borrowed #MaxU's df
df = pd.DataFrame([
[1, 2, 3],
[4, None, 6],
[None, 7, 8],
[9, 10, 11]
], dtype=object)
Solution
You can just use pd.DataFrame.dropna as is
df.dropna()
0 1 2
0 1 2 3
3 9 10 11
Supposing you have None strings like in this df
df = pd.DataFrame([
[1, 2, 3],
[4, 'None', 6],
['None', 7, 8],
[9, 10, 11]
], dtype=object)
Then combine dropna with mask
df.mask(df.eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
You can ensure that the entire dataframe is object when you compare with.
df.mask(df.astype(object).eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
Thanks for all your help. In the end I was able to get
df = df.replace(to_replace='None', value=np.nan).dropna()
to work. I'm not sure why your suggestions didn't work for me.
UPDATE:
In [70]: temp[temp.astype(str).ne('None').all(1)]
Out[70]:
0 1 2 3 4 5 6 7
0 str1 str2 2 3 5 6 76 8
Old answer:
In [35]: x
Out[35]:
a b c
0 1 2 3
1 4 None 6
2 None 7 8
3 9 10 11
In [36]: x = x[~x.astype(str).eq('None').any(1)]
In [37]: x
Out[37]:
a b c
0 1 2 3
3 9 10 11
or bit nicer variant from #roganjosh:
In [47]: x = x[x.astype(str).ne('None').all(1)]
In [48]: x
Out[48]:
a b c
0 1 2 3
3 9 10 11
im a bit late to the party, but this is prob the simplest method:
df.dropna(axis=0, how='any')
Parameters:
axis='index/column' how='any/all'
axis '0' is for dropping rows (most common), and '1' will drop columns instead.
and the parameter how will drop if there are 'any' None types in the row/ column,
or if they are all None types (how='all')
if still None is not removed , we can do
df = df.replace(to_replace='None', value=np.nan).dropna()
the above solution worked partially still the None was converted to NaN but not removed (thanks to the above answer as it helped to move further)
so then i added one more line of code that is take the particular column
df['column'] = df['column'].apply(lambda x : str(x))
this changed the NaN to nan
now remove the nan
df = df[df['column'] != 'nan']

Use a list to conditionally fill a new column based on values in multiple columns

I am trying to populate a new column within a pandas dataframe by using values from several columns. The original columns are either 0 or '1' with exactly a single 1 per series. The new column would correspond to df['A','B','C','D'] by populating new_col = [1, 3, 7, 10] as shown below. (A 1 at A means new_col = 1; if B=1,new_col = 3, etc.)
df
A B C D
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 1 0 0
The new df should look like this.
df
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3
I've tried to use map, loc, and where but can't seem to formulate an efficient way to get it done. Problem seems very close to this. A couple other posts I've looked at 1 2 3. None of these show how to use multiple columns conditionally to fill a new column based on a list.
I can think of a few ways, mostly involving argmax or idxmax, to get either an ndarray or a Series which we can use to fill the column.
We could drop down to numpy, find the maximum locations (where the 1s are) and use those to index into an array version of new_col:
In [148]: np.take(new_col,np.argmax(df.values,1))
Out[148]: array([ 1, 7, 10, 3])
We could make a Series with new_col as the values and the columns as the index, and index into that with idxmax:
In [116]: pd.Series(new_col, index=df.columns).loc[df.idxmax(1)].values
Out[116]: array([ 1, 7, 10, 3])
We could use get_indexer to turn the column idxmax results into integer offsets we can use with new_col:
In [117]: np.array(new_col)[df.columns.get_indexer(df.idxmax(axis=1))]
Out[117]: array([ 1, 7, 10, 3])
Or (and this seems very wasteful) we could make a new frame with the new columns and use idxmax directly:
In [118]: pd.DataFrame(df.values, columns=new_col).idxmax(1)
Out[118]:
0 1
1 7
2 10
3 3
dtype: int64
It's not the most elegant solution, but for me it beats the if/elif/elif loop:
d = {'A': 1, 'B': 3, 'C': 7, 'D': 10}
def new_col(row):
k = row[row == 1].index.tolist()[0]
return d[k]
df['new_col'] = df.apply(new_col, axis=1)
Output:
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3

pandas assign value based on mean

Let's say I have a dataframe column. I want to create a new column where the value for a given observation is 1 if the corresponding value in the old column is above average. But the value should be 0 if the value in the other column is average or below.
What's the fastest way of doing this?
Say you have the following DataFrame:
df = pd.DataFrame({'A': [1, 4, 6, 2, 8, 3, 7, 1, 5]})
df['A'].mean()
Out: 4.111111111111111
Comparison against the mean will get you a boolean vector. You can cast that to integer:
df['B'] = (df['A'] > df['A'].mean()).astype(int)
or use np.where:
df['B'] = np.where(df['A'] > df['A'].mean(), 1, 0)
df
Out:
A B
0 1 0
1 4 0
2 6 1
3 2 0
4 8 1
5 3 0
6 7 1
7 1 0
8 5 1

Categories

Resources