Delete rows in apply() function or depending on apply() result - python

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?

Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8

IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8

Related

Compare two rows on a loop for on Pandas

I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0.
d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] }
df = pd.DataFrame(data=d)
df
I have tried this with the np.where and it is working:
df[B] = np.where(df[A] > df[B], 0, df[B])
df[C] = np.where(df[B] > df[C], 0, df[C])
However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for.
Thanks
Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:
df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
A B C
0 6 0 0
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:
for a, b in zip(df.columns, df.columns[1:]):
df[b] = np.where(df[a] > df[b], 0, df[b])
print (df)
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen.
You can achieve a correct vectorial replacement using a shifted mask:
m1 = df.diff(axis=1).lt(0) # check if < than previous
m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice
df2 = df.mask(m1&m2, 0)
output:
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

Python: Creating new column names in a for loop

I am trying to make custom column header names for the dataframe using a for loop. Currently I am using two for loops to iterate through a dataframe, but don't know how to put new column headers in without hardcoding them. I have
df = pandas.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],})
result = []
for i in range(len(df.columns)):
SelectedCol = (df.iloc[:,i])
for c in range(i+1, len(df.columns)):
result.append(((SelectedCol+1)/ (df.iloc[:,c]+1)))
df1 = pandas.DataFrame(result)
df1=df1.transpose()
In df, the first column is taken and multiplied to the second, third, and fourth. And then the code takes the second, and multiples it by the third and fourth, and continues in the for loop so the output columns are
'AB' , 'AC', 'AD', 'BC', 'BD', and 'CD'.
What could I add to my for loop to extract the column names so each column name of df1 can be 'Long A, Short B' , 'Long A, Short C'.... and finally 'Long C, Short D'
Thanks for your help
from itertools import combinations
for x,y in combinations(df.columns,2):
df['Long '+x+' Short '+y]=df[x]*df[y]
import pandas
from itertools import combinations
df = pandas.DataFrame({
'A': [5, 3, 6, 9, 2, 4],
'B': [4, 5, 4, 5, 5, 4],
'C': [7, 8, 9, 4, 2, 3],
'D': [1, 3, 5, 7, 1, 0], })
# get all col name
for index, row in df.iteritems():
print(index)
# get all combinations
result = combinations(df.iteritems(), 2)
# calc
for name, data in result:
_name = name[0] + data[0]
_data = name[1] * data[1]
df[_name] = _data
print(df)
A
B
C
D
A B C D AB AC AD BC BD CD
0 5 4 7 1 20 35 5 28 4 7
1 3 5 8 3 15 24 9 40 15 24
2 6 4 9 5 24 54 30 36 20 45
3 9 5 4 7 45 36 63 20 35 28
4 2 5 2 1 10 4 2 10 5 2
5 4 4 3 0 16 12 0 12 0 0

How to use a Series to filter a DataFrame

I have a pandas Series with the following content.
$ import pandas as pd
$ s = pd.Series(
data = [True, False, True, True],
index = ['A', 'B', 'C', 'D']
)
$ s.index.name = 'my_id'
$ print(s)
my_id
A True
B False
C True
D True
dtype: bool
and a DataFrame like this.
$ df = pd.DataFrame({
'A': [1, 2, 9, 4],
'B': [9, 6, 7, 8],
'C': [10, 91, 32, 13],
'D': [43, 12, 7, 9]
})
$ print(df)
A B C D
0 1 9 10 43
1 2 6 91 12
2 9 7 32 7
3 4 8 13 9
s has A, B, C, and D as its indices. df also has A, B, C, and D as it column names.
True in s means that the corresponding column in df will be preserved. False in s means that the corresponding column in df will be removed.
How can I generate another DataFrame with column B removed using s?
I mean I want to create the following DataFrame using s and df.
A C D
0 1 10 43
1 2 91 12
2 9 32 7
3 4 13 9
Use boolean indexing with DataFrame.loc. The : means filter all rows. Columns are filtered by Series filled with boolean - mask:
df1 = df.loc[:, s]
print (df1)
A C D
0 1 10 43
1 2 91 12
2 9 32 7
3 4 13 9

Axis error when dropping specific columns Pandas

I have identified specific columns I want to select as my predictors for my model based on some analysis. I have captured those column numbers and stored it in a list. I have roughly 80 columns and want to loop through and drop the columns not in this specific list. X_train is the column in which I want to do this. Here is my code:
cols_selected = [24, 4, 7, 50, 2, 60, 46, 53, 48, 61]
cols_drop = []
for x in range(len(X_train.columns)):
if x in cols_selected:
pass
else:
X_train.drop([x])
When running this, I am faced with the following error while highlighting the code: X_train.drop([x]):
KeyError: '[3] not found in axis'
I am sure it is something very simple that I am missing. I tried including the inplace=True or axis=1 statements along with this and all of them had the same error message (while the value inside the [] changed with those error codes).
Any help would be great!
Edit: Here is the addition to get this working:
cols_selected = [24, 4, 7, 50, 2, 60, 46, 53, 48, 61]
cols_drop = []
for x in range(len(X_train.columns)):
if x in cols_selected:
pass
else:
cols_drop.append(x)
X_train = X_train.drop(X_train.columns[[cols_drop]], axis=1)
According to the documentation of drop:
Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly index or column names
You can not drop columns by simply using the index of the column. You need the name of the columns. Also the axis parameter has to be set to 1 or columns Replace X_train.drop([x]) with X_train=X_train.drop(X_train.columns[x], axis='columns') to make your example work.
I am just assuming as per the question litle:
Example DataFrame:
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Dropping Specific columns B & C:
>>> df.drop(['B', 'C'], axis=1)
# df.drop(['B', 'C'], axis=1, inplace=True) <-- to make the change the df itself , use inplace=True
A D
0 0 3
1 4 7
2 8 11
If you are trying to drop them by column numbers (Dropping by index) then try like below:
>>> df.drop(df.columns[[1, 2]], axis=1)
A D
0 0 3
1 4 7
2 8 11
OR
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Also, in addition to #pygo pointing out that df.drop takes a keyword arg to designate the axis, try this:
X_train = X_train[[col for col in X_train.columns if col in cols_selected]]
Here is an example:
>>> import numpy as np
>>> import pandas as pd
>>> cols_selected = ['a', 'c', 'e']
>>> X_train = pd.DataFrame(np.random.randint(low=0, high=10, size=(20, 5)), columns=['a', 'b', 'c', 'd', 'e'])
>>> X_train
a b c d e
0 4 0 3 5 9
1 8 8 6 7 2
2 1 0 2 0 2
3 3 8 0 5 9
4 5 9 7 8 0
5 1 9 3 5 9 ...
>>> X_train = X_train[[col for col in X_train.columns if col in cols_selected]]
>>> X_train
a c e
0 4 3 9
1 8 6 2
2 1 2 2
3 3 0 9
4 5 7 0
5 1 3 9 ...

Python Pandas Dataframe, remove all rows where 'None' is the value in any column

I have a large dataframe. When it was created 'None' was used as the value where a number could not be calculated (instead of 'nan')
How can I delete all rows that have 'None' in any of it's columns? I though I could use df.dropna and set the value of na, but I can't seem to be able to.
Thanks
I think this is a good representation of the dataframe:
temp = pd.DataFrame(data=[['str1','str2',2,3,5,6,76,8],['str3','str4',2,3,'None',6,76,8]])
Setup
Borrowed #MaxU's df
df = pd.DataFrame([
[1, 2, 3],
[4, None, 6],
[None, 7, 8],
[9, 10, 11]
], dtype=object)
Solution
You can just use pd.DataFrame.dropna as is
df.dropna()
0 1 2
0 1 2 3
3 9 10 11
Supposing you have None strings like in this df
df = pd.DataFrame([
[1, 2, 3],
[4, 'None', 6],
['None', 7, 8],
[9, 10, 11]
], dtype=object)
Then combine dropna with mask
df.mask(df.eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
You can ensure that the entire dataframe is object when you compare with.
df.mask(df.astype(object).eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
Thanks for all your help. In the end I was able to get
df = df.replace(to_replace='None', value=np.nan).dropna()
to work. I'm not sure why your suggestions didn't work for me.
UPDATE:
In [70]: temp[temp.astype(str).ne('None').all(1)]
Out[70]:
0 1 2 3 4 5 6 7
0 str1 str2 2 3 5 6 76 8
Old answer:
In [35]: x
Out[35]:
a b c
0 1 2 3
1 4 None 6
2 None 7 8
3 9 10 11
In [36]: x = x[~x.astype(str).eq('None').any(1)]
In [37]: x
Out[37]:
a b c
0 1 2 3
3 9 10 11
or bit nicer variant from #roganjosh:
In [47]: x = x[x.astype(str).ne('None').all(1)]
In [48]: x
Out[48]:
a b c
0 1 2 3
3 9 10 11
im a bit late to the party, but this is prob the simplest method:
df.dropna(axis=0, how='any')
Parameters:
axis='index/column' how='any/all'
axis '0' is for dropping rows (most common), and '1' will drop columns instead.
and the parameter how will drop if there are 'any' None types in the row/ column,
or if they are all None types (how='all')
if still None is not removed , we can do
df = df.replace(to_replace='None', value=np.nan).dropna()
the above solution worked partially still the None was converted to NaN but not removed (thanks to the above answer as it helped to move further)
so then i added one more line of code that is take the particular column
df['column'] = df['column'].apply(lambda x : str(x))
this changed the NaN to nan
now remove the nan
df = df[df['column'] != 'nan']

Categories

Resources