Pandas: Trying to drop rows based on for loop? - python

I have a dataframe consisting of multiple columns and then two of the columns, x and y, that are both filled with numbers ranging from 1 to 3.I want to drop all rows where the number in x is less than in the number in y. For example, if in one row x = 1 and y = 3 I want to drop that entire row. This is the code I've written so far:
for num1 in df.x:
for num2 in df.y:
if (num1< num2):
df.drop(df.iloc[num1], inplace = True)
but I keep getting the error:
labels ['new' 'active' 1 '1'] not contained in axis
Anyhelp is greatly appreciated. Thanks!

I would avoid loops in your scenario, and just use .drop:
df.drop(df[df['x'] < df['y']].index, inplace=True)
Example:
df = pd.DataFrame({'x':np.random.randint(0,4,5), 'y':np.random.randint(0,4,5)})
>>> df
x y
0 1 2
1 2 1
2 3 1
3 2 1
4 1 3
df.drop(df[df['x'] < df['y']].index, inplace = True)
>>> df
x y
1 2 1
2 3 1
3 2 1
[EDIT]: Or, more simply, without using drop:
df=df[~(df['x'] < df['y'])]

Writing two for loops is very ineffecient, instead you can
just compare the two columns
[df['x'] >= df['y']]
These returns a boolean array which you can use to filter the dataframe
df[df['x'] >= df['y']]

I think better is use boolean indexing or query with changing condition to >=:
df[df['x'] >= df['y']]
Or:
df = df.query('x >= y')
Sample:
df = pd.DataFrame({'x':[1,2,3,2], 'y':[0,4,5,1]})
print (df)
x y
0 1 0
1 2 4
2 3 5
3 2 1
df = df[df['x'] >= df['y']]
print (df)
x y
0 1 0
3 2 1

Related

How to conditionally replace a value with a value from the same row in a different column using pandas?

Say I have a data frame:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
I want column Z to equal column X, unless column X is equal to 0. In that case, I would want column Z to equal column Y. Is there a way to do this without a for loop using pandas?
Use a conditional with numpy.where:
df['Z'] = np.where(df['X'].eq(0), df['Y'], df['X'])
Or Series.where:
df['Z'] = df['Y'].where(df['X'].eq(0), df['X'])
Output:
ID X Y Z
1 3 5 3
2 1 4 1
3 5 3 5
4 0 7 7
You can try using np.where():
df['Z'] = np.where(df['X'] == 0,df['Y'],df['X'])
Basically this translates to "If X = 0 then use the corresponding value for column Y, else (different than 0) use column X"

custom groupby function pandas python

I have a following dataframe:
I would like to group by id and add a flag column which contains Y if anytime Y has occurred against id, resultant DF would like following:
Here is my approach which is too time consuming and not sure of correctness:
temp=pd.DataFrame()
j='flag'
for i in df['id'].unique():
test=df[df['id']==i]
test[j]=np.where(np.any((test[j]=='Y')),'Y',test[j])
temp=temp.append(test)
You can do groupby + max since Y > N:
df.groupby('id', as_index=False)['flag'].max()
id flag
0 1 Y
1 2 Y
2 3 N
3 4 N
4 5 Y
Compare flag to Y, group by id, and use any:
new_df = (df['flag'] == 'Y').groupby(df['id']).any().map({True:'Y', False:'N'}).reset_index()
Output:
>>> new_df
id flag
0 1 Y
1 2 Y
2 3 N
3 4 N
4 5 Y

Replacing values from a pandas dataframe

I have a data frame which have columns with strings and integers.
df = pd.DataFrame([ ['Manila', 5,12,0], ['NY',9,0,14], ['Berlin',8,10,6] ], columns = ['a','b','c','d'])
I want to change all the values to "1" where the value is greater than 1 and the zeros will be reamin the same.
So I tried with apply(lambda x: 1 if x > 1 else 0) but it shows its ambigious.
Then I tried to write a function separately as follow:
def find_value(x):
try:
x = int(x)
print(x)
if x > 1:
x = 1
else:
x = 0
except:
return x
return x
and then apply it
df = df.apply(find_value, axis=1)
But the output does not change and the df remains as it was.
I think there should be some apply function which can be applied on all of the eligible columns (those columns which has numerical values). But I am missing the point somehow. Can anyone please enlighten me how to solve it (with or without "map" function)?
Use DataFrame.select_dtypes for get numbers columns, compare for greater like 1 and then map True, False to 1,0 by casting to integers, for change data in original is used DataFrame.update:
df.update(df.select_dtypes(np.number).gt(1).astype(int))
print (df)
a b c d
0 Manila 1 1 0
1 NY 1 0 1
2 Berlin 1 1 1
Or use DataFrame.clip if all integers and no negative numbers:
df.update(df.select_dtypes(np.number).clip(upper=1))
print (df)
a b c d
0 Manila 1 1 0
1 NY 1 0 1
2 Berlin 1 1 1
EDIT:
Your solution working with DataFrame.applymap:
df = df.applymap(find_value)

Logical AND between elementwise and rowwise conditions in pandas (bolean and between matrix and column)

Editted:
How does one combine logical masks of different sizes in pandas?
For example, find all elements that satisfy an elementwise condition while the row they are in also satisfies another condition.
df = pandas.DataFrame({'name':'x y T Bob Banana'.split(),'value':[-100, 0, -1, 100, -33],'value_too':[1,2,2,2,-11]})
name value value_too
x -100 1
y 0 2
T -1 2
Bob -33 2
Banana 0 -11
Imagine if in a table above I need to change the negative values in the rows with T,Bob,Banana.
Checking the name gives a series of size 5:
c1 = df.name.isin({'Banana','Bob','T'})
Checking if a value is a negative numer gives a dataframe 5 by 3:
c2 = df.applymap(lambda x: x < 0 if isinstance(x,(int,float)) else False)
In matlab doing a binary operation on a vector of size m and matrix of size (m,n) would translate the operation across n columns and the result would also be of size (m,n). In python c1 & c2 produces a 5 by 8 table filled with False.
How are conditions of this type combined into a 5 by 3 matrix that can be fed into df.where cond? (In the table above it should point to values -1,-33 and -11.)
mask = ?????(c1,c2)
df = df.where(mask,0)
dtypes of the data are irrelevant, in the real problem all entries are strings, but numbers make example simpler.
Original text below.
Suppose I have pandas dataframe:
df = pandas.DataFrame({'a':[1,1,3],'b':[1,3,5]})
a b
0 1 1
1 1 3
2 3 5
I can get a boolean mask by value to find elements of particular value:
q = df >1
a b
0 0 0
1 1 1
2 1 1
I can get a boolean vector to find rows where column satisfies a condition:
q.b == 3
0
1
0
What is the idiomatic way of finding elements that satisfy a combination of these conditions?
??????
a b
0 0 0
1 0 1
2 0 0
edit: actually the expected output was
a b
0 0 0
1 1 1
2 0 0
(value greater than 1 and in a row that has b == 3)
try via assign() method:
out=df.gt(1).astype(int).assign(b=df['b'].eq(3).astype(int))
output of out:
a b
0 0 0
1 0 1
2 1 0
OR
via where() method
out=df.gt(1).astype(int).where(df['b']==3,0)
Output of out:
a b
0 0 0
1 0 1
2 0 0
Note: choose any 1 of the above method as per your need
If needed in form of True/False then use:
d={0:False,1:True}
out=out.replace(d.keys(),d.values(),regex=True)
Your first gt condition seems like not necessary
df[['b']].eq(3).reindex(df.columns,axis=1,fill_value=False)
Out[111]:
a b
0 False False
1 False True
2 False False
I've settled with explicit combining of the column.
This selects elements that satisfy c2 and are in a row that satisfies c1:
pad = c2.apply(lambda x: x&c1)
#size (5,3)
This selects rows which satisfy c1 or contain element satisfying c2:
pad2 = c2.agg(any,axis = 1) | c1
#size (5,1)

Remove all groups with more than N observations

If a value occurs more than two times in a column I want to drop every row that it occurs in.
The input df would look like:
Name Num
X 1
X 2
Y 3
Y 4
X 5
The output df would look like:
Name Num
Y 3
Y 4
I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.
df = df.drop_duplicates(subset = ['Name'], drop='third')
This code is completely wrong but it helps explain what I was trying to do.
Using head
df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4
Use GroupBy.cumcount for counter and filter all values less like 2:
df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
Detail:
print (df.groupby('Name').cumcount())
0 0
1 1
2 0
3 1
4 2
dtype: int64
EDIT
Filter by GroupBy.transform and GroupBy.size:
df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
Name Num
2 Y 3
3 Y 4

Categories

Resources