Removing rows from pandas dataframe based on several columns

Removing rows from pandas dataframe based on several columns - python

From a pandas dataframe, I want to remove the "rois" where half or more
of the rows have for any of the columns s, b1 or b2 a value of below 50.
Here an example dataframe:
roi s b1 b2
4 40 60 70
4 60 40 80
4 80 70 60
5 60 40 60
5 60 60 60
5 60 60 60
Only the three rows corresponding to roi 5 should be left over (roi 4 has 2 out of 3 rows
where at least one of the values of s, b1, b2 is below 50).
I have this implemented already, but wonder if there is a shorter (ie. faster and
cleaner) way to do this:
for roi in data.roi.unique():
subdata = data[data['roi']==roi];
subdatas = subdata[subdata['s']>=50];
subdatab1 = subdatas[subdatas['b1']>=50];
subdatab2 = subdatab1[subdatab1['b2']>=50]
if((subdatab2.size/10)/(subdata.size/10) < 0.5):
data = data[data['roi']!=roi];

You can do transform:
s = (data.set_index('roi') # filter `roi` out of later comparison
.lt(50).any(1) # check > 50 on all columns
.groupby('roi') # groupby
.transform('mean') # compute the mean
.lt(0.5) # make sure mean > 0.5
.values
)
data[s]
Output:
roi s b1 b2
3 5 60 40 60
4 5 60 60 60
5 5 60 60 60

You can use multiple filter conditions all at once to avoid creating intermediate data frames (space complexity efficiency), example:
for roi in data.roi.unique():
subdata2 = data[(data['roi']==roi) &
(data['s']>=50) &
(data['b2']>=50)]
if (subdata2.size/10)/(data[data['roi']==roi].size/10) < 0.5:
data = data[data['roi']!=roi]

Related

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99

First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)

Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96

Cumulative difference of numbers starting from an initial value

I have a Pandas dataframe containing a series of numbers:
df = pd.DataFrame({'deduction':[10,60,70,50,60,10,10,60,60,20,50,20,10,90,60,70,30,50,40,60]})
deduction
0 10
1 60
2 70
3 50
4 60
5 10
6 10
7 60
8 60
9 20
10 50
11 20
12 10
13 90
14 60
15 70
16 30
17 50
18 40
19 60
I would like to compute the cumulative difference of these numbers, starting from a larger number (i.e. <base_number> - 10 - 60 - 70 - 50 - ...).
My current solution is to negate all the numbers, prepend the (positive) larger number to the dataframe, and then call cumsum():
# Compact:
(-df['deduction'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
# Expanded:
total_series = (
# Negate
(-df['deduction']
# Reverse
[::-1])
# Add the base value to the end
.append(pd.Series([start_value]))
# Reverse again (to put the base value at the beginning)
[::-1]
# Calculate cumulative sum (all the values except the first are negative, so this will work)
.cumsum()
# Clean up
.reset_index(drop=True)
)
But I was wondering if there were possible a shorter solution, that didn't append to the series (I hear that that's bad practice).
(It doesn't need to be put in a dataframe; a series, like I've done above, will be alright.)

df['total'] = start_value - df["deduction"].cumsum()
If you need the start value at the beginning of the series then shift and insert (there's a few ways to do it, and this is one of them):
df['total'] = -df["deduction"].shift(1, fill_value=-start_value).cumsum()

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?

You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap

Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap

firstly，explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly，judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Python Dataframe Sum/Subtract Many Columns to Many Coumns in one line

I want to sum many columns to many columns in a data frame.
My code:
df =
A1 B1 A2 B2
0 15 30 50 70
1 25 40 60 80
# I have many columns like this. I want to do something like this A1-A2, B1-B2, etc
# My approach is
first_cols = [A1,B1]
sec_cols = [A2,B2]
# New column names
sub_cols = [A_sub,B_sub]
df[sub_cols] = df[first_cols] - df[sec_cols]
Present output:
ValueError: Wrong number of items passed , placement implies 1
Expected output:
df =
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40

I think what you are trying to do is similar to this post. In Dataframes generally the arithmetic operations are aligned on column and row indices. Since you are tring to subtract different columns, pandas doesn't carry out the operation. So, df[sub_cols] = df[first_cols] - df[second_cols] won't work.
However, if you were to use numpy array and do the operation, pandas carries it out elementwise. So, df[sub_cols] = df[first_cols] - df[second_cols].values will work and give you the expected result.
import pandas as pd
df = {"A1":[15,25], "B1": [30, 40], "A2":[50,60], "B2": [70, 80]}
df = pd.DataFrame(df)
first_cols = ["A1", "B1"]
second_cols = ["A2", "B2"]
sub_cols = ["A_sub","B_sub"]
df[sub_cols] = df[first_cols] - df[second_cols].values
print(df)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40

You could also pull it off with a groupby on the columns:
subtraction = (df.groupby(df.columns.str[0], axis = 1)
.agg(np.subtract.reduce, axis = 1)
.add_suffix("_sub")
)
df.assign(**subtraction)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40

It's not quite clear what you want. If you want one column that's A1-A2 and another that's B1-B2, you can do df[[A1,A,2]].sub(df[[B1,B2]]).

How to extract mean and fluctuation by equal index?

I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?

Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing rows from pandas dataframe based on several columns - python

Related

How to name the column when using value_count function in pandas?

Cumulative difference of numbers starting from an initial value

determine the range of a value using a look up table

Python Dataframe Sum/Subtract Many Columns to Many Coumns in one line

How to extract mean and fluctuation by equal index?

Categories

Resources