How to select the rows with same absolute value in a column - python

I want to select rows 0, 1, 3, and 4 and other rows with values that have the same absolution values. Note that assume we don't know the values (there could be -25, 25, -2356, 2356, etc.)
test = pd.DataFrame({'id':[1, 2, 3, 4, 5],
'quantity':[20, 30, 40, -30, -20]})
id quantity
0 1 20
1 2 30
2 3 40
3 4 -30
4 5 -20
.....
What is the best way of doing this?

IIUC, you want to filter the rows that have at least 2 times a value in absolute form. You could use groupby on the abs value:
test[test.groupby(test['quantity'].abs())['quantity'].transform('size').ge(2)]
If you want to ensure that you have both the negative and positive value, make it a set and check that there are 2 elements (the positive and negative):
test[test.groupby(test['quantity'].abs())['quantity'].transform(lambda g: len(set(g))==2)]
output:
id quantity
0 1 20
1 2 30
3 4 -30
4 5 -20

Related

Drop the entire row if a particular sub-row does not fulfill condition

I have a pandas df with subentries. I would like to make a condition for a particular subentry, and if this condition is not fulfilled, I would like to drop the entire row so I could update the df.
For example, I would like to check each subentry 0 for all the entries and give a condition that if pt<120 then drop the entire entry.
pt
entry subentry
0 0 100
1 200
2 300
1 0 200
1 300
2 0 80
1 300
3 400
4 300
... ... ...
So, the entry 0 and 2 (with all the subentries) should be deleted.
pt
entry subentry
1 0 200
1 300
... ... ...
I tried using:
df.loc[(slice(None), 0), :]["pt"]>100
but it creates a new series and I cannot pass it to the original df because it does not match the entries/subentries. Thank you.
Try this:
# Count the number of invalid `pt` per `entry`
invalid = df['pt'].lt(120).groupby(df['entry']).sum()
# Valid `entry` are those whose `invalid` count is 0
df[df['entry'].isin(invalid[invalid == 0].index)]
One solution is to groupby "entry" and then calculate using transform to create a minimum that can then be used with loc to index the correct rows
df = pd.DataFrame({'entry': [0, 0, 1, 1, 2, 2],
'subentry': [1, 2, 1, 2, 1, 2],
'pt': [100, 300, 200, 300, 80, 300]})
Initial df:
entry subentry pt
0 0 1 100
1 0 2 300
2 1 1 200
3 1 2 300
4 2 1 80
5 2 2 300
Use loc to find select only the rows matching conditional:
df.loc[df.groupby('entry').transform('min')['pt']>120]
Output:
entry subentry pt
2 1 1 200
3 1 2 300

How to find smallest positive integer in data frame row

I have looked everywhere for this answer which must exist. I am trying to find the smallest positive integer per row in a data frame.
Imagine a dataframe
'lat':[-120, -90, -100, -100],
'long':[20, 21, 19, 18],
'dist1':[2, 6, 8, 1],
'dist2':[1,3,10,5]}```
The following function gives me the minimum value, but includes negatives. i.e. the df['lat'] column.
df.min(axis = 1)
Obviously, I could drop the lat column, or convert to string or something, but I will need it later. The lat column is the only column with negative values. I am trying to return a new column such as
df['min_dist'] = [1,3,8,1]
I hope this all makes sense. Thanks in advance for any help.
In general you can use DataFrame.where to mark negative values as null and exclude them from min calculation:
df['min_dist'] = df.where(df > 0).min(1)
df
lat long dist1 dist2 min_dist
0 -120 20 2 1 1.0
1 -90 21 6 3 3.0
2 -100 19 8 10 8.0
3 -100 18 1 5 1.0
Filter for just the dist columns and apply the minimum function :
df.assign(min_dist = df.iloc[:, -2:].min(1))
Out[205]:
lat long dist1 dist2 min_dist
0 -120 20 2 1 1
1 -90 21 6 3 3
2 -100 19 8 10 8
3 -100 18 1 5 1
Just use:
df['min_dist'] = df[df > 0].min(1)

Conditionnal Rolling count

Here is my dataframe:
score
1
62
7
15
167
73
25
24
2
76
I want to compare a score with the previous 4 scores and count the number of scores higher than the current one.
This is my expected output:
score count
1
62
7
15
167 0
73 1 (we take 4 previous scores : 167,15,7,62 there is just 167 > 73 so count 1)
25 2
24 3
2 4
76 0
If somebody has an idea on how to do that, you are welcome
Thanks!
I do not think your output is according to your question. However, if you do look only at the previous 4 elements, then you could implement the following:
scores = [1, 62, 7, 15, 167, 73, 25, 24, 2, 76]
highers = []
for index, score in enumerate(scores[4:]):
higher = len([s for s in scores[index:index+4] if score < s])
print(higher)
highers.append(higher)
print(highers)
# [0, 1, 2, 3, 4, 0]
Then, you could just add this highers list as a pandas column:
df['output'] = [0]*4 + highers
Note that I pad the output in such way here that I assign zeros to the first four values.

Pandas complex filtering

I have a pandas.DataFrame() object like below
start, end
5, 9
6, 11
13, 11
14, 11
15, 17
16, 17
18, 17
19, 17
20, 24
22, 26
"end" has to always be > "start"
So, I need to filter it from when the "end" values becomes < "start" till the next row where they are again are back to normal.
In above example, I need:
1.
13,11
15,17
2.
18,17
20,24
Edit: (updated)
Think of these as timestamps in seconds. So I can find that it took 2 seconds in both scenario to recover back.
I can do this in iterating the data, but does Pandas have a better way ?
You could use panda's boolean indexing to find the rows where start < end. Then if you reset the index you can calculate the difference between the original indices that act as the upper and lower bounds delta between rows where start > end.
For example you could do something like the following:
# A = starts, B = ends
df = pd.DataFrame({'B' : [9, 11, 11, 11, 17, 17, 17, 17, 24, 26],
'A': [5, 6, 13, 14, 15, 16, 18, 19, 20, 22]})
# use boolean indexing
df = df[df['A'] < df['B']].reset_index()
# calculate the difference of each row's "old" index to determine delta
diffs = df['index'].diff()
# create a column to show deltas
df['delta'] = diffs
print(diffs)
print(df)
The diffs data frame looks like:
0 NaN
1 1
2 3
3 1
4 3
5 1
Name: index, dtype: float64
Notice the NaN value since the diff() method subtracts the previous row from the current row, but since the first row has no previous row it marks a NaN. One must only look at the first value of the index column to calculate the delta in the case that the first arbitrary number of n starts were > ends.
The fully augmented data frame would then look like:
index A B delta
0 0 5 9 NaN
1 1 6 11 1
2 4 15 17 3
3 5 16 17 1
4 8 20 24 3
5 9 22 26 1
If you wish to delete any of the extraneous columns you can use the del method like so:
del col1, col2, col3, etc..

Python Multiply two arrays of unequal size

I have an amplitude curve from x = 2000 to 5000 in 3000 steps and a data curve from x = 0 to 10000 in 50000 steps. Now I want to normalize the data (multiply with the amplitude curve), but as you can see the two arrays are of unequal length and have different start points.
Is there any way of doing this without resizing one of the two? (all values outside the amplitude range can be zero)
You can normalize two arrays of unequal size, but you have to make a decision or two about what makes sense for your application.
Example code:
a1 = [1,2,3,4]
a2 = [20,30]
If I want to scale the values in a1 by a2, how should I do it?
pairwise by indices, discarding extra length
make copies of indices in a2 to pad its length
pad values in a2 with fixed values
interpolate values in a2 to create new data points, while adding to its length
Do what makes sense for your data.
You said you don't want to resize the lists so you'll probably just have to iterate both lists using a while loop and keeping track of indices for each array. Stop looping when you reach the end of one the ranges.
You could also use the zip and map functions to do something like
>>> b = [2, 4, 6, 8]
>>> c = [1, 3, 5, 7, 9]
>>> map( lambda x : x[0]*x[1], zip(b, c[1:]))
>>> [6, 20, 42, 72]
but I am not sure if thats something you "can" do.
You can kind of do this with pandas if you're smart about how you define your row and column labels. When you multiply the dataframes, pandas will align the data where the column and row labels match. Values where the labels do not match will be set to NaN. Consider the following example:
# every other step
df1 = pandas.DataFrame(
data=np.arange(1, 10).reshape(3, 3),
columns=[1, 3, 5],
index=[0, 2, 4]
)
print(df1)
1 3 5
0 1 2 3
2 4 5 6
4 7 8 9
# every step
df2 = pandas.DataFrame(
data=np.arange(0, 25).reshape(5, 5),
columns=[1, 2, 3, 4, 5],
index=[0, 1, 2, 3, 4]
)
1 2 3 4 5
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
print(df1 * df2)
1 2 3 4 5
0 0 -- 4 -- 12 # <-- labels match
1 -- -- -- -- --
2 40 -- 60 -- 84 # <-- labels match
3 -- -- -- -- --
4 140 -- 176 -- 216 # <-- labels match
# ^ ^ ^
# | | |

Categories

Resources