Pandas complex filtering

Pandas complex filtering - python

I have a pandas.DataFrame() object like below
start, end
5, 9
6, 11
13, 11
14, 11
15, 17
16, 17
18, 17
19, 17
20, 24
22, 26
"end" has to always be > "start"
So, I need to filter it from when the "end" values becomes < "start" till the next row where they are again are back to normal.
In above example, I need:
1.
13,11
15,17
2.
18,17
20,24
Edit: (updated)
Think of these as timestamps in seconds. So I can find that it took 2 seconds in both scenario to recover back.
I can do this in iterating the data, but does Pandas have a better way ?

You could use panda's boolean indexing to find the rows where start < end. Then if you reset the index you can calculate the difference between the original indices that act as the upper and lower bounds delta between rows where start > end.
For example you could do something like the following:
# A = starts, B = ends
df = pd.DataFrame({'B' : [9, 11, 11, 11, 17, 17, 17, 17, 24, 26],
'A': [5, 6, 13, 14, 15, 16, 18, 19, 20, 22]})
# use boolean indexing
df = df[df['A'] < df['B']].reset_index()
# calculate the difference of each row's "old" index to determine delta
diffs = df['index'].diff()
# create a column to show deltas
df['delta'] = diffs
print(diffs)
print(df)
The diffs data frame looks like:
0 NaN
1 1
2 3
3 1
4 3
5 1
Name: index, dtype: float64
Notice the NaN value since the diff() method subtracts the previous row from the current row, but since the first row has no previous row it marks a NaN. One must only look at the first value of the index column to calculate the delta in the case that the first arbitrary number of n starts were > ends.
The fully augmented data frame would then look like:
index A B delta
0 0 5 9 NaN
1 1 6 11 1
2 4 15 17 3
3 5 16 17 1
4 8 20 24 3
5 9 22 26 1
If you wish to delete any of the extraneous columns you can use the del method like so:
del col1, col2, col3, etc..

Related

Does DataFrame.applymap always go from top down then left to right?

I want to know if the Pandas applymap function always go through from top to bottom and left to right (iterating through each row on a per column basis).
Mainly, I'm using applymap to pass a dictionary to count the number of items as a list in each cell, BUT I have to account for it differently once the value is seen for the first time. So if applymap always goes works consistently, I can use it, but if there are some weird potential for race conditions, then I can't.
import numpy as np
import pandas as pd
vals = np.arange(25).reshape([5,5])
df = pd.DataFrame(vals)
print(df)
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
l = []
_ = df.applymap(lambda x: l.append(x))
print(l)
[ 0, 5, 10, 15, 20,
1, 6, 11, 16, 21,
2, 7, 12, 17, 22,
3, 8, 13, 18, 23,
4, 9, 14, 19, 24]

I believe this always will be consistent, as apply by default also works column-by-columns.
I found a comment here on Stack Overflow to that effect (emphasis mine):
strictly speaking, applymap internally is implemented via apply with a little wrap-up over passed function parameter (rougly speaking replacing func to lambda x: [func(y) for y in x], and applying column-wise)

In the source code, applymap uses apply, which work by default by column.
The order seems consistent, even on a shuffled array:
import numpy as np
import pandas as pd
from itertools import count
df = pd.DataFrame(np.zeros((5,5)))
c = count()
df.sample(frac=1).sample(frac=1, axis=1).applymap(lambda x: next(c))
output:
1 3 2 0 4
0 0 5 10 15 20
4 1 6 11 16 21
3 2 7 12 17 22
1 3 8 13 18 23
2 4 9 14 19 24
Now, I think the real question is, "is this behavior stable or is it just an implementation detail that could change in the future?"

How to select the rows with same absolute value in a column

I want to select rows 0, 1, 3, and 4 and other rows with values that have the same absolution values. Note that assume we don't know the values (there could be -25, 25, -2356, 2356, etc.)
test = pd.DataFrame({'id':[1, 2, 3, 4, 5],
'quantity':[20, 30, 40, -30, -20]})
id quantity
0 1 20
1 2 30
2 3 40
3 4 -30
4 5 -20
.....
What is the best way of doing this?

IIUC, you want to filter the rows that have at least 2 times a value in absolute form. You could use groupby on the abs value:
test[test.groupby(test['quantity'].abs())['quantity'].transform('size').ge(2)]
If you want to ensure that you have both the negative and positive value, make it a set and check that there are 2 elements (the positive and negative):
test[test.groupby(test['quantity'].abs())['quantity'].transform(lambda g: len(set(g))==2)]
output:
id quantity
0 1 20
1 2 30
3 4 -30
4 5 -20

How to select pairs of values in an array according to a given sequence for all matrix shapes?

We try to select values from matrices into pairs according to the procedure where values are selected diagonally. my code doesn't work as it should
You can see this sequence in the example below. It can be seen that the values are selected sequentially in a cross-form, where it starts in the penultimate line of the first value and joins it from the second value of the last line. It then moves one line up and continues in the same way.
. In the 1st example, the principle is that it takes cross values in the 1st example 21-> 32, then it starts 11-> 22, 11-> 33,22-> 33,12-> 23 and so on for all matrices. The same goes for the second example
code:
import numpy as np
a=np.array([[11,12,13],
[21,22,23],
[31,32,33]])
w,h = a.shape
for y0 in range(1,h):
y = h-y0-1
for x in range(h-y-1):
print( a[y+x,x], a[y+x+1,x+1] )
for x in range(1,w-1):
for y in range(w-x-1):
print( a[y,x+y], a[y+1,x+y+1] )
my outupt:
21 32
11 22
22 33
12 23
required output
21 32
11 22
11 33
22 33
12 23
However, if I use this matrix, for example, it will throw me an error.
a=np.array([[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36]])
required output
21 32
11 22
11 33
22 33
12 23
12 34
23 34
13 24
13 35
24 35
14 25
14 36
25 36
15 26
my output
error
File "C:\Users\Pifkoooo\dp\skuska.py", line 24, in <module>
print( a[y+x,x], a[y+x+1,x+1] )
IndexError: index 2 is out of bounds for axis 0 with size 2
Can anyone advise me how to solve this problem and generalize it to work on all matrices with different shapes? Or if there is another way to approach this task?

Let's look for patterns (like here, but simpler)! First, let's say that you have an array of shape (M, N), with M=4 and N=5. First, let's note the linear indices of the elements:
i =
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
Once you have identified the first element in a pair, the linear index of the next element is just i + N + 1.
Now let's try to establish the path of the first element using the example in the linked question. First, look at the column indices and the row indices:
x =
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
y =
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
Now take the difference, and add a factor to account for the shape:
x - y + 2M - N =
3 4 5 6 7
2 3 4 5 6
1 2 3 4 5
0 1 2 3 4
The first element follows the index of the diagonals except at the bottom row and rightmost column. If you can stably argsort this array (np.argsort has a stable method that uses timsort), then apply that index to the linear indices, you have the path taken by the first element of every pair for any matrix at all. The first observation will then yield the second element.
So it all boils down to this:
M, N = a.shape
path = (np.arange(N - 1) - np.arange(M - 1)[:, None] + 2 * M - N).argsort(None)
indices = np.arange(M * N).reshape(M, N)[:-1, :-1].ravel()[path]
Now you have a couple of different options going forward:
Apply linear indices to the raveled a:
result = a.ravel()[indices[:, None] + [0, N + 1]]
Preserve the shape of a and use np.unravel_index to transform indices and indices + N + 1 into a 2D index:
result = a[np.unravel_index(indices[:, None] + [0, N + 1], a.shape)]
Moral of the story: this is all black magic!

Probably not the best performance, but it gets the job done if order does not matter. Iterate over all elements and try to access all of its diagonal partners. If the diagonal partner does not exist catch the raised IndexError and continue with the next element.
def print_diagonal_pairs(a):
rows, cols = a.shape
for row in range(rows):
for col in range(cols):
max_shift_amount = min(rows, cols) - min(row, col)
for shift_amount in range(1, max_shift_amount+1):
try:
print(a[row, col], a[row+shift_amount, col+shift_amount])
except IndexError:
continue
a = np.array([
[11,12,13],
[21,22,23],
[31,32,33],
])
print_diagonal_pairs(a)
# Output:
11 22
11 33
12 23
21 32
22 33
b = np.array([
[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36]
])
print_diagonal_pairs(b)
# Output:
11 22
11 33
12 23
12 34
13 24
13 35
14 25
14 36
15 26
21 32
22 33
23 34
24 35
25 36

Not a solution, but I think you can use fancy indexing for this task. In the code snippet below i am selecting the indices x = [[0,1], [0,2], [1,2]] along the first axis. These indices will be broadcasted against the indices in y along the first dimension.
from itertools import combinations
a=np.array([[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36]])
x = np.array(list(combinations(range(a.shape[0]), 2)))
y = x + np.arange(a.shape[1]-2)[:,None,None]
a[x,y].reshape(-1,2)
Output:
array([[11, 22],
[11, 33],
[22, 33],
[12, 23],
[12, 34],
[23, 34],
[13, 24],
[13, 35],
[24, 35],
[14, 25],
[14, 36],
[25, 36]])
This will select all correct values except for the start and end values for the second example. There is probably a smart way to include these edge values and select all values in one sweep, but I cannot think of a solution for this atm.
I thought the pattern was to select combinations of size 2 along each diagonal, but apparently not - so this solution will not give the correct "middle" values in your first example.
EDIT
You could extend the selection range and modify the two edge values:
x = np.array(list(combinations(range(a.shape[0]), 2)))
y = x + np.arange(-1,a.shape[1]-1)[:,None,None]
# assign edge values
y[0] = y[1][0]
y[-1] = y[-2][-1]
a[x,y].reshape(-1,2)[2:-2]
Output:
array([[21, 32],
[11, 22],
[11, 33],
[22, 33],
[12, 23],
[12, 34],
[23, 34],
[13, 24],
[13, 35],
[24, 35],
[14, 25],
[14, 36],
[25, 36],
[15, 26]])

My original answer was for the case in the original question where the pairs slid along the diagonals rather than spreading across them with the first point staying anchored. While the solution is not exactly the same, the concept of computing indices in a vectorized manner applies here too.
Start with the matrix of row minus column which gives diagonals as before:
diag = np.arange(1, N) - np.arange(1, M)[:, None] + 2 * M - N
This shows that the the second element is given by
second = a[1:, 1:].ravel()[diag.argsort(None, kind='stable')]
The heads of the diagonals are the first column in reverse and the first row. If you index them correctly, you get the first element of each pair:
head = np.r_[a[::-1, 0], a[0, 1:]]
first = head[np.sort(diag, axis=None)]
Now you can just concatenate the result:
result = np.stack((first, second), axis=-1)
See: black magic! And totally vectorized.

merge rows pandas dataframe based on condition

Hi have a dataframe df
containing a set of events (rows).
df = pd.DataFrame(data=[[1, 2, 7, 10],
[10, 22, 1, 30],
[30, 42, 2, 10],
[100,142, 22,1],
[143, 152, 2, 10],
[160, 162, 12, 11]],columns=['Start','End','Value1','Value2'])
df
Out[15]:
Start End Value1 Value2
0 1 2 7 10
1 10 22 1 30
2 30 42 2 10
3 100 142 22 1
4 143 152 2 10
5 160 162 12 11
If 2 (or more) consecutive events are <= 10 far apart I would like to merge the 2 (or more) events (i.e. use the start of the first event, end of the last and sum the values in Value1 and Value2).
In the example above df becomes:
df
Out[15]:
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

That's totally possible:
df.groupby(((df.Start - df.End.shift(1)) > 10).cumsum()).agg({'Start':min, 'End':max, 'Value1':sum, 'Value2': sum})
Explanation:
start_end_differences = df.Start - df.End.shift(1) #shift moves the series down
threshold_selector = start_end_differences > 10 # will give you a boolean array where true indicates a point where the difference more than 10.
groups = threshold_selector.cumsum() # sums up the trues (1) and will create an integer series starting from 0
df.groupby(groups).agg({'Start':min}) # the aggregation is self explaining
Here is a generalized solution that remains agnostic of the other columns:
cols = df.columns.difference(['Start', 'End'])
grps = df.Start.sub(df.End.shift()).gt(10).cumsum()
gpby = df.groupby(grps)
gpby.agg(dict(Start='min', End='max')).join(gpby[cols].sum())
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

Longest increasing unique subsequence

I have a list/array that looks something like this:
[ 0 1 2 3 4 5 6 7 3 9 10 11 13 13 14 15 16 17 18 19 4 16 22 5 3
2 10 17 34 5 11 18 27 14 11 15 29 2 11 10 19 32 8 27 1 32 6 2 0]
This list is supposed to be monotonic (strictly increasing).
It is not, but you can see that it is mostly increasing.
The values that does not fit into this pattern can be considered as noise,
and I want them removed.
So I want to extract the largest possible subset of this list which will
be a strictly increasing sequence of numbers.
There are many possible monotonic sequences here,
but the point is to find the largest possible one.
It is important that I get the indices of the values to be removed,
as I need to know the exact position of the remaining numbers
(so instead of removing numbers we can replace them with
f.ex. None, nan, or -1).
I can not change the order of any number,
just remove the ones that does not fit in.
The remaining list has to be strictly increasing,
so if we have f.ex. [11 13 13 14], both of the 13s have to be removed.
If there are several possible solutions that are equally large,
we cannot use any of them and must choose a solution with 1 number less.
F.ex. in [27 29 30 34 32] we have to throw away both 34 and 32,
because we cannot choose one over the other.
If we have [27 29 34 15 32] there is no possible solution,
because we cannot choose between [27 29], [27 34], [29 34], or [15 32].
The best possible solution to the list presented above would be this:
[ 0 1 2 3 4 5 6 7 -1 9 10 11 -1 -1 14 15 16 17 18 19 -1 -1 22 -1 -1
-1 -1 -1 -1 -1 -1 -1 27 -1 -1 -1 29 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
Can anyone think of an algorithm that would do this specifc job?
If you can bring me a part on the way that would also be appreciated.
My only idea so far is a loop for n in range(N, 0, -1):
where N is the size of the list.
The loop would first try to find solutions of size n=N,
and then for n=N-1, n=N-2, etc.
When it find exactly 1 solution for a specifc n it stops and
returns that solution. I'm not sure what should be inside the loop yet.
UPDATE:
Another SO question provides a Python algorithm for finding the longest
subsequence of a list. This is almost what I want to do, but not quite.
I have copied that function (see below) and added a little extra code at the end which
changed the ouput if fullsize=True.
Then the original sequence with its original shape is rebuilt,
but the numbers which are not part of the increasing sequence are replaced
by nans. And then I check if any number occurs more than once,
and if so, replace all occurences of that number with nans.
The original algorithm must still be changed since it does not provide
unique solutions.
For example:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32,
18, 19, 20, 16, 35, 35, 33, 32, 1, 35, 13, 5, 32, 8, 35, 29, 19,
35, 19, 28, 32, 18, 31, 13, 3, 32, 33, 35, 31, 0, 21]
print subsequence(a)
gives
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 32. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan]
Instead of ending with .. 16 32 nan .. it should have ended with
... 16 nan ... nan 31 nan nan 32 33 35 nan nan nan],
as far as I can see.
Simpler example:
a = [0,1,2,3,4,1,2,3,4,5]
print subsequence(a)
gives
[ 0. 1. 2. 3. nan nan nan nan nan 5.]
but it should only have given
[0 nan ... nan 5]
because 1 2 3 4 appears two times and is not unique.
Here comes the current semi-working version of the code
(which was used for my example runs):
import numpy as np
def subsequence(seq, fullsize=True):
"""
Credit:
http://stackoverflow.com/questions/3992697/longest-increasing-subsequence
"""
M = [None] * len(seq) # offset by 1 (j -> j-1)
P = [None] * len(seq)
# Since we have at least one element in our list, we can start by
# knowing that the there's at least an increasing subsequence of length one:
# the first element.
L = 1
M[0] = 0
# Looping over the sequence starting from the second element
for i in range(1, len(seq)):
# Binary search: we want the largest j <= L
# such that seq[M[j]] < seq[i] (default j = 0),
# hence we want the lower bound at the end of the search process.
lower = 0
upper = L
# Since the binary search will not look at the upper bound value,
# we'll have to check that manually
if seq[M[upper-1]] < seq[i]:
j = upper
else:
# actual binary search loop
while upper - lower > 1:
mid = (upper + lower) // 2
if seq[M[mid-1]] < seq[i]:
lower = mid
else:
upper = mid
j = lower # this will also set the default value to 0
P[i] = M[j-1]
if j == L or seq[i] < seq[M[j]]:
M[j] = i
L = max(L, j+1)
# Building the result: [seq[M[L-1]], seq[P[M[L-1]]], seq[P[P[M[L-1]]]], ...]
result = []
pos = M[L-1]
for _ in range(L):
result.append(seq[pos])
pos = P[pos]
result = np.array(result[::-1]) # reversing
if not fullsize:
return result # Original return from other SO question.
# This was written by me, PaulMag:
# Rebuild original sequence
subseq = np.zeros(len(seq)) * np.nan
for a in result:
for i, b in enumerate(seq):
if a == b:
subseq[i] = a
elif b > a:
break
if np.sum(subseq[np.where(subseq == a)].size) > 1: # Remove duplicates.
subseq[np.where(subseq == a)] = np.nan
return subseq # Alternative return made by me, PaulMag.

It's a classical dynamic programming problem.
You store for every element the length of the largest sequence that ends at that element.
For the first element the value is 1 (just take that element). For the rest you take max(1, 1 + the value assigned to some other previous element that is <= then you current element).
You can implement with 2 loops (O(N^2)). There are probably some optimizations you can do if your data is really large. Or knowing your sequence is mostly good only check for the previous X elements.
To fix your data you start with one of the maximum values assigned (that the length of the longest monotonous sequence), you replace with -1 everything after that then go backward through the list looking for the previous element in the sequence (should be <= then the current one and the assigned value should be -1 what the current element is assigned), while you don't find a match, that element doesn't belong. When you find a match you take it as the current and continue backwards until you find an element you've assigned 1 to (that's the first one).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas complex filtering - python

Related

Does DataFrame.applymap always go from top down then left to right?

How to select the rows with same absolute value in a column

How to select pairs of values in an array according to a given sequence for all matrix shapes?

merge rows pandas dataframe based on condition

Longest increasing unique subsequence

Categories

Resources