Longest increasing unique subsequence - python

I have a list/array that looks something like this:
[ 0 1 2 3 4 5 6 7 3 9 10 11 13 13 14 15 16 17 18 19 4 16 22 5 3
2 10 17 34 5 11 18 27 14 11 15 29 2 11 10 19 32 8 27 1 32 6 2 0]
This list is supposed to be monotonic (strictly increasing).
It is not, but you can see that it is mostly increasing.
The values that does not fit into this pattern can be considered as noise,
and I want them removed.
So I want to extract the largest possible subset of this list which will
be a strictly increasing sequence of numbers.
There are many possible monotonic sequences here,
but the point is to find the largest possible one.
It is important that I get the indices of the values to be removed,
as I need to know the exact position of the remaining numbers
(so instead of removing numbers we can replace them with
f.ex. None, nan, or -1).
I can not change the order of any number,
just remove the ones that does not fit in.
The remaining list has to be strictly increasing,
so if we have f.ex. [11 13 13 14], both of the 13s have to be removed.
If there are several possible solutions that are equally large,
we cannot use any of them and must choose a solution with 1 number less.
F.ex. in [27 29 30 34 32] we have to throw away both 34 and 32,
because we cannot choose one over the other.
If we have [27 29 34 15 32] there is no possible solution,
because we cannot choose between [27 29], [27 34], [29 34], or [15 32].
The best possible solution to the list presented above would be this:
[ 0 1 2 3 4 5 6 7 -1 9 10 11 -1 -1 14 15 16 17 18 19 -1 -1 22 -1 -1
-1 -1 -1 -1 -1 -1 -1 27 -1 -1 -1 29 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
Can anyone think of an algorithm that would do this specifc job?
If you can bring me a part on the way that would also be appreciated.
My only idea so far is a loop for n in range(N, 0, -1):
where N is the size of the list.
The loop would first try to find solutions of size n=N,
and then for n=N-1, n=N-2, etc.
When it find exactly 1 solution for a specifc n it stops and
returns that solution. I'm not sure what should be inside the loop yet.
UPDATE:
Another SO question provides a Python algorithm for finding the longest
subsequence of a list. This is almost what I want to do, but not quite.
I have copied that function (see below) and added a little extra code at the end which
changed the ouput if fullsize=True.
Then the original sequence with its original shape is rebuilt,
but the numbers which are not part of the increasing sequence are replaced
by nans. And then I check if any number occurs more than once,
and if so, replace all occurences of that number with nans.
The original algorithm must still be changed since it does not provide
unique solutions.
For example:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32,
18, 19, 20, 16, 35, 35, 33, 32, 1, 35, 13, 5, 32, 8, 35, 29, 19,
35, 19, 28, 32, 18, 31, 13, 3, 32, 33, 35, 31, 0, 21]
print subsequence(a)
gives
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 32. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan]
Instead of ending with .. 16 32 nan .. it should have ended with
... 16 nan ... nan 31 nan nan 32 33 35 nan nan nan],
as far as I can see.
Simpler example:
a = [0,1,2,3,4,1,2,3,4,5]
print subsequence(a)
gives
[ 0. 1. 2. 3. nan nan nan nan nan 5.]
but it should only have given
[0 nan ... nan 5]
because 1 2 3 4 appears two times and is not unique.
Here comes the current semi-working version of the code
(which was used for my example runs):
import numpy as np
def subsequence(seq, fullsize=True):
"""
Credit:
http://stackoverflow.com/questions/3992697/longest-increasing-subsequence
"""
M = [None] * len(seq) # offset by 1 (j -> j-1)
P = [None] * len(seq)
# Since we have at least one element in our list, we can start by
# knowing that the there's at least an increasing subsequence of length one:
# the first element.
L = 1
M[0] = 0
# Looping over the sequence starting from the second element
for i in range(1, len(seq)):
# Binary search: we want the largest j <= L
# such that seq[M[j]] < seq[i] (default j = 0),
# hence we want the lower bound at the end of the search process.
lower = 0
upper = L
# Since the binary search will not look at the upper bound value,
# we'll have to check that manually
if seq[M[upper-1]] < seq[i]:
j = upper
else:
# actual binary search loop
while upper - lower > 1:
mid = (upper + lower) // 2
if seq[M[mid-1]] < seq[i]:
lower = mid
else:
upper = mid
j = lower # this will also set the default value to 0
P[i] = M[j-1]
if j == L or seq[i] < seq[M[j]]:
M[j] = i
L = max(L, j+1)
# Building the result: [seq[M[L-1]], seq[P[M[L-1]]], seq[P[P[M[L-1]]]], ...]
result = []
pos = M[L-1]
for _ in range(L):
result.append(seq[pos])
pos = P[pos]
result = np.array(result[::-1]) # reversing
if not fullsize:
return result # Original return from other SO question.
# This was written by me, PaulMag:
# Rebuild original sequence
subseq = np.zeros(len(seq)) * np.nan
for a in result:
for i, b in enumerate(seq):
if a == b:
subseq[i] = a
elif b > a:
break
if np.sum(subseq[np.where(subseq == a)].size) > 1: # Remove duplicates.
subseq[np.where(subseq == a)] = np.nan
return subseq # Alternative return made by me, PaulMag.

It's a classical dynamic programming problem.
You store for every element the length of the largest sequence that ends at that element.
For the first element the value is 1 (just take that element). For the rest you take max(1, 1 + the value assigned to some other previous element that is <= then you current element).
You can implement with 2 loops (O(N^2)). There are probably some optimizations you can do if your data is really large. Or knowing your sequence is mostly good only check for the previous X elements.
To fix your data you start with one of the maximum values assigned (that the length of the longest monotonous sequence), you replace with -1 everything after that then go backward through the list looking for the previous element in the sequence (should be <= then the current one and the assigned value should be -1 what the current element is assigned), while you don't find a match, that element doesn't belong. When you find a match you take it as the current and continue backwards until you find an element you've assigned 1 to (that's the first one).

Related

How to select pairs of values in an array according to a given sequence for all matrix shapes?

We try to select values from matrices into pairs according to the procedure where values are selected diagonally. my code doesn't work as it should
You can see this sequence in the example below. It can be seen that the values are selected sequentially in a cross-form, where it starts in the penultimate line of the first value and joins it from the second value of the last line. It then moves one line up and continues in the same way.
. In the 1st example, the principle is that it takes cross values in the 1st example 21-> 32, then it starts 11-> 22, 11-> 33,22-> 33,12-> 23 and so on for all matrices. The same goes for the second example
code:
import numpy as np
a=np.array([[11,12,13],
[21,22,23],
[31,32,33]])
w,h = a.shape
for y0 in range(1,h):
y = h-y0-1
for x in range(h-y-1):
print( a[y+x,x], a[y+x+1,x+1] )
for x in range(1,w-1):
for y in range(w-x-1):
print( a[y,x+y], a[y+1,x+y+1] )
my outupt:
21 32
11 22
22 33
12 23
required output
21 32
11 22
11 33
22 33
12 23
However, if I use this matrix, for example, it will throw me an error.
a=np.array([[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36]])
required output
21 32
11 22
11 33
22 33
12 23
12 34
23 34
13 24
13 35
24 35
14 25
14 36
25 36
15 26
my output
error
File "C:\Users\Pifkoooo\dp\skuska.py", line 24, in <module>
print( a[y+x,x], a[y+x+1,x+1] )
IndexError: index 2 is out of bounds for axis 0 with size 2
Can anyone advise me how to solve this problem and generalize it to work on all matrices with different shapes? Or if there is another way to approach this task?
Let's look for patterns (like here, but simpler)! First, let's say that you have an array of shape (M, N), with M=4 and N=5. First, let's note the linear indices of the elements:
i =
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
Once you have identified the first element in a pair, the linear index of the next element is just i + N + 1.
Now let's try to establish the path of the first element using the example in the linked question. First, look at the column indices and the row indices:
x =
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
y =
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
Now take the difference, and add a factor to account for the shape:
x - y + 2M - N =
3 4 5 6 7
2 3 4 5 6
1 2 3 4 5
0 1 2 3 4
The first element follows the index of the diagonals except at the bottom row and rightmost column. If you can stably argsort this array (np.argsort has a stable method that uses timsort), then apply that index to the linear indices, you have the path taken by the first element of every pair for any matrix at all. The first observation will then yield the second element.
So it all boils down to this:
M, N = a.shape
path = (np.arange(N - 1) - np.arange(M - 1)[:, None] + 2 * M - N).argsort(None)
indices = np.arange(M * N).reshape(M, N)[:-1, :-1].ravel()[path]
Now you have a couple of different options going forward:
Apply linear indices to the raveled a:
result = a.ravel()[indices[:, None] + [0, N + 1]]
Preserve the shape of a and use np.unravel_index to transform indices and indices + N + 1 into a 2D index:
result = a[np.unravel_index(indices[:, None] + [0, N + 1], a.shape)]
Moral of the story: this is all black magic!
Probably not the best performance, but it gets the job done if order does not matter. Iterate over all elements and try to access all of its diagonal partners. If the diagonal partner does not exist catch the raised IndexError and continue with the next element.
def print_diagonal_pairs(a):
rows, cols = a.shape
for row in range(rows):
for col in range(cols):
max_shift_amount = min(rows, cols) - min(row, col)
for shift_amount in range(1, max_shift_amount+1):
try:
print(a[row, col], a[row+shift_amount, col+shift_amount])
except IndexError:
continue
a = np.array([
[11,12,13],
[21,22,23],
[31,32,33],
])
print_diagonal_pairs(a)
# Output:
11 22
11 33
12 23
21 32
22 33
b = np.array([
[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36]
])
print_diagonal_pairs(b)
# Output:
11 22
11 33
12 23
12 34
13 24
13 35
14 25
14 36
15 26
21 32
22 33
23 34
24 35
25 36
Not a solution, but I think you can use fancy indexing for this task. In the code snippet below i am selecting the indices x = [[0,1], [0,2], [1,2]] along the first axis. These indices will be broadcasted against the indices in y along the first dimension.
from itertools import combinations
a=np.array([[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36]])
x = np.array(list(combinations(range(a.shape[0]), 2)))
y = x + np.arange(a.shape[1]-2)[:,None,None]
a[x,y].reshape(-1,2)
Output:
array([[11, 22],
[11, 33],
[22, 33],
[12, 23],
[12, 34],
[23, 34],
[13, 24],
[13, 35],
[24, 35],
[14, 25],
[14, 36],
[25, 36]])
This will select all correct values except for the start and end values for the second example. There is probably a smart way to include these edge values and select all values in one sweep, but I cannot think of a solution for this atm.
I thought the pattern was to select combinations of size 2 along each diagonal, but apparently not - so this solution will not give the correct "middle" values in your first example.
EDIT
You could extend the selection range and modify the two edge values:
x = np.array(list(combinations(range(a.shape[0]), 2)))
y = x + np.arange(-1,a.shape[1]-1)[:,None,None]
# assign edge values
y[0] = y[1][0]
y[-1] = y[-2][-1]
a[x,y].reshape(-1,2)[2:-2]
Output:
array([[21, 32],
[11, 22],
[11, 33],
[22, 33],
[12, 23],
[12, 34],
[23, 34],
[13, 24],
[13, 35],
[24, 35],
[14, 25],
[14, 36],
[25, 36],
[15, 26]])
My original answer was for the case in the original question where the pairs slid along the diagonals rather than spreading across them with the first point staying anchored. While the solution is not exactly the same, the concept of computing indices in a vectorized manner applies here too.
Start with the matrix of row minus column which gives diagonals as before:
diag = np.arange(1, N) - np.arange(1, M)[:, None] + 2 * M - N
This shows that the the second element is given by
second = a[1:, 1:].ravel()[diag.argsort(None, kind='stable')]
The heads of the diagonals are the first column in reverse and the first row. If you index them correctly, you get the first element of each pair:
head = np.r_[a[::-1, 0], a[0, 1:]]
first = head[np.sort(diag, axis=None)]
Now you can just concatenate the result:
result = np.stack((first, second), axis=-1)
See: black magic! And totally vectorized.

Is there a pandas function to sum a set number of previous row elements in a dataframe?

I'm trying to create a function which can look at previous rows in a DataFrame and sum them based on a set number of rows to look back over. Here I have used 3 but ideally I would like to scale it up to look back over more rows. My solution works but doesn't seem very efficient. The other criteria is each time it hits a new team the count must start again, so the first row for each new team is always 0, the data will be ordered in team order but if a solution is known for where the data isn't in team order this would be incredible.
Is there a function in Pandas which could help with this?
So far I've tried the code below and tried googling the issue, the closest example I could find is: here! but this groups the index and I'm unsure how to apply this when the value has to keep resetting each time it hits a new team, as it wouldn't distinguish each time there is a new team.
np.random.seed(0)
data = {'team':['a','a','a','a','a','a','a','a','b','b',
'b','b','b','b','b','b','c','c','c','c','c','c','c','c'],
'teamPoints': np.random.randint(0,4,24)}
df = pd.DataFrame.from_dict(data)
df.reset_index(inplace=True)
def find_sum_last_3(x):
if x == 0:
return 0
elif x == 1:
return df['teamPoints'][x-1]
elif x == 2:
return df['teamPoints'][x-1] + df['teamPoints'][x-2]
elif df['team'][x] != df['team'][x-1]:
return 0
elif df['team'][x] != df['team'][x-2]:
return df['teamPoints'][x-1]
elif df['team'][x] != df['team'][x-3]:
return df['teamPoints'][x-1] + df['teamPoints'][x-2]
else:
return df['teamPoints'][x-1] + df['teamPoints'][x-2] +
df['teamPoints'][x-3]
df['team_form_3games'] = df['index'].apply(lambda x : find_sum_last_3(x))
The first part of the function addresses the edge cases where a sum of 3 isn't possible because there are less than 3 elements
The second part of the function addresses the problem of the 'team' changing. When the team changes the sum needs to start again, so each 'team' is considered seperately
The final part simply looks at the previous 3 elements of the dataFrame and sums them together.
This example works as expected and gives a new column with expected output as follows:
0, 0, 3, 4, 4, 4, 6, 9, 0, 1, 4, 5, 6, 3, 5, 5, 0, 0, 0, 2, 3, 5, 6, 8
1st element is 0 as it is edge case, 2nd is 0 because the sum of the first element is 0. 3rd is 3 as the sum of the 1st and 2nd elements are 3. 4th is the sum of 1st,2nd,3rd. 5th is sum of 2nd,3rd,4th. 6th is sum of 3rd,4th,5th
However when scaled up to 10 it is shown to be very inefficient which makes it difficult to scale up to 10 or 15. It is also inelegant and a new function needs to be written for each different length of sum.
I think you are looking for GroupBy.apply + rolling:
r3=df.groupby('team')['teamPoints'].apply(lambda x: x.rolling(3).sum().shift())
r2=df.groupby('team')['teamPoints'].apply(lambda x: x.rolling(2).sum().shift())
r1=df.groupby('team')['teamPoints'].apply(lambda x: x.shift())
df['team_form_3games'] = r3.fillna(r2.fillna(r1).fillna(0))
print(df)
Output:
index team teamPoints team_form_3games
0 0 a 0 0.0
1 1 a 3 0.0
2 2 a 1 3.0
3 3 a 0 4.0
4 4 a 3 4.0
5 5 a 3 4.0
6 6 a 3 6.0
7 7 a 3 9.0
8 8 b 1 0.0
9 9 b 3 1.0
10 10 b 1 4.0
11 11 b 2 5.0
12 12 b 0 6.0
13 13 b 3 3.0
14 14 b 2 5.0
15 15 b 0 5.0
16 16 c 0 0.0
17 17 c 0 0.0
18 18 c 2 0.0
19 19 c 1 2.0
20 20 c 2 3.0
21 21 c 3 5.0
22 22 c 3 6.0
23 23 c 2 8.0

How to efficiently subtract values from each column with numpy

I have a 2D array of shape (50,50). I need to subtract a value from each column of this array skipping the first), which is calculated based on the index of the column. For example, using a for loop it would look something like this:
for idx in range(1, A[0, :].shape[0]):
A[0, idx] -= idx * (...) # simple calculations with idx
Now, of course this works fine, but it's very slow and performance is critical for my application. I've tried computing the values to be subtracted using np.fromfunction() and then subtracting it from the original array, but results are different than those obtained by the for loop iteractive subtraction:
func = lambda i, j: j * (...) #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (1,50))
A[0, 1:] -= subtraction_matrix
What am I doing wrong? Or is there some other method that would be better? Any help is appreciated!
All your code snippets indicate that you require the subtraction to happen only in the first row of A (though you've not explicitly mentioned that). So, I'm proceeding with that understanding.
Referring to your use of from_function(), you can use the subtraction_matrix as below:
A[0,1:] -= subtraction_matrix[1:]
Testing it out (assuming shape (5,5) instead of (50,50)):
import numpy as np
A = np.arange(25).reshape(5,5)
print (A)
func = lambda j: j * 10 #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (5,), dtype=A.dtype)
A[0,1:] -= subtraction_matrix[1:]
print (A)
Output:
[[ 0 1 2 3 4] # print(A), before subtraction
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
[[ 0 -9 -18 -27 -36] # print(A), after subtraction
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[ 15 16 17 18 19]
[ 20 21 22 23 24]]
If you want the subtraction to happen in all the rows of A, you just need to use the line A[:,1:] -= subtraction_matrix[1:], instead of the line A[0,1:] -= subtraction_matrix[1:]

Pandas complex filtering

I have a pandas.DataFrame() object like below
start, end
5, 9
6, 11
13, 11
14, 11
15, 17
16, 17
18, 17
19, 17
20, 24
22, 26
"end" has to always be > "start"
So, I need to filter it from when the "end" values becomes < "start" till the next row where they are again are back to normal.
In above example, I need:
1.
13,11
15,17
2.
18,17
20,24
Edit: (updated)
Think of these as timestamps in seconds. So I can find that it took 2 seconds in both scenario to recover back.
I can do this in iterating the data, but does Pandas have a better way ?
You could use panda's boolean indexing to find the rows where start < end. Then if you reset the index you can calculate the difference between the original indices that act as the upper and lower bounds delta between rows where start > end.
For example you could do something like the following:
# A = starts, B = ends
df = pd.DataFrame({'B' : [9, 11, 11, 11, 17, 17, 17, 17, 24, 26],
'A': [5, 6, 13, 14, 15, 16, 18, 19, 20, 22]})
# use boolean indexing
df = df[df['A'] < df['B']].reset_index()
# calculate the difference of each row's "old" index to determine delta
diffs = df['index'].diff()
# create a column to show deltas
df['delta'] = diffs
print(diffs)
print(df)
The diffs data frame looks like:
0 NaN
1 1
2 3
3 1
4 3
5 1
Name: index, dtype: float64
Notice the NaN value since the diff() method subtracts the previous row from the current row, but since the first row has no previous row it marks a NaN. One must only look at the first value of the index column to calculate the delta in the case that the first arbitrary number of n starts were > ends.
The fully augmented data frame would then look like:
index A B delta
0 0 5 9 NaN
1 1 6 11 1
2 4 15 17 3
3 5 16 17 1
4 8 20 24 3
5 9 22 26 1
If you wish to delete any of the extraneous columns you can use the del method like so:
del col1, col2, col3, etc..

Python random function to select a new item from a list of values

I need to fetch random numbers from a list of values in Python. I tried using random.choice() function but it sometimes returns same values consecutively. I want to return new random values from the list each time. Is there any function in Python that allows me to perform such an action ?
Create a copy of the list, shuffle it, then pop items from that one by one as you need a new random value:
shuffled = origlist[:]
random.shuffle(shuffled)
def produce_random_value():
return shuffled.pop()
This is guaranteed to not repeat elements. You can, however, run out of numbers to pick, at which point you could copy again and re-shuffle.
To do this continuously, you could make this a generator function:
def produce_randomly_from(items):
while True:
shuffled = list(items)
random.shuffle(shuffled)
while shuffled:
yield shuffled.pop()
then use this in a loop or grab a new value with the next() function:
random_items = produce_randomly_from(inputsequence)
# grab one random value from the sequence
random_item = next(random_items)
Here is an example:
>>> random.sample(range(10), 10)
[9, 5, 2, 0, 6, 3, 1, 8, 7, 4]
Just replace the sequence given by range with the one you want to choose from. The second number is how many samples, and should be the length of the input sequence.
If you just want to avoid consecutive random values, you can try this:
import random
def nonrepeating_rand(n):
''' Generate random numbers in [0, n) such that no two consecutive numbers are equal. '''
k = random.randrange(n)
while 1:
yield k
k2 = random.randrange(n-1)
if k2 >= k: # Skip over the previous number
k2 += 1
k = k2
Test:
for i,j in zip(range(25), nonrepeating_rand(3)):
print i,j
prints (for example)
0 1
1 0
2 2
3 0
4 2
5 0
6 2
7 1
8 0
9 1
10 0
11 2
12 0
13 1
14 0
15 2
16 1
17 0
18 2
19 1
20 0
21 2
22 1
23 2
24 0
You can use nonrepeating_rand(len(your_list)) to get random indices for your list.

Categories

Resources