order a sequence based on the sum of the window values

order a sequence based on the sum of the window values - python

I know the question is not very clear so I try to make a clearer example:
I have some values, which represent the minute of the day and a corresponding value, like
import numpy as np
x = np.arange(1440)
values = np.sin(2*np.pi/1440*2*l - np.pi/2)+1
and I want to get the ordered (from lower to larger) sums per hour of values, where I don't want any value to be excluded, except at least the highest remaining ones, so in this case I would get 24 ordered (my hours) values, or better 23 if I consider that the minimum sum could be anywhere in the series and in the end I will have some values at the end and the beginning whose 'window will be less than 60 minutes, except in a very particular case.
I don't know if I should apply a boolean mask in a while loop which I could manage to do or if numpy or some other packets have already some functions that could help me in solving the problem. Thanks
Also, if possible, I would like to get the SORTED NONSUBSQUENT SUMS, up to the moment where I have no more intervals of the right (WINDOW) length, which in this particular case implies that my results will have a minimum of 12 windows sums up to 24. Thanks again
So, in a general case (I have no space to insert the minute data) if my values are: [1,2,3,4,5,6,0,0,0,7,8,9] and I will need to group them in windows of 3 elements size (in the real case this is my 60 minutes window), my result will be:
[[0,0,0],[1,2,3],[4,5,6],[7,8,9]].
Or to be more general,if they are: [0,1,2,3,4,5,6,0,0,0,7,9,1,9,6], in the first case I would get: [[0,0,0],[1,2,3],[4,5,6],[7,9,1]] (because I take subsequent windows) and in the second case I would get: [[0,0,0],[0,1,2],[3,4,5],[1,9,6]] (because I just focus on the minimum sorted sums)

I am not sure if you want to have a rolling window or if you want to sum over full hours. Here is the first try, by reshaping the values into a 24 x 60 array to get sums for the full hours.
import numpy as np
x = np.arange(60*24)
values = np.sin(2*np.pi/1440*2*x - np.pi/2)+1
x_values = values.reshape((24, 60))
sums = x_values.sum(axis=-1)
print(np.argsort(sums))
# array([12, 0, 11, 23, 13, 1, 10, 22, 14, 2, 9, 21,
15, 3, 8, 20, 16, 4, 7, 19, 17, 5, 18, 6])
# if some values are missing / not yet available you can simply
# fill them with zeros
values[3, 16:] = 0 # no data available for 04:16 - 05:00
values[3, :] = 0 # or ignore this hour completely
EDIT
Some remarks after the clarifications:
As always it depends on your use case which of the two options are better for you, maybe you even want to allow some overlap between the hours...
It seems a little strange to me to order everything after the first minimum you find; I would prefer the option where the intervals can be placed arbitrarily over the day, but note that you can easily end up with less than 23 valid intervals in this case.
inf = 1e9
m = 3
a = np.array([1,0,1,2,3,4,5,6,0,0,0,7,9,1,9,6])
for i in range(len(a)//m):
b = np.convolve(a, v=np.ones(m), mode='full')
i = np.argmin(b[m-1:-m+1])
if inf in a[i:i+m]:
break
print(i, a[i:i+m])
a[i:i+m] = inf
# This gives you the intervals with the corresponding starting index
# 8 [0 0 0]
# 0 [1 0 1]
# 3 [2 3 4]
# 13 [1 9 6]
For completeness here is also the second option:
# get interval with minimal sum
b = np.convolve(a, v=np.ones(m), mode='full')
i = np.argmin(b[m - 1:-m + 1])
# reshape values and clip boundaries according to found minimum i
a = a[i % m: -((len(a)-i) % m)]
a = a.reshape(-1, m)
# order the intervals and print their respective indices in the intial array
i_list = np.argsort(a.sum(axis=-1))
print(a[i_list])
print(i%m + i_list*m)
# [[0 0 0]
# [1 2 3]
# [4 5 6]
# [7 9 1]]
# [ 8 2 5 11]

Related

Looking for a better way to handle periodic boundary condition on numpy array or list in python

I have a set of a large dataset (2-dimensional matrix) of about 5 to 100 rows and 5000 to 25000 columns. I was told to extract a strip out of each row, the strip length is given. For each row, the strip is begin filled from a random position on the row and all the way up, if the position is beyond the length of the row, it will pick the entries from the beginning like the periodic boundary. For example, assume a row has 10 elements,
row = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
the position is picked to 8 and the strip length is 4. The strip will then be [9, 10, 1, 2]
I am trying to use NumPy to do the computation at first
A = np.ones((5, 8000), order='F')
import time
L = (4,3,3,3,4) # length for each of the 5 strips
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
I don't have a good way to handle the boundary condition so I just break it into two cases: one when the whole strip is within the row and one when parts of the strip are beyond the row. This code works and takes about 1.5 seconds to run. I then try to use the list instead
A = [[1]*8000]*5
starttime = time.process_time()
for i in range(80000):
B = []
for c, row in enumerate(A):
start = random.randint(0,len(row)-1)
end = start+L[c]
if end>len(row)-1:
sce = np.zeros(L[c])
for k in range(start, end):
sce[k-start] = k%len(row)
else:
sce = row[start:end]
B = sce
print(time.process_time() - starttime)
This one is about 0.5 seconds faster, it is quite surprised I expect NumPy should be faster!!! Both codes are good for the small size of the matrix and a small number of iteration. But in the real project, I am going to deal with a very large matrix and a lot more iterations, I wonder if there is any suggestion to improve the efficiency. Also, is there is any suggestion on how to handle the periodic boundary condition (neater and higher efficiency)?

Considering that you create the array A before timing it, both solutions will be equally fast because you are just iterating over the array. But i am actually not sure on why the pure python solution is quicker, maybe it has to do with collection-based iterators (enumerate) are better suited for primitive python types?
Looking at the example with one row, you want to take a range of elements from the row and wrap around the out-of-bounds indices. For this I would suggest doing:
row = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
start = 8
L = 4
np.take(row, np.arange(start, start+L), mode='wrap')
output:
array([ 9, 10, 1, 2])
This behavior can then be extended to 2 dimensions by specifying the axis keyword. But working with uneven lengths in L does make it a bit trickier, because working with non-homogeneous arrays you will loose most of the benefits from using numpy. The work-around is to partition L in a way that equally sized lengths are grouped together.
If I understand the whole task correctly, you are given some start value and you want to extract each corresponding strip length along the second axis of A.
A = np.arange(5*8000).reshape(5,8000) # using arange makes it easier to verify output
L = (4,3,3,3,4) # length for each of the 5 strips
parts = ((0,4), (1,2,3)) # partition L (to lazy to implement this myself atm)
start = 7998 # arbitrary start position
for part in parts:
ranges = np.arange(start, start+L[part[0]])
out = np.take(A[part,:], ranges, axis=-1, mode='wrap')
print(f'Output for rows {part} with length {L[part[0]]}:\n\n{out}\n')
Output:
Output for rows (0, 4) with length 4:
[[ 7998 7999 0 1]
[39998 39999 32000 32001]]
Output for rows (1, 2, 3) with length 3:
[[15998 15999 8000]
[23998 23999 16000]
[31998 31999 24000]]
Although, it looks like you want a random starting position for each row?

Sort values of x-axis in order to fill the plot f(x) evenly while computing f(x)

Global task for general understanding: I need to plot a result of my function f(x). Simple task, but there are two problems:
For each value of x, f(x) takes much time to compute (tens of minutes or even aroud an hour).
I don't know the estimated shape of f(x), therefore I don't know, how many x values I need in predefined x limits to represent the function correctly.
I want to update the plot of f(x) each time I get new f(x) value. I don't want to solve f(x) consequentially, I want to kind of increase level of detail, so every time I look on a plot, I see it over all of my (x_min, x_max) range, slowly updating within this range.
Therefore the question is: I need a function, which provides a list of x in proper order.
I came up with the following algorithm, inspired by binary search:
list a of x values contains only unique values and it is sorted.
def dissort(a)
step = len(a) - 1
picked = [False for x in a]
out = []
while False in picked and step > 0:
for k in range(0, len(a), step):
if not picked[k]:
out.append(a[k])
picked[k] = True
step = step // 2
return out
in = [1, 2, 3, 4, 5, 6, 7, 8, 9]
out = [1, 9, 5, 3, 7, 2, 4, 6, 8]
assert(dissort(in) == out)
I see some flaws here: picked array may be unnecessary, and picked values are unnecessarily checked every time level of detail increases. For now I'm happy with the performance, but in the future I can use it for much larger lists.
Is there a way to make it more performant? Is there an implementation in some python package already? I couldn't find it.

If your input-size is a power of 2, you could get the same order as with your algorithm like this:
To know where to place the n'th value in your output-array, take the binary representation of n reverse the order of the bits and use it as index in your output-array:
Example
n | bin | rev | out-index
0 = 000 -> 000 = 0
1 = 001 -> 100 = 4
2 = 010 -> 010 = 2
3 = 011 -> 110 = 6
4 = 100 -> 001 = 1
5 = 101 -> 101 = 5
6 = 110 -> 011 = 3
7 = 111 -> 111 = 7
So IN: [A,B,C,D,E,F,G,H] -> OUT: [A,E,C,G,B,F,D,H]
Takes O(n) time
How to reverse the order of the bits see Reverse bits in number
optimized ways: https://stackoverflow.com/a/746203/1921273

Is a random order of the x values is ok?
If so:
import random
xvals = random.sample(range(10),10)
result:
[9, 5, 1, 7, 6, 2, 3, 0, 4, 8]
The numbers will not repeat. Of course the result will differ each time you call it. And you can generate any number of x values:
random.sample(range(20),20)
[10, 5, 0, 18, 13, 14, 19, 3, 1, 2, 17, 6, 4, 11, 15, 7, 9, 8, 12, 16]
I believe this is also O(n).
The only issue with this, possibly, is this: On each iteration of increasing the number of x-values, do you also then not want any repeats from the prior iteration? Or is the above adequate?

Why make this so complicated?
Why not store your x and f(x) values in a dict, and sort the dict keys:
data = {}
# every time you get a new x and f(x) value, store it in the dict:
data[x] = f(x)
# then when you want to plot:
xvals = sorted(data.keys())
yvals = [data[x] for x in xvals]
# Now you have a x values and y values, sorted by x value, so just:
plot(xvals,yvals)
Does something like that work for you?
Also, just as an aside: it's understandable that you want something performant, as a general rule, but relative to your algorithm taking 10 minutes to an hour to converge on each value of f(x), whenever a new value comes in, resorting all of the existing results with the new result, even with an O(n*ln(n)) sort, is going be very much faster than the wait time for new values to sort. (Python sorted can sort 10,000 numbers in less than 2.5 milliseconds. The point is, compared to a 10 minute algorithm, shaving another 0.5 to 1.0 milliseconds off that is not going to make any difference in your overall process).

median of five points python

My data frame (df) has a list of numbers (total 1773) and I am trying to find a median for five numbers (e.g. the median of 3rd number = median (1st,2nd,3rd,4th,5th))
num
10
20
15
34
...
...
...
def median(a, b, c,d,e):
I=[a,b,c,d,e]
return I[2]
num_median = [num[0]]
for i in range(1, len(num)):
num_median = median(num[i - 1], num[i-2], num[i],num[i+1],num[i+2])
df['num_median']=num_median
IndexError: index 1773 is out of bounds for axis 0 with size 1773
Where did it go wrong and Is there any other method to compute the median?

An example which will help:
a = [0, 1, 2, 3]
print('Length={}'.format(len(a)))
print(a(4))
You are trying the same thing. The actual index of an element is one lower than the place it is. Keep in mind your exception shows you exactly where your problem is.
You need to modify:
for i in range(1, len(num) - 2):
num_median = median(num[i - 1], num[i-2], num[i],num[i+1],num[i+2])
So that your last index check will not be too large. Otherwise, when you are at the end of your array (index = 1773) you will be trying to access an index which doesn't exist (1773 + 2).

how do you add values from a list separately if one variable has two possible outcomes

This assignment calls another function:
def getPoints(n):
n = (n-1) % 13 + 1
if n == 1:
return [1] + [11]
if 2 <= n <= 10:
return [n]
if 11 <= n <= 13:
return [10]
So my assignment wants me to add the sum of all possible points from numbers from a list of 52 numbers. This is my code so far.
def getPointTotal(aList):
points = []
for i in aList:
points += getPoints(i)
total = sum(points)
return aList, points, total
However the issue is that the integer 1 has two possible point values, 1 or 11. When I do the sum of the points it does everything correctly, but it adds 1 and 11 together whereas I need it to compute the sum if the integer is 1 and if the integer is 11.
So for example:
>>>getPointTotal([1,26, 12]) # 10-13 are worth 10 points( and every 13th number that equates to 10-13 using n % 13.
>>>[21,31] # 21 if the value is 1, 31 if the value is 11.
Another example:
>>>getPointTotal([1,14]) # 14 is just 14 % 13 = 1 so, 1 and 1.
>>>[2, 12, 22] # 1+1=2, 1+11=12, 11+11=22
My output is:
>>>getPointTotal([1,14])
>>>[24] #It's adding all of the numbers 1+1+11+11 = 24.
So my question is, is how do I make it add the value 1 separately from the value 11 and vice versa. So that way when I do have 1 it would add all the values and 1 or it would add all the values and 11.

You make a mistake in storing all the values returned from getPoints(). You should store only the possible totals for the points returned so far. You can store all those in a set, and update them with the all the possible values returned from getPoints(). A set will automatically remove duplicate scores, such as 1+11 and 11+1. You can change the set to a sorted list at the end. Here is my code:
def getPointTotal(aList):
totals = {0}
for i in aList:
totals = {p + t for p in getPoints(i) for t in totals}
return sorted(list(totals))
I get these results:
>>> print(getPointTotal([1,26, 12]))
[21, 31]
>>> print(getPointTotal([1,14]))
[2, 12, 22]

Rory Daulton's answer is a good one, and it efficiently gives you the different totals that are possible. I want to offer another approach which is not necessarily better than that one, just a bit different. The benefit to my approach is that you can see the sequence of scores that lead to a given total, not only the totals at the end.
def getPointTotal(cards):
card_scores = [getPoints(card) for card in cards] # will be a list of lists
combos = {sorted(scores) for scores in itertools.product(*card_scores)}
return [(scores, sum(scores)) for scores in combos]
The key piece of this code is the call to itertools.product(*card_scores). This takes the lists you got from getPoints for each of the cards in the input list and gets all the combinations. So product([1, 11], [1, 11], [10]) will give (1, 1, 10), (1, 11, 10), (11, 1, 10), and (11, 11, 10).
This is probably a bit overkill for blackjack scoring where there are not going to be many variations on the scores for a given set of cards. But for a different problem (i.e. a different implementation of the getPoints function), it could be very interesting.

Rounding Numbers that fall within variable number of ranges in Python

I have an input list of numbers:
lst = [3.253, -11.348, 6.576, 2.145, -11.559, 7.733, 5.825]
I am trying to think of a way to replace each number in a list with a given number if it falls into a range. I want to create multiple ranges based on min and max of input list and a input number that will control how many ranges there is.
Example, if i said i want 3 ranges equally divided between min and max.
numRanges = 3
lstMin = min(lst)
lstMax = max(lst)
step = (lstMax - lstMin) / numRanges
range1 = range(lstMin, lstMin + step)
range2 = range(range1 + step)
range3 = range(range2 + step)
Right away here, is there a way to make the number of ranges be driven by the numRanges variable?
Later i want to take the input list and for example if:
for i in lst:
if i in range1:
finalLst.append(1) #1 comes from range1 and will be growing if more ranges
elif i in range2:
finalLst.append(2) #2 comes from range2 and will be growing if more ranges
else i in range3:
finalLst.append(3) #3 comes from range2 and will be growing if more ranges
The way i see this now it is all "manual" and I am not sure how to make it a little more flexible where i can just specify how many ranges and a list of numbers and let the code do the rest. Thank you for help in advance.
finalLst = [3, 1, 3, 3, 1, 3, 3]

This is easy to do with basic mathematical operations in a list comprehension:
numRanges = 3
lstMin = min(lst)
lstMax = max(lst) + 1e-12 # small value added to avoid floating point rounding issues
step = (lstMax - lstMin) / numRanges
range_numbers = [int((x-lstMin) / step) for x in lst]
This will give an integer for each value in the original list, with 0 indicating that the value falls in the first range, 1 being the second, and so on. It's almost the same as your code, but the numbers start at 0 rather than 1 (you could stick a + 1 in the calculation if you really want 1-indexing).
The small value I've added to lstMax is there for two reasons. The first is to make sure that floating point rounding issues don't make the largest value in the list yield numRange as its range index rather than numRange-1 (indicating the numRangeth range). The other reason is to avoid a division by zero error if the list only contains a single value (possibly repeated multiple times) such that min(lst) and max(lst) return the same thing.

Python has a very nice tool for doing exactly this kind of work called bisect. Lets say your range list is defined as such:
ranges = [-15, -10, -5, 5, 10, 15]
For your input list, you simply call bisect, like so:
lst = [3.253, -11.348, 6.576, 2.145, -11.559, 7.733, 5.825]
results = [ranges[bisect(ranges, element)] for element in lst]
Which results in
>>>[5, -10, 10, 5, -10, 10, 10]
You can then extend this to any arbitrary list of ranges using ranges = range(start,stop,step) in python 2.7 or ranges = list(range(start,stop,step)) in python 3.X
Update
Reread your question, and this is probably closer to what you're looking for (still using bisect):
from numpy import linspace
from bisect import bisect_left
def find_range(numbers, segments):
mx = max(numbers)
mn = mn(numbers)
ranges = linspace(mn, mx, segments)
return [bisect_left(ranges, element)+1 for element in numbers]
>>> find_range(lst, 3)
[3, 2, 3, 3, 1, 3, 3]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

order a sequence based on the sum of the window values - python

Related

Looking for a better way to handle periodic boundary condition on numpy array or list in python

Sort values of x-axis in order to fill the plot f(x) evenly while computing f(x)

median of five points python

how do you add values from a list separately if one variable has two possible outcomes

Rounding Numbers that fall within variable number of ranges in Python

Categories

Resources