Consecutively split an array by the next max value - python

Suppose I have an array (the elements can be floats also):
D = np.array([0,0,600,160,0,1200,1800,0,1800,900,900,300,1400,1500,320,0,0,250])
The goal is, starting from the beginning of the array, to find the max value (the last one if there are several equal ones) and cut the anterior part of the array. Then consecutively repeat this procedure till the end of the array. So, the expected result would be:
[[0,0,600,160,0,1200,1800,0,1800],
[900,900,300,1400,1500],
[320],
[0,0,250]]
I managed to find the last max value:
D_rev = D[::-1]
last_max_index = len(D_rev) - np.argmax(D_rev) - 1
i.e. I can get the first subarray of the desired answer. And then I can use a loop to get the rest.
My question is, if there is a numpy way to do it without looping?

IIUC, you can take the reverse cumulated max (see accumulate) of D to form groups, then split with itertools.groupby:
D = np.array([0,0,600,160,0,1200,1800,0,1800,900,900,300,1400,1500,320,0,0,250])
groups = np.maximum.accumulate(D[::-1])[::-1]
# array([1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1500, 1500,
# 1500, 1500, 1500, 320, 250, 250, 250])
from itertools import groupby
out = [list(list(zip(*g))[0]) for _, g in groupby(zip(D, groups), lambda x: x[1])]
# [[0, 0, 600, 160, 0, 1200, 1800, 0, 1800],
# [900, 900, 300, 1400, 1500],
# [320],
# [0, 0, 250]]

Related

How to find a distribution function from the max, min and average of a sample

Given that I know the, Max, Min and Average of sample (I don't have access to the sample itself). I would like to write a generic function to generate a sample with the same characteristics. From this answer I gather that this is no simple task since many distribuitions can be found with the same characteristics.
max, min, average = [411, 1, 20.98]
I'm trying to use scipy.norm but unsuccessfully. I can't seem to understand if I can pass the arguments mentioned above or if they are just returned values from an already generated function. I'm pretty new to python stats so this might be something quite easy to solve.
Triangular distribution should perform your desired task since it takes three parameters (min, mode, max) as inputs that match your criteria. You can think of other distributions such as standard, uniform, and so on; however, all of their input parameters either lack or partially take one of the three input parameters mentioned by you above. If I were in your position, I would consider triangular distribution because even partial exclusion of a single parameter can incur information loss.
import numpy as np
import matplotlib.pyplot as plt
h = plt.hist(np.random.triangular(-3, 0, 8, 100000), bins=200,
density=True)
plt.show()
Numpy - Triangular Distribution
As noted here:
There are an infinite number of possible distributions that would be
consistent with those sample quantities.
But you can introduce additional assumptions to find some solutions:
Use only fixed list of some popular distributions
Add constrains on the parameters of a distribution
You can think of this as an optimization problem: find the distribution and its parameters that have the best fit (in terms of specified min/max/avg statistics). In pseudo-code the solution would be something like this:
candidates = []
for distribution in distributions:
best_parameters, score = find_best_parameters(distribution, target_statistics)
candidates.append((distribution, best_parameters, score))
best_distribution = sorted(candidates, key=lambda x: x[2])
Using this procedure you can find that powerlaw distribution can produce the statistics similar to the desired:
s = stats.powerlaw(a=5.0909e-2, loc=1.00382, scale=4.122466e+2)
sample = s.rvs(size=100_000)
print(np.max(sample), np.min(sample), np.mean(sample))
Max/Min/Avg:
411.02946481216634 0.994030016 20.943683603008324
Full code:
import numpy as np
from scipy import stats
import cma
from matplotlib import pyplot as plt
distributions_and_bounds = [
(stats.cauchy, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.chi2, {'loc': [0, 1000], 'scale': [0, None]}),
(stats.expon, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.exponpow, {'b': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.gamma, {'a': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.lognorm, {'s': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.norm, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.powerlaw, {'a': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.rayleigh, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.uniform, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.alpha, {'a': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.anglit, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.arcsine, {'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.burr, {'c': [0, None], 'd': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.argus, {'chi': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
(stats.beta, {'a': [0, None], 'b': [0, None], 'loc': [-1000, 1000], 'scale': [0, None]}),
]
target_params = np.array([411, 1, 20.98])
candidates = []
for distribution, bounds in distributions_and_bounds:
def objective(params):
sample = distribution(*params).rvs(size=1_000)
pred_params = np.array([np.max(sample), np.min(sample), np.mean(sample)])
mse = (np.abs(target_params - pred_params) ** 2).mean()
return mse
x0 = np.ones(len(bounds))
lower_bounds = [bound[0] for bound in bounds.values()]
upper_bounds = [bound[1] for bound in bounds.values()]
best_params, es = cma.fmin2(objective, x0, 1, {'bounds': [lower_bounds, upper_bounds]}, restarts=4)
score = objective(best_params)
candidates.append((score, distribution, best_params))
best_distribution = list(sorted(candidates, key=lambda x: x[0]))[0]
print(best_distribution)
Here the CMA-ES optimization from package pycma was for simplicity.
Let's try the following function :
import numpy as np
import random
def re_sample(min_v, max_v, mean_v, size):
"""
Parameters
----------
min_v : Minimum value of the original population
max_v : Maximum value of the original population
mean_v : Mean value of the original population
size : Number of observation we want to generate
Returns
-------
sample : List of simulated values
"""
s_min_to_mean=int(((max_v-mean_v)/(max_v-min_v))*size)
sample_1=[random.uniform(min_v, mean_v) for i in range(s_min_to_mean)]
sample_2=[random.uniform(mean_v, max_v) for i in range(size-s_min_to_mean)]
sample=sample_1+sample_2
sample=random.sample(sample, len(sample))
sample=[round(x, 2) for x in sample]
return sample
When I test this function as follows:
sample = re_sample(1, 411, 20.98, 200)
print(np.mean(sample))
print(np.min(sample))
print(np.max(sample))
print(type(sample))
print(len(sample))
print(sample)
I get follwoing Outputs :
>>> 19.8997
>>> 1.0
>>> 307.8
>>> <class 'list'>
>>> 200
>>> [20.55, 7.87, 3.48, 5.23, 18.54, 268.06, 1.71,....
Quick edit with elaboration (I realized this later): you can apply the balancing trick on ANY distribution.
The pain with many of the proposed solutions is that the chances of hitting EXACT values for MIN and MAX and AVERAGE using floats is basically ZERO.
Knowing this means that the MIN and MAX values need to be added manually, but adding values mess with the generated distribution.
A naive approach would be something like, where you generate a distribution, add your MIN and MAX, and balance them out to hit the mean:
set min and max
calculate the mean
add points to compensate for the deviation of the desired mean (depends on how asymmetrical the MIN and MAX are placed around the desired mean)
create a random distribution that will still fit between the desired mean and the nearest boundary condition after you shift the mean
Shift the mean of the distribution to your desired TRUE mean
add the generated symmetrical distribution to the data available before point 4.
The initial 3 steps make sure that the boundary conditions (MIN, MAX) do not mess up your AVERAGE.
Steps 4-5 create some data that is guaranteed to have the exact desired AVERAGE, and will fall between the MIN and MAX.
Step 6 combines the data to the desired result.
import math
import numpy as np
MAX, MIN, AVERAGE = [411, 3, 20.98]
data = [3, 411]
left = AVERAGE - MIN
right = MAX - AVERAGE
ratio = max(left, right)/min(left,right)
n = math.ceil(ratio) - 1
dx = math.ceil(ratio) - ratio # this checks overcompensation due to working with integer numbers
data = data + [MIN]*(n) + [AVERAGE + left*dx] # the second part compensates the overcompensation again :)
print(np.mean(data))
print(min(data))
print(max(data))
N = 1000
width = min(MAX-AVERAGE, AVERAGE-MIN)
print(width)
dist = np.random.normal(AVERAGE, width/3, N)
delta1 = np.mean(dist) - AVERAGE
dist = [x for x in dist if x > (MIN + delta1) and x < (MAX - delta1)]
delta2 = np.mean(dist) - AVERAGE
dist = [x - delta2 for x in dist]
full = data + dist
print(np.mean(full))
print(min(full))
print(max(full))
A probability (function) isn't sufficiently defined by only it's min, avg and max values. There are (literally) an unlimited number of probability distribution that meet those conditions.
To demonstrate this point, a probability distribution that gives the minimum value with a probability of (max - avg) / (max - min) and the maximum value with a probability of (avg - min) / (max - min) already satisfies those characteristics.
This can be easily verified:
The minimum and maximum values are trivial.
The average = probability of minimum * minimum + probability of maximum * maximum = { min * (max - avg) + max * (avg - min) } / (max - min) = (- min * avg + max * avg) / (max - min) = (max - min) * avg / (max - min) = avg.
Also, a Normal distribution is symmetrical and not limited in observed values (e.g.: it has no minimum and maximum values).

Batch-wise loop over array Python to create new array without overwriting

I want to iterate over a 3d array (sequences) with shape (1134500, 1, 50)
array([[[1000, 1000, 1000, ..., 1005, 1005, 1005]],
[[1000, 1000, 1000, ..., 1004, 1005, 1004]],
[[1000, 1000, 1000, ..., 1004, 1005, 1004]],
...,
[[1000, 1000, 1000, ..., 1005, 1005, 1004]],
[[1000, 1000, 1000, ..., 1005, 1005, 1005]],
[[1000, 1000, 1000, ..., 1004, 1005, 1004]]], dtype=int32)
To do this, I use the following for loop, which works well except for it overwriting the results from the batch before:
batchsize = 500
for i in range(0, sequences.shape[0], batchsize):
batch = sequences[i:i+batchsize]
relevances = lrp_model.lrp(batch)
As a result, I want an array (relevances) with shape (1134500, 1, 50), but I get one with shape (500, 1, 50)
Can someone tell me what's going wrong?
In case you want to save the relevances, maybe
batchsize = 500
relevances = np.zeros(sequences.shape)
for i in range(0, sequences.shape[0], batchsize):
batch = sequences[i:i+batchsize]
relevances[i:i+batchsize, :, :] = lrp_model.lrp(batch)

Create an array of size n with initialized value

In python, it's possible to create an array of size n with [] * n or even initialize a fixed value for all items with [0] * n.
Is it possible to do something similar but initialize the value with 500n?
For example, create an array with size of 10 would result in this.
[0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
I can easily achieve this with a variable and a for loop as shown below but I would like optimize this in one go to pass into Robot Framework.
arr = [0] * 10
for i, v in enumerate(arr):
arr[i] = 500 * i
Use a simple comprehension:
[i*500 for i in range(n)]
The other answer gives you a way, here's another :
list(range(0, 500*n, 500))
It's allways a good idea to use numpy arrays. They have more fuctionalites and are very fast in use (vectorized form, and fast compiled C-code under the hood). Your example with numpy:
import numpy as np
nx = 10
a = 500*np.arange(nx)
a
gives:
array([ 0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500])

Merge sorting a 2d array

I'm stuck again on trying to make this merge sort work.
Currently, I have a 2d array with a Unix timecode(fig 1) and merge sorting using (fig 2) I am trying to check the first value in each array i.e array[x][0] and then move the whole array depending on array[x][0] value, however, the merge sort creates duplicates of data and deletes other data (fig 3) my question is what am I doing wrong? I know it's the merge sort but cant see the fix.
fig 1
[[1422403200 100]
[1462834800 150]
[1458000000 25]
[1540681200 150]
[1498863600 300]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]]
fig 2
import numpy as np
def sort(data):
if len(data) > 1:
Mid = len(data) // 2
l = data[:Mid]
r = data[Mid:]
sort(l)
sort(r)
z = 0
x = 0
c = 0
while z < len(l) and x < len(r):
if l[z][0] < r[x][0]:
data[c] = l[z]
z += 1
else:
data[c] = r[x]
x += 1
c += 1
while z < len(l):
data[c] = l[z]
z += 1
c += 1
while x < len(r):
data[c] = r[x]
x += 1
c += 1
print(data, 'done')
unixdate = [1422403200, 1462834800, 1458000000, 1540681200, 1498863600, 1540771200, 1540771200,1540771200, 1540771200, 1540771200]
price=[100, 150, 25, 150, 300, 100, 100, 100, 100, 100]
array = np.column_stack((unixdate, price))
sort(array)
print(array, 'sorted')
fig 3
[[1422403200 100]
[1458000000 25]
[1458000000 25]
[1498863600 300]
[1498863600 300]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]]
I couldn't spot any mistake in your code.
I have tried your code and I can tell that the problem does not happen, at least with regular Python lists: The function doesn't change the number of occurrence of any element in the list.
data = [
[1422403200, 100],
[1462834800, 150],
[1458000000, 25],
[1540681200, 150],
[1498863600, 300],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
]
sort(data)
from pprint import pprint
pprint(data)
Output:
[[1422403200, 100],
[1458000000, 25],
[1462834800, 150],
[1498863600, 300],
[1540681200, 150],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100]]
Edit, taking into account the numpy context and the use of np.column_stack.
-I expect what happens there is that np.column_stack actually creates a view mapping over the two arrays. To get a real array rather than a link to your existing arrays, you should copy that array:-
array = np.column_stack((unixdate, price)).copy()
Edit 2, taking into account the numpy context
This behavior has actually nothing to do with np.column_stack; np.column_stack already performs a copy.
The reason your code doesn't work is because slicing behaves differently with numpy than with python. Slicing create a view of the array which maps indexes.
The erroneous lines are:
l = data[:Mid]
r = data[Mid:]
Since l and r just map to two pieces of the memory held by data, they are modified when data is. This is why the lines data[c] = l[z] and data[c] = r[x] overwrite values and create copies when moving values.
If data is a numpy array, we want l and r to be copies of data, not just views. This can be achieved using the copy method.
l = data[:Mid]
r = data[Mid:]
if isinstance(data, np.ndarray):
l = l.copy()
r = r.copy()
This way, I tested, the copy works.
Note
If you wanted to sort the data using python lists rather than numpy arrays, the equivalent of np.column_stack in vanilla python is zip:
z = zip([10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000])
z
# <zip at 0x7f6ef80ce8c8>
# `zip` creates an iterator, which is ready to give us our entries.
# Iterators can only be walked once, which is not the case of lists.
list(z)
# [(10, 100, 1000), (20, 200, 2000), (30, 300, 3000), (40, 400, 4000)]
The entries are (non-mutable) tuples. If you need the entries to be editable, map list on them:
z = zip([10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000])
li = list(map(list, z))
# [[10, 100, 1000], [20, 200, 2000], [30, 300, 3000], [40, 400, 4000]]
To transpose a matrix, use zip(*matrix):
def transpose(matrix):
return list(map(list, zip(*matrix)))
transpose(l)
# [[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]]
You can also sort a python list li using li.sort(), or sort any iterator (lists are iterators), using sorted(li).
Here, I would use (tested):
sorted(zip(unixdate, price))

Subtract 3 lists in a tuple from 1 tuple

What is the best way to do this? Looking to take the difference but not like this horrible way. For each A, B, C it is subtracted from subtract from
A = [500, 500, 500, 500, 5000]
B = [100, 100, 540, 550, 1200]
C = [540, 300, 300, 100, 10]
triples= [tuple(A),tuple(B), tuple(C)]
subtract_from = tuple([1234,4321,1234,4321,5555])
diff = []
for main in subtract_from:
for i in range(len(triples)):
for t in triples[i]:
diff[i].append(main-t)
Try something like this:
all_lists = [A, B, C]
[[i-j for i,j in zip(subtract_from,l)] for l in all_lists]
[
[734, 3821, 734, 3821, 555],
[1134, 4221, 694, 3771, 4355],
[694, 4021, 934, 4221, 5545]
]
It is the best practice of doing this. no need to import any library, just use builtins.
You could try using map and operator:
import operator
A = [500, 500, 500, 500, 5000]
B = [100, 100, 540, 550, 1200]
C = [540, 300, 300, 100, 10]
l = [A, B, C]
subtract_from = [1234,4321,1234,4321,5555]
diff = list((list(map(operator.sub, subtract_from , i)) for i in l))
print(diff)
# [[734, 3821, 734, 3821, 555], [1134, 4221, 694, 3771, 4355], [694, 4021, 934, 4221, 5545]]
First of all, if you want tuples, use tuples explicitly without converting lists. That being said, you should write something like this:
a = 500, 500, 500, 500, 5000
b = 100, 100, 540, 550, 1200
c = 540, 300, 300, 100, 10
vectors = a, b, c
data = 1234, 4321, 1234, 4321, 5555
diff = [
[de - ve for de, ve in zip(data, vec)]
for vec in vectors
]
If you want list of tuples, use tuple(de - ve for de, ve in zip(data, vec)) instead of [de - ve for de, ve in zip(data, vec)].
I think everyone else nails it with list comprehensions already so here's a few odd ones in cases if you are using a mutable lists and reusing it in an imperative style is acceptable style, then the following code can be done
A = [500, 500, 500, 500, 5000]
B = [100, 100, 540, 550, 1200]
C = [540, 300, 300, 100, 10]
subtract_from = (1234,4321,1234,4321,5555)
for i,x in enumerate(subtract_from):
A[i], B[i], C[i] = x-A[i], x-B[i], x-C[i]
# also with map
#for i,x in enumerate(zip(subtract_from,A,B,C)):
# A[i], B[i], C[i] = map(x[0].__sub__, x[1:])
diff = [A,B,C]
It's less elegant but more efficient*(...I have not done any benchmark for this claim)

Categories

Resources