How do I quickly decimate a numpy array? - python

I need a function that decimates, removes m in n of, a numpy array. For example to remove 1 in 2 or remove 2 in 3. So an array which is:
[7, 4, 3, 5, 9, 2, 4, 1, 6, 8]
decimated by 1:2 would become:
[7, 3, 9, 4, 6]
I wonder if it is possible to reshape the array from 1d array N long to one that is 2d and N/2, 2 long then drop the extra dimension?
Ideally, rather than just dump the decimated samples, I would like to find the maximum value across each set (in this example pair) of values. For example:
[7, 5, 9, 4, 8]
Is there a way to find the maximum value across each set rather than just to drop it?
The added challenge is that the point here is to plot the values.
The decimation is required because plotting every value is taking too long meaning that I have to reduce the size of an array before plotting it but I need to do this quickly. So for or while loops would take too long.

A quick and dirty way is
k,N = 3,18
a = np.random.randint(0,10,N) #[9, 6, 6, 6, 8, 4, 1, 4, 8, 1, 2, 6, 1, 8, 9, 8, 2, 8]
a = a[:-k:k] #[9, 6, 1, 1, 1]
This should work regardless of k dividing into N or not.

It is worth being afraid of simply throwing out readings, because significant readings can be thrown out.
For the tasks that you described, it is worth using decimation.
Unfortunately it is not in numpy, but it is in scipy.
In the code below, I gave an example when discarding samples leads to an error.
As you can see, the original data (blue) has a peak. And manual thinning can just skip it (green).
If you apply deciamation from the library, then it will be included in the result (orange).
from scipy import signal
import matplotlib.pyplot as plt
import numpy as np
downsampling_factor = 2
t = np.linspace(0, 1, 50)
y = list(np.random.randint(0,10,int(len(t)/2))) + [50] + list(np.random.randint(0,10,int(len(t)/2-1)))
ydem = signal.decimate(y, downsampling_factor)
t_new = np.linspace(0, 1, len(ydem))
manual_decimation = y[:-downsampling_factor:downsampling_factor]
t_manual_decimation = np.linspace(0, 1, len(manual_decimation))
plt.plot(t, y, '.-', t_new, ydem, 'o-', t_manual_decimation, manual_decimation, 'x-')
plt.legend(['data', 'scipy decimate', 'manual decimate'], loc='best')
plt.show()
In general, this is not such a trivial task, please be careful.
UPD: note that the length of the vector must be greater than 27.

to find the maximum:
1) k divides N:
k,N = 3,18
a = np.random.randint(0,10,N)
a
# array([0, 6, 6, 3, 7, 0, 9, 2, 3, 2, 5, 4, 2, 6, 9, 6, 3, 2])
a.reshape(-1,k).max(1)
# array([6, 7, 9, 5, 9, 6])
2) k does not divide N:
k,N = 4,21
a = np.random.randint(0,10,N)
a
# array([4, 4, 6, 0, 0, 1, 7, 8, 2, 3, 0, 5, 7, 1, 1, 5, 7, 8, 3, 1, 7])
np.maximum.reduceat(a, np.arange(0,N,k))
# array([6, 8, 5, 7, 8, 7])
2) should always work but I suspect 1) is faster where applicable

Related

Why aren't these 2 loops giving me the same result and how to reduce time complexity?

I want to solve this problem, but this isn't my issue. I only give this as context.
"You are given an integer array height of length n. There are n vertical lines drawn such that the two endpoints of the ith line are (i, 0) and (i, height[i]).
Find two lines that together with the x-axis form a container, such that the container contains the most water.
Return the maximum amount of water a container can store."
The above vertical lines are represented by array [1,8,6,2,5,4,8,3,7]. In this case, the max area of water (blue section) the container can contain is 49.
I made a simple nested for loop to solve this problem:
for i in range(0, len(height)):
for j in range(0, len(height)):
maxim = max(min(height[i], height[j]) * abs(j - i),maxim)
But this solution takes too long for a bigger array. So I tried to do this with List Comprehension:
mxm = [min(height[i], height[j] * abs(j - i)) for i in range(0, len(height)) for j in range(0, len(height))]
maxim = max(mxm)
The problem is , I have 2 different outputs: the nested for loop works (it returns 49) but the second one returns 8. (the mxm array has these elements: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 6, 4, 8, 8, 8, 8, 8, 2, 6, 0, 2, 6, 6, 6, 6, 6, 2, 2, 2, 0, 2, 2, 2, 2, 2, 4, 5, 5, 2, 0, 4, 5, 5, 5, 4, 4, 4, 4, 4, 0, 4, 4, 4, 6, 8, 8, 6, 8, 4, 0, 3, 8, 3, 3, 3, 3, 3, 3, 3, 0, 3, 7, 7, 7, 7, 7, 7, 7, 3, 0])
Why are they different? And how can I make my solution faster?
In the first example you're applying the min function to just the height values
min(height[i], height[j])
In the second you include the absolute distance between index positions in that min function as well, it'll always apply to height[j] instead of the actual minimum.
min(height[i], height[j] * abs(j - i))
Also regarding your solution, I believe ive seen this problem before. I think what you're looking for is a sliding window.

How to plot two lists of data with various length in Python (embedding of a time-series in phase space)?

My aim is to transform a one-dimensional time-series into a two-dimensional phase space. Since the time-series is one-dimensional, the phase space will be a pseudo (lag) phase space.
One theoretical approach to transform a time-series into pseudo phase space is as follows:
The original list of data is the full-length time-series x(t).
A subseries of data is the "lagged" version of the original time-series, starting with the second value of the time-series (instead of with its first one, as the original time-series) x(t+1).
Consequently, the subseries will always have one value less in its list. For a 3D phase space, a second subseries would have two values less in its list.
This is where my code related problem comes in, since matplotlib does not allow me to plot a two-dimensional plane when the length of two lists is not equal.
Here is my current code:
import numpy as np
import matplotlib.pyplot as plt
# Example time-series
Data = [924, -5, 24, 1, 0, 242, -5, 42, 5, 1, -9, 50, 3, 432, 0, -5, 4, 1, 2, 3]
# Embedding (time-series to phase space)
x_list = Data[:-1]
y_list = Data[1:]
# Plot
plt.plot(x_list, y_list, c="blue", linewidth=0.5)
plt.show()
This code uses the whole length of the time-series except the last value in the list by x_list = Data[:-1] for the x-axis. For the y-axis, the code uses the whole time-series except the very first item in the list by Data[1:].
While this code works, its result is not a real embedding of a time-series into its two-dimensional phase space, since x_list = Data[:-1] does not include the last value of the time-series.
What would be a proper way for coding and plotting the phase space of subseries that increasingly diminish in length compared to the original time-series data?
A simple approach is to use pandas and it's shift method:
Data = [924, -5, 24, 1, 0, 242, -5, 42, 5, 1, -9, 50, 3, 432, 0, -5, 4, 1, 2, 3]
import pandas as pd
import matplotlib.pyplot as plt
timeseries = pd.Series(Data)
plt.plot(timeseries, timeseries.shift(), c='blue', linewidth=0.5)
For a lag of 2 use shift(2)
output:
NB. you can also shift with numpy, but it is less elegant IMO
autocorrelation
I am not sure what is your end goal, but in case you try to determine if you have a period, or to perform autocorrelation analysis you can use pandas.plotting.autocorrelation_plot:
pd.plotting.autocorrelation_plot(timeseries)
output:
For a wrap around solution you could use a list comp to shift the data:
Data = list(range(10))
d = 5
multi = [Data[dims:]+Data[:dims] for dims in range(d) ]
print(*multi, sep="\n")
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
[2, 3, 4, 5, 6, 7, 8, 9, 0, 1]
[3, 4, 5, 6, 7, 8, 9, 0, 1, 2]
[4, 5, 6, 7, 8, 9, 0, 1, 2, 3]
if you do not want to wrap around, fix it like so:
d = 5
multi = [(Data[dims:]+Data[:dims])[:-d+1] for dims in range(d)]
to get
[0, 1, 2, 3, 4, 5]
[1, 2, 3, 4, 5, 6]
[2, 3, 4, 5, 6, 7]
[3, 4, 5, 6, 7, 8]
[4, 5, 6, 7, 8, 9]
If you want a hypothetical last value you would have to do some extrapolation of the series you got - if that makes more sense then cutting it short ... dunno.

Conditional Numpy shuffling

Problem
Assume you have a structured np.array as such:
first_array = np.array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9])
and you would like to create a new np.array of same size, but shuffled.
E.g.
second_array = np.random.shuffle(first_array)
second_array
# np.array([3, 2, 9, 5, 6, 1, 1, 6, 9, 7, 8, 5, 2, 7, 8, 3, 4, 4])
So far, so good. However, random shuffling leads to some duplicates being very close to each other, which is something I'm trying to avoid.
Question
How do I shuffle this array so that the integer order is pseudo-random but so that each duplicate has high probability to be very distant to each other? Is there a more specific term for this problem?
This is more of an algorithmic problem than numpy only. A naive approach can be to split the array with the minimum target spacing (spacing_condition), that numbers should be at least that far apart.
import numpy as np
first_array = np.array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9])
spacing_condition = 3
subarrays = np.split(first_array, spacing_condition)
Next step is to choose sequentially from subarrays in order, this would guarantee the spacing condition, and remove the choice from that subarray along.
However, this last two step, naive loops will be slow for large arrays. Following a naive implementation, seed is there only to reproduce.
np.random.seed(42)
def choose_over_spacing(subarrays):
choices = []
new_subarrays_ = []
subarray_indices = np.arange(len(subarrays[0]))
for subarray in subarrays:
index_to_choose = np.random.choice(subarray_indices, 1)[0]
number_choice = subarray[index_to_choose]
choices.append(number_choice)
new_subarray = np.delete(subarray, index_to_choose)
new_subarrays_.append(new_subarray)
return choices, new_subarrays_
all_choices = []
for _ in np.arange(len(subarrays[0])):
choices, subarrays = choose_over_spacing(subarrays)
all_choices = all_choices + choices
Inspecting the resulting, we see that we guarantee that duplicated numbers are at least 3 numbers apart, as we condition with spacing_condition, one could choose different spacing condition as long as initial split works.
[2, 6, 8, 3, 6, 7, 2, 5, 9, 3, 4, 9, 1, 4, 8, 1, 5, 7]

Get the relative extrema from 1D numpy array

I'm writing code that includes the algorithm to find local maximum/minimum values in array. But I failed to find the proper function.
At first, I used argrelextrema in scipy.signal.
b = [6, 1, 3, 5, 5, 3, 1, 2, 2, 3, 2, 1, 1, 9, 10, 10, 9, 8, 7, 7, 13, 10]
scipy.signal.argrelextrema(np.array(b), np.greater)
scipy.signal.argrelextrema(np.array(b), np.greater_equal)
scipy.signal.argrelextrema(np.array(b), np.greater_equal, order=2)
The result is
(array([ 9, 20], dtype=int64),)
(array([ 0, 3, 4, 7, 9, 14, 15, 20], dtype=int64),)
(array([ 0, 3, 4, 9, 14, 15, 20], dtype=int64),)
First one didn't catch the b[3](or b[4]). So I modified it to second one, using np.greater_equal. However, in this case, the first value b[0] is also treated as local maximum, and the value 2 in b[7] is included. By using third one, I could throw away b[7]. But order=2 still also has problem when data is like [1, 3, 1, 4, 1] (it can't catch 3)
My expected result is
[3(or 4), 9, 14(or 15), 20]
I want to catch only one among b[3], b[4] (same value). I want some problems of argrelextrema I mentioned above to be solved. The code below succeeded.
scipy.signal.find_peaks(b)
the result is [3, 9, 14, 20].
The code I'm writing is treating the pair of local maximum, and local minimum. So I want to find the local minimum in the same way. Is there any function like scipy.signal.find_peaks to find local minimum?
You could simply apply find_peaks to the negative version of your array:
from scipy.signal import find_peaks
min_idx = find_peaks([-x for x in b])
Even more convenient when using numpy arrays:
import numpy as np
b = np.array(b)
min_idx = find_peaks(-b)

Create Poisson-like distribution with N random numbers whose sum is a constant (C)

I want to generate a random Poisson number distribution where the sum of the generated numbers is 1000 and the lower-upper bound of the distribution is (3-30).
I can use numpy to generate random number:
In [2]: np.random.poisson(5, 150)
array([ 4, 4, 6, 4, 8, 6, 4, 2, 6, 8, 8, 8, 1, 4, 3, 4, 1,
3, 7, 6, 7, 4, 5, 5, 7, 6, 5, 3, 3, 5, 4, 6, 2, 0,
3, 5, 6, 2, 5, 2, 4, 7, 4, 7, 8, 5, 6, 1, 4, 4, 7,
4, 7, 2, 7, 4, 3, 8, 10, 2, 5, 7, 6, 3, 5, 7, 8, 5,
4, 7, 8, 8, 2, 2, 10, 6, 3, 5, 2, 5, 5, 6, 4, 6, 4,
0, 4, 3, 5, 8, 6, 7, 4, 4, 4, 3, 3, 4, 4, 6, 7, 6,
3, 9, 7, 7, 4, 5, 2, 4, 3, 6, 5, 6, 3, 6, 8, 9, 6,
3, 4, 4, 7, 3, 9, 12, 4, 5, 5, 7, 6, 5, 2, 10, 1, 3,
4, 4, 6, 5, 4, 4, 7, 5, 6, 5, 7, 2, 5, 5])
But, I want to add something more to it :
- The random number should be minimal of 3 and max of 30
- The sum of the generated random number should be 1000.
I know, I may not be creating an exact Poisson distribution if I manipulate. But, I want something like Poisson but with suggested controls.
Here's yet another alternative, based on pre-allocating the minimum per bin, calculating how many observations remain, and dialing in a Poisson rate for each remaining bin determined by how many observations and how many bins remain, but subject to acceptance/rejection based on the upper bound per bin.
Since a Poisson is a count of how many observations fell in an interval, if not all have been allocated by the initial stage they are randomly allocated one-by-one to bins with remaining capacity.
Here it is:
import numpy as np
def make_poissonish(n, num_bins):
if n > 30 * num_bins:
print("requested n exceeds 30 / bin")
exit(-1)
if n < 3 * num_bins:
print("requested n cannot fill 3 / bin")
exit(-1)
# Disperse minimum quantity per bin in all bins, then determine remainder
lst = [3 for _ in range(num_bins)]
number_remaining = n - num_bins * 3
# Allocate counts to all bins using a truncated Poisson
for i in range(num_bins):
# dial the rate up or down depending on whether we're falling
# behind or getting ahead in allocating observations to bins
rate = number_remaining / float(num_bins - i) # avg per remaining bin
# keep generating until we meet the constraint requirement (acceptance/rejection)
while True:
x = np.random.poisson(rate)
if x <= 27 and x <= number_remaining: break
# Found an acceptable count, put it in this bin and move on
lst[i] += x
number_remaining -= x
# If there are still observations remaining, disperse them
# randomly across bins that have remaining capacity
while number_remaining > 0:
i = np.random.randint(0, num_bins)
if lst[i] >= 30: # not this one, it's already full!
continue
lst[i] += 1
number_remaining -= 1
return lst
Sample output:
result = make_poissonish(150, 10)
print(result) # => [16, 19, 11, 16, 21, 18, 12, 17, 8, 12]
print(sum(result)) # => 150
result = make_poissonish(50, 10)
print(result) # => [3, 5, 5, 4, 3, 3, 15, 3, 6, 3]
print(sum(result)) # => 50
Let me write something which could work or not, we'll see
Property of Poisson distribution is that one parameter - λ - is measure of mean and variance at the same time. Lets try another distribution, which really sums to 1000 and close enough to Poisson.
I would try Multinomial Distribution. Let's consider we're sampling 200 numbers from multinomial. We will shift each sampled number by 3, so minimum boundary condition is satisfied. It means that for sampled multinomial sum (n parameter) is equal to 1000 - 3*200 = 400. Probabilities pi will be set to 1/200.
Thus, for multinomial mean E[xi] = n pi = 400/200 = 2. Variance from multinomial would be = n pi (1 - pi), and because pi is very small, term (1 - pi) would be pretty much close to 1, thus making sampled integers resembling Poisson with mean equal to variance. Problem is, after shift mean would be 5, but variance stays at ~2.
Anyway, some code.
import numpy as np
N = 200
shift = 3
n = 1000 - N*shift
p = [1.0 / float(N)] * N
q = np.random.multinomial(n, p, size=1)
print(np.sum(q))
print(np.mean(q))
print(np.var(q))
result = q + shift
print(np.sum(result))
print(np.mean(result))
print(np.var(result))
you can easily do it using while loop and random module and it will do the job :
from random import randint
nums_sum = 0
nums_lst = list()
while nums_sum < 1000:
n = randint(3, 31)
nums_sum += n
nums_lst.append(str(n))
print(nums_sum)
if 1000-nums_sum > 30: # means if the sum is more than 30 then complete ..
continue
else:
nums_sum += 1000-nums_sum
print(nums_sum)
print(nums_lst)
so simple .

Categories

Resources