Quantile/Median/2D binning in Python - python

do you know a quick/elegant Python/Scipy/Numpy solution for the following problem:
You have a set of x, y coordinates with associated values w (all 1D arrays). Now bin x and y onto a 2D grid (size BINSxBINS) and calculate quantiles (like the median) of the w values for each bin, which should at the end result in a BINSxBINS 2D array with the required quantiles.
This is easy to do with some nested loop,but I am sure there is a more elegant solution.
Thanks,
Mark

This is what I came up with, I hope it's useful. It's not necessarily cleaner or better than using a loop, but maybe it'll get you started toward something better.
import numpy as np
bins_x, bins_y = 1., 1.
x = np.array([1,1,2,2,3,3,3])
y = np.array([1,1,2,2,3,3,3])
w = np.array([1,2,3,4,5,6,7], 'float')
# You can get a bin number for each point like this
x = (x // bins_x).astype('int')
y = (y // bins_y).astype('int')
shape = [x.max()+1, y.max()+1]
bin = np.ravel_multi_index([x, y], shape)
# You could get the mean by doing something like:
mean = np.bincount(bin, w) / np.bincount(bin)
# Median is a bit harder
order = bin.argsort()
bin = bin[order]
w = w[order]
edges = (bin[1:] != bin[:-1]).nonzero()[0] + 1
med_index = (np.r_[0, edges] + np.r_[edges, len(w)]) // 2
median = w[med_index]
# But that's not quite right, so maybe
median2 = [np.median(i) for i in np.split(w, edges)]
Also take a look at numpy.histogram2d

I'm just trying to do this myself and it sound like you want the command "scipy.stats.binned_statistic_2d" from you can find the mean, median, standard devation or any defined function for the third parameter given the bins.
I realise this question has already been answered but I believe this is a good built in solution.

thanks a lot for your code. Based on it I found the following solution of my problem (only a minor modification of your code):
import numpy as np
BINS=10
boxsize=10.0
bins_x, bins_y = boxsize/BINS, boxsize/BINS
x = np.array([0,0,0,1,1,1,2,2,2,3,3,3])
y = np.array([0,0,0,1,1,1,2,2,2,3,3,3])
w = np.array([0,1,2,0,1,2,0,1,2,0,1,2], 'float')
# You can get a bin number for each point like this
x = (x // bins_x).astype('int')
y = (y // bins_y).astype('int')
shape = [BINS, BINS]
bin = np.ravel_multi_index([x, y], shape)
# Median
order = bin.argsort()
bin = bin[order]
w = w[order]
edges = (bin[1:] != bin[:-1]).nonzero()[0] + 1
median = [np.median(i) for i in np.split(w, edges)]
#construct BINSxBINS matrix with median values
binvals=np.unique(bin)
medvals=np.zeros([BINS*BINS])
medvals[binvals]=median
medvals=medvals.reshape([BINS,BINS])
print medvals

With numpy/scipy it goes like this:
import numpy as np
import scipy.stats as stats
x = np.random.uniform(0,200,100)
y = np.random.uniform(0,200,100)
w = np.random.uniform(1,10,100)
h = np.histogram2d(x,y,bins=[10,10], weights=w,range=[[0,200],[0,200]])
hist, bins_x, bins_y = h
q = stats.mstats.mquantiles(hist,prob=[0.25, 0.5, 0.75])
>>> q.round(2)
array([ 512.8 , 555.41, 592.73])
q1 = np.where(hist<q[0],1,0)
q2 = np.where(np.logical_and(q[0]<=hist,hist<q[1]),2,0)
q3 = np.where(np.logical_and(q[1]<=hist,hist<=q[2]),3,0)
q4 = np.where(q[2]<hist,4,0)
>>>q1 + q2 + q3 + q4
array([[4, 3, 4, 3, 1, 1, 4, 3, 1, 2],
[1, 1, 4, 4, 2, 3, 1, 3, 3, 3],
[2, 3, 3, 2, 2, 2, 3, 2, 4, 2],
[2, 2, 3, 3, 3, 1, 2, 2, 1, 4],
[1, 3, 1, 4, 2, 1, 3, 1, 1, 3],
[4, 2, 2, 1, 2, 1, 3, 2, 1, 1],
[4, 1, 1, 3, 1, 3, 4, 3, 2, 1],
[4, 3, 1, 4, 4, 4, 1, 1, 2, 4],
[2, 4, 4, 4, 3, 4, 2, 2, 2, 4],
[2, 2, 4, 4, 3, 3, 1, 3, 4, 4]])
prob = [0.25, 0.5, 0.75] is the default value for the quantile settings, you can change it or leave it away.

Related

What is the best way to implement 1D-Convolution in python?

I am trying to implement 1D-convolution for signals.
It should have the same output as:
ary1 = np.array([1, 1, 2, 2, 1])
ary2 = np.array([1, 1, 1, 3])
conv_ary = np.convolve(ary2, ary1, 'full')
>>>> [1 2 4 8 8 9 7 3]
I came up with this approach:
def convolve_1d(signal, kernel):
n_sig = signal.size
n_ker = kernel.size
n_conv = n_sig - n_ker + 1
# by a factor of 3.
rev_kernel = kernel[::-1].copy()
result = np.zeros(n_conv, dtype=np.double)
for i in range(n_conv):
result[i] = np.dot(signal[i: i + n_ker], rev_kernel)
return result
But my result is [8,8] I might have to zero pad my array instead and change its indexing.
Is there a smoother way to achieve the desired outcome?
Here is a possible solution:
def convolve_1d(signal, kernel):
kernel = kernel[::-1]
return [
np.dot(
signal[max(0,i):min(i+len(kernel),len(signal))],
kernel[max(-i,0):len(signal)-i*(len(signal)-len(kernel)<i)],
)
for i in range(1-len(kernel),len(signal))
]
Here is an example:
>>> convolve_1d([1, 1, 2, 2, 1], [1, 1, 1, 3])
[1, 2, 4, 8, 8, 9, 7, 3]

Compare numpy array to every row of matrix to count similar items (vectorized)

I'm playing a bingo and I choose 15 balls out of 50 possible balls (without replacement).
Then 30 balls get drawn (without replacement), and if my 15 balls are in this set of 30 balls drawn, I get a prize.
I wanted to calculate the probability of winning a prize by simulating this many times, preferably vectorized.
So here's my code until now:
import numpy as np
my_chosen_balls = np.random.choice(range(1,51), 15)
samples_30_balls = np.random.choice(range(1,51), (1_000_000, 30))
How do I compare the 15 balls that i chose with any of these 30 ball samples and see if all my balls were picked?
So compare my 15 balls to every one of the samples separately.
Here is a smaller example to visualize with:
my_chosen_balls = np.array([7, 4, 3])
sample_5_balls = np.array([[5, 5, 5, 6, 6, 4, 1, 1, 1, 8, 2, 3, 2, 8, 8],
[1, 9, 1, 3, 4, 8, 5, 4, 7, 2, 8, 6, 5, 6, 4],
[7, 3, 6, 9, 8, 3, 6, 9, 3, 1, 6, 5, 3, 1, 7],
[8, 4, 3, 2, 9, 5, 3, 8, 4, 6, 9, 2, 6, 5, 9],
[3, 2, 8, 5, 1, 9, 2, 5, 8, 4, 5, 1, 7, 4, 6]])
There are a couple of ways of doing this. Since you have only a single selection of 15, you can use np.isin:
mask = np.isin(sample_5_balls, my_chosen_balls).sum(0) == my_chosen_balls.size
If you want the percentage of successes:
np.count_nonzero(mask) / sample_5_balls.shape[1]
The problem is that you can't easily generate an array like samples_30_balls or sample_5_balls using tools like np.random.choice or np.random.Generator.choice. There are some solutions available, like Numpy random choice, replacement only along one axis, but they only work for a small number of items.
Instead, you can use sorting and slicing to get what you want, as shown here and here:
sample_30_balls = np.random.rand(50, 100000).argsort(0)[:30, :]
You will want to add 1 to the numbers for display, but it will be much easier to go zero-based for the remainder of the answer.
If your population size stays at 64 or under, you can use bit twiddling to make everything work much faster. First convert the data to a single array of numbers:
sample_30_bits = (1 << sample_30_balls).sum(axis=0)
These two operations are equivalent to
sample_30_bits = np.bitwise_or.reduce((2**sample_30_balls), axis=0)
A single sample is a single integer with this scheme:
my_chosen_bits = (1 << np.random.rand(50).argsort()[:15]).sum()
np.isin is now infintely simpler: it's just bitwise AND (&). You can use the fast bit_count function I wrote here (copied verbatim):
def bit_count(arr):
# Make the values type-agnostic (as long as it's integers)
t = arr.dtype.type
mask = t(-1)
s55 = t(0x5555555555555555 & mask) # Add more digits for 128bit support
s33 = t(0x3333333333333333 & mask)
s0F = t(0x0F0F0F0F0F0F0F0F & mask)
s01 = t(0x0101010101010101 & mask)
arr = arr - ((arr >> 1) & s55)
arr = (arr & s33) + ((arr >> 2) & s33)
arr = (arr + (arr >> 4)) & s0F
return (arr * s01) >> (8 * (arr.itemsize - 1))
pct = (bit_count(my_chosen_bits & sample_30_bits) == 15).sum() / sample_30_bits.size
But there's more: now you can generate a large number of samples not just for the 30 balls, but for the 15 as well. One alternative is to generate identical numbers of samples, and compare them 1-to-1:
N = 100000
sample_15_bits = (1 << np.random.rand(50, N).argsort(0)[:15, :]).sum(0)
sample_30_bits = (1 << np.random.rand(50, N).argsort(0)[:30, :]).sum(0)
pct = (bit_count(sample_15_bits & sample_30_bits) == 15).sum() / N
Another alternative is to generate potentially different arrays of samples for each quantity, and compare all of them against each other. This will require a lot more space in the result, so I will show it for smaller inputs:
M = 100
N = 5000
sample_15_bits = (1 << np.random.rand(50, M).argsort(0)[:15, :]).sum(0)
sample_30_bits = (1 << np.random.rand(50, N).argsort(0)[:30, :]).sum(0)
pct = (bit_count(sample_15_bits[:, None] & sample_30_bits) == 15).sum() / (M * N)
If you need to optimize for space (e.g., using truly large sample sizes), keep in mind that all the operations here use ufuncs except np.random.rand and argsort. You can therefore do most of the work in-place without creating temporary arrays. That will be left as an exercise for the reader.
Also, I recommend that you draw histograms of bit_count(sample_15_bits & sample_30_bits) to adjust your expectations. Here is a histogram of the counts for the last example above:
y = np.bincount(bit_count(sample_15_bits[:, None] & sample_30_bits).ravel())
x = np.arange(y.size)
plt.bar(x, y)
Notice how tiny the bar at 15 is. I've seen values of pct around 7e-5 while writing this answer, but am too lazy to figure out the theoretical value.
With isin count the intersecting values and compare with 15. I changed the data generation to samples without replacement.
import numpy as np
np.random.seed(10)
my_chosen_balls = np.random.choice(range(0,50), 15, replace=False)
samples_30_balls = np.random.rand(1_000_000,50).argsort(1)[:,:30]
(np.isin(samples_30_balls, my_chosen_balls).sum(1) == 15).sum()
Output
74
So about 0.007% chance.
How generating a data sample without replacement works
Generate random values in [0,1) in the shape samples, range. Here 10 samples from [0,1,2,3,4]
np.random.rand(10,5)
Out
array([[0.37216438, 0.16884495, 0.05393551, 0.68189535, 0.30378455],
[0.63428637, 0.6566772 , 0.16162259, 0.16176099, 0.74568611],
[0.81452942, 0.10470267, 0.89547322, 0.60099124, 0.22604322],
[0.16562083, 0.89936513, 0.89291548, 0.95578207, 0.90790727],
[0.11326867, 0.18230934, 0.44912596, 0.65437732, 0.78308136],
[0.72693801, 0.22425798, 0.78157525, 0.93485338, 0.84097546],
[0.96751432, 0.57735756, 0.48147214, 0.22441829, 0.53388467],
[0.95415338, 0.07746658, 0.93875458, 0.21384035, 0.26350969],
[0.39937711, 0.35182801, 0.74707871, 0.07335893, 0.27553172],
[0.80749372, 0.40559599, 0.33654045, 0.14802479, 0.71198915]]
'Convert' to integers with argsort
np.random.rand(10,5).argsort(1)
Out
array([[4, 2, 1, 0, 3],
[0, 1, 3, 2, 4],
[1, 3, 2, 4, 0],
[4, 0, 2, 3, 1],
[2, 3, 0, 1, 4],
[1, 4, 3, 2, 0],
[4, 3, 2, 0, 1],
[1, 0, 2, 3, 4],
[4, 1, 2, 3, 0],
[1, 4, 0, 2, 3]])
Slice to the desired sample size
np.random.rand(10,5).argsort(1)[:,:3]
Out
array([[2, 3, 4],
[0, 4, 3],
[3, 0, 4],
[2, 0, 3],
[2, 3, 4],
[3, 4, 2],
[2, 0, 1],
[0, 4, 3],
[0, 2, 3],
[2, 3, 4]])

Special shuffling of the array

I want to shuffle my numpy array a = [2, 2, 2, 1, 1] in this way: a = [2, 1, 2, 1, 2]. So that the same elements do not stand side by side if possible. I know about numpy.array.shuffle but it generates all possible permutations uniformly. Therefore, with the same probability, can appear a = [2, 1, 2, 1, 2] or a = [2, 2, 2, 1, 1]. Is there vectorised solution for more difficult arrays? For example, for this b = np.hstack([np.ones(101), np.ones(50) * 2, np.ones(20) * 3]) array.

Calculate pairwise distance of multiple trajectories using numpy

Given an arbitrary number of 3D trajectories with N points (timesteps) each, I would like to compute the distance between each point for a given timestep.
Let's say we'll look at timestep 3 and have four trajectories t_0 ... t_3. The point of the third timestep of trajectory 0 is given as t_0(3). I want to calculate the distances as follows:
d_0 = norm(t_0(3) - t_1(3))
d_1 = norm(t_1(3) - t_2(3))
d_2 = norm(t_2(3) - t_3(3))
d_3 = norm(t_3(3) - t_0(3))
As you can see there is kind of circular behavior in it (the last one calculates the distance to the first one), but that is not strictly necessary.
I know how to write some for-loops and calculate what I want to. What i am looking for is a concept or maybe an implementation in numpy (or combinations of np-functions) which can perform this logic just using the right axis and other numpy magic.
Here some example trajectories
import numpy as np
TIMESTEP_COUNT = 70
origin = np.array([0, 0, 0])
run1_direction = np.array([1, 0, 0]) / np.linalg.norm([1, 0 ,0])
run2_direction = np.array([0, 1, 0]) / np.linalg.norm([0, 1, 0])
run3_direction = np.array([0, 0, 1]) / np.linalg.norm([0, 0, 1])
run4_direction = np.array([1, 1, 0]) / np.linalg.norm([1, 1, 0])
run1_trajectory = [origin]
run2_trajectory = [origin]
run3_trajectory = [origin]
run4_trajectory = [origin]
for t in range(TIMESTEP_COUNT - 1):
run1_trajectory.append(run1_trajectory[-1] + run1_direction)
run2_trajectory.append(run2_trajectory[-1] + run2_direction)
run3_trajectory.append(run3_trajectory[-1] + run3_direction)
run4_trajectory.append(run4_trajectory[-1] + run4_direction)
run1_trajectory = np.array(run1_trajectory)
run2_trajectory = np.array(run2_trajectory)
run3_trajectory = np.array(run3_trajectory)
run4_trajectory = np.array(run4_trajectory)
... results in the following image:
Thank you in advance!!
EDIT:
My question is different to the suggested answer below because i don't want to calculate a full distance matrix. My algo should work with the distances among consecutive runs only.
I think you can stack them vertically to get an array of shape 4 x n_timesteps, and then use np.roll to do the difference in each timestep, namely:
r = np.vstack([t0,t1,t2,t3])
r - np.roll(r,shift=-1,axis=0)
Numeric example:
t0,t1,t2,t3 = np.random.randint(1,10, 5), np.random.randint(1,10, 5), np.random.randint(1,10, 5), np.random.randint(1,10, 5)
r = np.vstack([t0,t1,t2,t3])
r
array([[1, 7, 7, 6, 2],
[9, 1, 2, 3, 6],
[1, 1, 6, 8, 1],
[2, 9, 5, 9, 3]])
r - np.roll(r,shift=-1,axis=0)
array([[-8, 6, 5, 3, -4],
[ 8, 0, -4, -5, 5],
[-1, -8, 1, -1, -2],
[ 1, 2, -2, 3, 1]])

cosine similarity between a vector and pandas column(a linear vector)

I have a pandas data frame containing list of wines with their respective wine attributes.
Then I made a new column vector that contains numpy vectors from these attributes.
def get_wine_profile(id):
wine = wines[wines['exclusiviId'] == id]
wine_vector = np.array(wine[wine_attrs].values.tolist()).flatten()
return wine_vector
wines['vector'] = wines.exclusiviId.apply(get_wine_profile)
hence the vector column look something like this
vector
[1, 1, 1, 2, 2, 2, 2, 1, 1, 1]
[3, 1, 2, 1, 2, 2, 2, 0, 1, 3]
[1, 1, 2, 1, 3, 3, 3, 0, 1, 1]
.
.
now I want to perform cosine similarity between this column and another vector that is resulting vector from the user input
This is what i have tried so far
from scipy.spatial.distance import cosine
cos_vec = wines.apply(lambda x: (1-cosine(wines["vector"],[1, 1, 1, 2, 2, 2, 2, 1, 1, 1]), axis=1)
Print(cos_vec)
this is throwing error
ValueError: ('operands could not be broadcast together with shapes (63,) (10,) ', 'occurred at index 0')
I also tries using sklearn but it also have the same problem with the arrar shape
what i want as a final output is a column that has match score between this column and user input
A better solution IMO is to use cdist with cosine metric. You are effectively computing pairwise distances between n points in your DataFrame and 1 point in your user input, i.e. n pairs in total.
If you handle more than one user at a time, this would be even more efficient.
from scipy.spatial.distance import cdist
# make into 1x10 array
user_input = np.array([1, 1, 1, 2, 2, 2, 2, 1, 1, 1])[None]
df["cos_dist"] = cdist(np.stack(df.vector), user_input, metric="cosine")
# vector cos_dist
# 0 [1, 1, 1, 2, 2, 2, 2, 1, 1, 1] 0.00000
# 1 [3, 1, 2, 1, 2, 2, 2, 0, 1, 3] 0.15880
# 2 [1, 1, 2, 1, 3, 3, 3, 0, 1, 1] 0.07613
By the way, it looks like you are using native Python lists. I would switch everything to numpy arrays. A conversion to np.array is happening under the hood anyway when you call cosine.
well i made my own function to do this and yes it works
import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
def get_similarity(id):
vec1 = result_vector
vec2 = get_wine_profile(id)
similarity = cosine_similarity(vec1, vec2)
return similarity
wines['score'] = wines.exclusiviId.apply(get_similarity)
display(wines.head())

Categories

Resources