Numpy/Torch insert smallest value in case of collision - python

I have an empty numpy array, a list of indices, and list of values associated with the indices. The issue is that there may be duplicates in the indices. In all these "collision" cases, I'd like the smallest value to be picked. Just wondering what is the best way to go about it.
Eg:
array = [0,0,0,0,0,0,0]
indices = [0, 0, 2, 3, 2, 4]
values = [1.0, 3.0, 3.5, 1.5, 2.5, 8.0]
Result:
out = [1.0, 0, 2.5, 1.5, 8.0, 0.0, 0.0]

You can always implement something manually like:
import numpy as np
def index_reduce(arr, indices, out, reducer=min):
touched = np.zeros_like(out, dtype=np.bool_)
for i, x in enumerate(indices):
if not touched[x]:
out[x] = arr[i]
touched[x] = True
else:
out[x] = reducer(out[x], arr[i])
return out
which essentially loops through the indices and assign the values of arr to out if not already touched (keeping track of this with the touched array) and reducing the output with the specified reducer.
NOTE: The reducer function needs to be such that the final result can only depend on the current and previous value.
The usage of this would be:
indices = [0, 0, 2, 3, 2, 4]
values = [1.0, 3.0, 3.5, 1.5, 2.5, 8.0]
array = np.zeros(7)
index_reduce(values, indices, array)
# array([1. , 0. , 2.5, 1.5, 8. , 0. , 0. ])
If performances are of concern, you can also accelerate the above code with Numba with a simple decoration provided that also the values and indices inputs are NumPy arrays:
import numba as nb
index_reduce_nb = nb.njit(index_reduce)
indices = np.array([0, 0, 2, 3, 2, 4])
values = np.array([1.0, 3.0, 3.5, 1.5, 2.5, 8.0])
array = np.zeros(7)
index_reduce_nb(values, indices, array)
# array([1. , 0. , 2.5, 1.5, 8. , 0. , 0. ])
Benchmarks
The above solutions can be compared to a Torch-based solution (reworked from #Shai's answer):
import torch
def index_reduce_torch(arr, indices, out, reduce_="amin"):
arr = torch.from_numpy(arr)
indices = torch.from_numpy(indices)
out = torch.from_numpy(out)
return out.index_reduce_(dim=0, index=indices, source=arr, reduce=reduce_, include_self=False).numpy()
or, with additional skipping of Torch gradients:
index_reduce_torch_ng = torch.no_grad()(index_reduce_torch)
index_reduce_torch_ng.__name__ = "index_reduce_torch_ng"
and a Pandas-based solution (reworked from #bpfrd's answer):
import pandas as pd
def index_reduce_pd(arr, indices, out, reducer=min):
df = pd.DataFrame(data=zip(indices, arr))
df1 = df.groupby(0, as_index=False).agg(reducer)
out[df1[0]] = df1[1]
return out
using the following code:
funcs = index_reduce, index_reduce_nb, index_reduce_pd, index_reduce_torch, index_reduce_torch_ng
timings = {}
for i in range(4, 18):
n = 2 ** i
print(f"n = {n}, i = {i}")
extrema = 0, 2 * n
indices = np.random.randint(*extrema, n)
values = np.random.random(n)
out = np.zeros(extrema[1] + 1)
timings[n] = []
base = funcs[0](values, indices, out)
for func in funcs:
res = func(values, indices, out)
is_good = np.allclose(base, res)
timed = %timeit -r 16 -n 16 -q -o func(values, indices, out)
timing = timed.best * 1e6
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good} {timing:10.3f} µs")
to produce with the additional lines:
import matplotlib.pyplot as plt
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', figsize=(6, 4))
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', ylim=[0, 500], figsize=(6, 4))
fig = plt.gcf()
fig.patch.set_facecolor('white')
these plots:
(the second is a zoomed-in version of the first).
These indicate that the Numba accelerated solution could be the fastest, closely followed by the Torch-based solution while the Pandas approach could be the slowest, even slower than the explicit solution without acceleration.

You are looking for index_reduce_, which was introduced in PyTorch 1.12.
import torch
array = torch.zeros(7)
indices = torch.tensor([0, 0, 2, 3, 2, 4])
values = torch.tensor([1.0, 3.0, 3.5, 1.5, 2.5, 8.0])
out = array.index_reduce_(dim=0, index=indices, source=values, reduce='amin', include_self=False)
You'll get your desired output:
tensor([1.0000, 0.0000, 2.5000, 1.5000, 8.0000, 0.0000, 0.0000])
Note that this method is in "beta" and its API may change in future PyTorch versions.

You can use pandas groupby agg as the following:
indices = [0, 0, 2, 3, 2, 4]
values = [1.0, 3.0, 3.5, 1.5, 2.5, 8.0]
array = [0,0,0,0,0,0,0]
df = pd.DataFrame(zip(indices, values), columns=['indices','values'])
df1 = df.groupby('indices', as_index=False).agg(values=('values', min))
for i,j in zip(df1['indices'].tolist(), df1['values'].tolist()):
array[i] = j
output:
array
>[1.0, 0, 2.5, 1.5, 8.0, 0, 0]

Related

FInd float numbers that have at least two multiples in given list

I need a function to find all float numbers that have at least two multiples in a given list.
Do you know if an already existing and efficient function exists in pandas, scipy or numpy for this purpose?
Example of expected behavior
Given the list [3.3, 3.4, 4.4, 5.1], I want a function that returns [.2, .3, 1.1, 1.7]
You can do something like:
import itertools
from itertools import chain
from math import sqrt
l = [3.3, 3.4, 4.4, 5.1]
def divisors(n):
# Find all divisors
return set(chain.from_iterable((i,n//i) for i in range(1,int(sqrt(n))+1) if n%i == 0))
# Multiply all numbers by 10, make integer, concatenate all divisors from all numbers
divisors_all = list(itertools.chain(*[list(divisors(int(x*10))) for x in l]))
# Take divisors with more than 2 multiples, then multiply by 10 back
div, counts = np.unique(divisors_all, return_counts=True)
result = div[counts > 1]/10
Output:
array([0.1, 0.2, 0.3, 1.1, 1.7])
This makes the hypothesis that all number have one decimal maximum in the original set.
This keeps 1 as it divides everything, but can be removed easily.
I think numpy.gcd() can be used to do what your question asks subject to the following clarifying constraints:
the input numbers will be examined to 1 decimal precision
inputs must be > 0
results must be > 1.0
import numpy as np
a = [3.3, 3.4, 4.4, 5.1]
b = [int(10*x) for x in a]
res = {np.gcd(x, y) for i, x in enumerate(b) for j, y in enumerate(b) if i != j}
res = [x/10 for x in res if x > 10]
Output:
[1.1, 1.7]
UPDATE:
To exactly match the results in the question after edit by OP (namely: [.2, .3, 1.1, 1.7]), we can do this:
import numpy as np
a = [3.3, 3.4, 4.4, 5.1]
b = [int(10*x) for x in a]
res = sorted({np.gcd(x, y) / 10 for i, x in enumerate(b) for j, y in enumerate(b) if i != j} - {1 / 10})
Output:
[0.2, 0.3, 1.1, 1.7]

Pandas groupby throws an error when using sum()

I am trying to calculate the between cluster scatter matrix. In order to do that, for each cluster (named "group" in the example below), I need to perform an operation which results in a matrix and subsequently perform an element-wise addition of the matrices from each cluster.
To do this I try the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')
def g(x, mu):
y = np.array([np.mean(x) - mu])
print((y.T # y)*len(x))
print("")
return (y.T # y)*len(x)
m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
This example throws a TypeError: Series.name must be a hashable type error on the last line. The print statement in the g function shows the result as expected, see below, so I believe the error is due to the .sum() operation.
[[0.2553 0.1458]
[0.1458 0.0833]]
[[1.5052 1.0625]
[1.0625 0.75 ]]
[[0.7912 0.625 ]
[0.625 0.5 ]]
The result I was expecting by adding the .sum() operation was the element-wise addition of the three matrices above.
The expected output is:
[[2.5416 1.8333]
[1.8333 1.3333]]
Any ideas why this is giving me an error and what I can do to correct it?
Update 1:
Using:
Sb = df.groupby('group').apply(g, mu=(mu)).sum()
instead of
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
gives the correct matrix, padded with nans. Why does features cause an error?
Have you tried this ?
sb=df.groupby('group').apply(g, mu=(mu)).sum()
it gives the following result:
[[2.54166667 1.83333333 nan]
[1.83333333 1.33333333 nan]
[ nan nan nan]]
Is it what you want ?
You still have to deal with the nans though
Edit to answer your comments:
To answer you problem in the comments you could change your function as below:
def g(x, mu):
x=x[["A","B"]] #or x=x[features]
y = np.array([np.mean(x) - mu])
print((y.T # y)*(len(x)))
print("")
return (y.T # y)*(len(x))
and then:
sb=df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)
which gives:
[[2.54166667 1.83333333]
[1.83333333 1.33333333]]

Merge histograms with different ranges

Is it any fast way to merge two numpy histograms with different bin ranges and bin number?
For example:
x = [1,2,2,3]
y = [4,5,5,6]
a = np.histogram(x, bins=10)
# a[0] = [1, 0, 0, 0, 0, 2, 0, 0, 0, 1]
# a[1] = [ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]
b = np.histogram(y, bins=5)
# b[0] = [1, 0, 2, 0, 1]
# b[1] = [ 4. , 4.4, 4.8, 5.2, 5.6, 6. ]
Now I want to have some function like this:
def merge(a, b):
# some actions here #
return merged_a_b_values, merged_a_b_bins
Actually I have not x and y, a and b are known only.
But the result of merge(a, b) must be equal to np.histogram(x+y, bins=10):
m = merge(a, b)
# m[0] = [1, 0, 2, 0, 1, 0, 1, 0, 2, 1]
# m[1] = [ 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ]
I'd actually have added a comment to dangom's answer, but I lack the reputation required.
I'm a little confused by your example. You're plotting the histogram of the histogram bins if I'm not mistaken. It should rather be this, right?
plt.figure()
plt.plot(a[1][:-1], a[0], marker='.', label='a')
plt.plot(b[1][:-1], b[0], marker='.', label='b')
plt.plot(c[1][:-1], c[0], marker='.', label='c')
plt.legend()
plt.show()
Also a note to your suggestion for combining the histogram. You are of course right, that there's no unique solution as you simply don't know, where the samples would've have been in the finer grid you use for the combination. When having two histograms, which have a significantly differing bin width the suggested merging function may result in a sparse and artificial looking histogram.
I tried combining the histograms by interpolation (assuming the samples within the count bin were distributed uniformly in the original bin - which is of course also only an assumption).
This leads however to a more natural looking result, at least for data sampled from distributions I typically encounter.
import numpy as np
def merge_hist(a, b):
edgesa = a[1]
edgesb = b[1]
da = edgesa[1]-edgesa[0]
db = edgesb[1]-edgesb[0]
dint = np.min([da, db])
min = np.min(np.hstack([edgesa, edgesb]))
max = np.max(np.hstack([edgesa, edgesb]))
edgesc = np.arange(min, max, dint)
def interpolate_hist(edgesint, edges, hist):
cumhist = np.hstack([0, np.cumsum(hist)])
cumhistint = np.interp(edgesint, edges, cumhist)
histint = np.diff(cumhistint)
return histint
histaint = interpolate_hist(edgesc, edgesa, a[0])
histbint = interpolate_hist(edgesc, edgesb, b[0])
c = histaint + histbint
return c, edgesc
An example for two gaussian distributions:
import numpy as np
a = 5 + 1*np.random.randn(100)
b = 10 + 2*np.random.randn(100)
hista, edgesa = np.histogram(a, bins=10)
histb, edgesb = np.histogram(b, bins=5)
histc, edgesc = merge_hist([hista, edgesa], [histb, edgesb])
plt.figure()
width = edgesa[1]-edgesa[0]
plt.bar(edgesa[:-1], hista, width=width)
width = edgesb[1]-edgesb[0]
plt.bar(edgesb[:-1], histb, width=width)
plt.figure()
width = edgesc[1]-edgesc[0]
plt.bar(edgesc[:-1], histc, width=width)
plt.show()
I, however, am no statistician, so please let me know if the suggestes approach is viable.
There is no unique solution to the problem of merging two different histograms. I propose here a simple and quick solution based on two design assumptions necessary to deal with the loss of information inherent from binning sequences:
Recovered values are represented by the start of the bin they belong to.
The merge shall keep the highest bin resolution to avoid further loss of information and shall completely encompass the intervals of the children histograms.
Here's the code:
import numpy as np
def merge(a, b):
def extract_vals(hist):
# Recover values based on assumption 1.
values = [[y]*x for x, y in zip(hist[0], hist[1])]
# Return flattened list.
return [z for s in values for z in s]
def extract_bin_resolution(hist):
return hist[1][1] - hist[1][0]
def generate_num_bins(minval, maxval, bin_resolution):
# Generate number of bins necessary to satisfy assumption 2
return int(np.ceil((maxval - minval) / bin_resolution))
vals = extract_vals(a) + extract_vals(b)
bin_resolution = min(map(extract_bin_resolution, [a, b]))
num_bins = generate_num_bins(min(vals), max(vals), bin_resolution)
return np.histogram(vals, bins=num_bins)
Here's the example code:
import matplotlib.pyplot as plt
x = [1,2,2,3]
y = [4,5,5,6]
a = np.histogram(x, bins=10)
# a[0] = [1, 0, 0, 0, 0, 2, 0, 0, 0, 1]
# a[1] = [ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]
b = np.histogram(y, bins=5)
# b[0] = [1, 0, 2, 0, 1]
# b[1] = [ 4. , 4.4, 4.8, 5.2, 5.6, 6. ]
# Merge and plot results
c = merge(a, b)
c_num_bins = c[1].size - 1
plt.hist(a[0], bins=5, label='a')
plt.hist(b[0], bins=10, label='b')
plt.hist(c[0], bins=c_num_bins, label='c')
plt.legend()
plt.show()

Mean value of each element in multiple lists - Python

If I have two lists
a = [2,5,1,9]
b = [4,9,5,10]
How can I find the mean value of each element, so that the resultant list would be:
[3,7,3,9.5]
>>> a = [2,5,1,9]
>>> b = [4,9,5,10]
>>> [(g + h) / 2 for g, h in zip(a, b)]
[3.0, 7.0, 3.0, 9.5]
Referring to your title of the question, you can achieve this simply with:
import numpy as np
multiple_lists = [[2,5,1,9], [4,9,5,10]]
arrays = [np.array(x) for x in multiple_lists]
[np.mean(k) for k in zip(*arrays)]
Above script will handle multiple lists not just two. If you want to compare the performance of two approaches try:
%%time
import random
import statistics
random.seed(33)
multiple_list = []
for seed in random.sample(range(100), 100):
random.seed(seed)
multiple_list.append(random.sample(range(100), 100))
result = [statistics.mean(k) for k in zip(*multiple_list)]
or alternatively:
%%time
import random
import numpy as np
random.seed(33)
multiple_list = []
for seed in random.sample(range(100), 100):
random.seed(seed)
multiple_list.append(np.array(random.sample(range(100), 100)))
result = [np.mean(k) for k in zip(*multiple_list)]
To my experience numpy approach is much faster.
What you want is the mean of two arrays (or vectors in math).
Since Python 3.4, there is a statistics module which provides a mean() function:
statistics.mean(data)
Return the sample arithmetic mean of data, a sequence or iterator of real-valued numbers.
You can use it like this:
import statistics
a = [2, 5, 1, 9]
b = [4, 9, 5, 10]
result = [statistics.mean(k) for k in zip(a, b)]
# -> [3.0, 7.0, 3.0, 9.5]
notice: this solution can be use for more than two arrays, because zip() can have multiple parameters.
An alternate to using a list and for loop would be to use a numpy array.
import numpy as np
# an array can perform element wise calculations unlike lists.
a, b = np.array([2,5,1,9]), np.array([4,9,5,10])
mean = (a + b)/2; print(mean)
>>>[ 3. 7. 3. 9.5]
Put the two lists into a numpy array using vstack and then take the mean (using 'tolist' to get back from the numpy array):
import numpy as np
a = [2,5,1,9]
b = [4,9,5,10]
np.mean(np.vstack([a,b]), axis=0).tolist()
[3.0, 7.0, 3.0, 9.5]
Seems you are looking for an element-wise mean value. setting axis=0 in np.mean is what you need.
>>> import numpy as np
>>> a = [2,5,1,9]
>>> b = [4,9,5,10]
Create a list containing all your lists
>>> a_b = [a,b]
>>> a_b
[[2, 5, 1, 9], [4, 9, 5, 10]]
Use np.mean and set the axis to 0
>>> np.mean(a_b, axis=0)
array([3. , 7. , 3. , 9.5])

Calculate moving average in numpy array with NaNs

I am trying to calculate the moving average in a large numpy array that contains NaNs. Currently I am using:
import numpy as np
def moving_average(a,n=5):
ret = np.cumsum(a,dtype=float)
ret[n:] = ret[n:]-ret[:-n]
return ret[-1:]/n
When calculating with a masked array:
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx).filled(np.nan)
print y
>>> array([3.8,3.8,3.6,nan,nan,nan,2,2.4,nan,nan,nan,2.8,2.6])
The result I am looking for (below) should ideally have NaNs only in the place where the original array, x, had NaNs and the averaging should be done over the number of non-NaN elements in the grouping (I need some way to change the size of n in the function.)
y = array([4.75,4.75,nan,4.4,3.75,2.33,3.33,4,nan,nan,3,3.5,nan,3.25,4,4.5,3])
I could loop over the entire array and check index by index but the array I am using is very large and that would take a long time. Is there a numpythonic way to do this?
Pandas has a lot of really nice functionality with this. For example:
x = np.array([np.nan, np.nan, 3, 3, 3, np.nan, 5, 7, 7])
# requires three valid values in a row or the resulting value is null
print(pd.Series(x).rolling(3).mean())
#output
nan,nan,nan, nan, 3, nan, nan, nan, 6.333
# only requires 2 valid values out of three for size=3 window
print(pd.Series(x).rolling(3, min_periods=2).mean())
#output
nan, nan, nan, 3, 3, 3, 4, 6, 6.3333
You can play around with the windows/min_periods and consider filling-in nulls all in one chained line of code.
I'll just add to the great answers before that you could still use cumsum to achieve this:
import numpy as np
def moving_average(a, n=5):
ret = np.cumsum(a.filled(0))
ret[n:] = ret[n:] - ret[:-n]
counts = np.cumsum(~a.mask)
counts[n:] = counts[n:] - counts[:-n]
ret[~a.mask] /= counts[~a.mask]
ret[a.mask] = np.nan
return ret
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx)
You could create a temporary array and use np.nanmean() (new in version 1.8 if I'm not mistaken):
import numpy as np
temp = np.vstack([x[i:-(5-i)] for i in range(5)]) # stacks vertically the strided arrays
means = np.nanmean(temp, axis=0)
and put original nan back in place with means[np.isnan(x[:-5])] = np.nan
However this look redundant both in terms of memory (stacking the same array strided 5 times) and computation.
If I understand correctly, you want to create a moving average and then populate the resulting elements as nan if their index in the original array was nan.
import numpy as np
>>> inc = 5 #the moving avg increment
>>> x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
>>> mov_avg = np.array([np.nanmean(x[idx:idx+inc]) for idx in range(len(x))])
# Determine indices in x that are nans
>>> nan_idxs = np.where(np.isnan(x))[0]
# Populate output array with nans
>>> mov_avg[nan_idxs] = np.nan
>>> mov_avg
array([ 4.75, 4.75, nan, 4.4, 3.75, 2.33333333, 3.33333333, 4., nan, nan, 3., 3.5, nan, 3.25, 4., 4.5, 3.])
Here's an approach using strides -
w = 5 # Window size
n = x.strides[0]
avgs = np.nanmean(np.lib.stride_tricks.as_strided(x, \
shape=(x.size-w+1,w), strides=(n,n)),1)
x_rem = np.append(x[-w+1:],np.full(w-1,np.nan))
avgs_rem = np.nanmean(np.lib.stride_tricks.as_strided(x_rem, \
shape=(w-1,w), strides=(n,n)),1)
avgs = np.append(avgs,avgs_rem)
avgs[np.isnan(x)] = np.nan
Currently bottleneck package should do the trick quite reliably and quickly. Here is slightly adjusted example from https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.move_mean:
>>> import bottleneck as bn
>>> a = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_mean(a, window=2)
array([ nan, 1.5, 2.5, nan, nan])
>>> bn.move_mean(a, window=2, min_count=1)
array([ 1. , 1.5, 2.5, 3. , 5. ])
Note that the resulting means correspond to the last index of the window.
The package is available from Ubuntu repos, pip etc. It can operate over arbitrary axis of numpy-array etc. Besides that, it is claimed to be faster than plain-numpy implementation in many cases.

Categories

Resources