How much data are there in an interval?

How much data are there in an interval? - python

I have a list object,i want to know that how many numbers are in a particular interval?The code is as follows
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
interval = 3
a = list(map(lambda x:int(x/interval),a))
for i in range(min(a),max(a)+1):
print(i*interval,(i+1)*interval,':',a.count(i))
Output
0 3 : 2
3 6 : 4
6 9 : 5
9 12 : 1
12 15 : 1
15 18 : 1
18 21 : 0
21 24 : 0
24 27 : 0
27 30 : 1
Is there a simple way to get this information?The simpler the better

Now that we're talking about performance, I'd like to offer my numpy solution using bincount:
import numpy as np
interval = 3
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
l = max(a) // interval + 1
b = np.bincount(a, minlength=l*interval).reshape((l,interval)).sum(axis=1)
(minlength is necessary just to be able to reshape if max(a) isn't a multiple of interval)
With the lables taken from Erfan's answer we get:
rnge = range(0, max(a) + interval + 1, interval)
lables = [f'[{i}-{j})' for i, j in zip(rnge[:-1], rnge[1:])]
for l,b in zip(lables,b):
print(l,b)
[0-3) 2
[3-6) 4
[6-9) 5
[9-12) 1
[12-15) 1
[15-18) 1
[18-21) 0
[21-24) 0
[24-27) 0
[27-30) 1
This is much faster than the pandas solution.
Performance and scaling comparison
In order to assess the scaling capability, I just replaced a = [1, ..., 28] * n and timed the execution (without imports and printing) for n = 1, 10, 100, 1K, 10K and 100K:
(python 3.7.3 on win32 / pandas 0.24.2 / numpy 1.16.2)

Pandas solution with pd.cut and groupby
s = pd.Series(a)
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False)
s.groupby(bins).count()
[0, 3) 2
[3, 6) 4
[6, 9) 5
[9, 12) 1
[12, 15) 1
[15, 18) 1
[18, 21) 0
[21, 24) 0
[24, 27) 0
[27, 30) 1
dtype: int64
To get cleaner bins results, we can use this method from linked answer:
s = pd.Series(a)
rnge = range(0, s.max() + interval, interval)
labels = [f'{i}-{j}' for i, j in zip(rnge[:-1], rnge[1:])]
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False, labels=labels)
s.groupby(bins).count()
0-3 2
3-6 4
6-9 5
9-12 1
12-15 1
15-18 1
18-21 0
21-24 0
24-27 0
27-30 1
dtype: int64

You can do it in one line using a dictionary comprehension :
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
{"[{};{}[".format(x, x+3) : len( [y for y in a if y >= x and y < x+3] )
for x in range(min(a), max(a), 3)}
Output :
{'[1;4[': 3,
'[4;7[': 4,
'[7;10[': 5,
'[10;13[': 1,
'[13;16[': 0,
'[16;19[': 1,
'[19;22[': 0,
'[22;25[': 0,
'[25;28[': 0}
Performance comparison :
Pandas solution with pd.cut and groupby : 8.51 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dictionary comprehension : 19.7 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Using np.bincount : 22.4 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related

Set inclusion faster with set literals

In the following I time 10_000_000 checks of whether if 10 is in {0, ..., 9}.
In the first check I use an intermediate variable and in the second one I use a literal.
import timeit
x = 10
s = set(range(x))
number = 10 ** 7
stmt = f'my_set = {s} ; {x} in my_set'
print(f'eval "{stmt}"')
print(timeit.timeit(stmt=stmt, number=number))
stmt = f'{x} in {s}'
print(f'eval "{stmt}"')
print(timeit.timeit(stmt=stmt, number=number))
Output:
eval "my_set = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} ; 10 in my_set"
1.2576093
eval "10 in {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}"
0.20336140000000036
How is it that the second one is way faster (by a factor of 5-6 approximately)? Is there some runtime optimisation performed by Python, e.g., if the inclusion check if made on a literal? Or maybe is it due to garbage collection (since it is a literal python garbage collects it right after use)?

You aren't testing the same two things -- in the first test, you're timing two assignments and lookups in addition to the membership test:
In [1]: import dis
...: x = 10
...: s = set(range(x))
In [2]: dis.dis("x in s")
1 0 LOAD_NAME 0 (x)
2 LOAD_NAME 1 (s)
4 CONTAINS_OP 0
6 RETURN_VALUE
In [3]: dis.dis("my_set = s; x in my_set")
1 0 LOAD_NAME 0 (s)
2 STORE_NAME 1 (my_set)
4 LOAD_NAME 2 (x)
6 LOAD_NAME 1 (my_set)
8 CONTAINS_OP 0
10 POP_TOP
12 LOAD_CONST 0 (None)
14 RETURN_VALUE
# By request
In [4]: dis.dis("s = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; 10 in s")
1 0 BUILD_SET 0
2 LOAD_CONST 0 (frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}))
4 SET_UPDATE 1
6 STORE_NAME 0 (s)
8 LOAD_CONST 1 (10)
10 LOAD_NAME 0 (s)
12 CONTAINS_OP 0
14 POP_TOP
16 LOAD_CONST 2 (None)
18 RETURN_VALUE
The actual difference between using literals and x in s is that the latter needs to go perform a lookup in globals, i.e., the difference is LOAD_NAME vs LOAD_CONST:
In [5]: dis.dis("10 in {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}")
1 0 LOAD_CONST 0 (10)
2 LOAD_CONST 1 (frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}))
4 CONTAINS_OP 0
6 RETURN_VALUE
Times:
In [6]: %timeit x in s
28.5 ns ± 0.792 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [7]: %timeit 10 in {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
20.3 ns ± 0.384 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Pandas groupby to expanding to list (numpy array)

I have a DataFrame that can be produced using this Python code:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
The following code enables me to reformat the data as desired:
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([df[(df.visit == v_) & (df.time <= t_)].groupby('visit')['observations'].apply(list)])
pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
However this is extremely slow.
I have tried using .expanding(), however, this will only return scalars whereas I would like list (or numpy array).
I would appreciate any help in vectorizing or otherwise optimizing this procedure.
Thanks

Fortunately, in pandas 1.1.0 and newer, expanding produces an iterable which can be used to use take advantage of the faster grouping, but produce non-scaler data like lists:
new_df = pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
new_df:
observations
visit
1 [0]
1 [0, 1]
1 [0, 1, 2]
1 [0, 1, 2, 3]
1 [0, 1, 2, 3, 4]
1 [0, 1, 2, 3, 4, 5]
2 [6]
2 [6, 7]
2 [6, 7, 8]
2 [6, 7, 8, 9]
2 [6, 7, 8, 9, 10]
2 [6, 7, 8, 9, 10, 11]
Timing via %timeit:
Setup:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
Original:
def fn():
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([
df[(df.visit == v_) & (df.time <= t_)]
.groupby('visit')['observations'].apply(list)
])
return pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
%timeit fn()
13 ms ± 692 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehension with expanding (~13x faster on this sample):
def fn2():
return pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
%timeit fn2()
967 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sanity Check:
fn().eq(fn2()).all(axis=None) # True
The double apply approach by #Quixotic22 (~3.4x faster than the original ~3.9x slower than comprehension + expanding):
def fn3():
return (df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum()))
%timeit fn3()
3.78 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
*Note this approach only produces a series of observations, does not include the visit as the index.
fn().eq(fn3()).all(axis=None) # False

Looks like a good solution has been provided but dropping this here as a viable alternative.
(df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum())
)

How to faster iterate over a Python numpy.ndarray with 2 dimensions

So, i simply want to make this faster:
for x in range(matrix.shape[0]):
for y in range(matrix.shape[1]):
if matrix[x][y] == 2 or matrix[x][y] == 3 or matrix[x][y] == 4 or matrix[x][y] == 5 or matrix[x][y] == 6:
if x not in heights:
heights.append(x)
Simply iterate over a 2x2 matrix (usually round 18x18 or 22x22) and check it's x. But its kinda slow, i wonder which is the fastest way to do this.
Thank you very much!

For a numpy based approach, you can do:
np.flatnonzero(((a>=2) & (a<=6)).any(1))
# array([1, 2, 6], dtype=int64)
Where:
a = np.random.randint(0,30,(7,7))
print(a)
array([[25, 27, 28, 21, 18, 7, 26],
[ 2, 18, 21, 13, 27, 26, 2],
[23, 27, 18, 7, 4, 6, 13],
[25, 20, 19, 15, 8, 22, 0],
[27, 23, 18, 22, 25, 17, 15],
[19, 12, 12, 9, 29, 23, 21],
[16, 27, 22, 23, 8, 3, 11]])
Timings on a larger array:
a = np.random.randint(0,30, (1000,1000))
%%timeit
heights=[]
for x in range(a.shape[0]):
for y in range(a.shape[1]):
if a[x][y] == 2 or a[x][y] == 3 or a[x][y] == 4 or a[x][y] == 5 or a[x][y] == 6:
if x not in heights:
heights.append(x)
# 3.17 s ± 59.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
yatu = np.flatnonzero(((a>=2) & (a<=6)).any(1))
# 965 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(yatu, heights)
# true
Vectorizing with numpy yields to roughly a 3200x speedup

It looks like you want to find if 2, 3, 4, 5 or 6 appear in the matrix.
You can use np.isin() to create a matrix of true/false values, then use that as an indexer:
>>> arr = np.array([1,2,3,4,4,0]).reshape(2,3)
>>> arr[np.isin(arr, [2,3,4,5,6])]
array([2, 3, 4, 4])
Optionally, turn that into a plain Python set() for faster in lookups and no duplicates.
To get the positions in the array where those numbers appear, use argwhere:
>>> np.argwhere(np.isin(arr, [2,3,4,5,6]))
array([[0, 1],
[0, 2],
[1, 0],
[1, 1]])

Max value per diagonal in 2d array

I have array and need max of rolling difference with dynamic window.
a = np.array([8, 18, 5,15,12])
print (a)
[ 8 18 5 15 12]
So first I create difference by itself:
b = a - a[:, None]
print (b)
[[ 0 10 -3 7 4]
[-10 0 -13 -3 -6]
[ 3 13 0 10 7]
[ -7 3 -10 0 -3]
[ -4 6 -7 3 0]]
Then replace upper triangle matrix to 0:
c = np.tril(b)
print (c)
[[ 0 0 0 0 0]
[-10 0 0 0 0]
[ 3 13 0 0 0]
[ -7 3 -10 0 0]
[ -4 6 -7 3 0]]
Last need max values per diagonal, so it means:
max([0,0,0,0,0]) = 0
max([-10,13,-10,3]) = 13
max([3,3,-7]) = 3
max([-7,6]) = 6
max([-4]) = -4
So expected output is:
[0, 13, 3, 6, -4]
What is some nice vectorized solution? Or is possible some another way for expected output?

Use ndarray.diagonal
v = [max(c.diagonal(-i)) for i in range(b.shape[0])]
print(v) # [0, 13, 3, 6, -4]

Not sure exactly how efficient this is considering the advanced indexing involved, but this is one way to do that:
import numpy as np
a = np.array([8, 18, 5, 15, 12])
b = a[:, None] - a
# Fill lower triangle with largest negative
b[np.tril_indices(len(a))] = np.iinfo(b.dtype).min # np.finfo for float
# Put diagonals as rows
s = b.strides[1]
diags = np.ndarray((len(a) - 1, len(a) - 1), b.dtype, b, offset=s, strides=(s, (len(a) + 1) * s))
# Get maximum from each row and add initial zero
c = np.r_[0, diags.max(1)]
print(c)
# [ 0 13 3 6 -4]
EDIT:
Another alternative, which may not be what you were looking for though, is just using Numba, for example like this:
import numpy as np
import numba as nb
def max_window_diffs_jdehesa(a):
a = np.asarray(a)
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
out = np.full_like(a, dtinf.min)
_pwise_diffs(a, out)
return out
#nb.njit(parallel=True)
def _pwise_diffs(a, out):
out[0] = 0
for w in nb.prange(1, len(a)):
for i in range(len(a) - w):
out[w] = max(a[i] - a[i + w], out[w])
a = np.array([8, 18, 5, 15, 12])
print(max_window_diffs(a))
# [ 0 13 3 6 -4]
Comparing these methods to the original:
import numpy as np
import numba as nb
def max_window_diffs_orig(a):
a = np.asarray(a)
b = a - a[:, None]
out = np.zeros(len(a), b.dtype)
out[-1] = b[-1, 0]
for i in range(1, len(a) - 1):
out[i] = np.diag(b, -i).max()
return out
def max_window_diffs_jdehesa_np(a):
a = np.asarray(a)
b = a[:, None] - a
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
b[np.tril_indices(len(a))] = dtinf.min
s = b.strides[1]
diags = np.ndarray((len(a) - 1, len(a) - 1), b.dtype, b, offset=s, strides=(s, (len(a) + 1) * s))
return np.concatenate([[0], diags.max(1)])
def max_window_diffs_jdehesa_nb(a):
a = np.asarray(a)
dtinf = np.iinfo(b.dtype) if np.issubdtype(b.dtype, np.integer) else np.finfo(b.dtype)
out = np.full_like(a, dtinf.min)
_pwise_diffs(a, out)
return out
#nb.njit(parallel=True)
def _pwise_diffs(a, out):
out[0] = 0
for w in nb.prange(1, len(a)):
for i in range(len(a) - w):
out[w] = max(a[i] - a[i + w], out[w])
np.random.seed(0)
a = np.random.randint(0, 100, size=100)
r = max_window_diffs_orig(a)
print((max_window_diffs_jdehesa_np(a) == r).all())
# True
print((max_window_diffs_jdehesa_nb(a) == r).all())
# True
%timeit max_window_diffs_orig(a)
# 348 µs ± 986 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit max_window_diffs_jdehesa_np(a)
# 91.7 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit max_window_diffs_jdehesa_nb(a)
# 19.7 µs ± 88.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.random.seed(0)
a = np.random.randint(0, 100, size=10000)
%timeit max_window_diffs_orig(a)
# 651 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit max_window_diffs_jdehesa_np(a)
# 1.61 s ± 6.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit max_window_diffs_jdehesa_nb(a)
# 22 ms ± 967 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The first one may be a bit better for smaller arrays, but doesn't work well for bigger ones. Numba on the other hand is pretty good in all cases.

You can use numpy.diagonal:
a = np.array([8, 18, 5,15,12])
b = a - a[:, None]
c = np.tril(b)
for i in range(b.shape[0]):
print(max(c.diagonal(-i)))
Output:
0
13
3
6
-4

Here's a vectorized solution with strides -
from skimage.util import view_as_windows
n = len(a)
z = np.zeros(n-1,dtype=a.dtype)
p = np.concatenate((a,z))
s = view_as_windows(p,n)
mask = np.tri(n,k=-1,dtype=bool)[:,::-1]
v = s[0]-s
out = np.where(mask,v.min()-1,v).max(1)
With one-loop for memory-efficiency -
n = len(a)
out = [max(a[:-i+n]-a[i:]) for i in range(n)]
Use np.max in place of max for better use of array-memory.

You can abuse the fact that reshaping non-square arrays of shape (N+1, N) to (N, N+1) will make diagonals appear as columns
from scipy.linalg import toeplitz
a = toeplitz([1,2,3,4], [1,4,3])
# array([[1, 4, 3],
# [2, 1, 4],
# [3, 2, 1],
# [4, 3, 2]])
a.reshape(3, 4)
# array([[1, 4, 3, 2],
# [1, 4, 3, 2],
# [1, 4, 3, 2]])
Which you can then use like (note that I've swapped the sign and set the lower triangle to zero)
smallv = -10000 # replace this with np.nan if you have floats
a = np.array([8, 18, 5,15,12])
b = a[:, None] - a
b[np.tril_indices(len(b), -1)] = smallv
d = np.vstack((b, np.full(len(b), smallv)))
d.reshape(len(d) - 1, -1).max(0)[:-1]
# array([ 0, 13, 3, 6, -4])

Summarize 2darray by index

I want to sum columns of a 2d array dat by row index idx. The following example works but is slow for large arrays. Any idea to speed it up?
import numpy as np
dat = np.arange(18).reshape(6, 3, order = 'F')
idx = np.array([0, 1, 1, 1, 2, 2])
for i in np.unique(idx):
print(np.sum(dat[idx==i], axis = 0))
Output
[ 0 6 12]
[ 6 24 42]
[ 9 21 33]

Approach #1
We can leverage matrix-multiplication with np.dot -
In [56]: mask = idx[:,None] == np.unique(idx)
In [57]: mask.T.dot(dat)
Out[57]:
array([[ 0, 6, 12],
[ 6, 24, 42],
[ 9, 21, 33]])
Approach #2
For the case with idx already sorted, we can use np.add.reduceat -
In [52]: p = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
In [53]: np.add.reduceat(dat, p, axis=0)
Out[53]:
array([[ 0, 6, 12],
[ 6, 24, 42],
[ 9, 21, 33]])

A bit faster approach with set object and ndarray.sum() method:
In [216]: for i in set(idx):
...: print(dat[idx == i].sum(axis=0))
...:
[ 0 6 12]
[ 6 24 42]
[ 9 21 33]
Time execution comparison:
In [217]: %timeit for i in np.unique(idx): r = np.sum(dat[idx==i], axis = 0)
109 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [218]: %timeit for i in set(idx): r = dat[idx == i].sum(axis=0)
71.1 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How much data are there in an interval? - python

Related

Set inclusion faster with set literals

Pandas groupby to expanding to list (numpy array)

How to faster iterate over a Python numpy.ndarray with 2 dimensions

Max value per diagonal in 2d array

Summarize 2darray by index

Categories

Resources