How to obtain the Ranks in a 2D Numpy Array - python

I'm trying to obtain the ranks in a 2D array, along axis=1, with no repeated ranks.
Suppose I have the array below:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
I would expect the following result, for a 'hi-lo' rank:
array([[ 2., 1., 3., 4.],
[ 3., nan, 1., 2.],
[nan, 1., 2., 3.]])
And the following result, for a 'lo-hi' rank:
array([[ 2., 4., 3., 1.],
[ 1., nan, 3., 2.],
[nan, 3., 1., 2.]])
I've been using scipy.stats.rankdata but this solution is very time consuming (for large arrays). Also, the code I'm using (shown below) relies on np.apply_along_axis, which I know is not very efficient. I know that scipy.stats.rankdata accepts an axis argument but the code behind it uses exactly np.apply_along_axis (See here).
def f(array, order='hi-lo'):
array = np.asarray(array)
lo_hi_rank = np.apply_along_axis(rankdata, 1, array, 'ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(array)] = np.NaN
if order == 'lo-hi':
return lo_hi_rank
else:
return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1
Does anyone know a faster implementation?
Update
I've compared the execution time of all the options suggested so far.
Option 1 below is the explicit loop version of the code I suggested above (repeated below as Option 2)
def option1(a, order='ascending'):
ranks = np.empty_like(a)
for row in range(ranks.shape[0]):
lo_hi_rank = rankdata(a[row], method='ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(a[row])] = np.NaN
if order == 'ascending':
ranks[row] = lo_hi_rank.copy()
else:
ranks[row] = np.nanmax(lo_hi_rank) - lo_hi_rank + 1
return ranks
def option2(a, order='ascending'):
a = np.asarray(a)
lo_hi_rank = np.apply_along_axis(rankdata, 1, a, 'ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(a)] = np.NaN
if order == 'ascending':
return lo_hi_rank
else:
return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1
Options 3-6 were suggested by Divakar:
def option3(a, order='ascending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
out = b.argsort(1,'stable').argsort(1)+1. - sm
out[out<=0] = np.nan
return out
def option4(a, order='ascending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
idx = b.argsort(1,'stable')
m,n = idx.shape
sidx = np.empty((m,n), dtype=float)
np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
out = sidx - sm
out[out<=0] = np.nan
return out
def option5(a, order='descending'):
b = -a if order=='descending' else a
out = b.argsort(1,'stable').argsort(1)+1.
return np.where(np.isnan(a), np.nan, out)
def option6(a, order='descending'):
b = -a if order=='descending' else a
idx = b.argsort(1,'stable')
m,n = idx.shape
out = np.empty((m,n), dtype=float)
np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
return np.where(np.isnan(a), np.nan, out)
Option 6 seems to be the cleanest and is indeed the fastest (~40% improvement vs Option 2). See below the average execution time for 100 iterations, with array.shape=(5348,1225)
>> TIME COMPARISON
>> 100 iterations | array.shape=(5348, 1225)
>> Option1: 0.4838 seconds
>> Option2: 0.3404 seconds
>> Option3: 0.3355 seconds
>> Option4: 0.2331 seconds
>> Option5: 0.3145 seconds
>> Option6: 0.2114 seconds
It can also be extend to generic axis and generic n-dim array, as proposed by Divakar. However, it is still too time consuming for what I'm trying to achieve (since I'll have to run this function millions of times within a loop). Is there a faster alternative? Or have we reached what's feasible with Python?

Method #1
Here's one way -
def rank_with_nans(a, order='descending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
out = b.argsort(1,'stable').argsort(1)+1. - sm
out[out<=0] = np.nan
return out
We can optimize on the double argsort part with a variation based on this post, shown below -
def rank_with_nans_v2(a, order='descending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
idx = b.argsort(1,'stable')
m,n = idx.shape
sidx = np.empty((m,n), dtype=float)
np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
out = sidx - sm
out[out<=0] = np.nan
return out
Sample runs -
In [338]: a
Out[338]:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
In [339]: rank_with_nans(a, order='descending')
Out[339]:
array([[ 2., 1., 3., 4.],
[ 3., nan, 1., 2.],
[nan, 1., 2., 3.]])
In [340]: rank_with_nans(a, order='ascending')
Out[340]:
array([[ 2., 4., 3., 1.],
[ 1., nan, 3., 2.],
[nan, 3., 1., 2.]])
Method #2
Without inf conversion, here's with double-argsort -
def rank_with_nans_v3(a, order='descending'):
b = -a if order=='descending' else a
out = b.argsort(1,'stable').argsort(1)+1.
return np.where(np.isnan(a), np.nan, out)
Again, with the argsort-skip trick -
def rank_with_nans_v4(a, order='descending'):
b = -a if order=='descending' else a
idx = b.argsort(1,'stable')
m,n = idx.shape
out = np.empty((m,n), dtype=float)
np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
return np.where(np.isnan(a), np.nan, out)
Bonus : Extend to generic axis and generic n-dim array
We can extend the proposed solutions to incorporate axis so that the ranking could be applied along that axis. The last solution v4 seems would be the most efficient one. Let's use it to make it generic -
def rank_with_nans_along_axis(a, order='descending', axis=-1):
b = -a if order=='descending' else a
idx = b.argsort(axis=axis, kind='stable')
out = np.empty(idx.shape, dtype=float)
indexer = tuple([None if i!=axis else Ellipsis for i in range(a.ndim)])
np.put_along_axis(out, idx, np.arange(1,a.shape[axis]+1, dtype=float)[indexer], axis=axis)
return np.where(np.isnan(a), np.nan, out)
Sample run -
In [227]: a
Out[227]:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
In [228]: rank_with_nans_along_axis(a, order='descending',axis=0)
Out[228]:
array([[ 1., 1., 2., 2.],
[ 2., nan, 1., 1.],
[nan, 2., 3., 3.]])
In [229]: rank_with_nans_along_axis(a, order='ascending',axis=0)
Out[229]:
array([[ 2., 2., 2., 2.],
[ 1., nan, 3., 3.],
[nan, 1., 1., 1.]])

Related

Finding determinant with torch.det doesn't return 0?

I'm trying to find the determinant of a matrix using torch.det. However, it seems like I'm either not doing it right or the function is not working properly (the results should be 0 rather than a small number).
a = torch.tensor([1.0, 1.0])
b = torch.tensor([3.0, 3.0])
c = torch.stack([a,b], dim = 1)
print(c)
torch.det(d)
>>>tensor([[1., 3.],
[1., 3.]])
tensor(1.2517e-06)
Another example:
a = torch.tensor([2, -1, 1]).float()
b = torch.tensor([3, -4, -2]).float()
c = torch.tensor([5, -10, -8]).float()
d = torch.stack([a,b,c], dim = 1)
print(d)
print(torch.det(d))
>>>
tensor([[ 2., 3., 5.],
[ -1., -4., -10.],
[ 1., -2., -8.]])
tensor(1.2517e-06)
Update 1:
I think I had a typo in the first example (I restarted everything and reran it):
import torch
a = torch.tensor([1.0, 1.0])
b = torch.tensor([3.0, 3.0])
c = torch.stack([a,b], dim = 1)
print(c)
torch.det(c)
>>> tensor([[1., 3.],
[1., 3.]])
tensor(0.)
Though, I believe the second example should also be 0

How to remove na and count values NxK arrays in numpy in a vectorized way

My situation: i have a pandas dataframe so that, for each row, I have to compute the following.
1) Get the first valute na excluded (df.apply(lambda x: x.dropna().iloc[0]))
2) Get the last valute na excluded (df.apply(lambda x: x.dropna().iloc[-1]))
3) Count the non na values (df.apply(lambda x: len(x.dropna()))
Sample case and expected output :
x = np.array([[1,2,np.nan], [4,5,6], [np.nan, 8,9]])
1) [1, 4, 8]
2) [2, 6, 9]
3) [2, 3, 2]
And i need to keep it optimized. So i turned to numpy and looked for a way to apply y = x[~numpy.isnan(x)] on a NxK array as a first step. Then,i would use what was shown here (Vectorized way of accessing row specific elements in a numpy array) for 1) and 2) but i am still empty handed for 3)
Here's one way -
In [756]: x
Out[756]:
array([[ 1., 2., nan],
[ 4., 5., 6.],
[ nan, 8., 9.]])
In [768]: m = ~np.isnan(x)
In [769]: first_idx = m.argmax(1)
In [770]: last_idx = m.shape[1] - m[:,::-1].argmax(1) - 1
In [771]: x[np.arange(len(first_idx)), first_idx]
Out[771]: array([ 1., 4., 8.])
In [772]: x[np.arange(len(last_idx)), last_idx]
Out[772]: array([ 2., 6., 9.])
In [773]: m.sum(1)
Out[773]: array([2, 3, 2])
Alternatively, we could make use of cumulative-summation to get those indices, like so -
In [787]: c = m.cumsum(1)
In [788]: first_idx = (c==1).argmax(1)
In [789]: last_idx = c.argmax(1)

Efficient way to compare the values of 3 lists in Python?

I have 3 lists with similar float values in a1, a2, a3 (whose lengths are equal).
for i in length(a1,a2,a3):
Find the increasing / decreasing order of a1[i], a2[i], a3[i]
Rank the values based on the order
Is there a simple/efficient way to do this? Rather than writing blocks of if-else statements?
I am trying to calculate the Friedman test ranks in Python. Though there is a scipy.stats.friedmanchisquare function, it doesn't return the ranks The Friedman test
EDIT
I have data like this in the Image 1.
a1 has week 1
a2 has week 2 and
a3 has week 3
I want to rank the values like in this Image 2
I tried comparing the values by using if else loops like this
for i in range(0,10):
if(acc1[i]>acc2[i]):
if(acc1[i]>acc3[i]):
rank1[i] = 1
if(acc2[i]>acc3[i]):
rank2[i] = 2
rank3[i] = 3
friedmanchisquare uses scipy.stats.rankdata. Here's one way you could use rankdata with your three lists. It creates a list called ranks, where ranks[i] is an array containing the ranking of [a1[i], a2[i], a3[i]].
In [41]: a1
Out[41]: [1.0, 2.4, 5.0, 6]
In [42]: a2
Out[42]: [9.0, 5.0, 4, 5.0]
In [43]: a3
Out[43]: [5.0, 6.0, 7.0, 2.0]
In [44]: from scipy.stats import rankdata
In [45]: ranks = [rankdata(row) for row in zip(a1, a2, a3)]
In [46]: ranks
Out[46]:
[array([ 1., 3., 2.]),
array([ 1., 2., 3.]),
array([ 2., 1., 3.]),
array([ 3., 2., 1.])]
If you convert that to a single numpy array, you can then easily work with either the rows or columns of ranks:
In [47]: ranks = np.array(ranks)
In [48]: ranks
Out[48]:
array([[ 1., 3., 2.],
[ 1., 2., 3.],
[ 2., 1., 3.],
[ 3., 2., 1.]])
In [49]: ranks.sum(axis=0)
Out[49]: array([ 7., 8., 9.])
You could define a simple function that returns the order of the sorts:
def sort3(a,b,c):
if (a >= b):
if (b >= c):
return (1, 2, 3)
elif (a >= c):
return (1, 3, 2)
else:
return (3, 1, 2)
elif (b >= c):
if (c >= a):
return (2, 3, 1)
else:
return (2, 1, 3)
else:
return (3, 2, 1)
Or consider using this https://stackoverflow.com/a/3382369/3224664
def argsort(seq):
# http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
return sorted(range(len(seq)), key=seq.__getitem__)
a = [1,3,5,7]
b = [2,2,2,6]
c = [3,1,4,8]
for i in range(len(a)):
print(argsort([a[i],b[i],c[i]]))

x distance between two lines of points

I have two 1D numpy arrays A and B of size (n, ) and (m, ) respectively which correspond to the x positions of points on a line. I want to calculate the distance between every point in A to every point in B. I then need to use these distances at a set y distance, d, to work out the potential at each point in A.
I'm currently using the following:
V = numpy.zeros(n)
for i in range(n):
xdist = A[i] - B
r = numpy.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = numpy.sum(dV)
This works but for large data sets it can take a while so I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
Vectorized approach
One vectorized approach after extending A to 2D with the introduction of a new axis using np.newaxis/None and thus making use of broadcasting would be -
(1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
Hybrid approach for large arrays
Now, for large arrays, we might have to divide the data into chunks.
Thus, with BSZ as the block size, we would have a hybrid approach, like so -
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
Runtime test
Approaches -
def original_app(A,B,d):
V = np.zeros(n)
for i in range(n):
xdist = A[i] - B
r = np.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = np.sum(dV)
return V
def vectorized_app1(A,B,d):
return (1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
def vectorized_app2(A,B,d, BSZ = 100):
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
return V.ravel()
Timings and verification -
In [203]: # Setup inputs
...: n,m = 10000,2000
...: A = np.random.rand(n)
...: B = np.random.rand(m)
...: d = 10
...:
In [204]: out1 = original_app(A,B,d)
...: out2 = vectorized_app1(A,B,d)
...: out3 = vectorized_app2(A,B,d, BSZ = 100)
...:
...: print np.allclose(out1, out2)
...: print np.allclose(out1, out3)
...:
True
True
In [205]: %timeit original_app(A,B,d)
10 loops, best of 3: 133 ms per loop
In [206]: %timeit vectorized_app1(A,B,d)
10 loops, best of 3: 138 ms per loop
In [207]: %timeit vectorized_app2(A,B,d, BSZ = 100)
10 loops, best of 3: 65.2 ms per loop
We can play around with the parameter block size BSZ -
In [208]: %timeit vectorized_app2(A,B,d, BSZ = 200)
10 loops, best of 3: 74.5 ms per loop
In [209]: %timeit vectorized_app2(A,B,d, BSZ = 50)
10 loops, best of 3: 67.4 ms per loop
Thus, the best one seems to be giving a 2x speedup with a block size of 100 at my end.
EDIT: My answer turned out to be nearly identical to Divakar's after a closer look. However, you can save some memory by doing the operations in-place. Taking the sum along the second axis is more efficient than long the first.
import numpy
a = numpy.random.randint(0,10,10) * 1.
b = numpy.random.randint(0,10,10) * 1.
xdist = a[:,None] - b
xdist **= 2
xdist += d**2
xdist **= -1
V = numpy.sum(xdist, axis=1)
which gives the same solution as your code.
I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
cdist works fine, you just have to reshape the arrays to have shape (n, 1) instead of (n,). You can add another dimension to a one-dimensional array A without copying the underlying data by using A[:, None] or A.reshape(-1, 1).
For example,
In [56]: from scipy.spatial.distance import cdist
In [57]: A
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [58]: B
Out[58]: array([0, 2, 4, 6, 8])
In [59]: A[:, None]
Out[59]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
In [60]: cdist(A[:, None], B[:, None])
Out[60]:
array([[ 0., 2., 4., 6., 8.],
[ 1., 1., 3., 5., 7.],
[ 2., 0., 2., 4., 6.],
[ 3., 1., 1., 3., 5.],
[ 4., 2., 0., 2., 4.],
[ 5., 3., 1., 1., 3.],
[ 6., 4., 2., 0., 2.],
[ 7., 5., 3., 1., 1.],
[ 8., 6., 4., 2., 0.],
[ 9., 7., 5., 3., 1.]])
To compute V as shown in your code, you can use cdist with metric='sqeuclidean', as follows:
In [72]: d = 3.
In [73]: r = np.sqrt(cdist(A[:,None], B[:,None], metric='sqeuclidean') + d**2)
In [74]: V = (1/r).sum(axis=1)

Python: convert numpy array of signs to int and back

I'm trying to convert from a numpy array of signs (i.e., a numpy array whose entries are either 1. or -1.) to an integer and back through a binary representation. I have something that works, but it's not Pythonic, and I expect it'll be slow.
def sign2int(s):
s[s==-1.] = 0.
bstr = ''
for i in range(len(s)):
bstr = bstr + str(int(s[i]))
return int(bstr, 2)
def int2sign(i, m):
bstr = bin(i)[2:].zfill(m)
s = []
for d in bstr:
s.append(float(d))
s = np.array(s)
s[s==0.] = -1.
return s
Then
>>> m = 4
>>> s0 = np.array([1., -1., 1., 1.])
>>> i = sign2int(s0)
>>> print i
11
>>> s = int2sign(i, m)
>>> print s
[ 1. -1. 1. 1.]
I'm concerned about (1) the for loops in each and (2) having to build an intermediate representation as a string.
Ultimately, I will want something that works with a 2-d numpy array, too---e.g.,
>>> s = np.array([[1., -1., 1.], [1., 1., 1.]])
>>> print sign2int(s)
[5, 7]
For 1d arrays you can use this one linear Numpythonic approach, using np.packbits:
>>> np.packbits(np.pad((s0+1).astype(bool).astype(int), (8-s0.size, 0), 'constant'))
array([11], dtype=uint8)
And for reversing:
>>> unpack = (np.unpackbits(np.array([11], dtype=np.uint8))[-4:]).astype(float)
>>> unpack[unpack==0] = -1
>>> unpack
array([ 1., -1., 1., 1.])
And for 2d array:
>>> x, y = s.shape
>>> np.packbits(np.pad((s+1).astype(bool).astype(int), (8-y, 0), 'constant')[-2:])
array([5, 7], dtype=uint8)
And for reversing:
>>> unpack = (np.unpackbits(np.array([5, 7], dtype='uint8'))).astype(float).reshape(x, 8)[:,-y:]
>>> unpack[unpack==0] = -1
>>> unpack
array([[ 1., -1., 1.],
[ 1., 1., 1.]])
I'll start with sig2int.. Convert from a sign representation to binary
>>> a
array([ 1., -1., 1., -1.])
>>> (a + 1) / 2
array([ 1., 0., 1., 0.])
>>>
Then you can simply create an array of powers of two, multiply it by the binary and sum.
>>> powers = np.arange(a.shape[-1])[::-1]
>>> np.power(2, powers)
array([8, 4, 2, 1])
>>> a = (a + 1) / 2
>>> powers = np.power(2, powers)
>>> a * powers
array([ 8., 0., 2., 0.])
>>> np.sum(a * powers)
10.0
>>>
Then make it operate on rows by adding axis information and rely on broadcasting.
def sign2int(a):
# powers of two
powers = np.arange(a.shape[-1])[::-1]
np.power(2, powers, powers)
# sign to "binary" - add one and divide by two
np.add(a, 1, a)
np.divide(a, 2, a)
# scale by powers of two and sum
np.multiply(a, powers, a)
return np.sum(a, axis = -1)
>>> b = np.array([a, a, a, a, a])
>>> sign2int(b)
array([ 11., 11., 11., 11., 11.])
>>>
I tried it on a 4 by 100 bit array and it seemed fast
>>> a = a.repeat(100)
>>> b = np.array([a, a, a, a, a])
>>> b
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> sign2int(b)
array([ 2.58224988e+120, 2.58224988e+120, 2.58224988e+120,
2.58224988e+120, 2.58224988e+120])
>>>
I'll add the reverse if i can figure it. - the best I could do relies on some plain Python without any numpy vectoriztion magic and I haven't figured how to make it work with a sequence of ints other than to iterate over them and convert them one at a time - but the time still seems acceptable.
def foo(n):
'''yields bits in increasing powers of two
bit sequence from lsb --> msb
'''
while n > 0:
n, r = divmod(n, 2)
yield r
def int2sign(n):
n = int(n)
a = np.fromiter(foo(n), dtype = np.int8, count = n.bit_length())
np.multiply(a, 2, a)
np.subtract(a, 1, a)
return a[::-1]
Works on 1324:
>>> bin(1324)
'0b10100101100'
>>> a = int2sign(1324)
>>> a
array([ 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1], dtype=int8)
Seems to work with 1.2e305:
>>> n = int(1.2e305)
>>> n.bit_length()
1014
>>> a = int2sign(n)
>>> a.shape
(1014,)
>>> s = bin(n)
>>> s = s[2:]
>>> all(2 * int(x) -1 == y for x, y in zip(s, a))
True
>>>
Here are some vectorized versions of your functions:
def sign2int(s):
return int(''.join(np.where(s == -1., 0, s).astype(int).astype(str)), 2)
def int2sign(i, m):
tmp = np.array(list(bin(i)[2:].zfill(m)))
return np.where(tmp == "0", "-1", tmp).astype(int)
s0 = np.array([1., -1., 1., 1.])
sign2int(s0)
# 11
int2sign(11, 5)
# array([-1, 1, -1, 1, 1])
To use your functions on 2-d arrays, you can use map function:
s = np.array([[1., -1., 1.], [1., 1., 1.]])
map(sign2int, s)
# [5, 7]
map(lambda x: int2sign(x, 4), [5, 7])
# [array([-1, 1, -1, 1]), array([-1, 1, 1, 1])]
After a bit of testing, the Numpythonic approach of #wwii that doesn't use strings seems to fit what I need best. For the int2sign, I used a for-loop over the exponents with a standard algorithm for the conversion---which will have at most 64 iterations for 64-bit integers. Numpy's broadcasting happens across each integer very efficiently.
packbits and unpackbits are restricted to 8-bit integers; otherwise, I suspect that would've been the best (though I didn't try).
Here are the specific implementations I tested that follow the suggestions in the other answers (thanks to everyone!):
def _sign2int_str(s):
return int(''.join(np.where(s == -1., 0, s).astype(int).astype(str)), 2)
def sign2int_str(s):
return np.array(map(_sign2int_str, s))
def _int2sign_str(i, m):
tmp = np.array(list(bin(i)[2:])).astype(int)
return np.pad(np.where(tmp == 0, -1, tmp), (m - len(tmp), 0), "constant", constant_values = -1)
def int2sign_str(i,m):
return np.array(map(lambda x: _int2sign_str(x, m), i.astype(int).tolist())).transpose()
def sign2int_np(s):
p = np.arange(s.shape[-1])[::-1]
s = s + 1
return np.sum(np.power(s, p), axis = -1).astype(int)
def int2sign_np(i,m):
N = i.shape[-1]
S = np.zeros((m, N))
for k in range(m):
b = np.power(2, m - 1 - k).astype(int)
S[k,:] = np.divide(i.astype(int), b).astype(float)
i = np.mod(i, b)
S[S==0.] = -1.
return S
And here is my test:
X = np.sign(np.random.normal(size=(5000, 20)))
N = 100
t = time.time()
for i in range(N):
S = sign2int_np(X)
print 'sign2int_np: \t{:10.8f} sec'.format((time.time() - t)/N)
t = time.time()
for i in range(N):
S = sign2int_str(X)
print 'sign2int_str: \t{:10.8f} sec'.format((time.time() - t)/N)
m = 20
S = np.random.randint(0, high=np.power(2,m), size=(5000,))
t = time.time()
for i in range(N):
X = int2sign_np(S, m)
print 'int2sign_np: \t{:10.8f} sec'.format((time.time() - t)/N)
t = time.time()
for i in range(N):
X = int2sign_str(S, m)
print 'int2sign_str: \t{:10.8f} sec'.format((time.time() - t)/N)
This produced the following results:
sign2int_np: 0.00165325 sec
sign2int_str: 0.04121902 sec
int2sign_np: 0.00318024 sec
int2sign_str: 0.24846984 sec
I think numpy.packbits is worth another look. Given a real-valued sign array a, you can use numpy.packbits(a > 0). Decompression is done by numpy.unpackbits. This implicitly flattens multi-dimensional arrays so you'll need to reshape after unpackbits if you have a multi-dimensional array.
Note that you can combine bit packing with conventional compression (e.g., zlib or lzma). If there is a pattern or bias to your data, you may get a useful compression factor, but for unbiased random data, you'll typically see a moderate size increase.

Categories

Resources