2 Column Array Rank with tiebreaker and Save Python numpy

2 Column Array Rank with tiebreaker and Save Python numpy - python

I need to be able to rank an array based on a single column and then again with using a second column as basically a tie breaker and then save those two ranks into the database
Array:
array = np.array(
[(70,3,100),
(72,3,101),
(70,2,102)], dtype=[
('score','int8'),
('tiebreaker','int8'),
('row_id','int8')])
array['score'] = array([70, 72, 70], dtype=int8)
First Rank using only the 'score' column would return
(1,3,1)
Then the second Rank rankings using 'score' and 'tiebreaker' columns
(2,3,1)
Then I want to save those two ranks to the database for example:
result1 = Result.objects.get(id=array[0]['row_id'])
result1.relative_rank = 1
result1.absolute_rank = 2
results.save()

You can use scipy.stats.rankdata, as follows:
In [10]: a
Out[10]:
array([(70, 3, 100), (72, 3, 101), (70, 2, 102)],
dtype=[('score', 'i1'), ('tiebreaker', 'i1'), ('row_id', 'i1')])
In [11]: from scipy.stats import rankdata
First rank:
In [12]: rankdata(a['score'], method='min').astype(int)
Out[12]: array([1, 3, 1])
Second rank:
In [13]: rankdata(256*a['score'] + a['tiebreaker'], method='min').astype(int)
Out[13]: array([2, 3, 1])
The value used in the second rank (256*a['score'] + a['tiebreaker']) relies on the data having type int8.
Check the docstring to see if a different method would be more appropriate for the second rank. If you know there will be no ties in the second rank, the method doesn't matter.

Related

Reorder numpy array to new bitlength elements without loop

If I have a numpy array with elements each representing e.g. a 9-bit integer, is there an easy way (maybe without a loop) to reorder it in a way that the resulting array elements each represent a 8-bit integer with the "lost bits" at the end of the previous element getting added at the beginning of the next element?
for example to get the following
np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100]) # initial array in binarys
# convert to
np.array([0b10011100, 0b01001011, 0b00110011, 0b10011001, 0b01000000]) # resulting array
I hope it is understandable what I want to archive.
Additional info, I don't know if this makes any difference:
All of my 9-bit numbers start with the msb beeing 1 (they are bigger than 255) and the last two bits are always 0, like in the example above.
The arrays I want to process are much bigger with thousands of elements.
Thanks for your help in advance!
edit:
my current (complicated) solution is the following:
import numpy as np
def get_bits(data, offset, leng):
data = (data % (1 << (offset + leng))) >> offset
return data
data1 = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
i = 1
part1 = []
part2 = []
for el in data1:
if i == 1:
part2.append(0)
part1.append(get_bits(el, i, 8))
part2.append(get_bits(el, 0, i)<<(8-i))
if i == 8:
i = 1
part1.append(0)
else:
i += 1
if i != 1:
part1.append(0)
res = np.array(part1) + np.array(part2)

It's been bugging me that np.packbits and np.unpackbits are inefficient, so I came up with a bit twiddling answer.
The general idea is to work it like any resampler: you make an output array, and figure out where each piece of the output comes from in the input. You have N elements of 9 bits each, so the output is:
data = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
result = np.empty(np.ceil(data.size * 9 / 8).astype(int), dtype=np.uint8)
Every nine output bytes have the following pattern relative to the corresponding eight input bytes. I use {...} to indicate the (inclusive) bits in each input integer:
result[0] = data[0]{8:1}
result[1] = data[0]{0:0} data[1]{8:2}
result[2] = data[1]{1:0} data[2]{8:3}
result[3] = data[2]{2:0} data[3]{8:4}
result[4] = data[3]{3:0} data[4]{8:5}
result[5] = data[4]{4:0} data[5]{8:6}
result[6] = data[5]{5:0} data[6]{8:7}
result[7] = data[6]{6:0} data[7]{8:8}
result[8] = data[7]{7:0}
The index of result (call it i) is really given modulo 9. The index into data is therefore offset by 8 * (i // 9). The lower portion of the byte is given by data[...] >> (i + 1). The upper portion is given by data[...] & ((1 << i) - 1), shifted left by 8 - i bits.
That makes it pretty easy to come up with a vectorized solution:
i = np.arange(result.size)
j = i[:-1]
result[i] = (data[8 * (i // 9) + (i % 9) - 1] & ((1 << i % 9) - 1)) << (8 - i % 9)
result[j] |= (data[8 * (j // 9) + (j % 9)] >> (j % 9 + 1)).astype(np.uint8)
You need to clip the index of the low portion because it may go out of bounds. You don't need to clip the high portion because -1 is a perfectly valid index, and you don't care which element it accesses. And of course numpy won't let you OR or add int elements to a uint8 array, so you have to cast.
>>> [bin(x) for x in result]
['0b10011100', '0b1001011', '0b110011', '0b10011001', '0b1000000']
This solution should be scalable to arrays of any size, and I wrote it so that you can work out different combinations of shifts, not just 9-to-8.

You can do it in two steps with np.unpackbits and np.packbits. First turn your array into a little-endian column vector:
>>> z = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100], dtype='<u2').reshape(-1, 1)
>>> z.view(np.uint8)
array([[ 56, 1],
[ 44, 1],
[156, 1],
[148, 1]], dtype=uint8)
You can convert this into an array of bits directly by unpacking. In fact, at some point (PR #10855) I added the count parameter to chop of the high zeros for you:
>>> np.unpackbits(z.view(np.uint8), axis=1, bitorder='l', count=9)
array([[0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 1, 1, 0, 1, 0, 0, 1],
[0, 0, 1, 1, 1, 0, 0, 1, 1],
[0, 0, 1, 0, 1, 0, 0, 1, 1]], dtype=uint8)
Now you can just repack the reversed raveled array:
>>> u = np.unpackbits(z.view(np.uint8), axis=1, bitorder='l', count=9)[:, ::-1].ravel()
>>> result = np.packbits(u)
>>> result.dtype
dtype('uint8')
>>> [bin(x) for x in result]
['0b10011100', '0b1001011', '0b110011', '0b10011001', '0b1000000']
If your machine is native little endian (e.g., most intel architectures), you can do this in a one-liner:
z = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
result = np.packbits(np.unpackbits(z.view(np.uint8), axis=1, bitorder='l', count=9)[:, ::-1].ravel())
Otherwise, you can do z.byteswap().view(np.uint8) to get the right starting order (still one liner though, I suppose).

I think I understood most of what you want, and given that you can do bit operation with numpy arrays in which case you get the desire bit operation element wise if do it with two array (or the same for all if it is an array vs a number), then you need to construct the appropriate arrays to do the thing, so something like this
>>> import numpy as np
>>> x = np.array([0b100111000, 0b100101100, 0b110011100, 0b110010100])
>>> goal=np.array([0b10011100, 0b01001011, 0b00110011, 0b10011001, 0b01000000])
>>> x
array([312, 300, 412, 404])
>>> goal
array([156, 75, 51, 153, 64])
>>> shift1 = np.array(range(1,1+len(x)))
>>> shift1
array([1, 2, 3, 4])
>>> mask1 = np.array([2**n -1 for n in range(1,1+len(x))])
>>> mask1
array([ 1, 3, 7, 15])
>>> res=((x>>shift1)|((x&mask1)<<(9-shift1)))&0b11111111
>>> res
array([156, 75, 51, 153], dtype=int32)
>>> goal
array([156, 75, 51, 153, 64])
>>>
I don't understand why your goal array have one extra element, but the above operation give the others numbers, and adding one extra is not complicated, so adjust as necessary.
Now for explaining the ((x>>shift1)|((x&mask1)<<(9-shift1)))&0b11111111
First I notice you do a bigger shift by element, that is simple
>>> x>>shift1
array([156, 75, 51, 25], dtype=int32)
>>>
>>> list(map(bin,x>>shift1))
['0b10011100', '0b1001011', '0b110011', '0b11001']
>>>
We also want to catch the bits that would be lose with the shift, with an and with an appropriate mask we get those
>>> x&mask1
array([0, 0, 4, 4], dtype=int32)
>>> list(map(bin,mask1))
['0b1', '0b11', '0b111', '0b1111']
>>> list(map(bin,x&mask1))
['0b0', '0b0', '0b100', '0b100']
>>>
then we right shift that result by the complementary amount
>>> 9-shift1
array([8, 7, 6, 5])
>>> ((x&mask1)<<(9-shift1))
array([ 0, 0, 256, 128], dtype=int32)
>>> list(map(bin,_))
['0b0', '0b0', '0b100000000', '0b10000000']
>>>
then we or both together
>>> (x>>shift1) | ((x&mask1)<<(9-shift1))
array([156, 75, 307, 153], dtype=int32)
>>> list(map(bin,_))
['0b10011100', '0b1001011', '0b100110011', '0b10011001']
>>>
and finally we and that with 0b11111111 to keep only the 8 bit we want
Additionally, you mention that the last 2 bit are always zero, then a more simple solution is simple shift it by 2, and to recover the original just shift in back in the other direction
>>> x
array([312, 300, 412, 404])
>>> y = x>>2
>>> y
array([ 78, 75, 103, 101], dtype=int32)
>>> y<<2
array([312, 300, 412, 404], dtype=int32)
>>>

Numpy apply the same random permutation to two different ndarray with the same shape

Suppose I have two NumPy arrays, say, each of shape (N,1). The first array is an Array called Age and the second called income. Suppose These attributes are samples of different people, so the ith index refers to the ith person in the sample and by knowing i I can retrieve both his age and income.
Now suppose I want to permutate both arrays randomly (or deterministically) so that both undergo the same permutation? I mean, After the permutation, the index j of both arrays refer to the attribute of the same person?
I know one way of doing this is defining objects of a person with two attributes: age and income, but I want the Numpy way of doing so.
Thanks.

You could first create a permutation of indices, then access both arrays with the same permutation of indices. This could be done using numpy.random.permutation, for instance.
Example:
>>> age = np.random.randint(0,100,10)
>>> income = np.random.randint(0,10000,10)
>>> age
array([38, 4, 70, 16, 8, 29, 1, 41, 54, 60])
>>> income
array([4797, 5884, 8005, 5696, 7577, 6386, 3314, 3574, 5422, 409])
>>> permutation_indices = np.random.permutation(10)
>>> permutation_indices
array([9, 1, 8, 0, 7, 3, 2, 6, 5, 4])
>>> age[permutation_indices]
array([60, 4, 54, 38, 41, 16, 70, 1, 29, 8])
>>> income[permutation_indices]
array([ 409, 5884, 5422, 4797, 3574, 5696, 8005, 3314, 6386, 7577])

Loss of data type when converting numpy.ndarry.tolist?

So I have an ndarray similar to this example:
dates = ['3/1/2020','4/15/2020','7/21/2020']
darray = np.asarray([dateutil.parser.parse(d) for d in dates], dtype='datetime64[ns]')
>>> array(['2020-03-01T00:00:00.000000000', '2020-04-15T00:00:00.000000000',
'2020-07-21T00:00:00.000000000'], dtype='datetime64[ns]')
darray.tolist()
>>> [1583020800000000000, 1586908800000000000, 1595289600000000000]
So I'm assuming it's converting to number of nanoseconds since the POSIX origin (1970-01-01). Is there a way to avoid this loss of data type?

Is this what you're looking for?
>>> list(darray)
[numpy.datetime64('2020-03-01T00:00:00.000000000'),
numpy.datetime64('2020-04-15T00:00:00.000000000'),
numpy.datetime64('2020-07-21T00:00:00.000000000')]
The difference is that np.ndarray.tolist() converts values to Python types, whereas list(...) leaves the objects as they are. Internally of course they both contain 64-bit integers either way. If you want to convert to Python datetime objects, have a look at this question. Unfortunately it's not as convenient as it could be.

Normally tolist is better than list. It works all-the-way down, and is faster. list just iterates on the first dimension. But the conversion to native Python types depends on dtype. In this case the time units make a difference.
In [559]: arr = np.array(['2020-03-01T00:00:00.000000000', '2020-04-15T00:00:00.000000000',
...: '2020-07-21T00:00:00.000000000'], dtype='datetime64[ns]')
In [560]: arr.shape
Out[560]: (3,)
In [561]: arr.dtype
Out[561]: dtype('<M8[ns]')
list is the equivalent of [x for x in arr], iteration on the first dimension:
In [562]: list(arr)
Out[562]:
[numpy.datetime64('2020-03-01T00:00:00.000000000'),
numpy.datetime64('2020-04-15T00:00:00.000000000'),
numpy.datetime64('2020-07-21T00:00:00.000000000')]
tolist converts it to Python objects - all the way down:
In [563]: arr.tolist()
Out[563]: [1583020800000000000, 1586908800000000000, 1595289600000000000]
While ns gives an integer, Other time units give different results:
In [564]: arr.astype('datetime64[D]')
Out[564]: array(['2020-03-01', '2020-04-15', '2020-07-21'], dtype='datetime64[D]')
In [565]: arr.astype('datetime64[D]').tolist()
Out[565]:
[datetime.date(2020, 3, 1),
datetime.date(2020, 4, 15),
datetime.date(2020, 7, 21)]
In [566]: arr.astype('datetime64[s]').tolist()
Out[566]:
[datetime.datetime(2020, 3, 1, 0, 0),
datetime.datetime(2020, 4, 15, 0, 0),
datetime.datetime(2020, 7, 21, 0, 0)]

Python create numpy array with different dtypes

I want to create a numpy array (size ~65000 rows x 17 columns). The first column contains complex numbers and the rest contains unsigned integers.
I first create a numpy.zeros array of the desired size and after that I want to fill it with complex numbers and uints as described above. I have looked at the dtypes option and therein should lie the solution I think, but I can't get it to work.
After that I want to save the whole array to a text file as CSV as follows:
0.25+0.30j,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
0.30+0.40j,0,1,0,0,0,0,0,0,1,0,1,1,1,1,1,1
etc...
I tried this amongst others, but later it gives me the following error:
TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and
'numpy.ndarray'
m = 16
dt = numpy.dtype([('comp', numpy.complex), ('f0', numpy.int64), ('f1', numpy.int64),
('f2', numpy.int64), ('f3', numpy.int64), ('f4', numpy.int64), ('f5', numpy.int64),
('f6', numpy.int64), ('f7', numpy.int64), ('f8', numpy.int64), ('f9', numpy.int64),
('f10', numpy.int64), ('f11', numpy.int64), ('f12', numpy.int64), ('f13', numpy.int64),
('f14', numpy.int64), ('f15', numpy.int64)])
fields = numpy.zeros((2**m, m+1), dtype=dt)
for i in range(0, m):
fields[:,0] = fields[:,0] + 1 # for example I add only 1 here

Maybe this does what you want:
Edit: Flattened the structure, so it is now closer to what you originally had in mind, and you can save it using savetxt.
import numpy
m = 15
rows = 5
integers = [('f'+str(i), numpy.int64) for i in range(m)]
dt = numpy.dtype([('comp', numpy.complex)] + integers)
fields = numpy.zeros(rows, dtype=dt)
fields['comp'] += 1j
fmt = '%s ' + m*' %u'
numpy.savetxt('fields.txt', fields, fmt=fmt)
Note: the array is now basically a vector of elements of the type dt. You can access the complex number with fields[row][0], and fields[row][1] will return the "subarray" of integers. That means to change a specific integer, you'll need to do something like this: fields[row][1][5] = 7.

np.savetxt does not handle fields with different number of values very well. A complex field has 2 values per row, an int just one. Or 15 in Psirus's version.
The basic operation in savetxt is:
for row in X:
fh.write(asbytes(format % tuple(row) + newline))
But the row tuple for your dtype is something like (for just 2 int fields)
In [306]: tuple(X[1])
Out[306]: ((1+4j), 0, 0)
And for Psirus's dtype:
In [307]: tuple(fields[1])
Out[307]: ((1+4j), array([2, 3], dtype=int64))
It's hard to come up with a format string that works without resorting to a generic %s for at least the complex value. It is harder still to come up with one that passes savetxt error checking.
It may be best to write your own save routine, one that formats that tuple exactly as you want it.
The savetxt code is readily available to read and copy. The asbyte business is for Python3 compatibility.
It might easier to skip the complex dtype, and work with a plain 2d array, Here's a simple example of writing a complex 'field' plus a couple of ints, without resorting to a structured dtype. The 'complex' magic resides in the fmt string.
In [320]: Y = np.zeros((5,4),dtype=int)
In [321]: Y[:,0]=np.arange(5)
In [322]: Y[:,1]=np.arange(5,0,-1)
In [323]: Y[:,2]=np.arange(5,0,-1)
In [324]: Y[:,3]=np.arange(10,15)
In [325]: Y
Out[325]:
array([[ 0, 5, 5, 10],
[ 1, 4, 4, 11],
[ 2, 3, 3, 12],
[ 3, 2, 2, 13],
[ 4, 1, 1, 14]])
In [326]: np.savetxt('mypy/temp.txt',Y,fmt='%3d+%dj, %3d, %3d')
In [327]: cat mypy/temp.txt
0+5j, 5, 10
1+4j, 4, 11
2+3j, 3, 12
3+2j, 2, 13
4+1j, 1, 14

Find span where condition is True using NumPy

Imagine I have a numpy array and I need to find the spans/ranges where that condition is True. For example, I have the following array in which I'm trying to find spans where items are greater than 1:
[0, 0, 0, 2, 2, 0, 2, 2, 2, 0]
I would need to find indices (start, stop):
(3, 5)
(6, 9)
The fastest thing I've been able to implement is making a boolean array of:
truth = data > threshold
and then looping through the array using numpy.argmin and numpy.argmax to find start and end positions.
pos = 0
truth = container[RATIO,:] > threshold
while pos < len(truth):
start = numpy.argmax(truth[pos:]) + pos + offset
end = numpy.argmin(truth[start:]) + start + offset
if not truth[start]:#nothing more
break
if start == end:#goes to the end
end = len(truth)
pos = end
But this has been too slow for the billions of positions in my arrays and the fact that the spans I'm finding are usually just a few positions in a row. Does anyone know a faster way to find these spans?

How's one way. First take the boolean array you have:
In [11]: a
Out[11]: array([0, 0, 0, 2, 2, 0, 2, 2, 2, 0])
In [12]: a1 = a > 1
Shift it one to the left (to get the next state at each index) using roll:
In [13]: a1_rshifted = np.roll(a1, 1)
In [14]: starts = a1 & ~a1_rshifted # it's True but the previous isn't
In [15]: ends = ~a1 & a1_rshifted
Where this is non-zero is the start of each True batch (or, respectively, end batch):
In [16]: np.nonzero(starts)[0], np.nonzero(ends)[0]
Out[16]: (array([3, 6]), array([5, 9]))
And zipping these together:
In [17]: zip(np.nonzero(starts)[0], np.nonzero(ends)[0])
Out[17]: [(3, 5), (6, 9)]

If you have access to the scipy library:
You can use scipy.ndimage.measurements.label to identify any regions of non zero value. it returns an array where the value of each element is the id of a span or range in the original array.
You can then use scipy.ndimage.measurements.find_objects to return the slices you would need to extract those ranges. You can access the start / end values directly from those slices.
In your example:
import numpy
from scipy.ndimage.measurements import label, find_objects
data = numpy.array([0, 0, 0, 2, 2, 0, 2, 2, 2, 0])
labels, number_of_regions = label(data)
ranges = find_objects(labels)
for identified_range in ranges:
print(identified_range[0].start, identified_range[0].stop)
You should see:
3 5
6 9
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

2 Column Array Rank with tiebreaker and Save Python numpy - python

Related

Reorder numpy array to new bitlength elements without loop

Numpy apply the same random permutation to two different ndarray with the same shape

Loss of data type when converting numpy.ndarry.tolist?

Python create numpy array with different dtypes

Find span where condition is True using NumPy

Categories

Resources