I have asked this question before and being downvoted heavily. Anyway judging by the fact that noone really sees a triple downvoted question again I repost it to make clear that I am interested in the actual answer (if there is one).
Problem statement:
I am in a situation I need the arbitrary precision feature of pure python integers. At some point in my code I have a numpy array with boolean. Something like:
arr
array([ True, False, False, False, True, True, True, False, True,
True, False, False, True, True, True, False, True, False,
False, True, False, True, True, True, True, True, False,
True, False, True, True, False, True, True, False, True,
False, False, True, False, True, True, False, True, False,
True, True, False, True, True, True, False, False, False,
True, False, False, True, True, True, True, False, True,
False])
which I convert it to numpy.int64 using arr.astype(int) to make it arithmetic.
But I used this code to convert it to an integer it overflowed (and produced negative numbers I don't want to).
Code is using this function (which is pure python and wont have any integer overflow issue by itself):
def bool2int(x):
y = 0
for i,j in enumerate(x):
y += j<<i
return y
If I run the code directly on np.array (converted to int or not does not matter):
bool2int(arr)
-2393826705255337647
bool2int(h.astype(int))
-2393826705255337647
will I need a positive integer. So, I used a list comprehension:
bool2int([int(x) for x in arr])
16052917368454213969
Obviously, the number represented by arr exceeds the capacity of fixed precision integers (i.e. 263-1) to be able to use ti directly.
Is there any other direct way to achieve beyond list comprehension?
Edit:
For the theory of integer overflow in python I sued this source.
Using astype(int) seems to be working fine; the following code:
import numpy as np
test = np.array([True, False, False, False, True, True, True, False, True, True, False, False, True, True, True, False, True, False, False, True, False, True, True, True, True, True, False, True, False, True, True, False, True, True, False, True, False, False, True, False, True, True, False, True, False, True, True, False, True, True, True, False, False, False, True, False, False, True, True, True, True, False, True, False])
test_int = test.astype(int)
print(test_int)
print(test_int.sum())
Returns:
[1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0
1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0]
37
The overflow exception you are getting seems unlikely here so I would look again into that because maybe you had an error somewhere else.
Edit
If you want to get a Python type instead of a numpy object just do:
test.astype(int).tolist()
One way of getting native Python type elements is .tolist(). Note that we can do this directly on the boolean array. Your code works fine with native Python bools.
>>> x = np.random.randint(0, 2, (100,)).astype(bool)
>>> x
array([ True, True, False, True, False, True, False, False, True,
False, False, True, True, False, False, False, True, False,
False, True, False, True, False, False, True, True, True,
True, True, True, True, False, False, False, False, False,
True, True, True, True, False, False, True, False, False,
False, False, True, False, True, True, False, False, True,
False, True, True, True, False, True, True, True, False,
True, True, True, True, False, True, True, True, False,
True, False, True, False, True, False, True, True, True,
False, False, True, True, True, True, True, False, False,
True, False, False, False, True, True, True, False, False, True], dtype=bool)
>>> bool2int(x)
-4925102932063228254
>>> bool2int(x.tolist())
774014555155191751582008547627L
As an added bonus it's actually faster.
>>> timeit(lambda:bool2int(x), number=1000)
0.24346303939819336
>>> timeit(lambda:bool2int(x.tolist()), number=1000)
0.010725975036621094
Related
I implemented a regression model using
formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) +
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous +
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"
model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()
After fitting a regression model, I ran a bonferroni correction using
smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False,
returnsorted=False)
And I get the following result:
(array([ True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True]),
array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
0.0007540287301109894,
0.0007352941176470588)
I want to use these arrays to remove the features in model_a that are False and create a new model 'train_simplified'.
I'm using the following manual approach, but I want to know if thereĀ“s a more efficient way to do it.
train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38,
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
You could use Pandas loc to select only the features in model_a that are True.
.loc[] is primarily label based, but may also be used with a boolean array.
train = pd.DataFrame(np.random.rand(5,68))
0 1 2 3 ... 63 64 65 66 67
0 0.637557 0.887213 0.472215 0.119594 ... 0.908266 0.239562 0.144895 0.489453 0.985650
1 0.242055 0.672136 0.761620 0.237638 ... 0.649633 0.849223 0.657613 0.568309 0.093675
2 0.367716 0.265202 0.243990 0.973011 ... 0.465598 0.542645 0.286541 0.590833 0.030500
3 0.037348 0.822601 0.360191 0.127061 ... 0.070569 0.642419 0.026511 0.585776 0.940230
4 0.575474 0.388170 0.643288 0.458253 ... 0.091206 0.494420 0.057559 0.549529 0.441531
[5 rows x 68 columns]
keep_columns = np.array([ # array from smt.multipletests
True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True])
np.sum(keep_columns) # 47 (keep 47 columns)
train_simplified = train.loc[:,keep_columns]
Output from train_simplified
0 2 3 4 ... 62 64 65 66 67
0 0.637557 0.472215 0.119594 0.713245 ... 0.278646 0.239562 0.144895 0.489453 0.985650
1 0.242055 0.761620 0.237638 0.728216 ... 0.746491 0.849223 0.657613 0.568309 0.093675
2 0.367716 0.243990 0.973011 0.393098 ... 0.035942 0.542645 0.286541 0.590833 0.030500
3 0.037348 0.360191 0.127061 0.522243 ... 0.162934 0.642419 0.026511 0.585776 0.940230
4 0.575474 0.643288 0.458253 0.545617 ... 0.789618 0.494420 0.057559 0.549529 0.441531
[5 rows x 47 columns]
I have a conditional statement for an array A(assume it is A>10) and I get the following boolean result.
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, False, False, False, False,
False, False, False, False, False, False, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
Now I am finding the indices where the values are True. I get the following array.
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90])
What I need to do is to find the start index and end index of continuous indices. For example, in the above array the start index of the first of the continuous indices is 20 and the end index is 49. Similarly, the start index of the second set of continuous indices is 60 and the end index is 90.
So to summarize, my output should be :
start_indices = array([20,60])
end_indices = array([49,90])
How to do this?
Here is a solution using groupby and accumulate from itertools:
from itertools import groupby, accumulate
## input array
#a = array([False, False, ..., True, ..., False])
indices = list(accumulate(len(list(g)) for i,g in groupby(a)))
starts = indices[:len(indices)//2*2:2]
stops = [i-1 for i in indices[1::2]]
NB. it works with any iterable, not only numpy arrays.
output:
>>> starts
[20, 60]
>>> stops
[49, 90]
import numpy as np
# With A as the original array
changes = np.where(np.diff(A > 10))[0] # Gets the actual array out of a tuple
start = changes[::2] + 1
end = changes[1::2]
I am looping through a list of 3 items, something like:
for i in range(3):
and trying to produce the following lists on each respective iteration:
[True, True, False, False, False, False]
[False, False, True, True, False, False]
[False, False, False, False, True, True]
What would be a good way in python to do this?
Here's one way:
>>> for i in range(3):
... print([(x // 2) == i for x in range(6)])
...
[True, True, False, False, False, False]
[False, False, True, True, False, False]
[False, False, False, False, True, True]
Try like this:
k = 0
for i in range(3):
# Other tasks
myList = [False for x in range(4)]
myList[k:k] = [True,True]
print(myList)
k += 2
L = [False, False, False, False, True, True]
for _ in range(3):
L = L[-2:] + L[:4]
print(L)
The fastest way to find numbers indexes in np. array in Python is?
Suppose we have a list of numbers from 0 to 20, and we want to know the indexes of digits 2 and 5
The canonical way would be to use numpy's where method:
a = np.array(range(20))
np.where((a == 2) | (a == 5))
Note that in order to combine the two terms (a == 2) and (a == 5) we need the bitwise or operator |. The reason is that both (a == 2) and (a == 5) return a numpy array of dtype('bool'):
>>> a == 2
array([False, False, True, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
>>> (a == 5)
array([False, False, False, False, False, True, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
>>> (a == 2) | (a==5)
array([False, False, True, False, False, True, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
I have a 1D (numpy) array with boolean values. for example:
x = [True, True, False, False, False, True, False, True, True, True, False, True, True, False]
The array contains 8 True values. I would like to keep, for example, exactly 3 (must be less than 8 in this case) as True values randomly from the 8 that exist. In other words I would like to randomly set 5 of those 8 True values as False.
A possible result can be:
x = [True, True, False, False, False, False, False, False, False, False, False, False, True, False]
How to implement it?
One approach would be -
# Get the indices of True values
idx = np.flatnonzero(x)
# Get unique indices of length 3 less than the number of indices and
# set those in x as False
x[np.random.choice(idx, len(idx)-3, replace=0)] = 0
Sample run -
# Input array
In [79]: x
Out[79]:
array([ True, True, False, False, False, True, False, True, True,
True, False, True, True, False], dtype=bool)
# Get indices
In [80]: idx = np.flatnonzero(x)
# Set 3 minus number of True indices as False
In [81]: x[np.random.choice(idx, len(idx)-3, replace=0)] = 0
# Verify output to have exactly three True values
In [82]: x
Out[82]:
array([ True, False, False, False, False, False, False, True, False,
False, False, True, False, False], dtype=bool)
Build an array with the number of desired True and False, then just shuffle it
import random
def buildRandomArray(size, numberOfTrues):
res = [False]*(size-numberOfTrues) + [True]*numberOfTrues
random.shuffle(res)
return res
Live example