I have a conditional statement for an array A(assume it is A>10) and I get the following boolean result.
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, False, False, False, False,
False, False, False, False, False, False, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
Now I am finding the indices where the values are True. I get the following array.
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90])
What I need to do is to find the start index and end index of continuous indices. For example, in the above array the start index of the first of the continuous indices is 20 and the end index is 49. Similarly, the start index of the second set of continuous indices is 60 and the end index is 90.
So to summarize, my output should be :
start_indices = array([20,60])
end_indices = array([49,90])
How to do this?
Here is a solution using groupby and accumulate from itertools:
from itertools import groupby, accumulate
## input array
#a = array([False, False, ..., True, ..., False])
indices = list(accumulate(len(list(g)) for i,g in groupby(a)))
starts = indices[:len(indices)//2*2:2]
stops = [i-1 for i in indices[1::2]]
NB. it works with any iterable, not only numpy arrays.
output:
>>> starts
[20, 60]
>>> stops
[49, 90]
import numpy as np
# With A as the original array
changes = np.where(np.diff(A > 10))[0] # Gets the actual array out of a tuple
start = changes[::2] + 1
end = changes[1::2]
Related
I implemented a regression model using
formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) +
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous +
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"
model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()
After fitting a regression model, I ran a bonferroni correction using
smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False,
returnsorted=False)
And I get the following result:
(array([ True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True]),
array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
0.0007540287301109894,
0.0007352941176470588)
I want to use these arrays to remove the features in model_a that are False and create a new model 'train_simplified'.
I'm using the following manual approach, but I want to know if thereĀ“s a more efficient way to do it.
train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38,
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
You could use Pandas loc to select only the features in model_a that are True.
.loc[] is primarily label based, but may also be used with a boolean array.
train = pd.DataFrame(np.random.rand(5,68))
0 1 2 3 ... 63 64 65 66 67
0 0.637557 0.887213 0.472215 0.119594 ... 0.908266 0.239562 0.144895 0.489453 0.985650
1 0.242055 0.672136 0.761620 0.237638 ... 0.649633 0.849223 0.657613 0.568309 0.093675
2 0.367716 0.265202 0.243990 0.973011 ... 0.465598 0.542645 0.286541 0.590833 0.030500
3 0.037348 0.822601 0.360191 0.127061 ... 0.070569 0.642419 0.026511 0.585776 0.940230
4 0.575474 0.388170 0.643288 0.458253 ... 0.091206 0.494420 0.057559 0.549529 0.441531
[5 rows x 68 columns]
keep_columns = np.array([ # array from smt.multipletests
True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True])
np.sum(keep_columns) # 47 (keep 47 columns)
train_simplified = train.loc[:,keep_columns]
Output from train_simplified
0 2 3 4 ... 62 64 65 66 67
0 0.637557 0.472215 0.119594 0.713245 ... 0.278646 0.239562 0.144895 0.489453 0.985650
1 0.242055 0.761620 0.237638 0.728216 ... 0.746491 0.849223 0.657613 0.568309 0.093675
2 0.367716 0.243990 0.973011 0.393098 ... 0.035942 0.542645 0.286541 0.590833 0.030500
3 0.037348 0.360191 0.127061 0.522243 ... 0.162934 0.642419 0.026511 0.585776 0.940230
4 0.575474 0.643288 0.458253 0.545617 ... 0.789618 0.494420 0.057559 0.549529 0.441531
[5 rows x 47 columns]
How can I get a boolean 1 dimentional output for values <40 from the below given array. Since there are three values <40 so the output should be: array([ True, True, True])
x = np.array([[40, 37, 70],[62, 61, 98],[65, 89, 22],[95, 98, 81],[44, 32, 79]])
You can do it like this:
import numpy as np
x = np.array([[40, 37, 70],[62, 61, 98],[65, 89, 22],[95, 98, 81],[44, 32, 79]])
x<40
Output:
array([[False, True, False],
[False, False, False],
[False, False, True],
[False, False, False],
[False, True, False]])
Or if you want a 1d result, you can use .flatten():
y = x.flatten()
y<40
Output:
array([False, True, False, False, False, False, False, False, True,
False, False, False, False, True, False])
If you want a 1d list like [True]*n where n is the number of values <40, you can do:
np.array([i for i in x.flatten()<40 if i])
Output:
array([True, True, True])
This could be solved in many ways, one could be:
x[x<40]<40
The fastest way to find numbers indexes in np. array in Python is?
Suppose we have a list of numbers from 0 to 20, and we want to know the indexes of digits 2 and 5
The canonical way would be to use numpy's where method:
a = np.array(range(20))
np.where((a == 2) | (a == 5))
Note that in order to combine the two terms (a == 2) and (a == 5) we need the bitwise or operator |. The reason is that both (a == 2) and (a == 5) return a numpy array of dtype('bool'):
>>> a == 2
array([False, False, True, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
>>> (a == 5)
array([False, False, False, False, False, True, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
>>> (a == 2) | (a==5)
array([False, False, True, False, False, True, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
I have asked this question before and being downvoted heavily. Anyway judging by the fact that noone really sees a triple downvoted question again I repost it to make clear that I am interested in the actual answer (if there is one).
Problem statement:
I am in a situation I need the arbitrary precision feature of pure python integers. At some point in my code I have a numpy array with boolean. Something like:
arr
array([ True, False, False, False, True, True, True, False, True,
True, False, False, True, True, True, False, True, False,
False, True, False, True, True, True, True, True, False,
True, False, True, True, False, True, True, False, True,
False, False, True, False, True, True, False, True, False,
True, True, False, True, True, True, False, False, False,
True, False, False, True, True, True, True, False, True,
False])
which I convert it to numpy.int64 using arr.astype(int) to make it arithmetic.
But I used this code to convert it to an integer it overflowed (and produced negative numbers I don't want to).
Code is using this function (which is pure python and wont have any integer overflow issue by itself):
def bool2int(x):
y = 0
for i,j in enumerate(x):
y += j<<i
return y
If I run the code directly on np.array (converted to int or not does not matter):
bool2int(arr)
-2393826705255337647
bool2int(h.astype(int))
-2393826705255337647
will I need a positive integer. So, I used a list comprehension:
bool2int([int(x) for x in arr])
16052917368454213969
Obviously, the number represented by arr exceeds the capacity of fixed precision integers (i.e. 263-1) to be able to use ti directly.
Is there any other direct way to achieve beyond list comprehension?
Edit:
For the theory of integer overflow in python I sued this source.
Using astype(int) seems to be working fine; the following code:
import numpy as np
test = np.array([True, False, False, False, True, True, True, False, True, True, False, False, True, True, True, False, True, False, False, True, False, True, True, True, True, True, False, True, False, True, True, False, True, True, False, True, False, False, True, False, True, True, False, True, False, True, True, False, True, True, True, False, False, False, True, False, False, True, True, True, True, False, True, False])
test_int = test.astype(int)
print(test_int)
print(test_int.sum())
Returns:
[1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0
1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0]
37
The overflow exception you are getting seems unlikely here so I would look again into that because maybe you had an error somewhere else.
Edit
If you want to get a Python type instead of a numpy object just do:
test.astype(int).tolist()
One way of getting native Python type elements is .tolist(). Note that we can do this directly on the boolean array. Your code works fine with native Python bools.
>>> x = np.random.randint(0, 2, (100,)).astype(bool)
>>> x
array([ True, True, False, True, False, True, False, False, True,
False, False, True, True, False, False, False, True, False,
False, True, False, True, False, False, True, True, True,
True, True, True, True, False, False, False, False, False,
True, True, True, True, False, False, True, False, False,
False, False, True, False, True, True, False, False, True,
False, True, True, True, False, True, True, True, False,
True, True, True, True, False, True, True, True, False,
True, False, True, False, True, False, True, True, True,
False, False, True, True, True, True, True, False, False,
True, False, False, False, True, True, True, False, False, True], dtype=bool)
>>> bool2int(x)
-4925102932063228254
>>> bool2int(x.tolist())
774014555155191751582008547627L
As an added bonus it's actually faster.
>>> timeit(lambda:bool2int(x), number=1000)
0.24346303939819336
>>> timeit(lambda:bool2int(x.tolist()), number=1000)
0.010725975036621094
Given an arbitrary one-dimensional mask:
In [1]: import numpy as np
...: mask = np.array(np.random.random_integers(0,1,20), dtype=bool)
...: mask
Out[1]:
array([ True, False, True, False, False, True, False, True, True,
False, True, False, True, False, False, True, True, False,
True, True], dtype=bool)
We can obtain an array of the True elements of mask using np.flatnonzero:
In[2]: np.flatnonzero(mask)
Out[2]: array([ 0, 2, 5, 7, 8, 10, 12, 15, 16, 18, 19], dtype=int64)
But now how do I reverse this process and go from _2 to a mask?
Create an all-false mask and then use numpy's index array functionality to assign the True entries for the mask.
In[3]: new_mask = np.zeros(20, dtype=bool)
...: new_mask
Out[3]:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False], dtype=bool)
In[4]: new_mask[_2] = True
...: new_mask
Out[4]:
array([ True, False, True, False, False, True, False, True, True,
False, True, False, True, False, False, True, True, False,
True, True], dtype=bool)
As a check we see that:
In[5]: np.flatnonzero(new_mask)
Out[5]: array([ 0, 2, 5, 7, 8, 10, 12, 15, 16, 18, 19], dtype=int64)
As expected, _5 == _2:
In[6]: np.all(_5 == _2)
Out[6]: True
You could use np.bincount:
In [304]: mask = np.random.binomial(1, 0.5, size=10).astype(bool); mask
Out[304]: array([ True, True, False, True, False, False, False, True, False, True], dtype=bool)
In [305]: idx = np.flatnonzero(mask); idx
Out[305]: array([0, 1, 3, 7, 9])
In [306]: np.bincount(idx, minlength=len(mask)).astype(bool)
Out[306]: array([ True, True, False, True, False, False, False, True, False, True], dtype=bool)