elementwise boolean testing in a matrix - python

I need to test a boolean in an array, and based on the answer, apply an elementwise operation on a matrix. I seem to be getting the boolean answer for the ROW and not for the individual element itself. how do i test and get the answer for each individual element?
i have an matrix of probabilities
probs = np.array([[0.1, 0.2, 0.3, 0.3, 0.7],
[0.1, 0.2, 0.3, 0.3, 0.7],
[0.7, 0.2, 0.6, 0.1, 0.0]])
and a matrix of test arrays
tst = ([False, False, True, True, False],
[True, False, True, False, False],
)
t = np.asarray(tst).astype('bool')
and this segment of code i have written which outputs the answer, but obviously test the entire row, as everything is FALSE.
for row in tst:
mat = []
for row1 in probs:
temp = []
if row == True:
temp.append(row1)
else: temp.append(row1-1)
mat.append(temp)
mat
Out[42]:
[[array([-0.9, -0.8, -0.7, -0.7, -0.3])],
[array([-0.9, -0.8, -0.7, -0.7, -0.3])],
[array([-0.3, -0.8, -0.4, -0.9, -1. ])]]
i need the new matrix to be
[[-0.9, -0.8, 0.3, 0.3, -0.3],
[-0.9, -0.8, 0.3, 0.3, -0.3],
[-0.3, -0.8, 0.6, 0.1, -1]
for the 1st array in tst. Thanks very much for any assistance!

You need to keep the values as-is if test is True and substract 1 otherwise
Your loop doesn't work because you're comparing a list with a boolean. After that you're adding the whole row minus 1 (substracting 1 to all elements)
My solution: substract the boolean row to the values row, but inverting True and False (if True, don't substract, if False, substract):
for row in tst:
mat = []
for row1 in probs:
mat.append(row1-[not v for v in row])
print(np.asarray(mat))
prints (for each iteration) (note that you have 2 results since you're combining 2 truth tables with your matrix):
[[-0.9 -0.8 0.3 0.3 -0.3]
[-0.9 -0.8 0.3 0.3 -0.3]
[-0.3 -0.8 0.6 0.1 -1. ]]
[[ 0.1 -0.8 0.3 -0.7 -0.3]
[ 0.1 -0.8 0.3 -0.7 -0.3]
[ 0.7 -0.8 0.6 -0.9 -1. ]]
(I'm not a numpy expert at all, sorry if this is clumsy, comments welcome)

You don't need need a loop here. You have an array and an corresponding mask array.
probs[np.invert(tst)]-=1.
The mask will give you back the true values. You wan't the false values so invert the tst array.
# This would be a longer version, if you are not familiar with the synthax above
probs[np.invert(tst)]=probs[np.invert(tst)]-1.
If you want to create a new numpy Array (you created a list of numpy-arrays in your code) it will work for example this way.
# copy the numpy array
mat=np.copy(probs)
mat[np.invert(tst)]=probs[np.invert(tst)]-1
I would recommend you to take a look at a view beginners tutorials first, programming will be much easier if you know for example the difference between lists and numpy-arrays and how to handle them.
https://www.scipy.org/scipylib/faq.html#what-advantages-do-numpy-arrays-offer-over-nested-python-lists
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
or a short explanation
Python List vs. Array - when to use?

Related

How can I remove the rows of my dataset using pandas?

Here's the dataset I'm dealing with (called depo_dataset):
Some entries starting from the second column (0.0) might be 0.000..., the goal is for each column starting from the 2nd one, I will generate a separate array of Energy, with the 0.0... entry in the column and associated Energy removed. I'm trying to use a mask in pandas. Here's what I tried:
for column in depo_dataset.columns[1:]:
e = depo_dataset['Energy'].copy()
mask = depo_dataset[column] == 0
Then I don't know how can I drop the 0 entry (assume there is one), and the corresponding element is e?
For instance, suppose we have depo_dataset['0.0'] to be 0.4, 0.0, 0.4, 0.1, and depo_dataset['Energy'] is 0.82, 0.85, 0.87, 0.90, I hope to drop the 0.0 entry in depo_dataset['0.0'], and 0.85 in depo_dataset['Energy'] .
Thanks for the help!
You can just us .loc on the DataFrame to filter out some rows.
Here a little example:
df = pd.DataFrame({
'Energy': [0.82, 0.85, 0.87, 0.90],
0.0: [0.4, 0.0, 0.4, 0.1],
0.1: [0.0, 0.3, 0.4, 0.1]
})
energies = {}
for column in df.columns[1:]:
energies[column] = df.loc[df[column] != 0, ['Energy', column]]
energies[0.0]
you can use .loc:
depo_dataset = pd.DataFrame({'Energy':[0.82, 0.85, 0.87, 0.90],
'0.0':[0.4, 0.0, 0.4, 0.1],
'0.1':[1,2,3,4]})
dataset_no_zeroes = depo_dataset.loc[(depo_dataset.iloc[:,1:] !=0).all(axis=1),:]
Explanation:
(depo_dataset.iloc[:,1:] !=0)
makes a dataframe from all cols beginning with the second one with bool values indicating if the cell is zero.
.all(axis=1)
take the rows of the dataframe '(axis =1)' and only return true if all values of the row are true

Add a scalar to a numpy matrix based on the indices in a different numpy array

I'm sorry if this question isn't framed well. So I would rather explain with an example.
I have a numpy matrix:
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
And another numpy array as shown:
b = np.array([1, 0, 2, 2])
With the given condition that values in b will be in the range(a.shape[1]) and that b.shape[1] == a.shape[0]. Now this is the operation I need to perform.
For every index i of a, and the corresponding index i of b, I need to subtract 1 from the index j of a[i] where j== b[i]
So in my example, a[0] == [0.5 0.8 0.1] and b[0] == 1. Therefore I need to subtract 1 from a[0][b[0]] so that a[0] = [0.5, -0.2, 0.1]. This has to be done for all rows of a. Any direct solution without me having to iterate through all rows or columns one by one?
Thanks.
Use numpy indexing. See this post for a nice introduction:
import numpy as np
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
b = np.array([1, 0, 2, 2])
a[np.arange(a.shape[0]), b] -= 1
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
As an alternative use substract.at:
np.subtract.at(a, (np.arange(a.shape[0]), b), 1)
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
The main idea is that:
np.arange(a.shape[0]) # shape[0] is equals to the numbers of rows
generates the indices of the rows:
[0 1 2 3]

Efficient way for calculating selected differences in array

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:
ids = np.array([2, 0, 1, 0, 1, 1, 2])
times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])
These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do
for id in np.unique(ids):
diffs = np.diff(times[ids==id])
print diffs
# do stuff with diffs
However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?
You can use array.argsort() and ignore the values corresponding to change in ids:
>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)
or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:
times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence
and finally you can split this array with np.split and np.where:
np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])
As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.
Say you np.argsort by ids:
inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])
Now sort times by this, np.diff, and prepend a nan:
diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs
array([ nan, 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
These differences are correct except for the boundaries. Let's calculate those
boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False, True, False, True, True, False, True], dtype=bool)
Now we can just do
diffs[~boundaries] = np.nan
Let's see what we got:
>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])
>>> times[inds]
array([ 0.3, 0.5, 0.3, 0.6, 1.2, 0.1, 1.3])
>>> diffs
array([ nan, 0.2, nan, 0.3, 0.6, nan, 1.2])
I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.
In pandas, you could do this in one step, after creating a DataFrame:
df = pd.DataFrame({'ids': ids, 'times': times})
df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)
This gives:
>>> df
ids times diffs
0 2 0.1 NaN
1 0 0.3 NaN
2 1 0.3 NaN
3 0 0.5 0.2
4 1 0.6 0.3
5 1 1.2 0.6
6 2 1.3 1.2
The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:
import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)
Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

python; counting elements of vectors

I would like to count and save in a vector a the number of elements of an array that are greater than a certain value t. I want to do this for different ts.
eg
My vector:c=[0.3 0.2 0.3 0.6 0.9 0.1 0.2 0.5 0.3 0.5 0.7 0.1]
I would like to count the number of elements of c that are greater than t=0.9, than t=0.8 than t=0.7 etc... I then want to save the counts for each different value of t in a vector
my code is (not working):
for t in range(0,10,1):
for j in range(0, len(c)):
if c[j]>t/10:
a.append(sum(c[j]>t))
my vector a should be of dimension 10, but it isn't!
Anybody can help me out?
I made a function that loops over the array and just counts whenever the value is greater than the supplied threshold
c=[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
def num_bigger(threshold):
count = 0
for num in c:
if num > threshold:
count +=1
return count
thresholds = [x/10.0 for x in range(10)]
for thresh in thresholds:
print thresh, num_bigger(thresh)
Note that the function checks for strictly greater, which is why, for example, the result is 0 when the threshold is .9.
There are few things wrong with your code.
my vector a should be of dimension 10, but it isn't!
That's because you don't append only 10 elements in your list. Look at your logic.
for t in range(0,10,1):
for j in range(0, len(c)):
if c[j]>t/10:
a.append(sum(c[j]>t))
For each threshold, t, you iterate over all 12 items in c one at a time and you append something to the list. Overall, you get 120 items. What you should have been doing instead is (in pseudocode):
for each threshold:
count = how many elements in c are greater than threshold
a.append(count)
numpy.where() gives you the indices in an array where a condition is satisfied, so you just have to count how many indices you get each time. We'll get to the full solution is a moment.
Another potential error is t/10, which in Python 2 is integer division and will return 0 for all thresholds. The correct way would be to force float division with t/10.. If you're on Python 3 though, you get float division by default so this might not be a problem. Notice though that you do c[j] > t, where t is between 0 and 10. Overall, your c[j] > t logic is wrong. You want to use a counter for all elements, like other answers have shown you, or collapse it all down to a one-liner list comprehension.
Finally, here's a solution fully utilising numpy.
import numpy as np
c = np.array([0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1])
thresh = np.arange(0, 1, 0.1)
counts = np.empty(thresh.shape, dtype=int)
for i, t in enumerate(thresh):
counts[i] = len(np.where(c > t)[0])
print counts
Output:
[12 10 8 5 5 3 2 1 1 0]
Letting numpy take care of the loops under the hood is faster than Python-level loops. For demonstration:
import timeit
head = """
import numpy as np
c = np.array([0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1])
thresh = np.arange(0, 1, 0.1)
"""
numpy_where = """
for t in thresh:
len(np.where(c > t)[0])
"""
python_loop = """
for t in thresh:
len([element for element in c if element > t])
"""
n = 10000
for test in [numpy_where, python_loop]:
print timeit.timeit(test, setup=head, number=n)
Which on my computer results in the following timings.
0.231292377372
0.321743753994
Your problem is here:
if c[j]>t/10:
Notice that both t and 10 are integers and so you perform integer division.
The easiest solution with the least changes is to change it to:
if c[j]>float(t)/10:
to force float division
So the whole code would look something like this:
a = []
c = [0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
for i in range(10): #10 is our 1.0 change it to 9 if you want to iterate to 0.9
sum = 0
cutoff = float(i)/10
for ele in c:
if ele <= cutoff:
sum += ele
a.append(sum)
print(len(a)) # prints 10, the numbers from 0.0 - 0.9
print(a) # prints the sums going from 0.0 cutoff to 1.0 cutoff
You have to divide t / 10.0 so the result is a decimal, the result of t / 10 is an integer
a = []
c=[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
for t in range(0,10,1):
count = 0
for j in range(0, len(c)):
if c[j]>t/10.0:
count = count+1
a.append(count)
for t in range(0,10,1):
print(str(a[t]) + ' elements in c are bigger than ' + str(t/10.0))
Output:
12 elements in c are bigger than 0.0
10 elements in c are bigger than 0.1
8 elements in c are bigger than 0.2
5 elements in c are bigger than 0.3
5 elements in c are bigger than 0.4
3 elements in c are bigger than 0.5
2 elements in c are bigger than 0.6
1 elements in c are bigger than 0.7
1 elements in c are bigger than 0.8
0 elements in c are bigger than 0.9
You can check the test here
If you simplify your code bugs won't have places to hide!
c=[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
a=[]
for t in [x/10 for x in range(10)]:
a.append((t,len([x for x in c if x>t])))
a
[(0.0, 12),
(0.1, 10),
(0.2, 8),
(0.3, 5),
(0.4, 5),
(0.5, 3),
(0.6, 2),
(0.7, 1),
(0.8, 1),
(0.9, 0)]
or even this one-liner
[(r/10,len([x for x in c if x>r/10])) for r in range(10)]
It depends on the sizes of your arrays, but your current solution has O(m*n) complexity, m being the number of values to test and n the size of your array. You may be better off with O((m+n)*log(n)) by first sorting your array in O(n*log(n)) and then using binary search to find the m values in O(m*log(n)). Using numpy and your sample c list, this would be something like:
>>> c
[0.3, 0.2, 0.3, 0.6, 0.9, 0.1, 0.2, 0.5, 0.3, 0.5, 0.7, 0.1]
>>> thresholds = np.linspace(0, 1, 10, endpoint=False)
>>> thresholds
array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
>>> len(c) - np.sort(c).searchsorted(thresholds, side='right')
array([12, 10, 8, 5, 5, 3, 2, 1, 1, 0])

Python - Converting an array to a list causes values to change

>>> import numpy as np
>>> a=np.arange(0,2,0.2)
>>> a
array([ 0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8])
>>> a=a.tolist()
>>> a
[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2000000000000002, 1.4000000000000001, 1.6, 1.8]
>>> a.index(0.6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 0.6 is not in list
it appears that some values in the list have changed and I can't find them with index(). How can I fix that?
0.6 hasn't changed; it was never there:
>>> import numpy as np
>>> a = np.arange(0, 2, 0.2)
>>> a
array([ 0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8])
>>> 0.0 in a
True # yep!
>>> 0.6 in a
False # what?
>>> 0.6000000000000001 in a
True # oh...
The numbers in the array are rounded for display purposes, but the array really contains the value you subsequently see in the list; 0.6000000000000001. 0.6 cannot be precisely represented as a float, therefore it is unwise to rely on floating-point numbers comparing precisely equal!
One way to find the index is to use a tolerance approach:
def float_index(seq, f):
for i, x in enumerate(seq):
if abs(x - f) < 0.0001:
return i
which will work on the array too:
>>> float_index(a, 0.6)
3

Categories

Resources