Numpy Divide Arrays With Multiple Out Conditions - python

I have two, two-dimensional arrays (say arrayA & arrayB) that are exactly the same size (2500X, 1500Y). I am interested in dividing array A by array B, but have three conditions that I would like to be excluded from the division and instead replaced with a specific value. These conditions are:
If arrayB contains zero at point (Bx,By), replace output (Cx,Cy) with (arrayA*arrayA)
If arrayA contains zero at point (Ax,Ay), replace output (Cx,Cy) with 0.50
If both arrayA & B at overlapping points (Ax,Ay & Bx,By) contain 0, replace output (Cx,Cy) with 1
I've found that numpy.divide parameters out and where allow me to define each of these individually, so I've taken the first condition and arranged it as follows:
arrayC = np.divide(arrayA, arrayB, out=(arrayA*arrayA), where=arrayB!=0)
My question is how can I combine the other two conditions and their desired outputs within this operation?

One solution, not sure it is the fastest
za=A==0
zb=B==0
case0=(~za)&~zb
case1=zb&~za
case2=za&~zb
case3=za&zb
C=case3*1 + case2*0.5 + case1*A*A # Case 3,2, 1
C[case0]=(A[case0]/B[case0])
Could be more compact with less intermediate values, but I've chosen clarity.
You could also use a cascade of np.where
zb=B==0
C=np.where(A==0, np.where(zb,1,0.5), np.where(zb, A*A, A/B))
Edit: better version (but still not perfect)
zb=B==0
za=A==0
C=np.where(za, np.where(zb,1,0.5), A*A)
np.divide(A, B, out=C, where=(~zb)&~za)
It combines np.where and your np.divide where=
It is as fast as the previous solution.
And does not complain about 0-division, since division occurs only for the cases where it is needed.
Nevertheless, it computes the first version of C (the one before np.divide), and particularly A*A, everywhere, even where it is not needed, since it will be overwritten.
So, it could probably be better.

Related

Ordering a two-dimensional array relative to the main diagonal

Given a two-dimensional array T of size NxN, filled with various natural numbers (They do not have to be sorted in any way as in the example below.). My task is to write a program that transforms the array in such a way that all elements lying above the main diagonal are larger than each element lying on the diagonal and all elements lying below the main diagonal are to be smaller than each element on the diagonal.
For example:
T looks like this:
[2,3,5][7,11,13][17,19,23] and one of the possible solutions is:
[13,19,23][3,7,17][5,2,11]
I have no clue how to do this. Would anyone have an idea what algorithm should be used here?
Let's say the matrix is NxN.
Put all N² values inside an array.
Sort the array with whatever method you prefer (ascending order).
In your final array, the (N²-N)/2 first values go below the diagonal, the following N values go to the diagonal, and the final (N²-N)/2 values go above the diagonal.
The following pseudo-code should do the job:
mat <- array[N][N] // To be initialized.
vec <- array[N*N]
for i : 0 to (N-1)
for j : 0 to (N-1)
vec[i*N+j]=mat[i][j]
next j
next i
sort(vec)
p_below <- 0
p_diag <- (N*N-N)/2
p_above <- (N*N+N)/2
for i : 0 to (N-1)
for j : 0 to (N-1)
if (i>j)
mat[i][j] = vec[p_above]
p_above <- p_above + 1
endif
if (i<j)
mat[i][j] = vec[p_below]
p_below <- p_below + 1
endif
if (i=j)
mat[i][j] = vec[p_diag]
p_diag <- p_diag + 1
endif
next j
next i
Code can be heavily optimized by sorting directly the matrix, using a (quite complex) custom sort operator, so it can be sorted "in place". Technically, you'll do a bijection between the matrix indices to a partitioned set of indices representing "below diagonal", "diagonal" and "above diagonal" indices.
But I'm unsure that it can be considered as an algorithm in itself, because it will be highly dependent on the language used AND on how you stored, internally, your matrix (and how iterators/indices are used). I could write one in C++, but I lack knownledge to give you such an operator in Python.
Obviously, if you can't use a standard sorting function (because it can't work on anything else but an array), then you can write your own with the tricky comparison builtin the algorithm.
For such small matrixes, even a bubble-sort can work properly, but obviously implementing at least a quicksort would be better.
Elements about optimizing:
First, we speak about the trivial bijection from matrix coordinate [x][y] to [i]: i=x+y*N. The invert is obviously x=floor(i/N) & y=i mod N. Then, you can parse the matrix as a vector.
This is already what I do in the first part initializing vec, BTW.
With matrix coordinates, it's easy:
Diagonal is all cells where x=y.
The "below" partition is everywhere x<y.
The "above" partition is everywhere x>y.
Look at coordinates in the below 3x3 matrix, it's quite evident when you know it.
0,0 1,0 2,0
0,1 1,1 2,1
0,2 1,2 2,2
We already know that the ordered vector will be composed of three parts: first the "below" partition, then the "diagonal" partition, then the "above" partition.
The next bijection is way more tricky, since it requires either a piecewise linear function OR a look-up table. The first requires no additional memory but will use more CPU power, the second use as much memory as the matrix but will require less CPU power.
As always, optimization for speed often cost memory. If memory is scarse because you use huge matrixes, then you'll prefer a function.
In order to shorten a bit, I'll explain only for "below" partition. In the vector, the (N-1) first elements will be the ones belonging to the first column. Then, we'll have (N-2) elements for the 2nd column, (N-3) for the third, until we had only 1 element for the (N-1)th column. You see the scheme, sum of the number of elements and the column (zero-based index) is always (N-1).
I won't write the function, because it's quite complex and, honestly, it won't help so much to understand. Simply know that converting from matrix indices to vector is "quite easy".
The opposite is more tricky and CPU-intensive, and it SHOULD use a (N-1) element vector to store where each column starts within the vector to GREATLY speed up the process. Thanks, this vector can also be used (from end to begin) for the "above" partition, so it won't burn too much memory.
Now, you can sort your "vector" normally, simply by chaining the two bijection together with the vector index, and you'll get a matrix cell instead. As long as the sorting algorithm is stable (that's usually the case), it will works and will sort your matrix "in place", at the expense of a lot of mathematical computing to "route" the linear indexes to matrix indexes.
Please note that, despite we speak about bijections, we need ONLY the "vector to matrix" formulas. The "matrix to vector" are important - it MUST be a bijection! - but you won't use them, since you'll sort directly the (virtual) vector from 0 to N²-1.

Python np.sqrt(x-a)*np.heaviside(x,a)

I am trying to implement a calculation from a research paper. In that calculation, the value of a function is supposed to be
0, for x<a
sqrt(x-a)*SOMETHING_ELSE, for x>=a
In my module, x and a are 1D numpy-arrays (of the same length). In my first attempt I implemented the function as
f = np.sqrt(x-a)*SOMETHING*np.heaviside(x,a)
But for x<a, np.sqrt() returns NaN and even though the heaviside function returns 0 in that case, in Python 0*NaN = NaN.
I could also replace all NaN in my resulting array with 0s afterwards but that would lead to warning outputs from numpy.sqrt() used on negative values that I would need to supress. Another solution is to treat the argument of the squareroot as an imaginary number by adding 0j and taking the real part afterwards:
f = np.real(np.sqrt(x-a+0j)*SOMETHING*np.heaviside(x,a))
But I feel like both solutions are not really elegant and the second solution is unnecessarily complicated to read. Is there a more elegant way to do this in Python that I am missing here?
You can cheat with np.maximum in this case to not compute the square root of negative numbers.
Moreover, please note that np.heaviside does not use a as a threshold but 0 (the second parameter is the output of the heaviside in some case). You can use np.where instead.
Here is an example:
f = np.where(x<a, 0, np.sqrt(np.maximum(x-a, 0))*SOMETHING)
Note that in this specific case, the expression can be simplified and np.where is not even needed (because np.sqrt(np.maximum(x-a, 0)) gives 0). Thus, you can simply write:
f = np.sqrt(np.maximum(x-a, 0))*SOMETHING

Float rounding error with Numpy isin function

I'm trying to use the isin() function from Numpy library to find elements that are common in two arrays.
Seems pretty basic, but one of those arrays is created using linspace() and the other I just put hard values in.
But it seems like isin() is using == for its comparisons, and so the result returned by the method is missing one of the numbers.
Is there a way I can work around this, either by defining my arrays differently or by using a method other than isin() ?
thetas = np.array(np.linspace(.25, .50, 51))
known_thetas = [.3, .35, .39, .41, .45]
unknown_thetas = thetas[np.isin(thetas, known_thetas, assume_unique = True, invert = True)]
Printing the three arrays, I find that .41 is still in the third array, because when printing them one by one, my value in the first array is actually 0.41000000000000003, which means == comparison returns False. What is the best way of working around this ?
We could make use of np.isclose after extending one of those arrays to 2D for an outer isclose-match-finding and then doing a ANY match to give us a 1D boolean-array that could be used to mask the relevant input array -
thetas[~np.isclose(thetas[:,None],known_thetas).any(1)]
To customize the level of tolerance for matches, we could feed in custom relative and absolute tolerance values to np.isclose.
If you are looking for performance on large arrays, we could optimize on memory and hence performance too with a NumPy implementation of np.isin with tolerance arg for floating pt numbers with np.searchsorted -
thetas[~isin_tolerance(thetas,known_thetas,tol=0.001)]
Feed in your tolerance value in tol arg.
If you have a fixed absolute tolerance, you can use np.around to round the values before comparing:
unknown_thetas = thetas[np.isin(np.around(thetas, 5), known_thetas, assume_unique = True, invert = True)]
This rounds thetas to 5 decimal digits, but it's up to you to decide how close the numbers need to be for you to consider them equal.

Checking validity of permutations in python

Using python, I would like to generate all possible permutations of 10 labels (for simplicity, I'll call them a, b, c, ...), and return all permutations that satisfy a list of conditions. These conditions have to do with the ordering of the different labels - for example, let's say I want to return all permutations in which a comes before b and when d comes after e. Notably, none of the conditions pertain to any details of the labels themselves, only their relative orderings. I would like to know what the most suitable data structure and general approach is for dealing with these sorts of problems. For example, I can generate all possible permutations of elements within a list, but I can't see a simple way to verify whether a given permutation satisfies the conditions I want.
"The most suitable data structure and general approach" varies, depending on the actual problem. I can outline three basic approaches to the problem you give (generate all permutations of 10 labels a, b, c, etc. in which a comes before b and d comes after e).
First, generate all permutations of the labels, using itertools.permutations, remove/skip over the ones where a comes after b and d comes before e. Given a particular permutation p (represented as a Python tuple) you can check for
p.index("a") < p.index("b") and p.index("d") > p.index("e")
This has the disadvantage that you reject three-fourths of the permutations that are initially generated, and that expression involves four passes through the tuple. But this is simple and short and most of the work is done in the fast code inside Python.
Second, general all permutation of the locations 0 through 9. Consider these to represent the inverses of your desired permutations. In other words, the number at position 0 is not what will go to position 0 in the permutation but rather shows where label a will go in the permutation. Then you can quickly and easily check for your requirements:
p[0] < p[1] and p[3] > p[4]
since a is the 0'th label, etc. If the permutation passes this test, then find the inverse permutation of this and apply it to your labels. Finding the inverse involves one or two passes through the tuple, so it makes fewer passes than the first method. However, this is more complicated and does more work outside the innards of Python, so it is very doubtful that this will be faster than the first method.
Third, generate only the permutations you need. This can be done with these steps.
3a. Note that there are four special positions in the permutations (those for a, b, d, and e). So use itertools.combinations to choose 4 positions out of the 10 total positions. Note I said positions, not labels, so choose 4 integers between 0 and 9.
3b. Use itertools.combinations again to choose 2 of those positions out of the 4 already chosen in step 3a. Place a in the first (smaller) of those 2 positions and b in the other. Place e in the first of the other 2 positions chosen in step 3a and place d in the other.
3c. Use itertools.permutations to choose the order of the other 6 labels.
3d. Interleave all that into one permutation. There are several ways to do that. You could make one pass through, placing everything as needed, or you could use slices to concatenate the various segments of the final permutation.
That third method generates only what you need, but the time involved in constructing each permutation is sizable. I do not know which of the methods would be fastest--you could test with smaller sizes of permutations. There are multiple possible variations for each of the methods, of course.

Python Grouping Data

I have a set of data:
(1438672131.185164, 377961152)
(1438672132.264816, 377961421)
(1438672133.333846, 377961690)
(1438672134.388937, 377961954)
(1438672135.449144, 377962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 377963285)
(1438672139.44115, 377963388)
What I need to figure out is how to group them. Until now I've used a super-duper simple approach, just by diffing two of the second part of the tuple and if the diff was bigger than a certain pre-defined threshold I'd put them into different groups. But it's yielded only unsatisfactory results.
But theoretically I imagine, that it should be possible to determine wether a value of the second part of the tuple belongs to the same group or not, by fitting them on one or multiple line, because I know about the first part of the tuple that it's strictly monotenous because it's a timestamp (time.time()) and I know that all data sets that result will be close to linear. Let's say the tuple is (y, x). There are only three options:
Either all data fits the same equation y = mx + c
Or there is only a differing offset c or
there is an offset c and a different m
The above set would be one group only. The following group would resolve in three groups:
(1438672131.185164, 377961152)
(1438672132.264816, 961421)
(1438672133.333846, 477961690)
(1438672134.388937, 377961954)
(1438672135.449144, 962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 477963285)
(1438672139.44115, 963388)
group1:
(1438672131.185164, 377961152)
(1438672134.388937, 377961954)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
group2:
(1438672132.264816, 961421)
(1438672135.449144, 962220)
(1438672139.44115, 963388)
group3:
(1438672133.333846, 477961690)
(1438672139.069998, 477963285)
Is there a module or otherwise simple solution that will solve this problem? I've found least-squares in numpy and scipy, but I'm not quite sure how to properly use or apply them. If there is another way besides linear functions I'm happy to hear about them as well!
EDIT 2
It is a two dimensional problem unfortunately, not one-dimensional. For example
(1439005464, 477961152)
should (if assuming for this data a relationship approximately 1:300) would still be first group.

Categories

Resources