Find the row that optimises different criteria in a numpy array

Find the row that optimises different criteria in a numpy array - python

I have the following numpy array matrix:
example = np.array([[1.5525672727035909, 0.9550488599348534, 0.04495114006514658, -4757.845003575899, -4747.172432255857, 1],
[1.3050643768242065, 0.962214983713355, 0.03778501628664495, -5024.418466943938, -5013.745895623896, 2],
[1.3950687447554788, 0.9596091205211726, 0.040390879478827364, -4922.047207088476, -4911.374635768434, 3],
[1.2375603195101852, 0.9641693811074918, 0.035830618892508145, -5105.942048800849, -5095.269477480807, 4],
[1.2375603195101852, 0.9641693811074918, 0.035830618892508145, -5105.942048800849, -5095.269477480807, 5],
[1.2375597985998075, 0.9641693811074918, 0.035830618892508145, -5105.942048800849, -5095.269477480807, 6],
[1.2375597985998072, 0.9641693811074918, 0.035830618892508145, -5105.942048800849, -5095.269477480807, 7],
[1.215059487982556, 0.9648208469055375, 0.03517915309446254, -5134.107976656531, -5123.435405336489, 8],
[1.1250535573201497, 0.9674267100977199, 0.03257328990228013, -5252.243174800487, -5241.570603480445, 9],
[1.1250551200512835, 0.9674267100977199, 0.03257328990228013, -5252.243174800487, -5241.570603480445, 10]])
As you can see it consists in an array of different parameters in 10 different rows, (the last column is an index of the rows). What I am trying to do is get the row index of the one that meets better different criteria from different columns.
For instance, the closer to zero from column 1, the closer to 100 from column 2, the further from zero from column 3 and 4. In a way, I would like an optimisation that gets the index of the row that meets different criteria the best.
So far I have only got the select the rows for the columns based on the individual criteria, but I have not get to the point where it takes all the considerations together.
Thanks in advance.

As mentioned in the comment you need to be more specific about your cost function.
In general a quadratic error function is a good idea for reasons I won't go into here. In that case it's simply a case of doing the following:
goal = np.array([0,100, 0 ,0,0,0])
weight = np.array([1, 1, -1 , -1,0,0])
costs = np.sum(((example - goal)**2)*weight, axis = 1)
print(example[np.argmin(costs),-1])
goal defines the points that are the targets.
weight represents if we want to be close to them or far from them.
So to compute the cost you substract the goal from the example and you are left with distances. Next you square them (this gets rid of negative signs, alternatively you can take the absolute value) and multiply them by the weights. Finally you need to sum up the individual column costs to get the overall row cost to select the "cheapest" one. You can then pick out the row index (the last column)

Related

Find Sign when Sign Changes in Pandas Column while Ignoring Zeros using Vectorization

I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!

Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output

Finding the row in an numpy array that has the smallest values compared to all other rows

say i have the following matrix:
A = [[7,5,1,2]
[10,1,3,8]
[2,2,2,3]]
I need to extract the row with elements closest to 0 compared to all other rows, aka the row with the minimal elements. So i need [2,2,2,3]
I have tried a number of things, np.min, np.amin, np.argmin
But all are giving me the minimum values of each row for example:
[2,1,1,2]
This is not what im looking for.
If someone knows the right function could you point me to the documentation of the function?
Thank you.

It depends on how you define distance when you say closest. I'm guessing you are looking for the Euclidean distance, i.e. L2 norm here. In which case, you can just find the minimum sum of square for all rows:
A[(A ** 2).sum(1).argmin()]
# array([2, 2, 2, 3])
You can also find the closest by L1 norm or the sum of absolute difference against 0s:
A[np.abs(A).sum(1).argmin()]
# array([2, 2, 2, 3])
In this dummy example, the two methods give the same result, but they could be different depending on the actual data.

import numpy as np
A = np.array([[7,5,1,2],
[10,1,3,8],
[2,2,2,3]])
print(print(A[np.argmin(A.sum(axis=1))]))
# [2 2 2 3]
Sum the rows, then find the row index of the minimum value, and finally find the row.

The first way that comes to mind is to find the minimum of each row. Then find the argmin of that array.
row_mins = A.min(axis=0)
row_with_minimum = row_mins.argmin()
Then to get the row with the minimum element, do
A[row_with_minimum, :]

How to check if one Pandas time-series is present in another long time-series?

I have two very long time-series. I have to check if Series B is present(in the given order) in Series A.
Series A: 1,2,3,4,5,6,5,4,3.
Series B: 3,4,5.
Result: True, with index where the small series first element found. Here, index:2 (as 3 is present at index 2 in Series A)
Note: The two series are quite big. let's say A contains 50000 elements and B contains 350.

a very slow solution is to convert series to list and check if first list is a subset of the main list in order
def is_series_a_subseries_in_order(main, sub):
n = len(sub)
main=main.tolist()
sub=sub.tolist()
return any((main[i:i+n] == sub) for i in range(len(main)-n+1))
will return True or False

A naive approach is to check for B(1) in A. In your example B(1) = A(3), so now you have to check if B(2) = A(4) and you continue till the end of your substring... If it's not correct, start with A(4) and continue till the end.
A better way to search for a substring is to apply Knuth-Morris-Pratt's algorithm. I'll let you search for more information about it!

Unluckily the rolling method of pandas does not allow being used as an iterator, even though implementation is planned in #11704.
Thus we have to implement a rolling window for subset checking on our own.
ser_a = pd.Series(data=[1, 2, 3, 4, 5, 6, 5, 4, 3])
ser_b = pd.Series(data=[3, 4, 5])
slider_df = pd.concat(
[ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)],
axis=1).astype(ser_a.dtype).T
sub_series = (ser_b == slider_df).all(axis=1)
# if you want, you can extract only the indices where a subseries was found:
sub_series_startindex = sub_series.index[sub_series]
What I am doing here:
[ser_a.shift(-i)[:ser_b.size] for i in range(ser_a.size - ser_b.size + 1)]: Create a "rolling window" by increased shifting of ser_a, limited to the size of the sub series ser_b to check for. Since shifts at the end will yield NaN, these are excluded in the range.
pd.concat(..., axis=1): Concatenate shifted Series, so that slider_df contains all shifts in the columns.
.astype(ser_a.dtype): is strictly optional. For large Series this may improve performance, for small Series it may degrade performance.
.T: transpose df, so that sub-series-index are aligned by axis 0.
sub_series = (ser_b == slider_df).all(axis=1): Find where ser_b matches sub-series.
sub_series.index[sub_series]: extract the indices, where a matching sub-series was found.

Counting non-zero elements within each row and within each column of a 2D NumPy array

I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?

import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.

A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.

The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]

For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])

(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.

Slicing arrays in Numpy / Scipy

I have an array like:
a = array([[1,2,3],[3,4,5],[4,5,6]])
What's the most efficient way to slice out a 1x2 array out of this that has only the first two columns of "a"?
i.e.
array([[2,3],[4,5],[5,6]]) in this case.

Two dimensional numpy arrays are indexed using a[i,j] (not a[i][j]), but you can use the same slicing notation with numpy arrays and matrices as you can with ordinary matrices in python (just put them in a single []):
>>> from numpy import array
>>> a = array([[1,2,3],[3,4,5],[4,5,6]])
>>> a[:,1:]
array([[2, 3],
[4, 5],
[5, 6]])

Is this what you're looking for?
a[:,1:]

To quote documentation, the basic slice syntax is i:j:k where i is the starting index, j is the stopping index, and k is the step (when k > 0).
Now if i is not given, it defaults to 0 if k > 0. Otherwise i defaults to n - 1 for k < 0 (where n is the length of the array).
If j is not given, it defaults to n (length of array).
That's for a one dimensional array.
Now a two dimensional array is a different beast. The slicing syntax for that is a[rowrange, columnrange].
So if you want all the rows, but just the last two columns, like in your case, you do:
a[0:3, 1:3]
Here, "[0:3]" means all the rows from 0 to 3. and "[1:3]" means all columns from column 1 to column 3.
Now as you may be wondering, even though you have only 3 columns and the numbering starts from 1, it must return 3 columns right? i.e: column 1, column 2, column 3
That is the tricky part of this syntax. The first column is actually column 0. So when you say "[1:3]", you are actually saying give me column 1 and column 2. Which are the last two columns you want. (There actually is no column 3.)
Now if you don't know how long your matrix is or if you want all the rows, you can just leave that part empty.
i.e.
a[:, 1:3]
Same goes for columns also. i.e if you wanted say, all the columns but just the first row, you would write
a[0:1,:]
Now, how the above answer a[:,1:] works is because when you say "[1:]" for columns, it means give me everything except for column 0, and till the end of all the columns. i.e empty means 'till the end'.
By now you must realize that anything on either side of the comma is all a subset of the one dimensional case I first mentioned above. i.e if you want to specify your rows using step sizes you can write
a[::2,1]
Which in your case would return
array([[2, 3],
[5, 6]])
i.e. a[::2,1] elucidates as: give me every other row, starting with the top most, and give me only the 2nd column.
This took me some time to figure out. So pasting it here, just in case it helps someone.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.