i'm new to python so this is a two-different-sections question... first I don't understand what this code means and whats for the DESCR this supposed to be for description isn't ? and for the split part with values? i don't understand the values
datasets = [ds.DESCR.split()[0] for ds in datasets]
clf_name = [str(clf).split('(')[0][:12] for clf in clfs]
second when do i use np.ones or np.zeros i know to generate an array of ones or zeros but what i mean is is when specificly in data science does it require me to initialize an array with ones or zeros?
This code is creating two lists using list comprehension.
The ds.DESCR and other expressions here can mean anything, depending on the context.
As for your second sub-question, I'd advise to as something more specific.
If you need ones, you use np.ones, if you need zeros, you use np.zeros. That's it.
Np.zeros is great if you for example gradually update your matrix with values. Every entry that is not updated by your algorithm stays zero.
In application this could be a matrix that shows you edges in a picture. You create a matrix of the size of the picture filled with zeros and then go over the picture with an kernel that detects edges. For every edge you detect you increase the value in the matrix at the position of the detected edge.
A matrix or a vector of ones is great to do some matrix multiplications. Assume some vector of shape (n,1) x (1,n) of a Vector filled with ones will expand the vector to a matrix of shape (n,n). This is and similar cases can make a vector/matrix of ones necessary.
Related
Let's say I have a matrix M of size 10x5, and a set of indices ix of size 4x3. I want to do tf.reduce_sum(tf.gather(M,ix),axis=1) which would give me a result of size 4x5. However, to do this, it creates an intermediate gather matrix of size 4x3x5. While at these small sizes this isn't a problem, if these sizes grow large enough, I get an OOM error. However, since I'm simply doing a sum over the 1st dimension, I never need to calculate the full matrix. So my question is, is there a way to calculate the end 4x5 matrix without going through the intermediate 4x3x5 matrix?
I think you can just multiply by sparse matrix -- I was searching if the two are internally equivalent then I landed on your post
I would like to make non simple operations on a 2D arrays using a sliding window in Python.
I will be more precise with an example. Suppose we have a 10x10 matrix and a sliding window of 3x3, starting from the very first element (1,1) i would like to create a new matrix of the same dimension where at each element i will have the result of the operation (percentile of the numbers, complex operations and so on) considering all the elements covered by the window. I can do this with the function np.lib.stride_tricks.as_strided, but for big arrays it gives memory error.
Does anyone know a better solution?
Do you mean to create a new matrix with the same values as the windows in order to alter its items without modifying the main matrix? If so, I think that you can use the copy method in order to avoid modifying the main matrix.
Numpy Copy method
I am trying to generate a few very large arrays, and at least one is ending up being singular, which is made obvious by this familiar error message:
File "C:\Anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
Of course I do not want my array to be singular, but I am more interested in determining WHY my array is singular. What I mean by this is that I would like to have a way to answer the following questions without manually checking each entry:
Is the array square? (I believe this is returned by a separate error message, which is convenient, but I'll include this as a singularity property anyway)
Are any rows populated only by zeros?
Are any columns populated only by zeros?
Are any rows not linearly independent of all other rows?
For relatively small arrays, the first two conditions are easily answered by visual inspection. However, because my arrays are substantially large, I do not want to have to go in and manually check each array element to see if any of those conditions are met.
I tried pulling up the linalg.py script to see if I could see how it determines a matrix to be singular, but I could not tell how it determines a matrix to be singular.
(this paragraph was edited for clarity)
I also tried searching for info online, and nothing seemed to be of help. Most topics seemed to only answer some form of the following questions/objectives: 1) "I want Python to tell me if my matrix is singular" or 2) why is Python giving me this error message". Because I already know that my matrix/matrices are singular, neither of these two questions are of importance to me.
Again, I am not looking for an answer along the lines of, "Oh, well this particular matrix is singular because . . .". I am looking for a method I can use immediately on ANY singular matrix to determine (especially for large arrays) what is causing the singularity.
Is there a built-in Python function that does this, or is there some other relatively simple way to do this before I try to create a function that will do this for me?
Singular matrices have at least one eigenvalue equal to zero. You can create a diagonalizable singular matrix by starting from its eigenvalue decomposition:
A = V D V^{-1}
D is the diagonal matrix of eigenvalues. So create any matrix V, the diagonal matrix D that has at least one zero in the diagonal, and then A will be singular.
The traditional way of checking is by computing an SVD. This is what the function numpy.linalg.matrix_rank uses to compute the rank, and you can then check if matrix_rank(M) == M.shape[0] (assuming a square matrix).
For more information, check out this excellent answer to a similar question for Matlab users.
The rank of the matrix will tell you how many rows aren't zero or linear combinations, but not specifically which ones. It's a relatively fast operation, so it might be useful as a first-pass check.
In case of a 2D array array.cumsum(0).cumsum(1) gives the Integral image of the array.
What happens if I compute array.cumsum(0).cumsum(1).cumsum(2) over a 3D array?
Do I get a 3D extension of Integral Image i.e, Integral volume over the array?
Its hard to visualize what happens in case of 3D.
I have gone through this discussion.
3D variant for summed area table (SAT)
This gives a recursive way on how to compute the Integral volume. What if I use the cumsum along the 3 axes. Will it give me the same thing?
Will it be more efficient than the recursive method?
Yes, the formula you give, array.cumsum(0).cumsum(1).cumsum(2), will work.
What the formula does is compute a few partial sums so that the sum of these sums is the volume sum. That is, every element needs to be summed exactly once, or, in other words, no element can be skipped and no element counted twice. I think going through each of these questions (is any element skipped or counted twice) is a good way to verify to yourself that this will work. And also run a small test:
x = np.ones((20,20,20)).cumsum(0).cumsum(1).cumsum(2)
print x[2,6,10] # 231.0
print 3*7*11 # 231
Of course, with all ones there could two errors that cancel each other out, but this wouldn't happen everywhere so it's a reasonable test.
As for efficiency, I would guess that the single pass approach is probably faster, but not by a lot. Also, the above could be sped up using an output array, eg, cumsum(n, out=temp) as otherwise three arrays will be created for this calculation. The best way to know is to test (but only if you need to).
I need to form a 2D matrix with total size 2,886 X 2,003,817. I try to use numpy.zeros to make a 2D zero element matrix and then calculate and assign each element of Matrix (most of them are zero son I need to replace few of them).
but when I try numpy.zero to initialize my matrix I get the following memory error:
C=numpy.zeros((2886,2003817)) "MemoryError"
I also try to form the matrix without initialization. Basically I calculate the element of each row in each iteration of my algorithm and then
C=numpy.concatenate((C,[A]),axis=0)
in which C is my final matrix and A is the calculated row at the current iteration. But I find out this method takes a lots of time, I am guessing it is because of using numpy.concatenate(?)
could you please let me know if there is a way to avoid memory error in initializing my matrix or is there any better method or suggestion to form the matrix in this size?
Thanks,
Amir
If your data has a lot of zeros in it, you should use scipy.sparse matrix.
It is a special data structure designed to save memory for matrices that have a lot of zeros. However, if your matrix is not that sparse, sparse matrices start to take up more memory. There are many kinds of sparse matrices, and each of them is efficient at one thing, while inefficient at another thing, so be careful with what you choose.