Unexpected behaviour from scipy.sparse.csr_matrix data - python

Something's odd with the data here.
If I create a scipy.sparse.csr_matrix with the data property containing only 0s and 1s, and then ask it to print the data property, sometimes there are 2s in the output (other times not).
You can see this behaviour here:
from scipy.sparse import csr_matrix
import numpy as np
from collections import OrderedDict
#Generate some fake data
#This makes an OrderedDict of 10 scipy.sparse.csr_matrix objects,
#with 3 rows and 3 columns and binary (0/1) values
od = OrderedDict()
for i in range(10):
row = np.random.randint(3, size=3)
col = np.random.randint(3, size=3)
data = np.random.randint(2, size=3)
print 'data is: ', data
sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))
od[i] = sp_matrix
#Print the data in each scipy sparse matrix
for i in range(10):
print 'data stored in sparse matrix: ', od[i].data
It'll print something like this:
data is: [1 0 1]
data is: [0 0 1]
data is: [0 0 0]
data is: [0 0 0]
data is: [1 1 1]
data is: [0 0 0]
data is: [1 1 0]
data is: [1 0 1]
data is: [0 0 0]
data is: [0 0 1]
data stored in sparse matrix: [1 1 0]
data stored in sparse matrix: [0 0 1]
data stored in sparse matrix: [0 0]
data stored in sparse matrix: [0 0 0]
data stored in sparse matrix: [2 1]
data stored in sparse matrix: [0 0 0]
data stored in sparse matrix: [1 1 0]
data stored in sparse matrix: [1 1 0]
data stored in sparse matrix: [0 0 0]
data stored in sparse matrix: [1 0 0]
Why does the data stored in the sparse matrix not reflect the data originally put there (there were no 2s in the original data)?

I'm assuming, your kind of matrix-creation:
sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))
will use coo_matrix under the hood (not found the relevant sources yet; see bottom).
In this case, the docs say (for COO):
By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)
Your random-matrix routine does not check for duplicate entries.
Edit: Ok. It think i found the code.
csr_matrix: no constructor-code -> inheritance from _cs_matrix
compressed.py: _cs_matrix
and there:
else:
if len(arg1) == 2:
# (data, ij) format
from .coo import coo_matrix
other = self.__class__(coo_matrix(arg1, shape=shape))
self._set_self(other)

Related

Memory efficient majority voting cell wise for large matrix in Numpy

Given a list of array, I would like to extract the frequent elements in every cell.
For example, for 3 array
arr 1
0,0,0
0,4,1
0,1,4
arr 2
0,0,0
0,7,1
0,1,1
arr 3
5,0,0
0,4,1
0,1,1
The most frequent element for each cell would be
0 0 0
0 4 1
0 1 1
May I know how to achieve this with Numpy? And in actual case, the list of array can be up to 10k in shape.
The list of array are defined as below
import numpy as np
arr=np.array([[0,0,0],[0,4,1],[0,1,4]])
arr2=np.array([[0,0,0],[0,7,1],[0,1,1]])
arr3=np.array([[5,0,0],[0,4,1],[0,1,1]])
arr = np.stack([arr,arr2,arr3], axis=0)
You can stack the arrays into a large matrix and then use scipy.stats.mode along the axis of interest:
import numpy as np
import scipy.stats
arr1 = [[0,0,0],
[0,4,1],
[0,1,4]]
arr2 = [[0,0,0],
[0,7,1],
[0,1,1]]
arr3 = [[5,0,0],
[0,4,1],
[0,1,1]]
arr = np.stack((arr1, arr2, arr3), axis=0)
output = scipy.stats.mode(arr, axis=0).mode[0]
print(output)
# [[0 0 0]
# [0 4 1]
# [0 1 1]]

Remove repeated rows in 2D numpy array, maintaining first instance and ordering

I have an 2-dimensional Numpy array where some rows are not unique, i.e., when I do:
import numpy as np
data.shape #number of rows X columns in data
# (75000, 8)
np.unique(data.T, axis=0).shape #number of unique rows is fewer than above
# (74801, 8)
Starting with the first row of data, I would like to remove any row that is a duplicate of a previous row, maintaining the original order of the rows. In the above example, the final shape of the new Numpy array should be (74801, 8).
E.g., given the below data array
data = np.array([[1,2,1],[2,2,3],[3,3,2],[2,2,3],[1,1,2],[0,0,0],[3,3,2]])
print(data)
[[1 2 1]
[2 2 3]
[3 3 2]
[2 2 3]
[1 1 2]
[0 0 0]
[3 3 2]]
I'd like to have the unique rows in their original order, i.e.,
[[1 2 1]
[2 2 3]
[3 3 2]
[1 1 2]
[0 0 0]]
Any tips on an efficient solution would be greatly appreciated!
Try numpy.unique with the "return_index" parameter:
data[np.sort(np.unique(data, axis = 0, return_index = True)[1])]
As it name indicates, it will return the unique rows and their indices in that order inside a tuple (that's why there's a [1] at the end).
You can also use pandas:
import pandas as pd
pd.DataFrame(data).drop_duplicates().values

How to unpack a uint32 array with np.unpackbits

I used a piece of code to create a 2D binary valued array to cover all possible scenarios of an event. For the first round, I tested it with 2 members.
Here is my code:
number_of_members = 2
n = number_of_members
values = np.arange(2**n, dtype=np.uint8).reshape(-1, 1)
print('$$$ ===> ', values)
bin_array = np.unpackbits(values, axis=1)[:, -n:]
print('*** ===> ', bin_array)
And the result is this:
$$$ ===> [[0]
[1]
[2]
[3]]
*** ===> [[0 0]
[0 1]
[1 0]
[1 1]]
As you can see, it correctly provided my 2D binary array.
The problem begins when I intended to use number_of_members = 20. If I assign 20 to number_of_members python shows this as result:
$$$ ===> [[ 0]
[ 1]
[ 2]
...
[253]
[254]
[255]]
*** ===> [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 1]
[0 0 0 ... 0 1 0]
...
[1 1 1 ... 1 0 1]
[1 1 1 ... 1 1 0]
[1 1 1 ... 1 1 1]]
The result has 8 columns, but I expected an array of 32 columns. How can I unpack a uint32 array?
As you noted correctly, np.unpackbits operates only on uint8 arrays. The nice thing is that you can view any type as uint8. You can create a uint32 view into your data like this:
view = values.view(np.uint8)
On my machine, this is little-endian, which makes trimming easier. You can force little-endian order conditionally across all systems:
if values.dtype.byteorder == '>' or (values.dtype.byteorder == '=' and sys.byteorder == 'big'):
view = view[:, ::-1]
Now you can unpack the bits. In fact, unpackbits has a nice feature that I personally added, the count parameter. It allows you to make your output be exactly 20 bits long instead of the full 32, without subsetting. Since the output will be mixed big-endian bits and little-endian bytes, I recommend displaying the bits in little-endian order too, and flipping the entire result:
bin_array = np.unpackbits(view, axis=1, count=20, bitorder='little')[:, ::-1]
The result is a (1<<20, 20) array with the exact values you want.

How do I subtract two columns from the same array and put the value in their own single column array with numpy?

Lets say i have a single array of 3x4 (3 rows, 4 columns) for example
import numpy as np
data = [[0,5,0,1], [0,5,0,1], [0,5,0,1]]
data = np.array(data)
print(data)
[[0 5 0 1]
[0 5 0 1]
[0 5 0 1]]
and i want to subtract column 4 from column 2 and have the values in their own, named, 3x1 array like this
print(subtraction)
[[4]
[4]
[4]]
how would i go about this in numpy?
result = (data[:, 1] - data[:, 3]).reshape((3, 1))

How to create selective patches from an image based on the objects identified

I have created a numpy matrix with all elements initialized to zeros as shown:
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
...
This is to resemble an image of the screenshot of a webpage which is of the size 1200 X 1000.
I have identified a few rectangular region of interest for different HTML objects such as Radiobutton, Textbox and dropdown within the screenshot image and assigned them fixed values like 1,2 and 3 for the respective object-regions in the numpy matrix created.
So the resultant matrix almost looks like :
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
...,
[[1 1 1 1]
[1 1 1 1]
[1 1 1 1]
...,
[0 0 0 0]
[2 2 2 2]
[0 0 0 0]
...,
I wish to now prepare data set for Convolutional neural network with the patches from the screenshot image. For the purpose of improving the quality of the data supplied to the CNN, I wish to filter the patches and provide only the patches to the CNN which has presence of the objects i.e. Textbox, Radiobutton etc which were detected earlier (Radiobutton and dropdown selections should be there fully and button atleast 50% of the region should be included in the patch). Any ideas how it can be realized in python?
a very naive approach
maxY, maxX = np.shape(theMatrix)
for curY in range(0,maxY):
for curX in range(0,maxX):
print theMatrix[curY,curX],
print " "
You can just use the plot function to plot a 2D-array. Take for example:
import numpy as np
import matplotlib.pyplot as pyplot
x = np.random.rand(3, 2)
which will yield us
array([[ 0.53255518, 0.04687357],
[ 0.4600085 , 0.73059902],
[ 0.7153942 , 0.68812506]])
If you use the pyplot.plot(x, 'ro'), it will plot you the figure given below.
The row numbers are put in x-axis and the values are plotted in the y-axis. But from the nature of your problem , I suspect you need the columns numbers to be put in x-axis and the values in y-axis. To do so, you can simply transpose your matrix.
pyplot.plot(x.T,'ro')
pyplot.show()
which now yields (for the same array) the figure given below.

Categories

Resources