how to make efficiently large sparse matrix in python?

how to make efficiently large sparse matrix in python? - python

1.
i try to make a numpy array with shape:(6962341, 268148), type: np.uint8
2.
i have the data consist of [x1,x2,x3,x4], [x2,x1], [x4,x5,x3]...
3.
i want to assign array[x1,x2] += 1, array[x1,x3] += 1, array[x1,x4] += 1, array[x2,x3] += 1, ...
4.
so i have tried a function of the following structure.
import numpy as np
from itertools import combinations
base_array = np.zeros((row_size, col_size), dtype=np.uint8))
for each_list in data:
for (x,y) in list(combinations(each_list,2)):
if x>y:
base_array[y,x] += 1
else:
base_array[x,y] += 1
it basically compute the upper triangle of a matrix and i will use the upper triangle value. also you can think this is similar to make the base matrix A for co-occurrence matrix. but this function is too slow and i think it is possible to make faster.
What should i do?

Assuming your data is integers (since they represent rows and columns) or you can hash your data x1, x2, ... into 1, 2, ... integers, here is a fast solution:
#list of pairwise combinations in your data
comb_list = []
for each_list in data:
comb_list += list(combinations(each_list,2))
#convert combination int to index (numpy is 0 based indexing)
comb_list = np.array(comb_list) - 1
#make array with flat indices
flat = np.ravel_multi_index((comb_list[:,0],comb_list[:,1]),(row_size,col_size))
#count number of duplicates for each index using np.bincount
base_array = np.bincount(flat,None,row_size*col_size).reshape((row_size,col_size)).astype(np.uint8)
sample data:
[[1, 2, 3, 4], [2, 1], [4, 5, 3, 4]]
Corresponding output:
[[0 1 1 1 0]
[1 0 1 1 0]
[0 0 0 2 0]
[0 0 1 1 1]
[0 0 1 1 0]]
EDIT: corresponding to explanation in comments:
data=[[1, 2, 3, 4], [2, 1], [4, 5, 3, 4]]
base_array = np.zeros((len(data), np.max(np.amax(data))), dtype=np.uint8)
for i, each_list in enumerate(data):
for j in each_list:
base_array[i, j-1] = 1
Output:
[[1 1 1 1 0]
[1 1 0 0 0]
[0 0 1 1 1]]

Related

replacing values with zeros

I have a numpy array, I want to replace whole values to zeros except some range of index.
1
2
3
4
5
I tried
Import numpy as np
data=np.loadtxt('data.txt')
print(data)
expected output
0
0
3
0
0

You can traverse the array with a for loop and check if the traversed element is in a list of desired selected values:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
nums = [3]
for i in range(len(a)):
if a[i] in nums:
pass
else:
a[i] = 0
print(a)
Output:
[0 0 3 0 0]

As you're working with a numpy array, use vectorial methods.
Here isin to form a boolean mask for replacement:
data = np.array([1, 2, 3, 4, 5])
l = [3]
data[~np.isin(data, l)] = 0
data
# array([0, 0, 3, 0, 0])

Turn 2D matrix with indexes (tuples) to 2D Boolean matrix (in different shape) - NumPy

I’m looking for an efficient way to make index matrix to Boolean matrix, without a loop (with NumPy).
The index matrix build from tuples which represents indices. I need to build a Boolean matrix (in different and known size) from it, which going to “1” on all of the indices that in the index matrix, and “0” in all the other positions. As an example if x array with shape (5, 3, 2) be as:
x = np.array([[[0, 0], [0, 1], [0, 3]],
[[1, 0], [1, 3], [1, 4]],
[[2, 2], [2, 3], [2, 4]],
[[3, 1], [3, 3], [3, 4]],
[[4, 2], [4, 3], [4, 4]]])
the desired output be in shape (5, 5) as:
[[1 1 0 1 0]
[1 0 0 1 1]
[0 0 1 1 1]
[0 1 0 1 1]
[0 0 1 1 1]]
In the first line the indices given are (0,0) (0,1) (0,3) ,
so the first line of the Boolean matrix is 11010 (1 where
index exists and 0 otherwise)
In the next line the indices are (1,0) (1,3) (1,4), so the line in the Boolean matrix is 10011
And so on…
I wrote a function that do it with loops, it’s attached. But it performs too slow! I’m looking for much efficiency way, with NumPy.
Thanks for all helpers!!

Row and column ids arrays can be taken from x using indexing. Then we can create a zero NumPy array with the desired shape; where, maximum column numbers can be taken from the x values and row numbers will be as for x:
row_ids = x[:, :, 0]
# [[0 0 0]
# [1 1 1]
# [2 2 2]
# [3 3 3]
# [4 4 4]]
cols_ids = x[:, :, 1]
# [[0 1 3]
# [0 3 4]
# [2 3 4]
# [1 3 4]
# [2 3 4]]
B = np.zeros((x.shape[0], x.max() + 1), dtype=np.int64)
# [[0 0 0 0 0]
# [0 0 0 0 0]
# [0 0 0 0 0]
# [0 0 0 0 0]
# [0 0 0 0 0]]
Now, we can fill the B array by 1 using indexing as:
B[row_ids, cols_ids] = 1
# [[1 1 0 1 0]
# [1 0 0 1 1]
# [0 0 1 1 1]
# [0 1 0 1 1]
# [0 0 1 1 1]]

How do I extract a 2D NumPy sub-array from a 2D NumPy array based on patterns?

I have a 2D NumPy array which looks like this:
Array=
[
[0,0,0,0,0,0,0,2,2,2],
[0,0,0,0,0,0,0,2,2,2].
[0,0,1,1,1,0,0,2,2,2],
[0,0,1,1,1,0,0,2,2,2],
[0,0,1,1,1,0,0,1,1,1],
[0,0,0,0,0,0,0,1,1,1]
]
I need to display the arrays of non-zero elements as:
Array1:
[
[1,1,1],
[1,1,1],
[1,1,1]
]
Array2:
[
[2,2,2],
[2,2,2],
[2,2,2],
[2,2,2]
]
Array3:
[
[1,1,1],
[1,1,1]
]
Could someone please help me out with what logic I could use to achieve the following output? I can't use fixed indexes (like array[a:b, c:d]) since the logic i create should be able to work for any NumPy array with a similar pattern.

This uses scipy.ndimage.label to recursively identify disconnected sub-arrays.
import numpy as np
from scipy.ndimage import label
array = np.array(
[[0,0,0,0,0,0,0,2,2,2,3,3,3],
[0,0,0,0,0,0,0,2,2,2,0,0,1],
[0,0,1,1,1,0,0,2,2,2,0,2,1],
[0,0,1,1,1,0,0,2,2,2,0,2,0],
[0,0,1,1,1,0,0,1,1,1,0,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0,0]])
# initialize list to collect sub-arrays
arr_list = []
def append_subarrays(arr, val, val_0):
'''
arr : 2D array
val : the value used for filtering
val_0 : the original value, which we want to preserve
'''
# remove everything that's not the current val
arr[arr != val] = 0
if 0 in arr: # <-- not a single rectangle yet
# get relevant indices as well as their minima and maxima
x_ind, y_ind = np.where(arr != 0)
min_x, max_x, min_y, max_y = min(x_ind), max(x_ind) + 1, min(y_ind), max(y_ind) + 1
# cut subarray (everything corresponding to val)
arr = arr[min_x:max_x, min_y:max_y]
# use the label function to assign different values to disconnected regions
labeled_arr = label(arr)[0]
# recursively apply append_subarrays to each disconnected region
for sub_val in np.unique(labeled_arr[labeled_arr != 0]):
append_subarrays(labeled_arr.copy(), sub_val, val_0)
else: # <-- we only have a single rectangle left ==> append
arr_list.append(arr * val_0)
for i in np.unique(array[array > 0]):
append_subarrays(array.copy(), i, i)
for arr in arr_list:
print(arr, end='\n'*2)
Output (note: modified example array):
[[1]
[1]]
[[1 1 1]
[1 1 1]
[1 1 1]]
[[1 1 1]
[1 1 1]]
[[2 2 2]
[2 2 2]
[2 2 2]
[2 2 2]]
[[2]
[2]]
[[3 3 3]]

This sounds like a floodfill problem, so skimage.measure.label is a good approach:
Array=np.array([[0,0,0,0,0,0,0,2,2,2],
[0,0,0,0,0,0,0,2,2,2],
[0,0,1,1,1,0,0,2,2,2],
[0,0,1,1,1,0,0,2,2,2],
[0,0,1,1,1,0,0,1,1,1],
[0,0,0,0,0,0,0,1,1,1]
])
from skimage.measure import label
labels = label(Array, connectivity=1)
for label in range(1, labels.max()+1):
xs, ys = np.where(labels==label)
shape = (len(np.unique(xs)), len(np.unique(ys)))
print(Array[xs, ys].reshape(shape))
Output:
[[2 2 2]
[2 2 2]
[2 2 2]
[2 2 2]]
[[1 1 1]
[1 1 1]
[1 1 1]]
[[1 1 1]
[1 1 1]]

startRowIndex = 0 #indexes of sub-arrays
endRowIndex = 0
startColumnIndex = 0
endColumnIndex = 0
tmpI = 0 #for iterating inside the i,j loops
tmpJ = 0
value = 0 #which number we are looking for in array
for i in range(array.shape[0]): #array.shape[0] says how many rows, shape[1] says how many columns
for j in range(array[i].size): #for all elements in a row
if(array[i,j] != 0): #if the element is different than 0
startRowIndex = i
startColumnIndex = j
tmpI = i
tmpJ = j #you cannot change the looping indexes so create tmp indexes
value = array[i,j] #save what number will be sub-array (for example 2)
while(array[tmpI,tmpJ] != 0 and array[tmpI,tmpJ] == value ): #iterate over column numbers
tmpJ+=1
if tmpJ == array.shape[1]: #if you reached end of the array (that is end of the column)
break
#if you left the array then it means you are on index which is not zero,
#so the previous where zero, but displaying array like this a[start:stop]
#will take the values from <start; stop) (stop is excluded)
endColumnIndex = tmpJ
tmpI = i
tmpJ = j
while(array[tmpI,tmpJ] != 0 and array[tmpI,tmpJ] == value): #iterate over row numbers
tmpI += 1
if tmpI == array.shape[0]: #if you reached end of the array
break
#if you left the array then it means you are on index which is not zero,
#so the previous where zero
endRowIndex = tmpI
print(array[startRowIndex:endRowIndex, startColumnIndex:endColumnIndex])
#change array to zero with already used elements
array[startRowIndex:endRowIndex, startColumnIndex:endColumnIndex] = 0
This one is kinda brute-force
but works the way you want it.
This approach doesn't use any external library other than numpy

Here's my pure Python (no NumPy) solution. I took advantage of the fact that the contiguous regions are always rectangular.
The algorithm scans from top-left to bottom-right; when it finds the corner of a region, it scans to find the top-right and bottom-left corners. The dictionary skip is populated so that later scans can skip horizontally past any rectangle which has already been found.
The time complexity is O(nm) for a grid with n rows and m columns, which is optimal for this problem.
def find_rectangles(grid):
width, height = len(grid[0]), len(grid)
skip = dict()
for y in range(height):
x = 0
while x < width:
if (x, y) in skip:
x = skip[x, y]
elif not grid[y][x]:
x += 1
else:
v = grid[y][x]
x2 = x + 1
while x2 < width and grid[y][x2] == v:
x2 += 1
y2 = y + 1
while y2 < height and grid[y2][x] == v:
skip[x, y2] = x2
y2 += 1
yield [ row[x:x2] for row in grid[y:y2] ]
x = x2
Example:
>>> for r in find_rectangles(grid1): # example from the question
... print(r)
...
[[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
[[1, 1, 1], [1, 1, 1], [1, 1, 1]]
[[1, 1, 1], [1, 1, 1]]
>>> for r in find_rectangles(grid2): # example from mcsoini's answer
... print(r)
...
[[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
[[3, 3, 3]]
[[1], [1]]
[[1, 1, 1], [1, 1, 1], [1, 1, 1]]
[[2], [2]]
[[1, 1, 1], [1, 1, 1]]

We can do this using scipy.ndimage.label and scipy.ndimage.find_objects:
from scipy.ndimage import label,find_objects
Array = np.array(Array)
[Array[j][i] for j in find_objects(*label(Array)) for i in find_objects(Array[j])]
# [array([[1, 1, 1],
# [1, 1, 1]]), array([[2, 2, 2],
# [2, 2, 2],
# [2, 2, 2],
# [2, 2, 2]]), array([[1, 1, 1],
# [1, 1, 1],
# [1, 1, 1]])]

Sorting a list using argsort in Numpy?

I have a list in python and the first numbers are [[29.046875, 1], [33.65625, 1], [18.359375, 1], [11.296875, 1], [36.671875, 1], [23.578125, 1],.........,[34.5625, 1]]
The above list is given an id of listNumber. I'm trying to use numpy.argsort to sort it based on the float elements:
listNumber = np.array(listNumber)
print(np.argsort(listNumber))
But this gives me the following but not sure why:
[[1 0]
[1 0]
[1 0]
...
[1 0]
[1 0]
[1 0]]
Why is this returning this? and is there another way to approach this?

Ok so i think there's two things going on here:
1- Your list is a list of lists
2- The 'argsort' function:
returns the indices that would sort an array.
According to the documentation.
So what is happening is the function reads through each item of the list, which in itself is a list, say index 0 is:
[29.046875, 1]
Then it is saying, okay this is another list so let me sort it and then return a number based on where it would go if it was the new index:
[29.046875, 1] -> [1, 0]
Because 1 would come before 29 if it was sorted in ascending order.
It does this for every nested list then gives you a final list containing all these 1's and 0's.
This answers the first question. Another user was able to answer the second :)

You must set axis like:
import numpy as np
l = [[29.046875, 1], [33.65625, 1], [18.359375, 1], [11.296875, 1], [36.671875, 1], [23.578125, 1],[34.5625, 1]]
l = np.argsort(l, axis=0) # sorts along first axis (down)
print(l)
output:
[[3 0]
[2 1]
[5 2]
[0 3]
[1 4]
[6 5]
[4 6]]

Try this;
sortedList = listNumber[listNumber[:,0].argsort(axis=0)]
print(sortedList)

I don't know why people like using predone functions instead of using their own algorithm. Anyway, you are using argsort in a bad way. argsort returns an array containing the INDEXES of your elements, thos are 2 examples :
Code 1:
import numpy as geek
# input array
in_arr = geek.array([ 2, 0, 1, 5, 4, 1, 9])
print ("Input unsorted array : ", in_arr)
out_arr = geek.argsort(in_arr)
print ("Output sorted array indices : ", out_arr)
print("Output sorted array : ", in_arr[out_arr])
Output :
Input unsorted array : [2 0 1 5 4 1 9]
Output sorted array indices : [1 2 5 0 4 3 6]
Output sorted array : [0 1 1 2 4 5 9]
Code 2:
# Python program explaining
# argpartition() function
import numpy as geek
# input 2d array
in_arr = geek.array([[ 2, 0, 1], [ 5, 4, 3]])
print ("Input array : ", in_arr)
# output sorted array indices
out_arr1 = geek.argsort(in_arr, kind ='mergesort', axis = 0)
print ("Output sorteded array indices along axis 0: ", out_arr1)
out_arr2 = geek.argsort(in_arr, kind ='heapsort', axis = 1)
print ("Output sorteded array indices along axis 1: ", out_arr2)
Output:
Input array : [[2 0 1]
[5 4 3]]
Output sorteded array indices along axis 0: [[0 0 0]
[1 1 1]]
Output sorteded array indices along axis 1: [[1 2 0]
[2 1 0]]
I am supposing that your data is stored in listnumber
import numpy as np
new_listnumber = listnumber[:, 0]
index_array = np.argsort(new_listnumber , axis=0)
New_val = listnumber[index_array]
print(New_val)

Vector pair ordering in numpy

I am looking to order a pair of vectors by the first inequal element. Example:
[0, 1, 2] < [0, 2, 1]
because 0 == 0 so look at the next index, where 1 < 2.
Is there a simple way to do this in numpy? Right now I am using this to find the difference between the "greater" and "lesser" vector, which leads to my first try, which is:
(x - y) * np.sign((x - y)[np.nonzero(x - y)[0][0]])

You can use tuple: (0,1,2)<(0,2,1). So a function like
def cmp(v1, v2): return tuple(v1) < tuple(v2)
should suffice ...

np.lexsort is probably the most efficient way to do this:
import numpy as np
# an (N, k) array of N k-dimensional vectors
data = np.array([[0, 2, 3], [0, 1, 2], [0, 1, 3], [0, 2, 1]])
print data
# [[0 2 3]
# [0 1 2]
# [0 1 3]
# [0 2 1]]
# lexsort assumes (k, N), so transpose data first. we also need to reverse the
# order of the columns, since lexsort sorts by the last column first
idx = np.lexsort(data[:, ::-1].T)
print data[idx]
# [[0 1 2]
# [0 1 3]
# [0 2 1]
# [0 2 3]]

I would bet it will be way faster to do a simple loop through both arrays
def comparison(a,b):
for i in xrange(len(a)): #assuming they have to be the same length
if a[i] < b[i]:
return True
elif a[i] > b[i]:
return False
return False
For the 3 element vectors you posted the iteration is 7x faster on my machine. For large enough stretches of identical initial elements the iteration will become slower but make sure that is the case before you go vectorizing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to make efficiently large sparse matrix in python? - python

Related

replacing values with zeros

Turn 2D matrix with indexes (tuples) to 2D Boolean matrix (in different shape) - NumPy

How do I extract a 2D NumPy sub-array from a 2D NumPy array based on patterns?

Sorting a list using argsort in Numpy?

Vector pair ordering in numpy

Categories

Resources