Sorting in Sparse Matrix

Sorting in Sparse Matrix - python

I have a sparse matrix. I need to sort this matrix row-by-row and create another [sparse] matrix.
Code may explain it better:
# for `rand` function, you need newer version of scipy.
from scipy.sparse import *
m = rand(6,6, density=0.6)
d = m.getrow(0)
print d
Output1
(0, 5) 0.874881629788
(0, 4) 0.352559852239
(0, 2) 0.504791645463
(0, 1) 0.885898140175
I have this m matrix. I want to create a new matrix with sorted version of m. The new matrix
contains 0'th row like this.
new_d = new_m.getrow(0)
print new_d
Output2
(0, 1) 0.885898140175
(0, 5) 0.874881629788
(0, 2) 0.504791645463
(0, 4) 0.352559852239
So I can obtain which column is bigger etc:
print new_d.indices
Output3
array([1, 5, 2, 4])
Of course every row should be sorted like above independently.
I have one solution for this problem but it is not elegant.

If you're willing to ignore the zero-value elements of the matrix, the code below should work. It is also much faster than implementations that use the getrow method, which is rather slow.
from itertools import izip
def sort_coo(m):
tuples = izip(m.row, m.col, m.data)
return sorted(tuples, key=lambda x: (x[0], x[2]))
For example:
>>> from numpy.random import rand
>>> from scipy.sparse import coo_matrix
>>>
>>> d = rand(10, 20)
>>> d[d > .05] = 0
>>> s = coo_matrix(d)
>>> sort_coo(s)
[(0, 2, 0.004775589084940246),
(3, 12, 0.029941507166614145),
(5, 19, 0.015030386789436245),
(7, 0, 0.0075044957259399192),
(8, 3, 0.047994403933129481),
(8, 5, 0.049401058471327031),
(9, 15, 0.040011608000125043),
(9, 8, 0.048541825332137023)]
Depending on your needs you may want to tweak the sort keys in the lambda or further process the output. If you want everything in a row indexed dictionary you could do:
from collections import defaultdict
sorted_rows = defaultdict(list)
for i in sort_coo(m):
sorted_rows[i[0]].append((i[1], i[2]))

My bad solution is like this:
from scipy.sparse import coo_matrix
import numpy as np
a = []
for i in xrange(m.shape[0]): # assume m is square matrix.
d = m.getrow(i)
n = len(d.indices)
s = zip([i]*n, d.indices, d.data)
sorted_s = sorted(s, key=lambda v: v[2], reverse=True)
a.extend(sorted_s)
a = np.array(a)
new_m = coo_matrix((a[:,2], (a[:,0], a[:,1])), m.shape)
There can be some simple mistakes above because I have not checked it yet. But the idea is intuitive, I guess. Is there any good solution?
Edit
This new matrix creation may be useless because if you call getrow method then the order is broken again.
Only coo_matrix.col keeps the order.
Another Solution
This one is not exact solution but it may be helpful:
def sortSparseMatrix(m, rev=True, only_indices=True):
""" Sort a sparse matrix and return column index dictionary
"""
col_dict = dict()
for i in xrange(m.shape[0]): # assume m is square matrix.
d = m.getrow(i)
s = zip(d.indices, d.data)
sorted_s = sorted(s, key=lambda v: v[1], reverse=True)
if only_indices:
col_dict[i] = [element[0] for element in sorted_s]
else:
col_dict[i] = sorted_s
return col_dict
>>> print sortSparseMatrix(m)
{0: [5, 1, 0],
1: [1, 3, 5],
2: [1, 2, 3, 4],
3: [1, 5, 2, 4],
4: [0, 3, 5, 1],
5: [3, 4, 2]}

Related

How to get index of 2d list after finding maximum value from list items

I have a 2d list. where I am finding the maximum value by comparing a11 with b11 and c11 and so on. For example,
[[2,3,4,5],[3,4,1,6],[7,1,2,10]]
The output is like:
[[7,4,4,10]]
Now I want the index of each the maximum value as: [[c11,b12,a13,c14]]
My original code is:
img = [cv2.imread(file,0) for file in glob.glob("resized/*.jpg")]
X=[]
for im in img:
arr = np.asarray(im)
arr = np.split(arr, 20)
arr = np.array([np.split(x, 20, 1) for x in arr])
mat = [arr[i][j].mean() for i in range(20) for j in range(20)]
X.append(mat)
a = max(X, key=lambda item: item[0])

For an input of [[2,3,4,5],[3,4,1,6],[7,1,2,10]], if you are looking for the indices of [7, 4, 4, 10] (which is basically consisting of the max value of the corresponding columns of the 2D array), that can be done this way:
a = [[2,3,4,5],[3,4,1,6],[7,1,2,10]]
tr_a = list(zip(*a))
result = [(row.index(max(row)),index) for index, row in enumerate(tr_a)]
print(result)
Execution:
>>> a = [[2,3,4,5],[3,4,1,6],[7,1,2,10]]
>>> tr_a = list(zip(*a))
>>> result = [(row.index(max(row)),index) for index, row in enumerate(tr_a)]
>>> print(result)
[(2, 0), (1, 1), (0, 2), (2, 3)]

Here is a solution using numpy's argmax().
import numpy as np
import itertools as itt
def get_col_max_inds(arr):
return list(zip(np.argmax(arr, axis=0), itt.count()))
On numpy arrays it seems to be 10 times faster than the accepted solution. I also think it's more straightforward.

Fast removal of consecutive duplicates in a list and corresponding items from another list

My question is similar to this previous SO question.
I have two very large lists of data (almost 20 million data points) that contain numerous consecutive duplicates. I would like to remove the consecutive duplicate as follows:
list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2] # This is 20M long!
list2 = ... # another list of size len(list1), also 20M long!
i = 0
while i < len(list)-1:
if list[i] == list[i+1]:
del list1[i]
del list2[i]
else:
i = i+1
And the output should be [1, 2, 3, 4, 5, 1, 2] for the first list.
Unfortunately, this is very slow since deleting an element in a list is a slow operation by itself. Is there any way I can speed up this process? Please note that, as shown in the above code snipped, I also need to keep track of the index i so that I can remove the corresponding element in list2.

Python has this groupby in the libraries for you:
>>> list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [k for k,_ in groupby(list1)]
[1, 2, 3, 4, 5, 1, 2]
You can tweak it using the keyfunc argument, to also process the second list at the same time.
>>> list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> list2 = [9,9,9,8,8,8,7,7,7,6,6,6,5]
>>> from operator import itemgetter
>>> keyfunc = itemgetter(0)
>>> [next(g) for k,g in groupby(zip(list1, list2), keyfunc)]
[(1, 9), (2, 7), (3, 7), (4, 7), (5, 6), (1, 6), (2, 5)]
If you want to split those pairs back into separate sequences again:
>>> zip(*_) # "unzip" them
[(1, 2, 3, 4, 5, 1, 2), (9, 7, 7, 7, 6, 6, 5)]

You can use collections.deque and its max len argument to set a window size of 2. Then just compare the duplicity of the 2 entries in the window, and append to the results if different.
def remove_adj_dups(x):
"""
:parameter x is something like '1, 1, 2, 3, 3'
from an iterable such as a string or list or a generator
:return 1,2,3, as list
"""
result = []
from collections import deque
d = deque([object()], maxlen=2) # 1st entry is object() which only matches with itself. Kudos to Trey Hunner -->object()
for i in x:
d.append(i)
a, b = d
if a != b:
result.append(b)
return result
I generated a random list with duplicates of 20 million numbers between 0 and 10.
def random_nums_with_dups(number_range=None, range_len=None):
"""
:parameter
:param number_range: use the numbers between 0 and number_range. The smaller this is then the more dups
:param range_len: max len of the results list used in the generator
:return: a generator
Note: If number_range = 2, then random binary is returned
"""
import random
return (random.choice(range(number_range)) for i in range(range_len))
I then tested with
range_len = 2000000
def mytest():
for i in [1]:
return [remove_adj_dups(random_nums_with_dups(number_range=10, range_len=range_len))]
big_result = mytest()
big_result = mytest()[0]
print(len(big_result))
The len was 1800197 (read dups removed), in <5 secs, which includes the random list generator spinning up.
I lack the experience/knowhow to say if it is memory efficient as well. Could someone comment please

numba #njit to update a big dict

I try to use numba for a function that need to do a search on a very big (10e6) dict with (int, int) tuple as key.
import numpy as np
from numba import njit
myarray = np.array([[0, 0], # 0, 1
[0, 1],
[1, 1], # 1, 2
[1, 2], # 1, 3
[2, 2],
[1, 3]]
) # a lot of this with shape~(10e6, 2)
dict_with_tuples_key = {(0, 1): 1,
(3, 7): 1} # ~10e6 keys
A simplified version look like this
# #njit
def update_dict(dict_with_tuples_key, myarray):
for line in myarray:
i, j = line
if (i, j) in dict_with_tuples_key:
dict_with_tuples_key[(i, j)] += 1
else:
dict_with_tuples_key[(i, j)] = 1
return dict_with_tuples_key
new_dict = update_dict(dict_with_tuples_key, myarray)
print new_dict
new_dict = update_dict2(dict_with_tuples_key, myarray)
# print new_dict
# {(0, 1): 2, # +1 already in dict_with_tuples_key
# (0, 0): 1, # diag
# (1, 1): 1, # diag
# (2, 2): 1, # diag
# (1, 2): 1, # new from myarray
# (1, 3): 1, # new from myarray
# (3, 7): 1 }
It would appear that #njit does not accept dict as function arg ?
I'm wondering how to rewrite this, specially the if (i, j) in dict_with_tuples_key part that do the search.

njit means that the function is compiled in nopython mode. A dict, list and tuple are python objects and therefore not supported. Not as arguments and not inside the function.
If your dict keys are all different I would consider using a 2D numpy array where the first axis represents the first index of the dict-key-tuple and the second axis the second index. Then you could rewrite it as:
from numba import njit
import numpy as np
#njit
def update_array(array, myarray):
elements = myarray.shape[0]
for i in range(elements):
array[myarray[i][0]][myarray[i][1]] += 1
return array
myarray = np.array([[0, 0], [0, 1], [1, 1],
[1, 2], [2, 2], [1, 3]])
# Calculate the size of the numpy array that replaces the dict:
lens = np.max(myarray, axis=0) # Maximum values
array = np.zeros((lens[0]+1, lens[1]+1)) # Create an empty array to hold all indexes in myarray
update_array(array, myarray)
Since you already indexed your dictionary with tuples the transition problems to indexing an array will not be great.

As an alternative you can try if this fast enough:
from collections import Counter
c2 = Counter(dict_with_tuples_key)
c1 = Counter(tuple(x) for x in myarray)
new_dict = dict(c1 + c2)

Python random list

I'm new to Python, and have some problems with creating random lists.
I'm using random.sample(range(x, x), y).
I want to get 4 lists with unique numbers, from 1-4, so I have been using this
a = random.sample(range(1, 5), 4)
b = random.sample(range(1, 5), 4)
c = random.sample(range(1, 5), 4)
d = random.sample(range(1, 5), 4)
So I get for example
a = 1, 3, 2, 4
b = 1, 4, 3, 2
c = 2, 3, 1, 4
d = 4, 2, 3, 1
How can I make it that the column are also unique?

Absent a clear mathematical theory, I distrust anything other than a somewhat hit-and-miss approach. In particular, backtracking approaches can introduce a subtle bias:
from random import shuffle
def isLatin(square):
#assumes that square is an nxn list
#where each row is a permutation of 1..n
n = len(square[0])
return all(len(set(col)) == n for col in zip(*square))
def randSquare(n):
row = [i for i in range(1,1+n)]
square = []
for i in range(n):
shuffle(row)
square.append(row[:])
return square
def randLatin(n):
#uses a hit and miss approach
while True:
square = randSquare(n)
if isLatin(square): return square
Typical output:
>>> s = randLatin(4)
>>> for r in s: print(r)
[4, 1, 3, 2]
[2, 3, 4, 1]
[1, 4, 2, 3]
[3, 2, 1, 4]

Totally random then:
def gen_matrix():
first_row = random.sample(range(1, 5), 4)
tmp = first_row + first_row
rows = []
for i in range(4):
rows.append(tmp[i:i+4])
return random.sample(rows, 4)

Create a list of all the elements, and as will filling the line, remove the used element.
import random
def fill_line(length):
my_list = list(range(length))
to_return = []
for i in range(length):
x = random.choice(my_list)
to_return.append(x)
my_list.remove(x)
return to_return
x = [fill_line(4)
for i in range(4)]
print(x)

Probably the simplest way is to create a valid matrix, and then shuffle the rows, and then shuffle the columns:
import random
def random_square(U):
U = list(U)
rows = [U[i:] + U[:i] for i in range(len(U))]
random.shuffle(rows)
rows_t = [list(i) for i in zip(*rows)]
random.shuffle(rows_t)
return rows_t
Usage:
>>> random_square(range(1, 1+4))
[[2, 3, 4, 1], [4, 1, 2, 3], [3, 4, 1, 2], [1, 2, 3, 4]]
This should be able to create any valid matrix with equal probability. After doing some reading it seems that this still has bias, although I don't fully comprehend why yet.

I would build a random latin square by 1) start with a single random permutation, 2) populate the rows with rotations 3) shuffle the rows 4) transpose the square 5) shuffle the rows again:
from collections import deque
from random import shuffle
def random_latin_square(elements):
elements = list(elements)
shuffle(elements)
square = []
for i in range(len(elements)):
square.append(list(elements))
elements = elements[1:] + [elements[0]]
shuffle(square)
square[:] = zip(*square)
shuffle(square)
return square
if __name__ == '__main__':
from pprint import pprint
square = random_latin_square('ABCD')
pprint(square)

find the "overlap" between 2 python lists

Given 2 lists:
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
I want to find the "overlap":
c = [3,4,5,5,6]
I'd also like it if i could extract the "remainder" the part of a and b that's not in c.
a_remainder = [5,]
b_remainder = [1,4,7,]
Note:
a has three 5's in it and b has two.
b has two 4's in it and a has one.
The resultant list c should have two 5's (limited by list b) and one 4 (limited by list a).
This gives me what i want, but I can't help but think there's a much better way.
import copy
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
c = []
for elem in copy.deepcopy(a):
if elem in b:
a.pop(a.index(elem))
c.append(b.pop(b.index(elem)))
# now a and b both contain the "remainders" and c contains the "overlap"
On another note, what is a more accurate name for what I'm asking for than "overlap" and "remainder"?

collection.Counter available in Python 2.7 can be used to implement multisets that do exactly what you want.
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
a_multiset = collections.Counter(a)
b_multiset = collections.Counter(b)
overlap = list((a_multiset & b_multiset).elements())
a_remainder = list((a_multiset - b_multiset).elements())
b_remainder = list((b_multiset - a_multiset).elements())
print overlap, a_remainder, b_remainder

Use python set
intersection = set(a) & set(b)
a_remainder = set(a) - set(b)
b_remainder = set(b) - set(a)

In the language of sets, overlap is 'intersection' and remainder is 'set difference'. If you had distinct items, you wouldn't have to do these operations yourself, check out http://docs.python.org/library/sets.html if you're interested.
Since we're not working with distinct elements, your approach is reasonable. If you wanted this to run faster, you could create a dictionary for each list and map the number to how many elements are in each array (e.g., in a, 3->1, 4->1, 5->2, etc.). You would then iterate through map a, determine if that letter existed, decrement its count and add it to the new list
Untested code, but this is the idea
def add_or_update(map,value):
if value in map:
map[value]+=1
else
map[value]=1
b_dict = dict()
for b_elem in b:
add_or_update(b_dict,b_elem)
intersect = []; diff = [];
for a_elem in a:
if a_elem in b_dict and b_dict[a_elem]>0:
intersect.add(a_elem);
for k,v in diff:
for i in range(v):
diff.add(k);

OK, verbose, but kind of cool (similar in spirit to the collections.Counter idea, but more home-made):
import itertools as it
flatten = it.chain.from_iterable
sorted(
v for u,v in
set(flatten(enumerate(g)
for k, g in it.groupby(a))).intersection(
set(flatten(enumerate(g)
for k, g in it.groupby(b))))
)
The basic idea is to make each of the lists into a new list which attaches a counter to each object, numbered to account for duplicates -- so that then you can then use set operations on these tuples after all.
To be slightly less verbose:
aa = set(flatten(enumerate(g) for k, g in it.groupby(a)))
bb = set(flatten(enumerate(g) for k, g in it.groupby(b)))
# aa = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5), (2, 5)])
# bb = set([(0, 1), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (1, 4), (1, 5)])
cc = aa.intersection(bb)
# cc = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5)])
c = sorted(v for u,v in cc)
# c = [3, 4, 5, 5, 6]
groupby -- produces a list of lists containing identical elements
(but because of the syntax needs the g for k,g in it.groupby(a) to extract each list)
enumerate -- appends a counter to each element of each sublist
flatten -- create a single list
set -- convert to a set
intersection -- find the common elements
sorted(v for u,v in cc) -- get rid of the counters and sort the result
Finally, I'm not sure what you mean by the remainders; it seems like it ought to be my aa-cc and bb-cc but I don't know where you get a_remainder = [4]:
sorted(v for u,v in aa-cc)
# [5]
sorted(v for u,v in bb-cc)
# [1, 4, 7]

A response from kerio in #python on freenode:
[ i for i in itertools.chain.from_iterable([k] * v for k, v in \
(Counter(a) & Counter(b)).iteritems())
]

Try difflib.SequenceMatcher(), "a flexible class for comparing pairs of sequences of any type"...
A quick try:
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
sm = difflib.SequenceMatcher(None, a, b)
c = []
a_remainder = []
b_remainder = []
for tag, i1, i2, j1, j2 in sm.get_opcodes():
if tag == 'replace':
a_remainder.extend(a[i1:i2])
b_remainder.extend(b[j1:j2])
elif tag == 'delete':
a_remainder.extend(a[i1:i2])
elif tag == 'insert':
b_remainder.extend(b[j1:j2])
elif tag == 'equal':
c.extend(a[i1:i2])
And now...
>>> print c
[3, 4, 5, 5, 6]
>>> print a_remainder
[5]
>>> print b_remainder
[1, 4, 7]

Aset = Set(a);
Bset = Set(b);
a_remainder = a.difference(b);
b_remainder = b.difference(a);
c = a.intersection(b);
But if you need c to have duplicates, and order is important for you,
you may look for w:Longest common subsequence problem

I don't think you should actually use this solution, but I took this opportunity to practice with lambda functions and here is what I came up with :)
a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
dedup = lambda x: [set(x)] if len(set(x)) == len(x) else [set(x)] + dedup([x[i] for i in range(1, len(x)) if x[i] == x[i-1]])
default_set = lambda x: (set() if x[0] is None else x[0], set() if x[1] is None else x[1])
deduped = map(default_set, map(None, dedup(a), dedup(b)))
get_result = lambda f: reduce(lambda x, y: list(x) + list(y), map(lambda x: f(x[0], x[1]), deduped))
c = get_result(lambda x, y: x.intersection(y)) # [3, 4, 5, 6, 5]
a_remainder = get_result(lambda x, y: x.difference(y)) # [5]
b_remainder = get_result(lambda x, y: y.difference(x)) # [1, 7, 4]
I'm pretty sure izip_longest would have simplified this a bit (wouldn't have needed the default_set lambda), but I was testing this with Python 2.5.
Here are some of the intermediate values used in the calculation in case anyone wants to understand this:
dedup(a) = [set([3, 4, 5, 6]), set([5]), set([5])]
dedup(b) = [set([1, 3, 4, 5, 6, 7]), set([4, 5])]
deduped = [(set([3, 4, 5, 6]), set([1, 3, 4, 5, 6, 7])), (set([5]), set([4, 5])), (set([5]), set([]))]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting in Sparse Matrix - python

Related

How to get index of 2d list after finding maximum value from list items

Fast removal of consecutive duplicates in a list and corresponding items from another list

numba #njit to update a big dict

Python random list

find the "overlap" between 2 python lists

Categories

Resources