Accessing column in sparse CSR matrix - python

Having some issues with accessing the last column in the sparse CSR matrix. Ideally, I would like to convert the last column into some sort of array that can be used as my label set. My CSR matrix looks like this:
(0, 1976) 1
(0, 2916) 1
(0, 3871) 1
(0, 4437) 1
(0, 8202) 1
(0, 9458) 1
(0, 10597) 1
(1, 4801) 1
(1, 6903) 1
(1, 7525) 1
(2, 873) 1
(2, 1017) 1
(2, 1740) 1
(2, 1925) 1
(3, 1976) 1
(3, 5606) 1
(3, 6898) 1
I want to access the last column, which contains all the '1'. Is there a way in which I can do this?

CSR matrix has indicies and indptr properties, see below example which converts matrix to list using these properties:
def sparse_to_string_list(matrix: csr_matrix):
res = []
indptr = matrix.indptr
indices = matrix.indices
for row in range(matrix.shape[0]):
arr = [k for k in indices[indptr[row]: indptr[row + 1]]]
arr.sort()
res.append(' '.join([str(k) for k in arr]))
return res

Related

How to random assign 0 or 1 for x rows depend upon y column value in excel

I'm trying to generate a below sample data in excel. There are 3 columns and I want output similar present to IsShade column. I've tried =RANDARRAY(20,1,0,1,TRUE) but not working exactly.
I want to display random '1' value only upto value present in shading for NoOfcells value rows.
NoOfCells Shading IsShade(o/p)
5 2 0
5 2 0
5 2 1
5 2 0
5 2 1
--------------------
4 3 1
4 3 1
4 3 0
4 3 1
--------------------
4 1 0
4 1 0
4 1 0
4 1 1
Appreciate if anyone can help me out.Python code will also work since the excel I will read in csv and try to generate output IsShade column. Thank you!!
A small snippet of Python that writes your excel file. This code does not use Pandas or NumPy, only the standard library, to keep it simple if you want to use Python with Excel.
import random
import itertools
import csv
cols = ['NoOfCells', 'Shading', 'IsShade(o/p)']
data = [(5, 2), (4, 3), (4, 1)] # (c, s)
lst = []
for c, s in data: # c=5, s=2
l = [0]*(c-s) + [1]*s # 3x[0], 2x[1] -> [0, 0, 0, 1, 1]
random.shuffle(l) # shuffle -> [1, 0, 0, 0, 1]
lst.append(zip([c]*c, [s]*c, l))
# flat the list
lst = list(itertools.chain(*lst))
with open('shade.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(cols)
writer.writerows(lst)
>>> lst
[(5, 2, 1),
(5, 2, 0),
(5, 2, 0),
(5, 2, 0),
(5, 2, 1),
(4, 3, 1),
(4, 3, 0),
(4, 3, 1),
(4, 3, 1),
(4, 1, 0),
(4, 1, 0),
(4, 1, 1),
(4, 1, 0)]
$ cat shade.csv
NoOfCells,Shading,IsShade(o/p)
5,2,0
5,2,0
5,2,1
5,2,0
5,2,1
4,3,1
4,3,1
4,3,1
4,3,0
4,1,0
4,1,1
4,1,0
4,1,0
You can count the number or rows for RANDARRAY to return using COUNTA. Also. to exclude the dividing lines, test for ISNUMBER
=LET(Data,FILTER(B:B,(B:B<>"")*(ROW(B:B)>1)),IF(ISNUMBER(Data),RANDARRAY(COUNTA(Data),1,0,1,TRUE),""))

'int' object is not iterable" when use itertools and apply function to each row

I have the following dataset:
index REWARD
(1,1,1) 0
(1,2,3) 0
(1,1,3) 0
I want to set REWARD = 2 if index have a pair of numbers. So output should look like
index REWARD
(1,1,1) 0
(1,2,3) 0
(1,1,3) 2
when I use this code
def set_reward(final):
for i in final['index']:
tempCount=[]
for item,count in collections.Counter((i)).items():
tempCount.append(count)
if tempCount==[2, 1] or tempCount==[1, 2]:
final['REWARD']=2
return final['REWARD']
final['REWARD']=final.apply(set_reward,axis=1)
It says that 'int' object is not iterable"
Are there any ways to resolve it?
You can achieve the desired result without explicit for looping and conditional logic. Try something like this:
# Example data
df = pd.DataFrame({'index': [(1, 1, 1), (1, 2, 3), (1, 1, 3)],
'REWARD': [0, 0, 2]})
# Select any row whose index contains at least one pair of values
mask = df['index'].apply(lambda x: 2 in Counter(x).values())
df.loc[mask, 'REWARD'] = 2
df
index REWARD
0 (1, 1, 1) 0
1 (1, 2, 3) 0
2 (1, 1, 3) 2

slice a csr matrix by a list of indexes - python

I 'm struggling to understand behaviour of slicing sparse matrix
I have this csr matrix say M
(0, 4136) 1
(0, 5553) 1
(0, 9089) 1
(0, 24104) 3
(0, 28061) 2
Now I extracted index (column) and i want to slice it.
From that matrix I want a matrix
(0, 4136) 1
(0, 5553) 1
(0, 9089) 1
(0, 24104) 3
and
(0, 28061) 2
Now if i do
M[0, training_set_index]
where training_set_index=[4136,5553,9089, 24104], I get
(0, 3) 3
(0, 2) 1
(0, 1) 1
(0, 0) 1
I just want to have a copy of the original matrix ( preserving indexes) with only the indexes specified in training_set_index list. Is it possible? what's wrong?
Thanks
When I hear sparsity matrix - the first thing that comes to my mind is a lot of zeros:)
One of approach is convert sparse matrix to numpy array -> do some fancy slicing -> turn back to the sparse matrix :
# create sparse matrix
training_set_index = [5553,24104] # for example I need this index
row = np.array([0, 0, 0, 0, 0])
col = np.array([4136, 5553, 9089, 24104, 28061])
data = np.array([1, 1, 1, 3, 2])
dim_0 = (1,28062)
S = csr_matrix((data, (row, col)), shape=dim_0)
print(S)
#(0, 4136) 1
#(0, 5553) 1
#(0, 9089) 1
#(0, 24104) 3
#(0, 28061) 2
# convert to numpy array
M = S.toarray()
# create np.zeros arrays and fill them with based on training_set_index
x = np.zeros((28061, ),dtype=int)
y = np.zeros((28061, ),dtype=int)
np.add.at(x, training_set_index, M[0,training_set_index])
np.add.at(y, training_set_index, M[0,28061])
# new sparse matrix
S_training = csr_matrix(x)
print(S_training)
#(0, 5553) 1
#(0, 24104) 3
Have a nice slicing!

How to get the position of the highest value in a row of a matrix

I need to create a program that, given a matrix, asks for a row in the matrix and gives the position of the highest value in that row.
For example with the matrix:
D = [(0 1 3),
(1 0 4),
(3 4 0)]
when given the value 2 should return 1 (because in the row[2] the highest value is 4 which is in position 1 in that row).
Right now I'm trying:
def farthest(matrix, row, point):
maxdist = 0
matrix2 = []
for i in range(0, len(matrix)):
if i == int(point):
matrix2.append(i)
if i != int(point):
pass
for j in range(0, len(matrix2)):
if j < maxdist :
pass
if j > maxdist:
maxdist = maxdist + j
print(matrix2)
print(maxdist)
return matrix2
I should find a solution using loops.
The current output that I get is [2] and [0].
You can use numpy and argmax for axis 1 for this, to get the position of the maximum value per row.
import numpy as np
D = [(0, 1, 3),
(1, 0, 4),
(3, 4, 0)]
t = np.array(D).argmax(axis=1)
t
Which will give you
array([2, 2, 1])
Can can then index with the row you are after
t[2]
1

Iterate over sparse matrix and concatenate data and indicies for each row

I have a scenario where I have a dataframe and vocabulary file which I am trying to fit to the dataframe string columns. I am using scikit learn countVectorizer which produces a sparse matrix. I need to take the output of the sparse matrix and merge it with the dataframe for corresponding row in dataframe.
code:-
from sklearn.feature_extraction.text import CountVectorizer
docs = ["You can catch more flies with honey than you can with vinegar.",
"You can lead a horse to water, but you can't make him drink.",
"search not cleaning up on hard delete",
"updating firmware version failed",
"increase not service topology s memory",
"Nothing Matching Here"
]
vocabulary = ["catch more","lead a horse", "increase service", "updating" , "search", "vinegar", "drink", "failed", "not"]
vectorizer = CountVectorizer(analyzer=u'word', vocabulary=vocabulary,lowercase=True,ngram_range=(0,19))
SpraseMatrix = vectorizer.fit_transform(docs)
Below is sparse matrix output -
(0, 0) 1
(0, 5) 1
(1, 6) 1
(2, 4) 1
(2, 8) 1
(3, 3) 1
(3, 7) 1
(4, 8) 1
Now, What I am looking to do is build a string for each row from sparse matrix and add it to the corresponding document.
Ex:- for doc 3 ("Updating firmware version failed") , I am looking to get "3:1 7:1 " from sparse matrix (i.e updating & failed column index and their frequency) and add this to doc's data frame's row 3.
I tried below , and it produces flatten output where as I am looking to get the submatrix based on the row index, loop through it and build a concated string for each row such as "3:1 7:1" , and finally then add this string as a new column to data frame for each corresponding row.
cx = SpraseMatrix .tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
print((i,j,v))
(0, 0, 1)
(0, 5, 1)
(1, 6, 1)
(2, 4, 1)
(2, 8, 1)
(3, 3, 1)
(3, 7, 1)
(4, 8, 1)
I'm not entirely following what you want, but maybe the lil format will be easier to work with:
In [1122]: M = sparse.coo_matrix(([1,1,1,1,1,1,1,1],([0,0,1,2,2,3,3,4],[0,5,6,4,
...: 8,3,7,8])))
In [1123]: M
Out[1123]:
<5x9 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>
In [1124]: print(M)
(0, 0) 1
(0, 5) 1
(1, 6) 1
(2, 4) 1
(2, 8) 1
(3, 3) 1
(3, 7) 1
(4, 8) 1
In [1125]: Ml = M.tolil()
In [1126]: Ml.data
Out[1126]: array([list([1, 1]), list([1]), list([1, 1]), list([1, 1]), list([1])], dtype=object)
In [1127]: Ml.rows
Out[1127]: array([list([0, 5]), list([6]), list([4, 8]), list([3, 7]), list([8])], dtype=object)
It's attributes are organized by row, which appears to be how you want it.
In [1130]: Ml.rows[3]
Out[1130]: [3, 7]
In [1135]: for i,(rd) in enumerate(zip(Ml.rows, Ml.data)):
...: print(' '.join(['%s:%s'%ij for ij in zip(*rd)]))
...:
0:1 5:1
6:1
4:1 8:1
3:1 7:1
8:1
You can also iterate through the rows of the csr format, but that requires a bit more math with the .indptr attribute.

Categories

Resources