How to rearrange matrix elements vertically on python - python

I'm trying to build a basic game-like program where I need to rearrange a given matrix but vertically. In this case, I only have 0s and 1s. 0 being lighter objects and 1 being heavier. When the function runs, all the 1s should fall down vertically and the zeros go up vertically as well. It needs to have the exact number of 0s and 1s as the original matrix. Example:
-If I give the following matrix:
[1,0,1,1,0,1,0],
[0,0,0,1,0,0,0],
[1,0,1,1,1,1,1],
[0,1,1,0,1,1,0],
[1,1,0,1,0,0,1]
It should rearrange it to:
[0,0,0,0,0,0,0],
[0,0,0,1,0,0,0],
[1,0,1,1,0,1,0],
[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1]
Any help or suggestions will be highly appreciated.

Consider using numpy for your matrices. You can then use np.sort to do what you want:
np.sort(matrix, axis=0)

Not as readable as the numpy approach, but if you want to use the list-approach you could
Transpose the matrix by using the zip(*matrix) approach.
Sort the resulting rows (which are columns of the original matrix)
Transpose back.
You can do it in one line:
[row for row in zip(*[sorted(column) for column in zip(*matrix)])]

If you didn't want to use numpy (though you should), you could do:
from collections import Counter
test = [[1,0,1,1,0,1,0],
[0,0,0,1,0,0,0],
[1,0,1,1,1,1,1],
[0,1,1,0,1,1,0],
[1,1,0,1,0,0,1] ]
new_version = [[] for _ in test] # create an empty list to append data to
for count, item in enumerate(test[0]): # go through the length of one of the list of lists for their length # assuming that all lists are of equal length
frequency = Counter([x[count] for x in test]) # get frequency count for the column
for count_inside, item_inside in enumerate(test):
# to add the values depending on their frequency distribution in the column
value = 0 if 0 in frequency and count_inside < frequency[0] else 1
new_version[count_inside].append(value)
print(new_version)

Related

Function eliminating rows and columns if any of the two are all equal to 0 AND index removed (for sparse matrices) in Python

I am trying to come up with a function which does the following:
Take an array (preferably a sparse csr.matrix) N x N
Find which rows and/or columns all have 0 entries
Remove both the Nth row and the Nth column if any of the two (or both) have all 0 entries
Return the new NxN array (or sparse matrix) (with no all-0 entries rows and/or columns) and the index of the removed rows/columns.
I manage to return the correct array, but the return index of the removed rows and columns is not correct (and smaller): this is due to the fact that since I remove rows and columns, it might happen that new rows/columns now become all 0, while they were not before.
Take for example the array: [0,1,0,0],[0,0,0,0],[0,1,0,0],[1,0,0,0]. I shall remove the 2nd row and the 2nd column. Now, the new array [0,0,0],[0,0,0],[1,1,1] shall remove both the 1st and 2nd row and columns, which is okay, but it's easy to see how the returned indices are in a sense "scaled", i.e. they are not really the initial ones.
This is the function a created for the moment:
def remove_zero_rows(X):
# X is a scipy sparse matrix. We want to remove all zero rows/columns from it
creat_list=list(range(0, X.shape[1]))
nonzero_row_indice, nonzero_col_indice = X.nonzero()
unique_nonzero_indice = np.unique(nonzero_row_indice)
row_ind=np.array(list(set(creat_list).difference(unique_nonzero_indice))) ## Set of row 0s
nonzero_col_indice = np.unique(nonzero_col_indice)
col_ind=np.array(list(set(creat_list).difference(nonzero_col_indice))) ## Set of cols 0s
merge_two= list(set(row_ind) | set(col_ind)) # This is the index of 0 rows/columns
#Create new matrix
for i in range(X.shape[1]):
if(X.shape[1]-(np.unique(X.nonzero()[0]).size)>0 or X.shape[1]-(np.unique(X.nonzero()[1]).size)>0):
X = X[np.unique(X.nonzero()[0])][:,np.unique(X.nonzero()[0])]
X = X[np.unique(X.nonzero()[1])][:,np.unique(X.nonzero()[1])]
#print(i)
else:
break
return X, row_ind, col_ind, merge_two;
Thanks you!

Understanding np.ix_

Code:
import numpy as np
ray = [1,22,33,42,51], [61,71,812,92,103], [113,121,132,143,151], [16,172,183,19,201]
ray = np.asarray(ray)
type(ray)
ray[np.ix_([-2:],[3:4])]
I'd like to use index slicing and get a subarray consisting of the last two rows and the 3rd/4th columns. My current code produces an error:
I'd also like to sum each column. What am I doing wrong? I cannot post a picture because I need at least 10 reputation points.
So you want to make a slice of an array. The most straightforward way to do it is... slicing:
slice = ray[-2:,3:]
or if you want it explicitly
slice = ray[-2:,3:5]
See it explained in Understanding slicing
But if you do want to use np.ix_ for some reason, you need
slice = ray[np.ix_([-2,-1],[3,4])]
You can't use : here, because [] here don't make a slice, they construct lists and you should specify explicitly every row number and every column number you want in the result. If there are too many consecutive indices, you may use range:
slice = ray[np.ix_(range(-2, 0),range(3, 5))]
And to sum each column:
slice.sum(0)
0 means you want to reduce the 0th dimension (rows) by summation and keep other dimensions (columns in this case).

Can you extract indexes of data over a threshold from numpy array or pandas dataframe

I am using the following to compare several strings to each other. It's the fastest method I've been able to devise, but it results in a very large 2D array. which I can look at and see what I want. Ideally, I would like to set a threshold and pull the index(es) for each value over that number. To make matters more complicated, I don't want the index comparing the string to itself, and it's possible the string might be duplicated elsewhere so I would want to know if that's the case, so I can't just ignore 1's.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
texts = sql.get_corpus()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
similarity = cosine_similarity(vectors)
sql.get_corups() returns a list of strings, currently 1600ish strings.
Is what I want possible? I've tried using comparing each of the 1.4M combinations to each other using Levenshtein, which works, but it takes 2.5 hours vs half above. I've also tried vecotrs with spacy, which takes days.
I'm not entirely sure I read your post correctly, but I believe this should get you started:
import numpy as np
# randomly distributed data we want to filter
data = np.random.rand(5, 5)
# get index of all values above a threshold
threshold = 0.5
above_threshold = data > threshold
# I am assuming your matrix has all string comparisons to
# itself on the diagonal
not_ident = np.identity(5) == 0.
# [edit: to prevent duplicate comparisons, use this instead of not_ident]
#upper_only = np.triu(np.ones((5,5)) - np.identity(5))
# 2D array, True when criteria met
result = above_threshold * not_ident
print(result)
# original shape, but 0 in place of all values not matching above criteria
values_orig_shape = data * result
print(values_orig_shape)
# all values that meet criteria, as a 1D array
values = data[result]
print(values)
# indices of all values that meet criteria (in same order as values array)
indices = [index for index,value in np.ndenumerate(result) if value]
print(indices)

how to iterate through a 3d array and calculating the mean of each cell

i want to create a loop for the following lines in python (i use pycharm):
mean_diff = np.mean(np.array([diff_list[0].values, diff_list[1].values, diff_list[2].values, diff_list[3].values,...,diff_list[100], axis=0)
with this i get the mean of each individual cell from different arrays (raster change over time)
i tried the following:
for x in range(100):
mean_diff = np.mean(np.array([diff_list[x].values]), axis=0)
but what's happening here is that it will start to calculate the mean between the mean of the last iteration and the new array and so on, instead of adding everything up first and afterwards calculating the mean of the total. one idea was to create a "sumarray" first with all the diff_list values in it, but i failed to do that too. the original type of my diff_list is a list which contains data frames in it (for each row it has an array, so it's a 3d array/data frame (??)... -> see picture: image shows the structure of the list).
You need to populate the array, not do the computation, within the loop. Python list comprehensions are perfect for this:
Your first program is the equivalent of:
mean_diff = np.mean(np.array([a.values for a in diff_list[:101]], axis=0))
Or if you prefer:
x = []
for a in diff_list[:101]:
x.append(a.values)
mean_diff = np.mean(np.array(x, axis=0))
If you are using the whole list instead of its first 101 elements you can drop the "[:101]".

Vectorize an operation in Numpy

I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))

Categories

Resources