Vectorize an operation in Numpy - python

I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.

The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals

This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))

Related

How to rearrange matrix elements vertically on python

I'm trying to build a basic game-like program where I need to rearrange a given matrix but vertically. In this case, I only have 0s and 1s. 0 being lighter objects and 1 being heavier. When the function runs, all the 1s should fall down vertically and the zeros go up vertically as well. It needs to have the exact number of 0s and 1s as the original matrix. Example:
-If I give the following matrix:
[1,0,1,1,0,1,0],
[0,0,0,1,0,0,0],
[1,0,1,1,1,1,1],
[0,1,1,0,1,1,0],
[1,1,0,1,0,0,1]
It should rearrange it to:
[0,0,0,0,0,0,0],
[0,0,0,1,0,0,0],
[1,0,1,1,0,1,0],
[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1]
Any help or suggestions will be highly appreciated.
Consider using numpy for your matrices. You can then use np.sort to do what you want:
np.sort(matrix, axis=0)
Not as readable as the numpy approach, but if you want to use the list-approach you could
Transpose the matrix by using the zip(*matrix) approach.
Sort the resulting rows (which are columns of the original matrix)
Transpose back.
You can do it in one line:
[row for row in zip(*[sorted(column) for column in zip(*matrix)])]
If you didn't want to use numpy (though you should), you could do:
from collections import Counter
test = [[1,0,1,1,0,1,0],
[0,0,0,1,0,0,0],
[1,0,1,1,1,1,1],
[0,1,1,0,1,1,0],
[1,1,0,1,0,0,1] ]
new_version = [[] for _ in test] # create an empty list to append data to
for count, item in enumerate(test[0]): # go through the length of one of the list of lists for their length # assuming that all lists are of equal length
frequency = Counter([x[count] for x in test]) # get frequency count for the column
for count_inside, item_inside in enumerate(test):
# to add the values depending on their frequency distribution in the column
value = 0 if 0 in frequency and count_inside < frequency[0] else 1
new_version[count_inside].append(value)
print(new_version)

Can you extract indexes of data over a threshold from numpy array or pandas dataframe

I am using the following to compare several strings to each other. It's the fastest method I've been able to devise, but it results in a very large 2D array. which I can look at and see what I want. Ideally, I would like to set a threshold and pull the index(es) for each value over that number. To make matters more complicated, I don't want the index comparing the string to itself, and it's possible the string might be duplicated elsewhere so I would want to know if that's the case, so I can't just ignore 1's.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
texts = sql.get_corpus()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
similarity = cosine_similarity(vectors)
sql.get_corups() returns a list of strings, currently 1600ish strings.
Is what I want possible? I've tried using comparing each of the 1.4M combinations to each other using Levenshtein, which works, but it takes 2.5 hours vs half above. I've also tried vecotrs with spacy, which takes days.
I'm not entirely sure I read your post correctly, but I believe this should get you started:
import numpy as np
# randomly distributed data we want to filter
data = np.random.rand(5, 5)
# get index of all values above a threshold
threshold = 0.5
above_threshold = data > threshold
# I am assuming your matrix has all string comparisons to
# itself on the diagonal
not_ident = np.identity(5) == 0.
# [edit: to prevent duplicate comparisons, use this instead of not_ident]
#upper_only = np.triu(np.ones((5,5)) - np.identity(5))
# 2D array, True when criteria met
result = above_threshold * not_ident
print(result)
# original shape, but 0 in place of all values not matching above criteria
values_orig_shape = data * result
print(values_orig_shape)
# all values that meet criteria, as a 1D array
values = data[result]
print(values)
# indices of all values that meet criteria (in same order as values array)
indices = [index for index,value in np.ndenumerate(result) if value]
print(indices)

Python: Only exponentiate array if values is not equal to zero

I have the below function in which I run several regressions. Some estimated coefficients are outputted as '0s' and naturally when they're exponentiated they turn into '1s'.
Ideally, I would have sm.OLS() output 'blanks' rather than 'zeros' in those cases where the estimated coefficient is zero. But I've tried and this doesn't seem possible.
So, alternatively, I would prefer to keep zeros rather than 1s. This would require not exponentiating the zeros in this line of the code: exp_coefficients=np.exp(results.params)
How could I do this?
import statsmodels.api as sm
df_index = []
coef_mtr = [] # start with an empty list
for x in df_main.project_x.unique():
df_holder=df_main[df_main.project_x == x]
X = df_holder.drop(['unneeded1', 'unneeded2','unneeded3'], axis=1)
X['constant']=1
Y = df_holder['sales']
eq=sm.OLS(y, X)
results=eq.fit()
exp_coefficients=np.exp(results.params)
# print(exp_coefficients)
coef_mtr.append(exp_coefficients)
df_index.append(x)
coef_mtr = np.array(coef_mtr)
# create a dataframe with this data
df_columns = [f'coef_{n}' for n in range(coef_mtr.shape[1])]
df_matrix=pd.DataFrame(data = coef_mtr, index = df_index, columns = df_columns)
The cleanest would probably be using the where keyword (not the function) as in
out = np.exp(in_,where=in_!=0)
This will skip al zero values. But because when I say skip I mean skip this will leave the corresponding values in out uninitialized. We therefore need to preset them to zero:
out = np.zeros_like(in_)
np.exp(in_,where=in_!=0,out=out)

fastest way to get max value of each masked np.array for many masks?

I have two numpy arrays of the same shape. One contains information that I am interested in, and the other contains a bunch of integers that can be used as mask values.
In essence, I want to loop through each unique integer to get each mask for the array, then filtered the main array using this mask and find the max value of the filtered array.
For simplicity, lets say the arrays are:
arr1 = np.random.rand(10000,10000)
arr2 = np.random.randint(low=0, high=1000, size=(10000,10000))
right now I'm doing this:
maxes = {}
ids = np.unique(arr2)
for id in ids:
max_val = arr1[np.equal(arr2, id)].max()
maxes[id] = max_val
My arrays are alot bigger and this is painfully slow, I am strugging to find a quicker way of doing this...maybe there's some kind of creative method I'm not aware of, would really appreciate any help.
EDIT
let's say the majority of arr2 is actually 0 and I dont care about the 0 id, is it possible to speed it up by dropping this entire chunk from the search??
i.e.
arr2[:, 0:4000] = 0
and just return the maxes for ids > 0 ??
much appreciated..
Generic bin-based reduction strategies
Listed below are few approaches to tackle such scenarios where we need to perform bin-based reduction operations. So, essentially we are given two arrays and we are required to use one as the bins and the other one for values and reduce the second one.
Approach #1 : One strategy would be to sort arr1 based on arr2. Once we have them both sorted in that same order, we find the group start and stop indices and then with appropriate ufunc.reduceat, we do our slice-based reduction operation. That's all there is!
Here's the implementation -
def binmax(bins, values, reduceat_func):
''' Get binned statistic from two 1D arrays '''
sidx = bins.argsort()
bins_sorted = bins[sidx]
grpidx = np.flatnonzero(np.r_[True,bins_sorted[:-1]!=bins_sorted[1:]])
max_per_group = reduceat_func(values[sidx],grpidx)
out = dict(zip(bins_sorted[grpidx], max_per_group))
return out
out = binmax(arr2.ravel(), arr1.ravel(), reduceat_func=np.maximum.reduceat)
It's applicable across ufuncs that have their corresponding ufunc.reduceat methods.
Approach #2 : We can also leverage scipy.stats.binned_statistic that 's basically a generic utility to do some of the common reduction operations based on binned array values -
from scipy.stats import binned_statistic
def binmax_v2(bins, values, statistic):
''' Get binned statistic from two 1D arrays '''
num_labels = bins.max()+1
R = np.arange(num_labels+1)
Mx = binned_statistic(bins, values, statistic=statistic, bins=R)[0]
idx = np.flatnonzero(~np.isnan(Mx))
out = dict(zip(idx, Mx[idx].astype(int)))
return out
out = binmax_v2(arr2.ravel(), arr1.ravel(), statistic='max')

Thresholding a python list with multiple values

Okay so I have a an array of 1000x100 with random numbers. I want to threshold this list with a list of multiple numbers; these numbers go from [3 to 9].If they are higher than the threshold I want the sum of the row appended to a list.
I have tried many ways, including a 3 times for conditional. Right now, I have found a way to compare an array to a list of numbers but each time that happens I get random numbers from that list again.
xpatient=5
sd_healthy=2
xhealthy=7
sd_patient=2
thresholdvalue1=(xpatient-sd_healthy)*10
thresholdvalue2=(((xhealthy+sd_patient))*10)
thresholdlist=[]
x1=[]
Ahealthy=np.random.randint(10,size=(1000,100))
Apatient=np.random.randint(10,size=(1000,100))
TParray=np.random.randint(10,size=(1,61))
def thresholding(A,B):
for i in range(A,B):
thresholdlist.append(i)
i+=1
thresholding(thresholdvalue1,thresholdvalue2+1)
thresholdarray=np.asarray(thresholdlist)
thedivisor=10
newthreshold=(thresholdarray/thedivisor)
for x in range(61):
Apatient=np.random.randint(10,size=(1000,100))
Apatient=[Apatient>=newthreshold[x]]*Apatient
x1.append([sum(x) for x in zip(*Apatient)])
So,my for loop consists of a random integer within it, but if I don't do that, I don't get to see the threshold each turn. I want the threshold for the whole array to be 3,3.1,3.2 etc. etc.
I hope I delivered my point. Thanks in advance
You can solve your problem using this approach:
import numpy as np
def get_sums_by_threshold(data, threshold, axis): # use axis=0 to sum values along rows, axis=1 - along columns
result = list(np.where(data >= threshold, data, 0).sum(axis=axis))
return result
xpatient=5
sd_healthy=2
xhealthy=7
sd_patient=2
thresholdvalue1=(xpatient-sd_healthy)*10
thresholdvalue2=(((xhealthy+sd_patient))*10)
np.random.seed(100) # to keep generated array reproducable
data = np.random.randint(10,size=(1000,100))
thresholds = [num / 10.0 for num in range(thresholdvalue1, thresholdvalue2+1)]
sums = list(map(lambda x: get_sums_by_threshold(data, x, axis=0), thresholds))
But you should know that your initial array includes only integer values and you will have same result for multiple thresholds that have the same integer part (f.e. 3.0, 3.1, 3.2, ..., 3.9). If you want to store float numbers from 0 to 9 in your initial array with the specified shape you can do following:
data = np.random.randint(90,size=(1000,100)) / 10.0

Categories

Resources