I have to scale between [0,1] a matrix. So, for each element from matrix i have to do this formula:
(Element - min_cols) / (max_cols - min_cols)
min_cols -> array with every minimum of each column from the matrix. max_cols -> same but with max
My problem is, i want to calculate result with this:
result = (Element- min_cols) / (max_cols - min_cols)
Or, from each element from the matrix i have to do difference between that element and the minimum from element's column, and do the difference between (maximum element's column and the minimum).*
but when i have for example the value from min_cols negative and the value from max_cols also negative, it results the sum between both.
I want to specify that the matrix is: _mat = np.random.randn(1000, 1000) * 50
Use numpy
Example
import numpy as np
x = 50*np.random.rand(6,4)
array([[26.7041017 , 46.88118463, 41.24541748, 31.17881807],
[47.57036124, 16.49040094, 6.62454156, 37.15976348],
[46.7157895 , 8.53357717, 39.01399714, 5.14287858],
[24.36012016, 5.67603151, 40.7697121 , 13.09877845],
[21.69045322, 12.61989002, 8.74692768, 46.23368735],
[ 3.9058066 , 35.50845507, 4.66785679, 2.34177134]])
Apply your formula
np.divide(np.subtract(x, x.min(axis=0)), x.max(axis=0)-x.min(axis=0))
array([[0.52212361, 1. , 1. , 0.65700132],
[1. , 0.26245187, 0.05349413, 0.79326663],
[0.98042871, 0.06934923, 0.93899483, 0.06381829],
[0.46844205, 0. , 0.98699461, 0.24507946],
[0.40730168, 0.16851918, 0.1115184 , 1. ],
[0. , 0.7239974 , 0. , 0. ]])
The max value of each column is mapped to 1, the min value of each column is mapped to 0 an the intermediate values have are linearly mapped between 0 and 1
Related
I have a dataframe consisting of individual tweets (id, text, author_id, nn_list) where nn_list is a list of other tweet indices which were previously identified as potential nearest neighbours. Now I have to calculate the cosine similarity of the index and every single entry of this list by looking at the index in the tfidf matrix to compare the vectors but with my current approach this is kind of slow. The current code looks something like this:
for index, row in data_df.iterrows():
for candidate in row["nn_list"]:
candidate_cos = float("%.2f" % pairwise_distances(tfidf_matrix[candidate], tfidf_matrix[index], metric='cosine'))
if candidate_cos < nn_distance:
current_nn_candidate = candidate
nn_distance = candidate_cos
Is there a significantly faster way to calculate this?
The following code should work assuming you have not a too large range of IDs:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame({"nn_list": [[1, 2], [1,2,3], [1,2,3,7], [11, 12, 13], [2,1]]})
# Data consistent with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
df["data"] = df["nn_list"].apply(lambda x: np.repeat(1, len(x)))
df["row"] = df.index
df["row_ind"] = df[['row', 'nn_list']].apply(lambda x: np.repeat(x[0], len(x[1])), axis=1)
df["col_ind"] = df['nn_list'].apply(lambda x: np.array(x))
m = csr_matrix(
(np.concatenate(df['data']),
(np.concatenate(df['row_ind']), np.concatenate(df['col_ind']))))
cosine_similarity(m)
Will return:
array([[1. , 0.81649658, 0.70710678, 0. , 1. ],
[0.81649658, 1. , 0.8660254 , 0. , 0.81649658],
[0.70710678, 0.8660254 , 1. , 0. , 0.70710678],
[0. , 0. , 0. , 1. , 0. ],
[1. , 0.81649658, 0.70710678, 0. , 1. ]])
If you have a larger range of IDs I recommend to use spark or have look to cosine similarity on large sparse matrix with numpy.
I have a text file that has some values of a matrix, but it just has half of the values of it, like this:
1. 1. 0.01
2. 1. 0.052145
2. 2. 0.045
3. 1. 0.054521
3. 2. 0.05424
3. 3. 0.05459898
the first two columns are referent to matrix (x,y) position, and the last one, the value it has. the first two values might be, actually, value-1.
I made a function that reads the file and mirrors these values to a full matrix:
def expand_mirror_matrix(matrix_path='data.txt'):
data = np.loadtxt(matrix_path)
shape = (int(data[-1][0]), int(data[-1][1]))
m = np.zeros(shape)
for d in data:
x, y, z = int(d[0]), int(d[1]), d[2]
m[x-1,y-1] = z
m[shape[0]-x,shape[1]-y]=z
return m
But it has some unnecessary loops, like the first and the last, and the loop that changes the value of the center of the matrix.
Is there a way of optimizing it? This file actually have thousands of lines, it might be great to downgrade this loop execution time.
I believe this does what you want, at least without the mirroring:
def expand_mirror_matrix(matrix_path='data.txt'):
data = np.loadtxt(matrix_path)
shape = (int(data[-1][0]), int(data[-1][1]))
xs = data[:,0].astype(int) - 1 # Numpy uses zero-based indexing.
ys = data[:,1].astype(int) - 1
m = np.zeros(shape)
m[(xs, ys)] = data[:,2]
return m
For your example file above this returns:
array([[0.01 , 0. , 0. ],
[0.052145 , 0.045 , 0. ],
[0.054521 , 0.05424 , 0.05459898]])
If you wish to mirror it you probably want to edit the above function with the following:
m[(xs, ys)] = data[:,2]
m[(ys, xs)] = data[:,2] # Mirrored.
The result of that is:
array([[0.01 , 0.052145 , 0.054521 ],
[0.052145 , 0.045 , 0.05424 ],
[0.054521 , 0.05424 , 0.05459898]])
Note that this assumes the matrix is square.
I'm trying to find a minimum for a function which takes a 5x4 matrix shown below.
def objective(w):
...
return value
# Random Initial Input Matrix where rows sum up to 1.0
w = array(
[[0.33333333, 0. , 0.2 , 0.46666667],
[0.07142857, 0.42857143, 0.5 , 0. ],
[0. , 0.42857143, 0. , 0.57142857],
[0.31034483, 0.27586207, 0.27586207, 0.13793103],
[0.27272727, 0.22727273, 0.22727273, 0.27272727]])
The only constraint is to keep each row sum up to 1.0.
I've tried
1) scipy.optimize.fmin(objective, w) - which gives me back a converging result, but is incorrect because I'm not sure how to apply the constraint.
2) scipy.optimize.minimize(objective, w) - which isn't changing the initial matrix.
Any suggestions of what I can look at?
Thanks in advance.
I am calculating the difference of each element in a numpy array. My code is
import numpy as np
M = 10
x = np.random.uniform(0,1,M)
y = np.array([x])
# Calculate the difference
z = np.array(y[:,None]-y)
When I run my code I get [[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]. I don't get a 10 by 10 array.
Where do I go wrong?
You should read the broadcasting rules for numpy
y.T - x
Another way:
np.subtract.outer(x, x)
You are not getting 10 by 10 array because value of M is 10. Try:
M = (10,10)
I have a numpy array called "PRECIP" with shape (2,3,3) which corresponds to (time, lat, lon)
array([[[ 0.05368402, 0.43843025, 0.09521903],
[ 0.22627141, 0.12920409, 0.17039465],
[ 0.48148674, 0.59170703, 0.41321763]],
[[ 0.63621704, 0.11119242, 0.25992372],
[ 0.67846732, 0.3710733 , 0.25641174],
[ 0.1992151 , 0.86837441, 0.80136514]]])
I have another numpy array called "idx" which is a list of indices, with the shape (3, 4):
array([[0,0,1,1], # time
[0,2,0,2], # x coordinate
[0,2,0,2]]) # y coordinate
So far I have been able to index the "PRECIP" variable with the "idx" variable so that I get an array with the shape (4,), ie.
>>>accum = PRECIP[idx[0,:],idx[1,:],idx[2,:]]
array([ 0.05368402, 0.41321763, 0.63621704, 0.80136514])
BUT, what I need is an array of zeros "ACCUM" with the shape (3,3), populated with the sum of "PRECIP" for each pair of coordinates in "IDX". All other gridpoints not listed in "IDX" would be 0.
Basically I want an array "accum" that looks like this
>>>accum
array([[[ 0.68990106, 0. , 0. ], # 0.68990106 = 0.05368402 + 0.63621704
[ 0. , 0. , 0. ],
[ 0. , 0. , 1.21458277], # 1.21458277 = 0.41321763 + 0.80136514
I'd appreciate any help! Thanks :)
If I understand correctly what you need is:
array = [0.5] * 249
It will return an array of length 249 populated with 0.5 in each index. After that you can slice it if its necesary to retrieve the amount of elements you like.
If that is not what you want, you can use dictionaries and add a key that is the tuple that you want this way.
dict = {(40, 249): array}
I hope it helps.
Convert any NaNs in the Lat and Lon columns of PRECIP to zero, then sum them and reshape the result.
np.nan_to_num(PRECIP[idx[1,:], idx[2,:]]).sum(axis=1).reshape(PRECIP.shape[1], PRECIP.shape[2])