I'm trying to get cosine similarity for 2 sets of data (with unequal lengths).
Test set contains 4 random similar images from google.
Training set contains 1 similar image to test set from google.
Following the code im using to do the same by converting image to vectors and calculating cosine similarity
import os
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity
from img_to_vec import Img2Vec
import numpy as np
test_path = '/Users/Desktop/img_vec/test'
train_path = '/Users/Desktop/img_vec/train'
print("Getting vectors for test images...\n")
img2vec = Img2Vec()
# For each test image, we store the filename and vector as key, value in a dictionary
pics = {}
for file in os.listdir(test_path):
filename = os.fsdecode(file)
img = Image.open(os.path.join(test_path, filename))
vec = img2vec.get_vec(img)
pics[filename] = vec
# print (pics)
pic_name = {}
for file1 in os.listdir(train_path):
filename1 = os.fsdecode(file1)
img1 = Image.open(os.path.join(train_path, filename1))
vec1 = img2vec.get_vec(img1)
pic_name[filename1] = vec1
# print(pic_name)
vec1 = np.array([pics])
vec2 = np.array([pic_name])
sims = {}
for key in list(pics.keys()):
print(key)
sims[key] = cosine_similarity(vec1[vec2].reshape((1, -1)), vec1[key].reshape((1, -1)))[0][0]
d_view = [(v, k) for k, v in sims.items()]
d_view.sort(reverse=True)
for v, k in d_view:
print(v, k)
However, I'm unable to resolve the following error:
sims[key] = cosine_similarity(vec1[vec2].reshape((1, -1)), vec1[key].reshape((1, -1)))[0][0]
IndexError: arrays used as indices must be of integer (or boolean) type
I tried to compute cosine similarity in Python manually (using numpy) by using a specialised library. It doesn't work. I believe it's an issue with dtype.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
# use library, operates on sets of vectors
aa = a.reshape(1,3)
ba = b.reshape(1,3)
cos_lib = cosine_similarity(aa, ba)
Any help / guidance / alternative is much appreciated.
vec1 = np.array([pics])
vec2 = np.array([pic_name])
I don't see the need to do this.
Also, in the line where error is coming, the error is present at:
vec1[vec2].reshape((1, -1))
because you're indexing vec1 using vec2. I suppose you mean to put key instead of vec2.
Related
I am trying to convert coordinates from one system to another using the affine transfomation method on python. the code seems to be running but the output is looking strange, here is my code:
import json
import math
import numpy as np
labels = open('labels3.txt')
read_ready = labels.read()
JSONfile = json.loads(read_ready)
#input_system should be the ground plane(georefrenced) images
#notes on this set of coordinates, assuming that the points arent rotated, may introduce the rotation in # the very end
input_sys = np.array([[0,0,0],[0,23133,0],[35093,23133,0],[35093,0,0]])
#output system should be the slant range coords, these arent too special should be pretty straightforward
output_sys = np.array([[0,0,0],[0,16384,0],[16384,16384,0],[16384,0,0]])
#now find which point should be transformed
p = np.array([7954,9338,0])
#finding the transformation
l = len(input_sys)
entry = lambda r,d: np.linalg.det(np.delete(np.vstack([r,input_sys.T,np.ones(l)]),d,axis = 0))
M = np.array([[(-1)**i * entry(R,i) for R in output_sys.T] for i in range(l+1)])
A,t = np.hsplit(M[1:].T/(-M[0])[:,None], [l-1])
t = np.transpose(t)[0]
#output transform
print("Affine transformation matrix:\n", A)
print("Affine transformation translation vector:\n", t)
#print("Testing:")
#for p,P in zip(np.array(input_sys), np.array(output_sys)):
#image_p = np.dot(A,p)+t
#result ="[OK]" if np.allclose(image_p,P) else"[ERROR]"
#print(p, " mapped to: ", image_p, " ; expected: ", P, result)
#print('Calculation:')
P = np.dot(A,p)+t
print(p,'mapped to:', P)
the output for the matrix M is a zero matrix which is not correct , can someone help me find a way to get a nonzero matrix out for M?
I'm trying to design a Gaussian notch filter in Python to remove periodic noise.I tried implementing the following formula:
Gaussian Notch Filter
And here is the code:
import numpy as np
def gaussian_bandpass_filter(image):
image_array = np.array(image)
#Fourier Transform
fourier_transform = np.fft.fftshift(np.fft.fft2(image_array))
#Size of Image
m = np.shape(fourier_transform)[0]
n = np.shape(fourier_transform)[1]
u = np.arange(m)
v = np.arange(n)
# Find the center
u0 = int(m/2)
v0 = int(n/2)
# Bandwidth
D0 = 10
gaussian_filter = np.zeros(np.shape(fourier_transform))
for x in u:
for y in v:
D1 = math.sqrt((x-m/2-u0)**2 + (y-n/2-v0)**2)
D2 = math.sqrt((x-m/2+u0)**2 + (y-n/2+v0)**2)
gaussian_filter[x][y] = 1 - math.exp(-0.5 * D1*D2/(D0**2))
#Apply the filter
fourier_transform = fourier_transform + gaussian_filter
image_array = np.fft.ifft2(np.fft.ifftshift(fourier_transform))
return image_array
this function is supposed to apply the Gaussian notch filter to an image and return the filtered image but it doesn't seem to work. I don't know where I went wrong with this (maybe I didn't understand the formula correctly?) so if anyone could help me I would really appreciate it.
Edit:
As an example, here is a noisy image.
Using the existing gaussian_filter function in scipy.ndimage library, I get this, which is acceptable.
But my function returns this. (I'm using PIL.Image.fromarray function to convert array to image)
I am trying to compute cosine distance between all pairs of a large matrix (3m x 2048) and extract the top30 similar vectors using pytorch.
the following is my code which works fine but it take about 30 sec for each iteration which is too long for 3 million word vectors.
Any idea to speed it up ?
import torch.nn.functional as F
import torch
from tqdm import tqdm
import gc
sym_dict={}
tmp_list=[]
tot_dict=torch.load('xbx.pt')
all_tensors = torch.cat([v.unsqueeze(0) for k,v in tot_dict.items()], dim=0)
token_list= [i for i in tot_dict.keys()]
del tot_dict
gc.collect()
for counter ,value in tqdm(enumerate(token_list)):
uniq_vec=torch.unsqueeze(all_tensors[counter],dim=0)
dist = 1 - F.cosine_similarity(uniq_vec,all_tensors)
index_sorted = torch.argsort(dist)
roll_me=index_sorted[:30].cpu().numpy().tolist()
for ind in roll_me:
tmp_list.append(token_list[ind])
sym_dict.update({value:tmp_list})
tmp_list=[]
#save .pt file
torch.save(sym_dict,'sym_dict.pt')
Would directly finding pairwise distances between the two matrices work? Here's the code:
def pairwise_dist(x, y,p=2, eps=1e-6):
x_a =x[..., None, :, :]
y_a =y[...,None,:]
dist = torch.pow(torch.abs((x_a - y_a) + eps), p).sum(dim=-1, keepdim=True).squeeze(2)
return torch.pow(dist, 1/p)
t1 = torch.rand(3, 10)
t2 = torch.rand(4,10)
dist = pairwise_dist(t1,t2, eps=0)
print(dist)
dist is of shape 4 x 3 where each row represents distance of all vectors of t1 with a vector of t2.
Note that pairwise distance between two vectors here is exactly equivalent to Pytorch's F. pairwise_distance.
This is a program for face recognition using pca logic. Everything went fine except for the index error that came up at the end of the program.
When I run the code I get an index error at the fourth last line of my program.
distances.append((dist, y[i]))
IndexError: list index out of range
can anyone just help in this. I am newbie into python, so am I not so expert in solving.
Here is my code :
from sklearn.decomposition import RandomizedPCA
import numpy as np
import glob
import cv2
import math
import os.path
import string
#function to get ID from filename
def ID_from_filename(filename):
part = string.split(filename, '/')
return part[1].replace("s", "")
#function to convert image to right format
def prepare_image(filename):
img_color = cv2.imread(filename)
img_gray = cv2.cvtColor(img_color, cv2.cv.CV_RGB2GRAY)
img_gray = cv2.equalizeHist(img_gray)
return img_gray.flat
IMG_RES = 92 * 112 # img resolution
NUM_EIGENFACES = 10 # images per train person
NUM_TRAINIMAGES = 110 # total images in training set
#loading training set from folder train_faces
folders = glob.glob('train_faces/*')
# Create an array with flattened images X
# and an array with ID of the people on each image y
X = np.zeros([NUM_TRAINIMAGES, IMG_RES], dtype='int8')
y = []
# Populate training array with flattened imags from subfolders of
train_faces and names
c = 0
for x, folder in enumerate(folders):
train_faces = glob.glob(folder + '/*')
for i, face in enumerate(train_faces):
X[c,:] = prepare_image(face)
y.append(ID_from_filename(face))
c = c + 1
# perform principal component analysis on the images
pca = RandomizedPCA(n_components=NUM_EIGENFACES, whiten=True).fit(X)
X_pca = pca.transform(X)
# load test faces (usually one), located in folder test_faces
test_faces = glob.glob('test_faces/*')
# Create an array with flattened images X
X = np.zeros([len(test_faces), IMG_RES], dtype='int8')
# Populate test array with flattened imags from subfolders of train_faces
for i, face in enumerate(test_faces):
X[i,:] = prepare_image(face)
# run through test images (usually one)
for j, ref_pca in enumerate(pca.transform(X)):
distances = []
# Calculate euclidian distance from test image to each of the known
images and save distances
for i, test_pca in enumerate(X_pca):
dist = math.sqrt(sum([diff**2 for diff in (ref_pca - test_pca)]))
distances.append((dist, y[i]))
found_ID = min(distances)[1]
print "Identified (result: "+ str(found_ID) +" - dist - " +
str(min(distances)[0]) + ")"
Your i in the loop below goes up to the length of X_pca - 1
for i, test_pca in enumerate(X_pca):
dist = math.sqrt(sum([diff**2 for diff in (ref_pca - test_pca)]))
distances.append((dist, y[i]))
However, your y is not built to have that length necessarily:
for x, folder in enumerate(folders):
train_faces = glob.glob(folder + '/*')
for i, face in enumerate(train_faces):
X[c,:] = prepare_image(face)
y.append(ID_from_filename(face))
So you are using an index i which is greater than the bounds of your list y.
I am trying to write a script where I will calculate the similarity of few documents. I want to do it by using LSA. I have found the following code and change it a bit. I has as an input 3 documents and then as output a 3x3 matrix with the similarity between them. I want to do the same similarity calculation but only with sklearn library. Is that possible?
from numpy import zeros
from scipy.linalg import svd
from math import log
from numpy import asarray, sum
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
titles = [doc1,doc2,doc3]
ignorechars = ''',:'!'''
class LSA(object):
def __init__(self, stopwords, ignorechars):
self.stopwords = stopwords.words('english')
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower()
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
def calc(self):
self.U, self.S, self.Vt = svd(self.A)
return -1*self.Vt
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
mylsa = LSA(stopwords, ignorechars)
for t in titles:
mylsa.parse(t)
mylsa.build()
a = mylsa.calc()
cosine_similarity(a)
From #ogrisel's answer:
I run the following code, but my mouth is still open :) When TFIDF has max 80% similarity on two documents with the same subject, this code give me 99.99%. That's why I think that it is something wrong :P
dataset = [doc1,doc2,doc3]
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english')
X = vectorizer.fit_transform(dataset)
lsa = TruncatedSVD()
X = lsa.fit_transform(X)
X = Normalizer(copy=False).fit_transform(X)
cosine_similarity(X)
You can use the TruncatedSVD transformer from sklearn 0.14+: you call it with fit_transform on your database of documents and then call the transform method (from the same TruncatedSVD method) on the query document and then can compute the cosine similarity of the transformed query documents with the transformed database with the function: sklearn.metrics.pairwise.cosine_similarity and numpy.argsort the result to find the index of most similar document.
Note that under the hood, scikit-learn also uses NumPy but in a more efficient way than the snippet you gave (by using the Randomized SVD trick by Halko, Martinsson and Tropp).