How to optimize a function for calculating distribution similarity - python

I have a dataframe with different actors distribution of attention towards different issues. It looks like this:
Social politics & Welfare Technology & IT Business, Finance, & Economy ...
actor_1 0.034483 0.051724 0.017241 ...
actor_2 0.032000 0.016000 0.056000 ...
actor_3 0.012195 0.004065 0.010163 ...
actor_4 0.000000 0.045977 0.022989 ...
actor_5 0.027397 0.006849 0.000000 ...
actor_6 0.128205 0.000000 0.051282 ...
I've created two functions for creating a matrix with the similarity scores between all the different actors.
def dist_sim(array1, array2):
array1 = array1*100
array2 = array2*100
distances = array1-array2
total_distance = 0
for distance in distances:
total_distance += math.sqrt(distance*distance)
return(100-total_distance/2)
def dist_sim_matrix(df):
matrix = []
for index, row in df.iterrows():
party_matrix = []
for index1, row1 in df.iterrows():
party_matrix.append(dist_sim(row, row1))
matrix.append(party_matrix)
return np.array(matrix, int)
They work perfectly fine, however when I apply it to a large dataframe (eg. with 2000 different actors and 25 issues) it takes forever (I'm actually not sure I've got enough RAM for it?).
I'm new in the business of creating my own functions, so any help on optimization would be awesome!

Here what you can do:
import pandas as pd
import numpy as np
# I used a fake dataframe
df = pd.DataFrame(data={'c1': np.random.rand(10),
'c2': np.random.rand(10),
'c3': np.random.rand(10),
'c4': np.random.rand(10)},
index=[f'actor_{i}' for i in range(1,11)])
# Traspose it
df = df.T
# Define the function to compute distance
def dist_sim(array1, array2):
'''
Use vectorization, distributive property and numpy functions
'''
d = np.sqrt((np.square(array1-array2)).sum())*100
return(100-d/2)
# Initialize an empty dataframe
sim_df = pd.DataFrame(columns=list(df), index=list(df))
# cycle over the dataframe actors - exploit symmetry to half iteration number
for i,c1 in enumerate(list(df)):
for c2 in list(df)[i:]:
sim_df.loc[c1, c2]=sim_df.loc[c2, c1]=dist_sim_opt(df[c1], df[c2])
The resulting dataframe is something like
sim_df
actor_1 actor_2 actor_3 ... actor_8 actor_9 actor_10
actor_1 100 67.146 56.3693 ... 74.2303 77.7915 55.0946
actor_2 67.146 100 64.7546 ... 61.9146 72.5428 63.7388
actor_3 56.3693 64.7546 100 ... 57.5318 51.5127 95.3162
actor_4 68.5392 59.2313 75.0851 ... 73.3381 61.7608 74.6694
actor_5 72.671 67.2219 79.2112 ... 64.2796 59.9031 77.3241
actor_6 62.8109 67.1849 87.7293 ... 60.9305 53.3952 83.9605
actor_7 62.0589 63.5562 35.7006 ... 57.5888 61.3989 33.1785
actor_8 74.2303 61.9146 57.5318 ... 100 69.602 55.4216
actor_9 77.7915 72.5428 51.5127 ... 69.602 100 51.4612
actor_10 55.0946 63.7388 95.3162 ... 55.4216 51.4612 100

in this case there is an optimised function in scipy, see the spatial.distance module, specifically the pdist function for computing:
Pairwise distances between observations in n-dimensional space.
in your case you can do:
from scipy.spatial import distance
d = distance.squareform(distance.pdist(df, 'euclidean'))
dd = pd.DataFrame(d, df.index, df.index)
note that these are "distances", so the distance to the same actor is zero. if you really want to have it take a maximal value (as in your calculations) you could do:
d *= -50
d += 100
before turning into a dataframe. note that I'm doing these calculations "inplace" so that additional copies of a potentially enormous matrix aren't created

Related

Minimum distance between coordinates

I have a two pandas data frames, each containing a column with a tuple of coordinates (one for schools, the other for houses).
I would like to create a new column in the houses dataframe containing the shortest distance from any of the schools. Below is some code I tried but without success
import pandas as pd
import numpy as np
import geopy.distance
schools = pd.DataFrame([(29.775803, -95.56353), (40.060276, -83.004196), (40.70592, -74.010765)])
houses = pd.DataFrame([(41.291989997, -73.087632372), (41.16741635, -73.188437585), (41.038689564, -73.635282641), (40.825542, -96.60775)])
x = 0
minimum_distance = []
for i in t:
for j in private:
if geopy.distance.geodesic(i, j).km > x:
v = geopy.distance.geodesic(i, j).km
minimum_distance.append(v).km
else:
continue
schools['shortest_distance'] = minimum_distance
The house dataframe should look like this after:
0 1 shortest_distance
0 41.291990 -73.087632 101.332983
1 41.167416 -73.188438 86.153595
2 41.038690 -73.635283 48.656830
3 40.825542 -96.607750 1156.075739
Does anyone have any idea how to perform this ? I used a double loop in my code because I don't think there is another way since each element has to be searched but I am also wondering how efficient it would be with 2 dataframe of 20000 rows each.
Thank you in advance for your help !
Louis

How to speed up pandas dataframe iteration involving 2 different dataframes with a complex condition?

I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.
I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.
For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.
So far I have the following:
ID_list = []
for index, row in A.iterrows():
filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
and row['latitude'] < x['max_latitude'] \
and row['longitude'] >= x['min_longitude'] \
and row['longitude'] < x['max_longitude'] \
else None, axis = 1)
ID_list.append(B.loc[filtered_B == True]['ID']
The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.
The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.
Any tips would be appreciated, thank you.
edit: sample dataframes:
A:
location
latitude
longitude
1
-33.81263
151.23691
2
-33.994823
151.161274
3
-33.320154
151.662009
4
-33.99019
151.1567332
B:
ID
min_latitude
max_latitude
min_longitude
max_longitude
9ae8704
-33.815
-33.810
151.234
151.237
2ju1423
-33.555
-33.543
151.948
151.957
3ef4522
-33.321
-33.320
151.655
151.668
0uh0478
-33.996
-33.990
151.152
151.182
expected output:
ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]
I would use geopandas to do this, which makes use of rtree indexing.
import geopandas as gpd
from shapely.geometry import box
a_gdf = gpd.GeoDataFrame(a[['location']], geometry=gpd.points_from_xy(a.longitude,
a.latitude))
b_gdf = gpd.GeoDataFrame(
b[['ID']],
geometry=[box(*bounds) for _, bounds in b.loc[:, ['min_longitude',
'min_latitude',
'max_longitude',
'max_latitude']].iterrows()])
gpd.sjoin(a_gdf, b_gdf)
Output:
location
geometry
index_right
ID
0
1
POINT (151.23691 -33.81263)
0
9ae8704
1
2
POINT (151.161274 -33.994823)
3
0uh0478
3
4
POINT (151.1567332 -33.99019000000001)
3
0uh0478
2
3
POINT (151.662009 -33.320154)
2
3ef4522
We can create an multi-interval-index on b and then use regular loc index into it with tuples from the rows of a. Interval indexes are useful in situations like this when we have a table of low and high values to bucket a variable into.
from io import StringIO
import pandas as pd
a = pd.read_table(StringIO("""
location latitude longitude
1 -33.81263 151.23691
2 -33.994823 151.161274
3 -33.320154 151.662009
4 -33.99019 151.1567332
"""), sep='\s+')
b = pd.read_table(StringIO("""
ID min_latitude max_latitude min_longitude max_longitude
9ae8704 -33.815 -33.810 151.234 151.237
2ju1423 -33.555 -33.543 151.948 151.957
3ef4522 -33.321 -33.320 151.655 151.668
0uh0478 -33.996 -33.990 151.152 151.182
"""), sep='\s+')
lat_index = pd.IntervalIndex.from_arrays(b['min_latitude'], b['max_latitude'], closed='left')
lon_index = pd.IntervalIndex.from_arrays(b['min_longitude'], b['max_longitude'], closed='left')
index = pd.MultiIndex.from_tuples(list(zip(lat_index, lon_index)), names=['lat_range', 'lon_range'])
b = b.set_index(index)
print(b.loc[list(zip(a.latitude, a.longitude)), 'ID'].tolist())
The above will even handle rows of a that have no corresponding row in b by gracefully filling in those values with nan.
A good option for this might be to perform a cross-product merge and drop the undesirable columns. For example, you might do:
AB_cross = A.merge(
B
how = "cross"
)
Now we have a giant dataframe with all the possible matchings where IDs in B (or might not, we don't know yet) might have boundaries qualifying for the points in A. This is fast but makes a large dataset in memory, since we now have a dataset that is 30000x10000 rows long.
Now, we need to apply our logic by filtering the dataset accordingly. This is a numpy process (as far as I'm aware), so it's vectorized and very fast! I will also say that it might be easier to use between to make your code a bit more semantic.
Note that below I use .between(inclusive = 'left') to represent the fact that you want to look to see if the long/lat is min_long <= long < max_long (the inclusive inequality is on the left side of the equation).
ID_list = AB_cross['ID'].loc[
AB_cross['longitude'].between(AB_cross['min_longitude'], AB_cross['max_longitude'], inclusive = 'left') &
AB_cross['latitude'].between(AB_cross['min_latitude'], AB_cross['max_latitude'], inclusive = 'left')
]
A reasonably fast approach could be to pre-sort points by latitudes and longitudes, then iterate over boxes finding points inside the box by latitude (lat_min < lat < lat_max) and longitude (lon_min < lon < lon_max) separately with np.searchsorted and then intersecting them with np.intersect1d.
For 300K points and 10K non-overlapping boxes in my tests it took less than 10 seconds to run.
Here's an example implementation:
# create `ids` series to be populated with box IDs for each point
ids = pd.Series(np.nan, index=a.index)
# create series with points sorted by lats and lons
lats = a['latitude'].sort_values()
lons = a['longitude'].sort_values()
# iterate over boxes
for bi, r in b.set_index('ID').iterrows():
# find points inside the box by latitude:
i1, i2 = np.searchsorted(lats, [r['min_latitude'], r['max_latitude']])
ix_lat = lats.index[i1:i2]
# find points inside the box by longitude:
j1, j2 = np.searchsorted(lons, [r['min_longitude'], r['max_longitude']])
ix_lon = lons.index[j1:j2]
# find points inside the box as intersection and set values in ids:
ix = np.intersect1d(ix_lat, ix_lon)
ids.loc[ix] = bi
ids.tolist()
Output (on provided sample data):
['9ae8704', '0uh0478', '3ef4522', '0uh0478']
You code consuming
and this code
A = pd.DataFrame.from_dict(
{'location': {0: 1, 1: 2, 2: 3, 3: 4},
'latitude': {0: -33.81263, 1: -33.994823, 2: -33.320154, 3: -33.99019},
'longitude': {0: 151.23691, 1: 151.161274, 2: 151.662009, 3: 151.1567332}}
)
B = pd.DataFrame.from_dict(
{'ID': {0: '9ae8704', 1: '2ju1423', 2: '3ef4522', 3: '0uh0478'},
'min_latitude': {0: -33.815, 1: -33.555, 2: -33.321, 3: -33.996},
'max_latitude': {0: -33.81, 1: -33.543, 2: -33.32, 3: -33.99},
'min_longitude': {0: 151.234, 1: 151.948, 2: 151.655, 3: 151.152},
'max_longitude': {0: 151.237, 1: 151.957, 2: 151.668, 3: 151.182}}
)
def func(latitude, longitude):
for y, x in B.iterrows():
if (latitude >= x.min_latitude and latitude < x.max_latitude and longitude >= x.min_longitude and longitude < x.max_longitude):
return x['ID']
np.vectorize(func)
A.apply(lambda x: func(x.latitude, x.longitude), axis=1).to_list()
consumes
so the new solution is 2.33 times faster
I think Geopandas would be the best solution for making sure all of your edge cases are covered like meridian/equator crossovers and the like - spatial queries are exactly what Geopandas is designed for. It can be a pain to install though.
One naive approach in numpy (assuming that most of the points don't change signs anywhere) would be to calculate each of your clauses as a sign of a difference and then keep only the matches where all the signs match your criteria.
For lots of intensive, repetitive calculations like this, it's usually better to dump it out of pandas into numpy, process and then put it back into pandas.
a_lats = A.latitude.values.reshape(-1,1)
b_min_lats = B.min_latitude.values.reshape(1,-1)
b_max_lats = B.max_latitude.values.reshape(1,-1)
a_lons = A.longitude.values.reshape(-1,1)
b_min_lons =B.min_longitude.values.reshape(1,-1)
b_max_lons = B.max_longitude.values.reshape(1,-1)
north_of_min_lat = np.sign(a_lats - b_min_lats)
south_of_max_lat = np.sign(b_max_lats - a_lats)
west_of_min_lon = np.sign(a_lons - b_min_lons)
east_of_max_lon = np.sign(b_max_lons - a_lons)
margin_matches = (north_of_min_lat + south_of_max_lat + west_of_min_lon + east_of_max_lon)
match_indexes = (margin_matches == 4).nonzero()
matches = [(A.location[i], B.ID[j]) for i, j in zip(match_indexes[0], match_indexes[1])]
print(matches)
PS - You can painlessly run this on a GPU if you use CuPy and replace all references to numpy with cupy.

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?
For example:
My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.
import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance
items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))},
index = item_item_idx
)
My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.
users = [1,2,3,4,5]
ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
{'rating': np.random.randint(low = 1, high = 6, size = len(users)*len(items))},
columns = ['rating'],
index = ratings_idx
)
Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df, select to from rating_df the vector of ratings for each item, and calculate the cosine distance between those two.
I want to know the least dumb way to do this. Here's what I tried so far:
==== FIRST TRY - Iterating over rows
def similarity(ii, iu):
for index, row in ii.iterrows():
v = iu.loc[index[0]]
u = iu.loc[index[1]]
row['similarity'] = distance.cosine(v, u)
return(ii)
import time
start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.
The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:
==== SECOND TRY - index.map()
def similarity_map(idx):
v = rating_df.loc[idx[0]]
u = rating_df.loc[idx[1]]
return distance.cosine(v, u)
start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.034961s to execute. Slower than just iterating over rows.
So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?
Thanks for the attention.
For your given example I'd just pivot it into an array and move on with my life.
from sklearn.metrics.pairwise import cosine_similarity
rating_df = rating_df.reset_index().pivot(index='item', columns='user')
cs_df = pd.DataFrame(cosine_similarity(rating_df),
index=rating_df.index, columns=rating_df.index)
>>> cs_df
item 1 2 3 4
item
1 1.000000 0.877346 0.660529 0.837611
2 0.877346 1.000000 0.608781 0.852029
3 0.660529 0.608781 1.000000 0.758098
4 0.837611 0.852029 0.758098 1.000000
This would be more difficult with a giant, highly-sparse array. Sklearn cosine_similarity takes sparse arrays though so as long as your number of items is reasonable (since the output matrix will be dense) this should be solvable.
Same thing but different. Work with numpy arrays. Fine for small arrays but with 10k rows you'll have some large arrays.
import numpy as np
data = rating_df.unstack().values # shape (4,5)
udotv = np.dot(data,data.T) # shape (4,4)
mag_data = np.linalg.norm(data,axis=1)
mag = mag_data * mag_data[:,None]
cos_sim = 1 - (udotv / mag)
df['sim2'] = cos_sim.flatten()
4k users and 14k items pretty much blows up my poor computer. I'm going to have to look how sklearn.metrics.pairwise.cosine_similarity handles that large data.

Cross-correlation (time-lag-correlation) with pandas?

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.
I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.
Edit
The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.
The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.
Edit2
Some minimal sample data:
import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)
Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:
bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)
This is what I get when I correlate with pandas and shift one dataset:
In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523
And trying scipy:
In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]:
array([[ nan],
[ nan],
[ nan],
...,
[ nan],
[ nan],
[ nan]])
which according to whos is
scicorr ndarray 1801x1: 1801 elems, type `float64`, 14408 bytes
But I'd just like to have 12 entries.
/Edit2
The idea I have come up with, is to implement a time-lag-correlation myself, like so:
corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on
But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.
As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:
def autocorr(self, lag=1):
"""
Lag-N autocorrelation
Parameters
----------
lag : int, default 1
Number of lags to apply before performing autocorrelation.
Returns
-------
autocorr : float
"""
return self.corr(self.shift(lag))
So a simple timelagged cross covariance function would be
def crosscorr(datax, datay, lag=0):
""" Lag-N cross correlation.
Parameters
----------
lag : int, default 0
datax, datay : pandas.Series objects of equal length
Returns
----------
crosscorr : float
"""
return datax.corr(datay.shift(lag))
Then if you wanted to look at the cross correlations at each month, you could do
xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]
There is a better approach: You can create a function that shifted your dataframe first before calling the corr().
Get this dataframe like an example:
d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)
>>> df
prcp stp
0 0.1 0.0
1 0.2 0.1
2 0.3 0.2
3 0.0 0.3
Your function to shift others columns (except the target):
def df_shifted(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for c in df.columns:
if c == target:
new[c] = df[target]
else:
new[c] = df[c].shift(periods=lag)
return pd.DataFrame(data=new)
Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)
If you do at the present will be:
>>> df.corr()
prcp stp
prcp 1.0 -0.2
stp -0.2 1.0
But if you shifted 1(one) period all other columns and keep the target (prcp):
df_new = df_shifted(df, 'prcp', lag=-1)
>>> print df_new
prcp stp
0 0.1 0.1
1 0.2 0.2
2 0.3 0.3
3 0.0 NaN
Note that now the column stp is shift one up position at period, so if you call the corr(), will be:
>>> df_new.corr()
prcp stp
prcp 1.0 1.0
stp 1.0 1.0
So, you can do with lag -1, -2, -n!!
To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (e.g. to see which lag gives the highest correlations), you can do something like this:
lagged_correlation = pd.DataFrame.from_dict(
{x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})
This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).

Increasing performance of nearest neighbors of rows in Pandas

I am given 8000x3 data set similar to this one:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(8000,3), columns=list('XYZ'))
So for a visual reference, df.head(5) looks like this:
X Y Z
0 0.462433 0.559442 0.016778
1 0.663771 0.092044 0.636519
2 0.111489 0.676621 0.839845
3 0.244361 0.599264 0.505175
4 0.115844 0.888622 0.766014
I'm trying to implement a method that when given an index from the dataset, it will return similar items from the dataset (in some reasonable way). For now I have:
def find_similiar_items(item_id):
tmp_df = df.sub(df.loc[item_id], axis='columns')
tmp_series = tmp_df.apply(np.square).apply(np.sum, axis=1)
tmp_series.sort()
return tmp_series
This method takes your row, then subtracts it from each other row in the dataframe, then calculates the norm for each row. So this method simply returns a series of the nearest points to your given point using the euclidean distance.
So you can get the nearest 5 points, for instance, with:
df.loc[find_similiar_items(5).index].head(5)
which yields:
X Y Z
5 0.364020 0.380303 0.623393
4618 0.369122 0.399772 0.643603
4634 0.352484 0.402435 0.619763
5396 0.386675 0.370417 0.600555
3229 0.355186 0.410202 0.616844
The problem with this method is that it takes roughly half a second each time I call it. This isn't acceptable for my purpose, so I need to figure out how to improve the performance of this method in someway. So I have a few questions:
Question 1 Is there perhaps a more efficient way of simply calculating the euclidean distance as above?
Question 2 Is there some other technique that will yield reasonable results like this (the euclidean distance isn't import for instance). Computation time is more important than memory in this problem and pre-processing time is not important; so I would be willing, for instance, to construct a new dataframe that has the size of the Cartesian product (n^2) the original dataframe (but anything more than that might become unreasonable)
Your biggest (and easiest) performance gain is likely to be from merely doing this in numpy rather than pandas. I'm seeing over a 200x improvement just from a quick conversion of the code to numpy:
arr = df.values
def fsi_numpy(item_id):
tmp_arr = arr - arr[item_id]
tmp_ser = np.sum( np.square( tmp_arr ), axis=1 )
return tmp_ser
df['dist'] = fsi_numpy(5)
df = df.sort_values('dist').head(5)
X Y Z dist
5 0.272985 0.131939 0.449750 0.000000
5130 0.272429 0.138705 0.425510 0.000634
4609 0.264882 0.103006 0.476723 0.001630
1794 0.245371 0.175648 0.451705 0.002677
6937 0.221363 0.137457 0.463451 0.002883
Check that it gives the same result as your function (since we have different random draws):
df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
X Y Z
5 0.272985 0.131939 0.449750
5130 0.272429 0.138705 0.425510
4609 0.264882 0.103006 0.476723
1794 0.245371 0.175648 0.451705
6937 0.221363 0.137457 0.463451
Timings:
%timeit df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
1 loops, best of 3: 638 ms per loop
In [105]: %%timeit
...: df['dist'] = fsi_numpy(5)
...: df = df.sort_values('dist').head(5)
...:
100 loops, best of 3: 2.69 ms per loop

Categories

Resources