Increasing performance of nearest neighbors of rows in Pandas - python

I am given 8000x3 data set similar to this one:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(8000,3), columns=list('XYZ'))
So for a visual reference, df.head(5) looks like this:
X Y Z
0 0.462433 0.559442 0.016778
1 0.663771 0.092044 0.636519
2 0.111489 0.676621 0.839845
3 0.244361 0.599264 0.505175
4 0.115844 0.888622 0.766014
I'm trying to implement a method that when given an index from the dataset, it will return similar items from the dataset (in some reasonable way). For now I have:
def find_similiar_items(item_id):
tmp_df = df.sub(df.loc[item_id], axis='columns')
tmp_series = tmp_df.apply(np.square).apply(np.sum, axis=1)
tmp_series.sort()
return tmp_series
This method takes your row, then subtracts it from each other row in the dataframe, then calculates the norm for each row. So this method simply returns a series of the nearest points to your given point using the euclidean distance.
So you can get the nearest 5 points, for instance, with:
df.loc[find_similiar_items(5).index].head(5)
which yields:
X Y Z
5 0.364020 0.380303 0.623393
4618 0.369122 0.399772 0.643603
4634 0.352484 0.402435 0.619763
5396 0.386675 0.370417 0.600555
3229 0.355186 0.410202 0.616844
The problem with this method is that it takes roughly half a second each time I call it. This isn't acceptable for my purpose, so I need to figure out how to improve the performance of this method in someway. So I have a few questions:
Question 1 Is there perhaps a more efficient way of simply calculating the euclidean distance as above?
Question 2 Is there some other technique that will yield reasonable results like this (the euclidean distance isn't import for instance). Computation time is more important than memory in this problem and pre-processing time is not important; so I would be willing, for instance, to construct a new dataframe that has the size of the Cartesian product (n^2) the original dataframe (but anything more than that might become unreasonable)

Your biggest (and easiest) performance gain is likely to be from merely doing this in numpy rather than pandas. I'm seeing over a 200x improvement just from a quick conversion of the code to numpy:
arr = df.values
def fsi_numpy(item_id):
tmp_arr = arr - arr[item_id]
tmp_ser = np.sum( np.square( tmp_arr ), axis=1 )
return tmp_ser
df['dist'] = fsi_numpy(5)
df = df.sort_values('dist').head(5)
X Y Z dist
5 0.272985 0.131939 0.449750 0.000000
5130 0.272429 0.138705 0.425510 0.000634
4609 0.264882 0.103006 0.476723 0.001630
1794 0.245371 0.175648 0.451705 0.002677
6937 0.221363 0.137457 0.463451 0.002883
Check that it gives the same result as your function (since we have different random draws):
df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
X Y Z
5 0.272985 0.131939 0.449750
5130 0.272429 0.138705 0.425510
4609 0.264882 0.103006 0.476723
1794 0.245371 0.175648 0.451705
6937 0.221363 0.137457 0.463451
Timings:
%timeit df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
1 loops, best of 3: 638 ms per loop
In [105]: %%timeit
...: df['dist'] = fsi_numpy(5)
...: df = df.sort_values('dist').head(5)
...:
100 loops, best of 3: 2.69 ms per loop

Related

fastest way to filter a 2d numpy array

I am trying to filter a numpy array of array, I have done a function like the following:
#nb.njit
def numpy_filter (npX):
n = np.full (npX.shape[0], True)
for npo_index in range(npX.shape[0]):
n[npo_index] = npX[npo_index][0] < 2000 and npX[npo_index][1] < 4000 and npX[npo_index][2] < 5000
return npX[n]
It took 1.75s (numba njit mode) for len of array = 600K , while it only take < 0.5s for list [x for x in obj1 if x[0] < 2000 and x[1] < 4000 and x[2] < 5000]
Is there any better implementation could have a filtering function that could make it run faster?
Generally with Pandas/NumPy arrays, you'll get the best performance if you
avoid iterating over the array
only create soft copies or views of the base array
create a minimal number of intermediate Python objects
Pandas is probably your friend here, allowing you to create a view of the backing NumPy arrays and operate on the rows of each via a shared index
starting data
This creates a random array of the same shape as your source data, with values range from 0-10000
>>> import numpy as np
>>> arr = np.random.rand(600000, 3) * 10000
>>> arr
array([[8079.54193993, 925.74430028, 2031.45569251],
[8232.74161149, 2347.42814063, 7571.21287502],
[7435.52165567, 756.74380534, 1023.12181186],
...,
[2176.36643662, 5374.36584708, 637.43482263],
[2645.0737415 , 9059.42475818, 3913.32941652],
[3626.54923011, 1494.57126083, 6121.65034039]])
create a Pandas DataFrame
This creates view over your source data so you can work with all the rows together using a shared index
>>> import pandas as pd
>>> df = pd.DataFrame(arr)
>>> df
0 1 2
0 8079.541940 925.744300 2031.455693
1 8232.741611 2347.428141 7571.212875
2 7435.521656 756.743805 1023.121812
3 4423.799649 2256.125276 7591.732828
4 6892.019075 3170.699818 1625.226953
... ... ... ...
599995 642.104686 3164.107206 9508.818253
599996 102.819102 3068.249711 1299.341425
599997 2176.366437 5374.365847 637.434823
599998 2645.073741 9059.424758 3913.329417
599999 3626.549230 1494.571261 6121.650340
[600000 rows x 3 columns]
filter
This gets a filtered view of the index for each columns and uses the combined result filter the DataFrame
>>> df[(df[0] < 2000) & (df[1] < 4000) & (df[2] < 5000)]
0 1 2
35 1829.777633 1333.083450 1928.982210
38 653.584288 3129.089395 4753.734920
71 1354.736876 279.202816 5.793797
97 1381.531847 551.465381 3767.436640
115 183.112455 1573.272310 1973.143995
... ... ... ...
599963 1895.537096 1695.569792 1866.575164
599970 1061.011239 51.534961 1014.290040
599988 1780.535714 2311.671494 1012.828410
599994 878.643910 352.858091 3014.505666
599996 102.819102 3068.249711 1299.341425
[24067 rows x 3 columns]
benchmark maybe to follow, but it's very fast
The jit function has not been warmed up, after the first run, the result shows it only takes 0.07s to finish the task.
Make your jit function return only the mask n, dont send npX[n].
Since jit compiler can not fix the the return size of filtered array there is chance it will slow down.
Do the filtering i.e. npX[n] outside jit function. That should speed up.
Also to get actual time better add signature with decorater, this will force eager compilation.
The optimization way in both numpy and numba are different so you always experiment which will be faster. But when speed is almost same you can add parallel option which will make it more faster(I guess you already know that)

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?
For example:
My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.
import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance
items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))},
index = item_item_idx
)
My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.
users = [1,2,3,4,5]
ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
{'rating': np.random.randint(low = 1, high = 6, size = len(users)*len(items))},
columns = ['rating'],
index = ratings_idx
)
Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df, select to from rating_df the vector of ratings for each item, and calculate the cosine distance between those two.
I want to know the least dumb way to do this. Here's what I tried so far:
==== FIRST TRY - Iterating over rows
def similarity(ii, iu):
for index, row in ii.iterrows():
v = iu.loc[index[0]]
u = iu.loc[index[1]]
row['similarity'] = distance.cosine(v, u)
return(ii)
import time
start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.
The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:
==== SECOND TRY - index.map()
def similarity_map(idx):
v = rating_df.loc[idx[0]]
u = rating_df.loc[idx[1]]
return distance.cosine(v, u)
start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.034961s to execute. Slower than just iterating over rows.
So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?
Thanks for the attention.
For your given example I'd just pivot it into an array and move on with my life.
from sklearn.metrics.pairwise import cosine_similarity
rating_df = rating_df.reset_index().pivot(index='item', columns='user')
cs_df = pd.DataFrame(cosine_similarity(rating_df),
index=rating_df.index, columns=rating_df.index)
>>> cs_df
item 1 2 3 4
item
1 1.000000 0.877346 0.660529 0.837611
2 0.877346 1.000000 0.608781 0.852029
3 0.660529 0.608781 1.000000 0.758098
4 0.837611 0.852029 0.758098 1.000000
This would be more difficult with a giant, highly-sparse array. Sklearn cosine_similarity takes sparse arrays though so as long as your number of items is reasonable (since the output matrix will be dense) this should be solvable.
Same thing but different. Work with numpy arrays. Fine for small arrays but with 10k rows you'll have some large arrays.
import numpy as np
data = rating_df.unstack().values # shape (4,5)
udotv = np.dot(data,data.T) # shape (4,4)
mag_data = np.linalg.norm(data,axis=1)
mag = mag_data * mag_data[:,None]
cos_sim = 1 - (udotv / mag)
df['sim2'] = cos_sim.flatten()
4k users and 14k items pretty much blows up my poor computer. I'm going to have to look how sklearn.metrics.pairwise.cosine_similarity handles that large data.

Is there a faster way to split a pandas dataframe into two complementary parts?

Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?
A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)

Pandas dataframe interpolation with logarithmically sampled time intervals

I have a pandas dataframe with columns 'Time' in minutes and 'Value' pulled in from a data logger. The data are logged in logarithmic time intervals, meaning that the first values are logged in fractional minutes then as time proceeds the time intervals get longer:
print(df)
Minutes Value
0 0.001 0.00100
1 0.005 0.04495
2 0.010 0.04495
3 0.015 0.09085
4 0.020 0.11368
.. ... ...
561 4275.150 269.17782
562 4285.150 266.90964
563 4295.150 268.35306
564 4305.150 269.42984
565 4315.150 268.37594
I would like to linearly interpolate the 'Value' at one minute intervals from 0 to 4315 minutes.
I have attempted a few different iterations of df.interpolate() however have not found success. Can someone please help me out? Thank you
I think it's possible that my question was very basic or I made a confusing question. Either way I just wrote a little loop to solve my problem and felt like I should share it. I am sure that this not the most efficient way of doing what I was asking and hopefully somebody could suggest better ways of accomplishing this. I am still very new a this whole thing.
First a few qualifying things:
The 'Value' data that I was talking about is called 'drawdown', which refers to a difference in water level from the initial starting water level inside a water well. It starts at 0.
This kind of data is often viewed in a semi-log plot and sometimes its easier to replace 0 with a very low number instead (i.e 0.0001) so that it plots easy in other programs.
This code takes a .csv file with column names 'Minutes' and 'Drawdown' and compares time values with a new reference dataframe of minutes from 0 through the end of the dataset. It references the 2 closest time values to the desired time value in the list and makes a weighted average of those values then creates a new csv of the integer minutes with drawdown.
Cheers!
# -*- coding: utf-8 -*-
"""
Created on Tue Sep 22 13:42:29 2020
#author: cmeyer
"""
import pandas as pd
import numpy as np
df=pd.read_csv('Read_in.csv')
length=len(df)-1
last=df.at[length,'Drawdown']
lengthpump=int(df.at[length,'Minutes'])
minutes=np.arange(0,lengthpump,1)
dfminutes=pd.DataFrame(minutes)
dfminutes.columns = ['Minutes']
for i in range(1, lengthpump, 1):
non_uni_minutes=df['Minutes']
uni_minutes=dfminutes.at[i,'Minutes']
close1=non_uni_minutes[np.argsort(np.abs(non_uni_minutes-uni_minutes))[0]]
close2=non_uni_minutes[np.argsort(np.abs(non_uni_minutes-uni_minutes))[1]]
index1 = np.where(non_uni_minutes == close1)
index1 = int(index1[0])
index2 = np.where(non_uni_minutes == close2)
index2 = int(index2[0])
num1=df.at[index1,'Drawdown']
num2=df.at[index2,'Drawdown']
weight1 = 1-abs((i-close1)/i)
weight2 = 1-abs((i-close2)/i)
Value = (weight1*num1+weight2*num2)/(weight1+weight2)
dfminutes.at[i,'Drawdown'] = Value
dfminutes.at[0,'Drawdown'] = 0.000001
dfminutes.at[0,'Minutes'] = 0.000001
dfminutes.to_csv('integer_minutes_drawdown.csv')
Here I implemented efficient solution using numpy.interp. I've coded a bit fancy way of reading-in data into pandas.DataFrame from string, you may use any simpler suitable way for your needs like pandas.read_csv(...).
Try next code here online!
import math
import pandas as pd, numpy as np
# Here is just fancy way of reading data, use any other method of reading instead
df = pd.DataFrame([map(float, line.split()) for line in """
0.001 0.00100
0.005 0.04495
0.010 0.04495
0.015 0.09085
0.020 0.11368
4275.150 269.17782
4285.150 266.90964
4295.150 268.35306
4305.150 269.42984
4315.150 268.37594
""".splitlines() if line.strip()], columns = ['Time', 'Value'])
a = df.values
# Create array of integer x = [0 1 2 3 ... LastTimeFloor].
x = np.arange(math.floor(a[-1, 0] + 1e-6) + 1)
# Linearly interpolate
y = np.interp(x, a[:, 0], a[:, 1])
df = pd.DataFrame({'Time': x, 'Value': y})
print(df)

How to optimize a function for calculating distribution similarity

I have a dataframe with different actors distribution of attention towards different issues. It looks like this:
Social politics & Welfare Technology & IT Business, Finance, & Economy ...
actor_1 0.034483 0.051724 0.017241 ...
actor_2 0.032000 0.016000 0.056000 ...
actor_3 0.012195 0.004065 0.010163 ...
actor_4 0.000000 0.045977 0.022989 ...
actor_5 0.027397 0.006849 0.000000 ...
actor_6 0.128205 0.000000 0.051282 ...
I've created two functions for creating a matrix with the similarity scores between all the different actors.
def dist_sim(array1, array2):
array1 = array1*100
array2 = array2*100
distances = array1-array2
total_distance = 0
for distance in distances:
total_distance += math.sqrt(distance*distance)
return(100-total_distance/2)
def dist_sim_matrix(df):
matrix = []
for index, row in df.iterrows():
party_matrix = []
for index1, row1 in df.iterrows():
party_matrix.append(dist_sim(row, row1))
matrix.append(party_matrix)
return np.array(matrix, int)
They work perfectly fine, however when I apply it to a large dataframe (eg. with 2000 different actors and 25 issues) it takes forever (I'm actually not sure I've got enough RAM for it?).
I'm new in the business of creating my own functions, so any help on optimization would be awesome!
Here what you can do:
import pandas as pd
import numpy as np
# I used a fake dataframe
df = pd.DataFrame(data={'c1': np.random.rand(10),
'c2': np.random.rand(10),
'c3': np.random.rand(10),
'c4': np.random.rand(10)},
index=[f'actor_{i}' for i in range(1,11)])
# Traspose it
df = df.T
# Define the function to compute distance
def dist_sim(array1, array2):
'''
Use vectorization, distributive property and numpy functions
'''
d = np.sqrt((np.square(array1-array2)).sum())*100
return(100-d/2)
# Initialize an empty dataframe
sim_df = pd.DataFrame(columns=list(df), index=list(df))
# cycle over the dataframe actors - exploit symmetry to half iteration number
for i,c1 in enumerate(list(df)):
for c2 in list(df)[i:]:
sim_df.loc[c1, c2]=sim_df.loc[c2, c1]=dist_sim_opt(df[c1], df[c2])
The resulting dataframe is something like
sim_df
actor_1 actor_2 actor_3 ... actor_8 actor_9 actor_10
actor_1 100 67.146 56.3693 ... 74.2303 77.7915 55.0946
actor_2 67.146 100 64.7546 ... 61.9146 72.5428 63.7388
actor_3 56.3693 64.7546 100 ... 57.5318 51.5127 95.3162
actor_4 68.5392 59.2313 75.0851 ... 73.3381 61.7608 74.6694
actor_5 72.671 67.2219 79.2112 ... 64.2796 59.9031 77.3241
actor_6 62.8109 67.1849 87.7293 ... 60.9305 53.3952 83.9605
actor_7 62.0589 63.5562 35.7006 ... 57.5888 61.3989 33.1785
actor_8 74.2303 61.9146 57.5318 ... 100 69.602 55.4216
actor_9 77.7915 72.5428 51.5127 ... 69.602 100 51.4612
actor_10 55.0946 63.7388 95.3162 ... 55.4216 51.4612 100
in this case there is an optimised function in scipy, see the spatial.distance module, specifically the pdist function for computing:
Pairwise distances between observations in n-dimensional space.
in your case you can do:
from scipy.spatial import distance
d = distance.squareform(distance.pdist(df, 'euclidean'))
dd = pd.DataFrame(d, df.index, df.index)
note that these are "distances", so the distance to the same actor is zero. if you really want to have it take a maximal value (as in your calculations) you could do:
d *= -50
d += 100
before turning into a dataframe. note that I'm doing these calculations "inplace" so that additional copies of a potentially enormous matrix aren't created

Categories

Resources