Comparing values in Dataframes - python

I am doing a Python project and trying to cut down on some computational time at the start using Pandas.
The code currently is:
for c1 in centres1:
for c2 in centres2:
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])
I am trying to put centres1 and centres2 into data frames then compare each value to each other value. Would pandas help me cut some time off it? (currently 2 mins). If not how could I work around it?
Thanks

Unfortunately this is never going to be fast as you are going to be performing n squared operations. For example if you are comparing n objects where n = 1000 then you only have 1 million comparisons. If however you have n = 10_000 then you 100 million comparisons. A problem 10x bigger becomes 100 times slower.
nevertheless, for loops in python are relatively expensive. Using a library like pandas may mean that you can only make one function call and will shave some time off. Without any input data it's hard to assist further but the below should provide some building blocks
import pandas
df1 = pandas.Dataframe(centres1)
df2 = pandas.Dataframe(centres2)
df3 = df1.merge(df2, how = 'cross')
df3['combined_centre'] = ((df3['0_x']-df2['0_y']**2 + (df1['1_x']-df['1_y'])**2)
df3[df3['prod'] > search_rad**2

Yes, for sure pandas will help in cutting-off atleast some time which will be less that what you are getting right now, but you can try this out:
for i,j in zip(center1, center2):
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])

Related

Speeding up numpy operations

Using a 2D numpy array, I want to create a new array that expands the original one using a moving window. Let me explain what I mean using an example code:
# Simulate some data
import numpy as np
np.random.seed(1)
t = 20000 # total observations
location = np.random.randint(1, 5, (t,1))
var_id = np.random.randint(1, 8, (t,1))
hour = np.repeat(np.arange(0, (t/5)), 5).reshape(-1,1)
value = np.random.rand(t,1)
df = np.concatenate((location,var_id,hour,value),axis = 1)
Having "df" I want to create a new array "results" like below:
# length of moving window
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
results = np.concatenate((results, obs_data), axis=0)
my problem is that the concatenation is very slow (on my system the operation take 1.4 and 16 seconds without and with the concatenation respectively). I have over a million data points and I want to speedup this code. Does anyone know a better way to create the new array faster (possibly without using the np.concatenate)?
If you need to iterate, make the results array big enough to hold all the values.
# create an empty array to store the results
n = len(set(hours))-window+1
results = np.empty((n,4))
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[i,:] = obs_data
Repeated concatenate is slow; list append is faster.
It may be possible to get all obs_data from df with one indexing call, but I won't try to explore that now.
Not a completely for-free answer neither, but a working one
window = 10
hours = df[:,2]
# create an empty array to store the results
results = np.empty((0,4))
lr=[]
for i in range(len(set(hours))-window+1):
obs_data = df[(hours >= i) & (hours <= i+window)]
lr.append(obs_data)
np.vstack(lr)
It is way faster. For the reason already given: calling concatenate in a loop is awfully slow. Where as python list can be expanded more efficiently.
I would have preferred something like hpaulj answer. With some array initially created, and then filled. Even if obs_data is not a single row (as they seem to assume) but several row, it is not really a problem. Something like
p=0
for i in range(n):
obs_data = df[(hours >= i) & (hours <= i+...
results[p:p+len(obs_data),:] = obs_data
p+=len(obs_data)
would do.
But the problem here is to estimate the size of results. With your example, with uniformly distributed hours, it is quite easy : (len(set(hours))-window+1)*window*(len(hours)/len(set(hours))
But I guess in reality, each obs_data has a different size.
So, the only way to compute the size of result in advance would be to do a first iteration just to compute the sum of len(obs_data), and then a second to store obs_data. So, vstack, even if not entierely satisfying, is probably the best option.
Anyway, it is a very visible improvement from your version (on my computer 22 seconds vs less than 1)

Python Dask - Group-by performance on all columns

I want to count the number of unique rows in my data. Below a quick input/output example.
#input
A,B
0,0
0,1
1,0
1,0
1,1
1,1
#output
A,B,count
0,0,1
0,1,1
1,0,2
1,1,2
The data in my pipeline have more than 5000 columns and more than 1M rows, each cell is a 0 or a 1. Below there are my two attempts at scaling with Dask (with 26 columns):
import numpy as np
import string
import time
client = Client(n_workers=6, threads_per_worker=2, processes=True)
columns = list(string.ascii_uppercase)
data = np.random.randint(2, size = (1000000, len(columns)))
ddf_parent = dd.from_pandas(pd.DataFrame(data, columns = columns), npartitions=20)
#1st solution
ddf = ddf_parent.astype(str)
ddf_concat = ddf.apply(''.join, axis =1).to_frame()
ddf_concat.columns = ['pattern']
ddf_concat = ddf_concat.groupby('pattern').size()
start = time.time()
ddf_concat = ddf_concat.compute()
print(time.time()-start)
#2nd solution
ddf_concat_other = ddf_parent.groupby(list(ddf.columns)).size()
start = time.time()
ddf_concat_other = ddf_concat_other.compute()
print(time.time() - start)
results:
9.491615056991577
12.688117980957031
The first solution first concatenates every column into a string and then runs the group-by on it. The second one just group-by all the columns. I am leaning toward using the first one as it is faster in my tests, but I am open to suggestions. Feel free to completely change my solution if there is anything better in term of performance (also, interesting, sort=False does not speed up the group-by, which may actually be related to this: https://github.com/dask/dask/issues/5441 and this https://github.com/rapidsai/cudf/issues/2717)
NOTE:
After some testing the first solution scales relatively well with the number of columns. I guess one improvement could be to hash the strings to always have a fix length. Any suggestion on the partition number in this case? From the remote dashboard I can see that after couple of operations the nodes in the computational graph reduces to only 3, not taking advantage of the other workers available.
The second solutions fails when columns increase.
NOTE2:
Also, with the first solution, something really strange is happening with what I guess is how Dask schedules and maps operations. What is happening is that after some time a single worker gets many more tasks than the others, then the worker exceed 95% of the memory, crash, then tasks are split correctly, but after some time another worker gets more tasks (and the cycle restart). The pipeline runs fine, but I was wondering if this is the expected behavior. Attached a screenshot:

Computation between two large columns in a Pandas Dataframe

I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.

Looking for faster loops for big dataframes

I have a very simple loop that just takes too long to iterate over my big dataframe.
value = df.at[n,'column_A']
for y in range(0,len(df)):
index=df[column_B.ge(value_needed)].index[y]
if index_high > n:
break
With this, I'm trying to find the first index that has a value greater than value_needed. The problem is that this loop is just too inneficent to run when len(df)>200000
Any ideas on how to solve this issue?
In general you should try to avoid loops with pandas, here is a vectorized way to get what you want:
df.loc[(df['column_B'].ge(value_needed)) & (df.index > n)].index[0]
I wish you have sample data. Try this on your data and let me know what you get
import numpy as np
index = np.where(df[column_B] > value_needed)[0].flat[0]
Then
#continue with other logic

Looking for ways to improve the speed of my python script that uses the Pandas library

I am fairly new to Pandas, and have started using the library to work with data sets in Power BI. I recently had to write a snippet of code to run some calculations on a column of integers, but had a hard time translating my code from standard python to Pandas. The code is essentially casting the column to a list, and then running a loop on the items in the list, appending the resulting number to a new list that I then make into it's own column.
I have read that running loops in Pandas can be slow, and the execution of the code below does indeed seem slow. Any help pointing me in the right direction would be much appreciated!
Here is the code that I am trying to optimize:
import pandas as pd
df = dataset #Required step in Power BI
gb_list = df['Estimated_Size'].T.tolist()
hours_list = []
for size in gb_list:
hours = -0.50
try:
for count in range(0,round(size)):
if count % 100 == 0:
hours += .50
else:
continue
except:
hours = 0
hours_list.append(hours)
df['Total Hours'] = hours_list
IIUC, your code is equivalent to:
df['Total Hours'] = (df['Estimated_Size'] // 100) * 0.5
Except that I'm not clear what value you want when Estimated_Size is exactly 100.

Categories

Resources