How to Parallelize python loop with large dataset - python

I am trying to construct hierarchies given a dataset, where each row represents a student, the course they've taken, and some other metadata. From this dataset, i'm trying to construct an adjacency matrix and determine the hierarchies based on what classes students have taken, and the path that different students take when choosing classes.
That being said, to construct this adjacency matrix, it is computationally expensive. Here is the code I have currently, which has been running for around 2 hours.
uniqueStudentIds = df.Id.unique()
uniqueClasses = df['Course_Title'].unique()
for studentID in uniqueStudentIds:
for course1 in uniqueClasses:
for course2 in uniqueClasses:
if (course1 != course2 and have_taken_both_courses(course1, course2, studentID)):
x = vertexDict[course1]
y = vertexDict[course2]
# Assuming symmetry
adjacency_matrix[x][y] += 1
adjacency_matrix[y][x] += 1
print(course1 + ', ' + course2)
def have_taken_both_courses(course1, course2, studentID):
hasTakenFirstCourse = len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0
if hasTakenFirstCourse:
return len(df.loc[(df['Course_Title'] == course2) & (df['Id'] == studentID)]) > 0
return False
Given that I have a very large dataset size, I have tried to consult online resources in parallelizing/multithreading this computationally expensive for loop. However, i'm new to python and multiprocessing, so any guidance would be greatly appreciated!

It appears are looping way more than you have to. For every student you do NxN iterations, where N is the total number of classes. But your student has only taken a subset of those classes. So you can cut down on iterations significantly.
Your have_taken_both_courses() lookup is also more expensive than it needs to be.
Something like this will probably go a lot faster:
import numpy as np
import itertools
import pandas as pd
df = pd.read_table('/path/to/data.tsv')
students_df = pd.DataFrame(df['student'].unique())
students_lkp = {x[1][0]: x[0] for x in students_df.iterrows()}
classes_df = pd.DataFrame(df['class'].unique())
classes_lkp = {x[1][0]: x[0] for x in classes_df.iterrows()}
df['student_key'] = df['student'].apply(lambda x: students_lkp[x])
df['class_key'] = df['class'].apply(lambda x: classes_lkp[x])
df.set_index(['student_key', 'class_key'], inplace=True)
matr = np.zeros((len(classes_df), len(classes_df)))
for s in range(0, len(students_df)):
print s
# get all the classes for this student
classes = df.loc[s].index.unique().tolist()
for x, y in itertools.permutations(classes, 2):
matr[x][y] += 1


How to replace for loop here?

I have created this function in python for generating different price combinations for a product dataset. So if a price of a product is 10$ the different possible prices would be [10,11,12,13,14,15].
For eg:
df = pd.DataFrame({'Product_id': [1, 2], 'price_per_tire': [10, 110]})
My function:
def price_comb(df):
K= [0,1,2,3,4,5]
final_df = pd.DataFrame()
for j in K:
print('K count=' + str(c))
for index,i in df.iterrows():
if (i['price_per_tire']<=100):
i['price_per_tire'] = i['price_per_tire'] + 1*j
elif ((i['price_per_tire']>100) & (i['price_per_tire']<200)):
i['price_per_tire'] = i['price_per_tire'] + 2*j
elif ((i['price_per_tire']>200) & (i['price_per_tire']<300)):
i['price_per_tire'] = i['price_per_tire'] + 3*j
elif i['price_per_tire']>=300:
i['price_per_tire'] = i['price_per_tire'] + 5*j
final_df = final_df.append(i)
return final_df
when I run this function the output is
df = pd.DataFrame({'Product_id': [1,1,1,1,1,1, 2,2,2,2,2], 'price_per_tire': [10,11,12,13,14,15, 110,112,114,116,118,120]})
How ever its taking a lot of time (upto 2days) for 545k rows dataset. Im trying to find ways to execute this faster. Any help would be appreiciated
Please, provide a working version of the code, here is not clear where price_per_tire comes from.
This algo is a O(N2) so the is a lot of improvement to do.
First suggestion is to avoid for loop using numpy or pandas, try to solve your problem using vectorial approach.
This means that internal loop can be refactored using mask technique
for x in df.iterrows():
if x[fld] < limit:
x[fld] = f(x[fld])
can be refactored:
mask = df[fld] < limit
df[fld] = f(df[fld]) # if f(unction) can work in vectorial
df[fld] = df[fld].map(f) # Rolling version but slower
With this approach, you can speed up your code to a surprisingly fast version
Another point is that df.append is not a good practice, doing inline changes will be more efficient. You must create all needed columns before the main loop in order to allocate all required space .

Can I store data from a for-loop as a different variable for each iteration?

I have a function which creates a set of results in a list. This is in a for-loop which changes one of the variables in each iteration. I need to be able to store these lists separately so that I can show the difference in results between each iteration as a graph.
Is there any way to store them separately like that? So far the only solution I've found is to copy out the function multiple times and manually change the variable and name of the list it stores to, but obviously this is a terrible way of doing it and I figure there must be a proper way.
Here is the code. The function is messy but works. Ideally I would be able to put this all in another for-loop which changes deceleration_p each iteration and then stores collected_averages as a different list so that I could compare collected_averages for each iteration.
import numpy as np
import random
import matplotlib.pyplot as plt
from statistics import mean
road_length = 500
deceleration_p = 0.1
max_speed = 5
buffer_road = np.zeros(road_length, dtype=int)
buffer_speed = 0
number_of_iterations = 1000
average_speed = 0
average_speed_list = []
collected_averages = []
total_speed = 0
for cars in range(1, road_length):
empty_road = np.ones(road_length - cars, dtype=int) * -1
cars_on_road = np.ones(cars, dtype=int)
road = np.append(empty_road, cars_on_road)
for i in range(0, number_of_iterations):
# acceleration
for speed in np.nditer(road, op_flags=['readwrite']):
if -1 < speed < max_speed:
speed[...] += 1
# randomisation
for speed in np.nditer(road, op_flags=['readwrite']):
if 0 < speed:
if deceleration_p > random.random():
speed += -1
# slowing down
for cell in range(0, road_length):
speed = road[cell]
for val in range(1, speed + 1):
new_speed = val
if (cell + val) > (road_length - 1):
val += -road_length
if road[cell + val] > -1:
speed = val - 1
road[cell] = new_speed - 1
buffer_road=np.ones(road_length, dtype=int)*-1
for cell in range(0, road_length):
speed = road[cell]
buffer_cell = cell + speed
if (buffer_cell) > (road_length - 1):
buffer_cell += -road_length
if speed > -1:
total_speed += speed
buffer_road[buffer_cell] = speed
road = buffer_road
average_speed = total_speed/cars
average_speed = 0
total_speed = 0
Not to my knowledge. As stated in the comments, you could use a dictionary, but my suggestion is to use a list. For every iteration of the loop, you could append the value. (From what I understood) You stated that your results are in a list, so you could make a 2D array. My recommendation would be to use a numpy array as it is much faster. Hopefully this was helpful.

Optimize code for step function using only NumPy

I'm trying to optimize the function 'pw' in the following code using only NumPy functions (or perhaps list comprehensions).
from time import time
import numpy as np
def pw(x, udata):
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
N = 50000
x = np.linspace(0,10,N)
data = [1,3,4,5,5,7]
udata = np.unique(data)
ti = time()
tf = time()
print(tf - ti)
import cProfile'pw(x,udata)')
The is telling me that most of the overhead is coming from np.where (about 1 ms) but I'd like to create faster code if possible. It seems that performing the operations row-wise versus column-wise makes some difference, unless I'm mistaken, but I think I've accounted for it. I know that sometimes list comprehensions can be faster but I couldn't figure out a faster way than what I'm doing using it.
Searchsorted seems to yield better performance but that 1 ms still remains on my computer:
def pw(xx, uu):
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds != uu.shape[0]]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
This answer using piecewise seems pretty close, but that's on a scalar x0 and x1. How would I do it on arrays? And would it be more efficient?
Understandably, x may be pretty big but I'm trying to put it through a stress test.
I am still learning though so some hints or tricks that can help me out would be great.
There seems to be a mistake in the second function since the resulting array from the second function doesn't match the first one (which I'm confident that it works):
N1 = pw1(x,udata.reshape(udata.shape[0],1)).shape[0]
N2 = np.sum(pw1(x,udata.reshape(udata.shape[0],1)) == pw2(x,udata))
print(N1 - N2)
data points that are not the same. So it seems that I don't know how to use 'searchsorted'.
Actually I fixed it:
pw_func = vals[inds[inds != uu.shape[0]]]
was changed to
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
so at least the resulting arrays match. But the question still remains on whether there's a more efficient way of going about doing this.
Thanks Tin Lai for pointing out the mistake. This one should work
pw_func = vals[inds[(inds != uu.shape[0])*(inds != 0)]-1]
Maybe a more readable way of presenting it would be
non_endpts = (inds != uu.shape[0])*(inds != 0) # only consider the points in between the min/max data values
shift_inds = inds[non_endpts]-1 # searchsorted side='right' includes the left end point and not right end point so a shift is needed
pw_func = vals[shift_inds]
I think I got lost in all those brackets! I guess that's the importance of readability.
A very abstract yet interesting problem! Thanks for entertaining me, I had fun :)
p.s. I'm not sure about your pw2 I wasn't able to get it output the same as pw1.
For reference the original pws:
def pw1(x, udata):
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
def pw2(xx, uu):
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
My first attempt was utilising a lot of boardcasting operation from numpy:
def pw3(x, udata):
# the None slice is to create new axis
step_bool = x >= udata[None,:].T
# we exploit the fact that bools are integer value of 1s
# skipping the last value in "data"
step_vals = np.sum(step_bool[:-1], axis=0)
# for the step_bool that we skipped from previous step (last index)
# we set it to zerp so that we can negate the step_vals once we reached
# the last value in "data"
step_vals[step_bool[-1]] = 0
return step_vals
After looking at the searchsorted from your pw2 I had a new approach that utilise it with much higher performance:
def pw4(x, udata):
inds = np.searchsorted(udata, x, side='right')
# fix-ups the last data if x is already out of range of data[-1]
if x[-1] > udata[-1]:
inds[inds == inds[-1]] = 0
return inds
Plots with:
plt.plot(pw1(x,udata.reshape(udata.shape[0],1)), label='pw1')
plt.plot(pw2(x,udata), label='pw2')
plt.plot(pw3(x,udata), label='pw3')
plt.plot(pw4(x,udata), label='pw4')
with data = [1,3,4,5,5,7]:
with data = [1,3,4,5,5,7,11]
pw1,pw3,pw4 are all identical
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw3(x,udata)))
>>> True
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw4(x,udata)))
>>> True
Performance: (timeit by default runs 3 times, average of number=N of times)
print(timeit.Timer('pw1(x,udata.reshape(udata.shape[0],1))', "from __main__ import pw1, x, udata").repeat(number=1000))
>>> [3.1938983199979702, 1.6096494779994828, 1.962694135003403]
print(timeit.Timer('pw2(x,udata)', "from __main__ import pw2, x, udata").repeat(number=1000))
>>> [0.6884554479984217, 0.6075002400029916, 0.7799002879983163]
print(timeit.Timer('pw3(x,udata)', "from __main__ import pw3, x, udata").repeat(number=1000))
>>> [0.7369808239964186, 0.7557657590004965, 0.8088172269999632]
print(timeit.Timer('pw4(x,udata)', "from __main__ import pw4, x, udata").repeat(number=1000))
>>> [0.20514375300263055, 0.20203858999957447, 0.19906871100101853]

How to improve the speed of specific routines?

I'm trying to work with some basic actuarial mathematics in Python.
I have a data base consisting of 1000+ person and their info for pension.
In this problem, I'm working with these variables:
m_age: Age of the insured person in months
m_tmpness: Temporariness of the benefit, if temporary.
m_tmbenef: Time elapsed, in months, from the start of the benefit.
m_interest: Interest rate for the benefit.
tableid: ID number for an actuarial table.
mytable_n: qx(probability of dying before the end of age x) from an actuarial table. I have several tables, so it's mytable_1, mytable_2, ... mytable_n
In[2]: m_age
Out[2]: [877, 877, 797, 797, 794]
In[3]: m_tmpness
Out[3]: [240, 240, 0, 120, 120]
In[4]: m_tmbenef
Out[4]: [101, 28, 0, 118, 118]
In[5]: m_interest
Out[5]: [0.0016515813019202241,
In[6]: mytable_1
0 0.000337
1 0.000337
2 0.000337
1500 1.000000
Name: at49m, Length: 1501, dtype: float64
I have calculated lx (number of people living at age x) values for each table, to support calculating Nx and Dx. I have to calculate Dx and Nx for each person in my data base, according to their data.
Dx is simply lx * 1/(1+interest)^x. Nx is the sum of all Dx values from a certain point. If x = 0, Nx = D0 + D1 + D2 + ... + Dn. If x = 50, Nx = D50 + D51 + ... + Dn.
It's absolutely easy to calculate Dx for each person, but I'm struggling with it because I need all values from Dx until certain age for each person to calculate their Nx at a certain age.
So, that's what I've been trying so far:
import pandas
lx_mytable_1 = [100000 if i==0 else 0 for i in range(len(mytable_1))]
for i in range(len(mytable_1)):
lx_mytable_1[i] = lx_mytable_1[i-1]*(1-mytable_1[i-1])
### Replicate it to n tables
### Dx and Nx
def Dx(x,lx,qx,interest):
D_x = [((1/(1+interest))**i)*lx[i] for i in range(len(qx))]
def Nx(x,lx,qx,interest):
N_x = 0
for i in range(len(qx)):
N_x = N_x + Dx(x=i,lx=lx,qx=qx,interest=interest)
### And one should run it like this, for example:
### Nx(x=100,lx=lx_mytable_1,qx=mytable_1,interest=m_interest)
aux_NX = [0 for i in range(len(tableid))]
for i in range(len(tableid)):
if (tableid[i] == 0):
aux_NX[i] = 0.0
if (tableid[i] == 1):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_1,qx=mytable_1,interest=m_interest[i])
if (tableid[i] == 2):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_2,qx=mytable_2,interest=m_interest[i])
if (tableid[i] == 3):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_3,qx=mytable_3,interest=m_interest[i])
if (tableid[i] == 4):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_4,qx=mytable_4,interest=m_interest[i])
if (tableid[i] == 5):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_5,qx=mytable_5,interest=m_interest[i])
### And as many elses and ifs as necessary... Currently I'm using 15 tables.
When I run it for a single line, that's fine. But when I run it for the 1000+ lines, it can take hours to run properly. Probably it's because I'm calling a for loop in Nx, and Dx is using another for loop, with 1500 iterations in Dx and Nx...
My question is: is there a computationally faster way to do the same? How?
I recommend some testing with the timeit module.
You can then find out exactly which parts of your code are taking a long time to run, and which are running quickly. You can then use this knowledge to ask a more specific question about how to optimise your code.
Quickly scanning your code, I would recommend using a dictionary to store your lx_mytable_2 tables. Then you can just refer to them as lx_mytable[i] and use the same index in your existing for loop, rather than 10 if else statements. This would make your code much cleaner and you wouldn't have to write a new line of code as your data grows.
As a side note, try using the elif syntax rather than else: if: on separate lines.

Compute Higher Moments of Data Matrix

this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)

