How to improve the speed of specific routines? - python

I'm trying to work with some basic actuarial mathematics in Python.
I have a data base consisting of 1000+ person and their info for pension.
In this problem, I'm working with these variables:
m_age: Age of the insured person in months
m_tmpness: Temporariness of the benefit, if temporary.
m_tmbenef: Time elapsed, in months, from the start of the benefit.
m_interest: Interest rate for the benefit.
tableid: ID number for an actuarial table.
mytable_n: qx(probability of dying before the end of age x) from an actuarial table. I have several tables, so it's mytable_1, mytable_2, ... mytable_n
In[2]: m_age
Out[2]: [877, 877, 797, 797, 794]
In[3]: m_tmpness
Out[3]: [240, 240, 0, 120, 120]
In[4]: m_tmbenef
Out[4]: [101, 28, 0, 118, 118]
In[5]: m_interest
Out[5]: [0.0016515813019202241,
0.0016515813019202241,
0.0023039138595752906,
0.0040741237836483535,
0.0040741237836483535]
In[6]: mytable_1
Out[6]:
0 0.000337
1 0.000337
2 0.000337
...
1500 1.000000
Name: at49m, Length: 1501, dtype: float64
I have calculated lx (number of people living at age x) values for each table, to support calculating Nx and Dx. I have to calculate Dx and Nx for each person in my data base, according to their data.
Dx is simply lx * 1/(1+interest)^x. Nx is the sum of all Dx values from a certain point. If x = 0, Nx = D0 + D1 + D2 + ... + Dn. If x = 50, Nx = D50 + D51 + ... + Dn.
It's absolutely easy to calculate Dx for each person, but I'm struggling with it because I need all values from Dx until certain age for each person to calculate their Nx at a certain age.
So, that's what I've been trying so far:
import pandas
lx_mytable_1 = [100000 if i==0 else 0 for i in range(len(mytable_1))]
for i in range(len(mytable_1)):
lx_mytable_1[i] = lx_mytable_1[i-1]*(1-mytable_1[i-1])
### Replicate it to n tables
### Dx and Nx
def Dx(x,lx,qx,interest):
D_x = [((1/(1+interest))**i)*lx[i] for i in range(len(qx))]
return(D_x[x])
def Nx(x,lx,qx,interest):
N_x = 0
for i in range(len(qx)):
N_x = N_x + Dx(x=i,lx=lx,qx=qx,interest=interest)
return(N_x)
### And one should run it like this, for example:
### Nx(x=100,lx=lx_mytable_1,qx=mytable_1,interest=m_interest)
aux_NX = [0 for i in range(len(tableid))]
for i in range(len(tableid)):
if (tableid[i] == 0):
aux_NX[i] = 0.0
else:
if (tableid[i] == 1):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_1,qx=mytable_1,interest=m_interest[i])
else:
if (tableid[i] == 2):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_2,qx=mytable_2,interest=m_interest[i])
else:
if (tableid[i] == 3):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_3,qx=mytable_3,interest=m_interest[i])
else:
if (tableid[i] == 4):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_4,qx=mytable_4,interest=m_interest[i])
else:
if (tableid[i] == 5):
aux_NX[i] = Nx(x=PBCIDADE[i],lx=lx_mytable_5,qx=mytable_5,interest=m_interest[i])
### And as many elses and ifs as necessary... Currently I'm using 15 tables.
When I run it for a single line, that's fine. But when I run it for the 1000+ lines, it can take hours to run properly. Probably it's because I'm calling a for loop in Nx, and Dx is using another for loop, with 1500 iterations in Dx and Nx...
My question is: is there a computationally faster way to do the same? How?

I recommend some testing with the timeit module.
You can then find out exactly which parts of your code are taking a long time to run, and which are running quickly. You can then use this knowledge to ask a more specific question about how to optimise your code.
Quickly scanning your code, I would recommend using a dictionary to store your lx_mytable_2 tables. Then you can just refer to them as lx_mytable[i] and use the same index in your existing for loop, rather than 10 if else statements. This would make your code much cleaner and you wouldn't have to write a new line of code as your data grows.
As a side note, try using the elif syntax rather than else: if: on separate lines.

Related

Chunk a variable into parts and sum the total in each part

My dataset has 2 million observations. I want to split it into 200 categories based on the value of a variable, 'rv'. For example, imagine I had the categories 0-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000 I would want to split an observation with value 4500 like this: 1000 in each of the 1st 4 categories, and 500 in the final category. I have the following code, which works but is very slow:
# create random data set
import pandas as pd
import numpy as np
data = np.random.randint(0, 5000, size=2000)
df = pd.DataFrame({'rv': data})
#%% slice
sizes = [0, 1000, 2000, 3000, 4000, 5000]
size_names = ['{:.0f} to {:.0f}'.format(lower, upper) for lower, upper in zip(sizes[0:-1], sizes[1:])]
for lower, upper, name in zip(sizes[0:-1], sizes[1:], size_names):
df[name] = df['rv'].apply(lambda x: max(0, (min(x, upper) - lower)))
# summary table
df_slice = df[size_names].sum()
Are there better ways of doing this, where better means faster principally? With 2 million observations and 200 categories this takes quite a long time (not sure how long as I stopped the code before it had finished).
I wrote an algorithm that sorts the data beforehand, which takes it from a O(n*m) loop (over the data and the categories) to a O(n) loop (just over the data, albeit there is a O(n log n) time for sorting it). By sorting it, you already know which bin you're in and just have to take care of the summing for that particular bin, then apply the sum to that bin and all bins below it once per bin. It takes about 1.2 seconds on 2 million data points over 200 categories. Hope it helps:
from time import time
from random import randint
data = [randint(0, 4999) for i in range(2000000)]
sizes = range(0, 5001, 25)
bound_pairs = [[sizes[i], sizes[i + 1]] for i in range(len(sizes) - 1)]
results = [0 for i in range(len(sizes) - 1)]
data.sort()
curr_bin = 0
curr_bin_count = 0
curr_bin_sum = 0
for d in data:
if d >= bound_pairs[curr_bin][1]:
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
curr_bin_count = 0
curr_bin_sum = 0
while d >= bound_pairs[curr_bin][1]:
curr_bin += 1
curr_bin_count += 1
curr_bin_sum += d - bound_pairs[curr_bin][0]
results[curr_bin] += curr_bin_sum
for i in range(curr_bin):
results[i] += curr_bin_count * (bound_pairs[i][1] - bound_pairs[i][0])
EDIT: There may be some issues here depending on whether you want the upper bound or lower bound to be inclusive or exclusive. I leave the particulars to you.

Can I store data from a for-loop as a different variable for each iteration?

I have a function which creates a set of results in a list. This is in a for-loop which changes one of the variables in each iteration. I need to be able to store these lists separately so that I can show the difference in results between each iteration as a graph.
Is there any way to store them separately like that? So far the only solution I've found is to copy out the function multiple times and manually change the variable and name of the list it stores to, but obviously this is a terrible way of doing it and I figure there must be a proper way.
Here is the code. The function is messy but works. Ideally I would be able to put this all in another for-loop which changes deceleration_p each iteration and then stores collected_averages as a different list so that I could compare collected_averages for each iteration.
import numpy as np
import random
import matplotlib.pyplot as plt
from statistics import mean
road_length = 500
deceleration_p = 0.1
max_speed = 5
buffer_road = np.zeros(road_length, dtype=int)
buffer_speed = 0
number_of_iterations = 1000
average_speed = 0
average_speed_list = []
collected_averages = []
total_speed = 0
for cars in range(1, road_length):
empty_road = np.ones(road_length - cars, dtype=int) * -1
cars_on_road = np.ones(cars, dtype=int)
road = np.append(empty_road, cars_on_road)
np.random.shuffle(road)
for i in range(0, number_of_iterations):
# acceleration
for speed in np.nditer(road, op_flags=['readwrite']):
if -1 < speed < max_speed:
speed[...] += 1
# randomisation
for speed in np.nditer(road, op_flags=['readwrite']):
if 0 < speed:
if deceleration_p > random.random():
speed += -1
# slowing down
for cell in range(0, road_length):
speed = road[cell]
for val in range(1, speed + 1):
new_speed = val
if (cell + val) > (road_length - 1):
val += -road_length
if road[cell + val] > -1:
speed = val - 1
road[cell] = new_speed - 1
break
buffer_road=np.ones(road_length, dtype=int)*-1
for cell in range(0, road_length):
speed = road[cell]
buffer_cell = cell + speed
if (buffer_cell) > (road_length - 1):
buffer_cell += -road_length
if speed > -1:
total_speed += speed
buffer_road[buffer_cell] = speed
road = buffer_road
average_speed = total_speed/cars
average_speed_list.append(average_speed)
average_speed = 0
total_speed = 0
steady_state_average=mean(average_speed_list[9:number_of_iterations])
average_speed_list=[]
collected_averages.append(steady_state_average)
Not to my knowledge. As stated in the comments, you could use a dictionary, but my suggestion is to use a list. For every iteration of the loop, you could append the value. (From what I understood) You stated that your results are in a list, so you could make a 2D array. My recommendation would be to use a numpy array as it is much faster. Hopefully this was helpful.

Optimize code for step function using only NumPy

I'm trying to optimize the function 'pw' in the following code using only NumPy functions (or perhaps list comprehensions).
from time import time
import numpy as np
def pw(x, udata):
"""
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
"""
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
N = 50000
x = np.linspace(0,10,N)
data = [1,3,4,5,5,7]
udata = np.unique(data)
ti = time()
pw(x,udata)
tf = time()
print(tf - ti)
import cProfile
cProfile.run('pw(x,udata)')
The cProfile.run is telling me that most of the overhead is coming from np.where (about 1 ms) but I'd like to create faster code if possible. It seems that performing the operations row-wise versus column-wise makes some difference, unless I'm mistaken, but I think I've accounted for it. I know that sometimes list comprehensions can be faster but I couldn't figure out a faster way than what I'm doing using it.
Searchsorted seems to yield better performance but that 1 ms still remains on my computer:
(modified)
def pw(xx, uu):
"""
Creates the step function
| 1, if d0 <= x < d1
| 2, if d1 <= x < d2
pw(x,data) = ...
| N, if d(N-1) <= x < dN
| 0, otherwise
where di is the ith element in data.
INPUT: x -- interval which the step function is defined over
data -- an ordered set of data (without repetitions)
OUTPUT: pw_func -- an array of size x.shape[0]
"""
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds != uu.shape[0]]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
This answer using piecewise seems pretty close, but that's on a scalar x0 and x1. How would I do it on arrays? And would it be more efficient?
Understandably, x may be pretty big but I'm trying to put it through a stress test.
I am still learning though so some hints or tricks that can help me out would be great.
EDIT
There seems to be a mistake in the second function since the resulting array from the second function doesn't match the first one (which I'm confident that it works):
N1 = pw1(x,udata.reshape(udata.shape[0],1)).shape[0]
N2 = np.sum(pw1(x,udata.reshape(udata.shape[0],1)) == pw2(x,udata))
print(N1 - N2)
yields
15000
data points that are not the same. So it seems that I don't know how to use 'searchsorted'.
EDIT 2
Actually I fixed it:
pw_func = vals[inds[inds != uu.shape[0]]]
was changed to
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
so at least the resulting arrays match. But the question still remains on whether there's a more efficient way of going about doing this.
EDIT 3
Thanks Tin Lai for pointing out the mistake. This one should work
pw_func = vals[inds[(inds != uu.shape[0])*(inds != 0)]-1]
Maybe a more readable way of presenting it would be
non_endpts = (inds != uu.shape[0])*(inds != 0) # only consider the points in between the min/max data values
shift_inds = inds[non_endpts]-1 # searchsorted side='right' includes the left end point and not right end point so a shift is needed
pw_func = vals[shift_inds]
I think I got lost in all those brackets! I guess that's the importance of readability.
A very abstract yet interesting problem! Thanks for entertaining me, I had fun :)
p.s. I'm not sure about your pw2 I wasn't able to get it output the same as pw1.
For reference the original pws:
def pw1(x, udata):
vals = np.arange(1,udata.shape[0]+1).reshape(udata.shape[0],1)
pw_func = np.sum(np.where(np.greater_equal(x,udata)*np.less(x,np.roll(udata,-1)),vals,0),axis=0)
return pw_func
def pw2(xx, uu):
inds = np.searchsorted(uu, xx, side='right')
vals = np.arange(1,uu.shape[0]+1)
pw_func = vals[inds[inds[(inds != uu.shape[0])*(inds != 0)]-1]]
num_mins = np.sum(xx < np.min(uu))
num_maxs = np.sum(xx > np.max(uu))
pw_func = np.concatenate((np.zeros(num_mins), pw_func, np.zeros(xx.shape[0]-pw_func.shape[0]-num_mins)))
return pw_func
My first attempt was utilising a lot of boardcasting operation from numpy:
def pw3(x, udata):
# the None slice is to create new axis
step_bool = x >= udata[None,:].T
# we exploit the fact that bools are integer value of 1s
# skipping the last value in "data"
step_vals = np.sum(step_bool[:-1], axis=0)
# for the step_bool that we skipped from previous step (last index)
# we set it to zerp so that we can negate the step_vals once we reached
# the last value in "data"
step_vals[step_bool[-1]] = 0
return step_vals
After looking at the searchsorted from your pw2 I had a new approach that utilise it with much higher performance:
def pw4(x, udata):
inds = np.searchsorted(udata, x, side='right')
# fix-ups the last data if x is already out of range of data[-1]
if x[-1] > udata[-1]:
inds[inds == inds[-1]] = 0
return inds
Plots with:
plt.plot(pw1(x,udata.reshape(udata.shape[0],1)), label='pw1')
plt.plot(pw2(x,udata), label='pw2')
plt.plot(pw3(x,udata), label='pw3')
plt.plot(pw4(x,udata), label='pw4')
with data = [1,3,4,5,5,7]:
with data = [1,3,4,5,5,7,11]
pw1,pw3,pw4 are all identical
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw3(x,udata)))
>>> True
print(np.all(pw1(x,udata.reshape(udata.shape[0],1)) == pw4(x,udata)))
>>> True
Performance: (timeit by default runs 3 times, average of number=N of times)
print(timeit.Timer('pw1(x,udata.reshape(udata.shape[0],1))', "from __main__ import pw1, x, udata").repeat(number=1000))
>>> [3.1938983199979702, 1.6096494779994828, 1.962694135003403]
print(timeit.Timer('pw2(x,udata)', "from __main__ import pw2, x, udata").repeat(number=1000))
>>> [0.6884554479984217, 0.6075002400029916, 0.7799002879983163]
print(timeit.Timer('pw3(x,udata)', "from __main__ import pw3, x, udata").repeat(number=1000))
>>> [0.7369808239964186, 0.7557657590004965, 0.8088172269999632]
print(timeit.Timer('pw4(x,udata)', "from __main__ import pw4, x, udata").repeat(number=1000))
>>> [0.20514375300263055, 0.20203858999957447, 0.19906871100101853]

Finding the optimal location for router placement

I am looking for an optimization algorithm that takes a text file encoded with 0s, 1s, and -1s:
1's denoting target cells that requires Wi-Fi coverage
0's denoting cells that are walls
1's denoting cells that are void (do not require Wi-Fi coverage)
Example of text file:
I have created a solution function along with other helper functions, but I can't seem to get the optimal positions of the routers to be placed to ensure proper coverage. There is another file that does the printing, I am struggling with finding the optimal location. I basically need to change the get_random_position function to get the optimal one, but I am unsure how to do that. The area covered by the various routers are:
This is the kind of output I am getting:
Each router covers a square area of at most (2S+1)^2
Type 1: S=5; Cost=180
Type 2: S=9; Cost=360
Type 3: S=15; Cost=480
My code is as follows:
import numpy as np
import time
from random import randint
def is_taken(taken, i, j):
for coords in taken:
if coords[0] == i and coords[1] == j:
return True
return False
def get_random_position(floor, taken , nrows, ncols):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
while floor[i][j] == 0 or floor[i][j] == -1 or is_taken(taken, i, j):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
return (i, j)
def solution(floor):
start_time = time.time()
router_types = [1,2,3]
nrows, ncols = floor.shape
ratio = 0.1
router_scale = int(nrows*ncols*0.0001)
if router_scale == 0:
router_scale = 1
row_ratio = int(nrows*ratio)
col_ratio = int(ncols*ratio)
print('Row : ',nrows, ', Col: ', ncols, ', Router scale :', router_scale)
global_best = [0, ([],[],[])]
taken = []
while True:
found_better = False
best = [global_best[0], (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))]
for times in range(0, row_ratio+col_ratio):
if time.time() - start_time > 27.0:
print('Time ran out! Using what I got : ', time.time() - start_time)
return global_best[1]
fit = []
for rtype in router_types:
interim = (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))
for i in range(0, router_scale):
pos = get_random_position(floor, taken, nrows, ncols)
interim[0].append(pos[0])
interim[1].append(pos[1])
interim[2].append(rtype)
fit.append((fitness(floor, interim), interim))
highest_fitness = fit[0]
for index in range(1, len(fit)):
if fit[index][0] > highest_fitness[0]:
highest_fitness = fit[index]
if highest_fitness[0] > best[0]:
best[0] = highest_fitness[0]
best[1] = (highest_fitness[1][0],highest_fitness[1][1], highest_fitness[1][2])
found_better = True
global_best = best
taken.append((best[1][0][-1],best[1][1][-1]))
break
if found_better == False:
break
print('Best:')
print(global_best)
end_time = time.time()
run_time = end_time - start_time
print("Run Time:", run_time)
return global_best[1]
def available_cells(floor):
available = 0
for i in range(0, len(floor)):
for j in range(0, len(floor[i])):
if floor[i][j] != 0:
available += 1
return available
def fitness(building, args):
render = np.array(building, dtype=int, copy=True)
cov_factor = 220
cost_factor = 22
router_types = { # type: [coverage, cost]
1: {'size' : 5, 'cost' : 180},
2: {'size' : 9, 'cost' : 360},
3: {'size' : 15, 'cost' : 480},
}
routers_used = args[-1]
for r, c, t in zip(*args):
size = router_types[t]['size']
nrows, ncols = render.shape
rows = range(max(0, r-size), min(nrows, r+size+1))
cols = range(max(0, c-size), min(ncols, c+size+1))
walls = []
for ri in rows:
for ci in cols:
if building[ri, ci] == 0:
walls.append((ri, ci))
def blocked(ri, ci):
for w in walls:
if min(r, ri) <= w[0] and max(r, ri) >= w[0]:
if min(c, ci) <= w[1] and max(c, ci) >= w[1]:
return True
return False
for ri in rows:
for ci in cols:
if blocked(ri, ci):
continue
if render[ri, ci] == 2:
render[ri, ci] = 4
if render[ri, ci] == 1:
render[ri, ci] = 2
render[r, c] = 5
return (
cov_factor * np.sum(render > 1) -
cost_factor * np.sum([router_types[x]['cost'] for x in routers_used])
)
Here's a suggestion on how to solve the problem; however I don't affirm this is the best approach, and it's certainly not the only one.
Main idea
Your problem can be modelised as a weighted minimum set cover problem.
Good news, this is a well known optimization problem:
It is easy to find algorithm descriptions for approximate solutions
A quick search on the web shows many implementations of approximation algorithms in Python.
Bad news, this is a NP-hard optimization problem:
If you need an exact solution: algorithms will work only for "small" sized problems in a reasonable amount of time(in your case: size of the problem <=> number of "1" cells).
Approximate (a.k.a greedy) algorithms are trade-off between computation requirements, and a risk do deliver far from optimal solutions in certain cases.
Note that the following part does not prove that your problem is NP-hard. The general minimum set cover problem is NP-hard. In your case the subsets have several properties that might help to design a better algorithm. I have no idea how though.
Translating into a cover set problem
Let's define some sets:
U: the set of "1" cells (requiring Wifi).
P(U): the power set of U (the set of subsets of U).
P: the set of cells on which you can place a router (not sure if P=U in your original post).
T: the set of router type (3 values in your case).
R+: positive Real number (used to describe prices).
Let's define a function (pseudo Python):
# Domain of definition : T,P --> R+,P(U)
# This function takes a router type and a position, and returns
# a tuple containing:
# - the price of a router of the given type.
# - the subset of U containing all the position covered by a router
# of the given type placed at the given position.
def weighted_subset(routerType, position):
pass # TODO: implementation
Now, we define a last set, as the image of the function we've just described: S=weighted_subset(T,P). Each element of this set is a subset of U, weighted by a price in R+.
With all this formalism, finding the router types & positions that:
gives coverage to all the desirable locations
minimize the cost
Is equivalent to finding a sub-collection of S:
whose union of their P(U) is equal to U
which minimise the sum of the associated weights
Which is the weighted minimal set cover problem.

How to Parallelize python loop with large dataset

I am trying to construct hierarchies given a dataset, where each row represents a student, the course they've taken, and some other metadata. From this dataset, i'm trying to construct an adjacency matrix and determine the hierarchies based on what classes students have taken, and the path that different students take when choosing classes.
That being said, to construct this adjacency matrix, it is computationally expensive. Here is the code I have currently, which has been running for around 2 hours.
uniqueStudentIds = df.Id.unique()
uniqueClasses = df['Course_Title'].unique()
for studentID in uniqueStudentIds:
for course1 in uniqueClasses:
for course2 in uniqueClasses:
if (course1 != course2 and have_taken_both_courses(course1, course2, studentID)):
x = vertexDict[course1]
y = vertexDict[course2]
# Assuming symmetry
adjacency_matrix[x][y] += 1
adjacency_matrix[y][x] += 1
print(course1 + ', ' + course2)
def have_taken_both_courses(course1, course2, studentID):
hasTakenFirstCourse = len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0
if hasTakenFirstCourse:
return len(df.loc[(df['Course_Title'] == course2) & (df['Id'] == studentID)]) > 0
else:
return False
Given that I have a very large dataset size, I have tried to consult online resources in parallelizing/multithreading this computationally expensive for loop. However, i'm new to python and multiprocessing, so any guidance would be greatly appreciated!
It appears are looping way more than you have to. For every student you do NxN iterations, where N is the total number of classes. But your student has only taken a subset of those classes. So you can cut down on iterations significantly.
Your have_taken_both_courses() lookup is also more expensive than it needs to be.
Something like this will probably go a lot faster:
import numpy as np
import itertools
import pandas as pd
df = pd.read_table('/path/to/data.tsv')
students_df = pd.DataFrame(df['student'].unique())
students_lkp = {x[1][0]: x[0] for x in students_df.iterrows()}
classes_df = pd.DataFrame(df['class'].unique())
classes_lkp = {x[1][0]: x[0] for x in classes_df.iterrows()}
df['student_key'] = df['student'].apply(lambda x: students_lkp[x])
df['class_key'] = df['class'].apply(lambda x: classes_lkp[x])
df.set_index(['student_key', 'class_key'], inplace=True)
matr = np.zeros((len(classes_df), len(classes_df)))
for s in range(0, len(students_df)):
print s
# get all the classes for this student
classes = df.loc[s].index.unique().tolist()
for x, y in itertools.permutations(classes, 2):
matr[x][y] += 1

Categories

Resources