Is numpy.argmax slower than MATLAB [~,idx] = max()? - python

I am writing a Bayseian classifier for a normal distribution. I have both code in python and MATLAB which are nearly identical. However the MATLAB code runs about 50x faster than my Python script. I'm new to Python, so maybe there's something I did terribly wrong. I assume it's somewhere where I loop over the dataset.
Possibly numpy.argmax() is much slower than [~,idx]=max()? Looping through the data frame is slow? Bad use of dictionaries (previously I tried an object and it was even slow)?
Any advice is welcome.
Python code
import numpy as np
import pandas as pd
#import the data as a data frame
train_df = pd.read_table('hw1_traindata.txt',header = None)#training
train_df.columns = [1, 2] #rename column titles
The data here is a 2 columns (300 rows/samples for training and 300000 for test). This is the function parameters; mi and Si are the sample means and covariances.
case3_p = {'w': [], 'w0': [], 'W': []}
case3_p['w']={1:S1.I*m1,2:S2.I*m2,3:S3.I*m3}
case3_p['w0']={1: -1.0/2.0*(m1.T*S1.I*m1)-
1.0/2.0*np.log(np.linalg.det(S1)),
2: -1.0/2.0*(m2.T*S2.I*m2)-1.0/2.0*np.log(np.linalg.det(S2)),
3: -1.0/2.0*(m3.T*S3.I*m3)-1.0/2.0*np.log(np.linalg.det(S3))}
case3_p['W']={1: -1.0/2.0*S1.I,
2: -1.0/2.0*S2.I,
3: -1.0/2.0*S3.I}
#W1=-1.0/2.0*S1.I
#w1_3=S1.I*m1
#w01_3=-1.0/2.0*(m1.T*S1.I*m1)-1.0/2.0*np.log(np.linalg.det(S1))
def g3(x,W,w,w0):
return x.T*W*x+w.T*x+w0
This is the classifier/loop
train_df['case3'] = 0
for i in range(train_df.shape[0]):
x = np.mat(train_df.loc[i,[1, 2]]).T#observation
#case 3
vals = [g3(x,case3_p['W'][1],case3_p['w'][1],case3_p['w0'][1]),
g3(x,case3_p['W'][2],case3_p['w'][2],case3_p['w0'][2]),
g3(x,case3_p['W'][3],case3_p['w'][3],case3_p['w0'][3])]
train_df.loc[i,'case3'] = np.argmax(vals) + 1 #add one to make it the class value
Corresponding MATLAB code
train = load('hw1_traindata.txt');
The discriminant functions
W1=-1/2*S1^-1;%there isn't one for the other cases
w1_3=S1^-1*m1';%fix the transpose thing
w10_3=-1/2*(m1*S1^-1*m1')-1/2*log(det(S1));
g1_3=#(x) x'*W1*x+w1_3'*x+w10_3';
W2=-1/2*S2^-1;
w2_3=S2^-1*m2';
w20_3=-1/2*(m2*S2^-1*m2')-1/2*log(det(S2));
g2_3=#(x) x'*W2*x+w2_3'*x+w20_3';
W3=-1/2*S3^-1;
w3_3=S3^-1*m3';
w30_3=-1/2*(m3*S3^-1*m3')-1/2*log(det(S3));
g3_3=#(x) x'*W3*x+w3_3'*x+w30_3';
The classifier
case3_class_tr = Inf(size(act_class_tr));
for i=1:length(train)
x=train(i,:)';%current sample
%case3
vals = [g1_3(x),g2_3(x),g3_3(x)];%compute discriminant function value
[~, case3_class_tr(i)] = max(vals);%get location of max
end

In cases such as this it's best to profile your code. First I created some mock data:
import numpy as np
import pandas as pd
fname = 'hw1_traindata.txt'
ar = np.random.rand(1000, 2)
np.savetxt(fname, ar, delimiter='\t')
m1, m2, m3 = [np.mat(ar).T for ar in np.random.rand(3, 2)]
S1, S2, S3 = [np.mat(ar) for ar in np.random.rand(3, 2, 2)]
Then I wrapped your code in a function and profiled with the lprun (line_profiler) IPython magic. These are the results:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 4.77361 s
File: <ipython-input-164-563f57dadab3>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
=====================================================
1 def train(fname, m1, S1, m2, S2, m3, S3):
2 1 9868 9868.0 0.1 train_df = pd.read_table(fname ,header = None)#training
3 1 328 328.0 0.0 train_df.columns = [1, 2] #rename column titles
4
5 1 17 17.0 0.0 case3_p = {'w': [], 'w0': [], 'W': []}
6 1 877 877.0 0.0 case3_p['w']={1:S1.I*m1,2:S2.I*m2,3:S3.I*m3}
7 1 356 356.0 0.0 case3_p['w0']={1: -1.0/2.0*(m1.T*S1.I*m1)-
8
9 1 204 204.0 0.0 1.0/2.0*np.log(np.linalg.det(S1)),
10 1 498 498.0 0.0 2: -1.0/2.0*(m2.T*S2.I*m2)-1.0/2.0*np.log(np.linalg.det(S2)),
11 1 502 502.0 0.0 3: -1.0/2.0*(m3.T*S3.I*m3)-1.0/2.0*np.log(np.linalg.det(S3))}
12 1 235 235.0 0.0 case3_p['W']={1: -1.0/2.0*S1.I,
13 1 229 229.0 0.0 2: -1.0/2.0*S2.I,
14 1 230 230.0 0.0 3: -1.0/2.0*S3.I}
15
16 1 1818 1818.0 0.0 train_df['case3'] = 0
17
18 1001 17409 17.4 0.2 for i in range(train_df.shape[0]):
19 1000 4254511 4254.5 49.9 x = np.mat(train_df.loc[i,[1, 2]]).T#observation
20
21 #case 3
22 1000 298245 298.2 3.5 vals = [g3(x,case3_p['W'][1],case3_p['w'][1],case3_p['w0'][1]),
23 1000 269825 269.8 3.2 g3(x,case3_p['W'][2],case3_p['w'][2],case3_p['w0'][2]),
24 1000 274279 274.3 3.2 g3(x,case3_p['W'][3],case3_p['w'][3],case3_p['w0'][3])]
25 1000 3395654 3395.7 39.8 train_df.loc[i,'case3'] = np.argmax(vals) + 1
26
27 1 45 45.0 0.0 return train_df
There are two lines that together take 90% of the time. So let's split these lines up a bit and rerun the profiler:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 6.15358 s
File: <ipython-input-197-92d9866b57dc>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
======================================================
...
19 1000 5292988 5293.0 48.2 thing = train_df.loc[i,[1, 2]] # Observation
20 1000 265101 265.1 2.4 x = np.mat(thing).T
...
26 1000 143142 143.1 1.3 index = np.argmax(vals) + 1 # Add one to make it the class value
27 1000 4164122 4164.1 37.9 train_df.loc[i,'case3'] = index
Most time is spent indexing the Pandas dataframe! Taking the argmax is only 1.5% of total execution time.
The situation can be improved somewhat by pre-allocating train_df['case3'] and using .iloc:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 3.26716 s
File: <ipython-input-192-f6173cdf9990>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
======= ======= ======================================
16 1 1548 1548.0 0.0 train_df['case3'] = np.zeros(len(train_df))
...
19 1000 2608489 2608.5 44.7 thing = train_df.iloc[i,[0, 1]] # Observation
20 1000 228959 229.0 3.9 x = np.mat(thing).T
...
26 1000 123165 123.2 2.1 index = np.argmax(vals) + 1 # Add one to make it the class value
27 1000 1849283 1849.3 31.7 train_df.iloc[i,2] = index
Still though, iterating individual values from Pandas dataframes in tight loops is a bad idea. In this case use Pandas only for loading the text-data (it's very good at it) but other than that use "raw" Numpy arrays. E.g. use train_data = pd.read_table(fname, header=None).values. And when you reach the analysis stage maybe go back to Pandas.
Some other ramblings:
Use Python's zero-based indexing and don't go out of your way to use
one-based indexing.
Consider using normal Numpy arrays instead of matrices. When you use
matrices you tend to mix them up with arrays and run into hard to debug
problems.
MATLAB has a JIT compliler, so a speed difference between Python and
MATLAB is expected for loop heavy code.

It's really hard to tell, but straight out of the package Matlab will be faster than Numpy. Primarily because it comes with its own Math Kernel Library
Whether 50x is a reasonable approximatino it'll be hard to compare basic Numpy vs Matlab's MKL.
There are other Python distribution that come with their own MKL such as Enthought and Anaconda
In Anaconda's MKL Optimizations page you'll see the chart comparing the difference between regular Anaconda and the one with MKL. The improvement is not linear, but definitely there.

Related

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Computing using two dataframes in Pandas

I'm trying to compute the following:
When there are
df1 (dataframe that has speed of characters(char_speed) of subtitle that starts at start_time and ends at end_time):
char_speed start_time end_time
0 34 3 15
1 19 15 21
2 9 21 28
...
and
df2 (dataframe that has user's listening log that starts at start_time and ends at end_time with the speed that the user listened to at that interval):
start_time end_time speed
0 9.23 20.929 1.0
1 1.4 20.26 1.5
2 20.0 27.6 1.25
...
then compute the total character count during each interval:
start_time end_time speed total_char
0 9.23 20.929 1.0
1 1.4 20.26 1.5
2 20.0 27.6 1.25
...
For example, df2['total_char'].iloc[0] would be
((15-9.23)*34) + ((20.929-15)*19)
as among time period of 9.23 ~ 20.929,
during 9.23 ~ 15, the speed would be 34,
during 15 ~ 20.929, the speed would be 19
and df2['total_char'].iloc[1] would be
(3-1.4)*0 + ((15-3)*34) + ((20.26-15)*19)
as among time period of 1.4 ~ 20.26,
during 1.4 ~ 3, the speed is not found in df1, so 0
during 3 ~ 15, the speed would be 34
during 15 ~ 20.26, the speed would be 19
I'm a newbie in Pandas and I've been recently mesmerized by how Pandas can be efficient in short and simple codings, but I'm not sure if there's a way to compute this in a short and simple coding. Right now, I can only think of an way to do it without utilizing Pandas functions: calling each row of df2 and then searching through each row in df1 and then compute it.
It would be helpful if you could tell me a way to efficiently code this using Pandas. Or any recommendation of functions would be helpful too!
Thanks in advance! :)
If you aren't opposed to merging the dataframes then apply makes it easy.
df2 = pd.concat([df1, df2], axis=1, sort=False)
def speed_calc(row):
return ((row['end_time1']-row['start_time1'])*row['char_speed']) + \
((row['end_time2']-row['end_time1'])*row['char_speed'])
df2['total_char'] = df2.apply(speed_calc, axis=1)
This would require you to adjust the header names.

Optimized method for mapping contents of a column in a 2D numpy array

I have a numpy 2D array containing integers between 0 to 100. For a particular column, I want to map the values in the following way:
0-4 mapped to 0
5-9 mapped to 5
10-14 mapped to 10, and so on.
This is my code:
import numpy as np
#profile
def map_column(arr,col,incr):
col_data = arr[:,col]
vec = np.arange(0,100,incr)
for i in range(col_data.shape[0]):
for j in range(len(vec)-1):
if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
col_data[i] = vec[j]
if (col_data[i]>vec[-1]):
col_data[i] = vec[-1]
return col_data
np.random.seed(1)
myarr = np.random.randint(100,size=(80000,4))
x = map_column(myarr,2,5)
This code takes 8.3 seconds to run. The following is the output of running line_profiler on this code.
Timer unit: 1e-06 s
Total time: 8.32155 s
File: testcode2.py
Function: map_column at line 2
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 #profile
3 def map_column(arr,col,incr):
4 1 17.0 17.0 0.0 col_data = arr[:,col]
5 1 34.0 34.0 0.0 vec = np.arange(0,100,incr)
6 80001 139232.0 1.7 1.7 for i in range(col_data.shape[0]):
7 1600000 2778636.0 1.7 33.4 for j in range(len(vec)-1):
8 1520000 4965687.0 3.3 59.7 if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
9 76062 207492.0 2.7 2.5 col_data[i] = vec[j]
10 80000 221693.0 2.8 2.7 if (col_data[i]>vec[-1]):
11 3156 8761.0 2.8 0.1 col_data[i] = vec[-1]
12 1 2.0 2.0 0.0 return col_data
In future I have to work with real data much bigger than this one.
Can anyone please suggest a faster method to do this?
I think this can be solved with an arithmetic expression, if I understand the question correctly:
def map_column(arr,col,incr):
col_data = arr[:,col]
return (col_data//incr)*incr
should do the trick. What happens here is that due to the integer division, the remainder is discarded. Thus, multiplying again with the increment, you get the next smaller number that is divisible by the increment.

Pandas: duplicating dataframe entries while column higher or equal to 0

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions
ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

Pandas vectorize statistical odds-ratio test

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neither site x & y (b) altered in site x, not in y (c) altered in y, not in x (d) altered in both. I'd also like to calculate the fisher exact test to determine statistical significance. The scipy function fisher_exact can calculate both of these (see below).
#here's a sample of my original dataframe
sample_id_no var_col
0 258.0
1 -24.0
2 -150.0
3 149.0
4 108.0
5 -126.0
6 -83.0
7 2.0
8 -177.0
9 -171.0
10 -7.0
11 -377.0
12 -272.0
13 66.0
14 -13.0
15 -7.0
16 0.0
17 189.0
18 7.0
13 -21.0
19 80.0
20 -14.0
21 -76.0
3 83.0
22 -182.0
import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
import itertools
#create a dataframe with each possible pair of variable
var_pairs = pd.DataFrame(list(itertools.combinations(df.var_col.unique(),2) )).rename(columns = {0:'alpha_site', 1: 'beta_site'})
#create a cross-tab with samples and vars
sample_table = pd.crosstab(df.sample_id_no, df.var_col)
odds_ratio_results = var_pairs.apply(getOddsRatio, axis=1, args = (sample_table,))
#where the function getOddsRatio is defined as:
def getOddsRatio(pairs, sample_table):
alpha_site, beta_site = pairs
oddsratio, pvalue = fisher_exact(pd.crosstab(sample_table[alpha_site] > 0, sample_table[beta_site] > 0))
return ([oddsratio, pvalue])
This code runs very slow, especially when used on large datasets. In my actual dataset, I have around 700k variable pairs. Since the getOddsRatio() function is applied to each pair individually, it is definitely the main source of the slowness. Is there a more efficient solution?

Categories

Resources