Computing using two dataframes in Pandas

Computing using two dataframes in Pandas - python

I'm trying to compute the following:
When there are
df1 (dataframe that has speed of characters(char_speed) of subtitle that starts at start_time and ends at end_time):
char_speed start_time end_time
0 34 3 15
1 19 15 21
2 9 21 28
...
and
df2 (dataframe that has user's listening log that starts at start_time and ends at end_time with the speed that the user listened to at that interval):
start_time end_time speed
0 9.23 20.929 1.0
1 1.4 20.26 1.5
2 20.0 27.6 1.25
...
then compute the total character count during each interval:
start_time end_time speed total_char
0 9.23 20.929 1.0
1 1.4 20.26 1.5
2 20.0 27.6 1.25
...
For example, df2['total_char'].iloc[0] would be
((15-9.23)*34) + ((20.929-15)*19)
as among time period of 9.23 ~ 20.929,
during 9.23 ~ 15, the speed would be 34,
during 15 ~ 20.929, the speed would be 19
and df2['total_char'].iloc[1] would be
(3-1.4)*0 + ((15-3)*34) + ((20.26-15)*19)
as among time period of 1.4 ~ 20.26,
during 1.4 ~ 3, the speed is not found in df1, so 0
during 3 ~ 15, the speed would be 34
during 15 ~ 20.26, the speed would be 19
I'm a newbie in Pandas and I've been recently mesmerized by how Pandas can be efficient in short and simple codings, but I'm not sure if there's a way to compute this in a short and simple coding. Right now, I can only think of an way to do it without utilizing Pandas functions: calling each row of df2 and then searching through each row in df1 and then compute it.
It would be helpful if you could tell me a way to efficiently code this using Pandas. Or any recommendation of functions would be helpful too!
Thanks in advance! :)

If you aren't opposed to merging the dataframes then apply makes it easy.
df2 = pd.concat([df1, df2], axis=1, sort=False)
def speed_calc(row):
return ((row['end_time1']-row['start_time1'])*row['char_speed']) + \
((row['end_time2']-row['end_time1'])*row['char_speed'])
df2['total_char'] = df2.apply(speed_calc, axis=1)
This would require you to adjust the header names.

Related

Pandas DataFrames: Efficiently find next value in one column where another column has a greater value

The title describes my situation. I already have a working version of this, but it is very inefficient when scaled to large DataFrames (>1M rows). I was wondering if anyone has a better idea of doing this.
Example with solution and code
Create a new column next_time that has the next value of time where the price column is greater than the current row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
series_to_concat = []
for price in df['price'].unique():
index_equal_to_price = df[df['price'] == price].index
series_time_greater_than_price = df[df['price'] > price]['time']
time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill')
series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))
df['next_time'] = pd.concat(series_to_concat, sort=False)
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This gets me the desired result. When scaled up to some large dataframes, calculating this can take a few minutes. Does anyone have a better idea of how to approach this?
Edit: Clarification of constraints
We can assume the dataframe is sorted by time.
Another way to word this would be, given any row n (Time_n, Price_n), 0 <= n <= len(df) - 1, find x such that Time_x > Time_n AND Price_x > Price_n AND there is no y such that n < y < x with Price_y > Price_n.

These solutions were faster when I tested with %timeit on this sample, but I tested on a larger dataframe and they were much slower than your solution. It would be interesting to see if any of the 3 solutions are faster in your larger dataframe. I would look into dask or check out: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
I hope someone else is able to post a more efficient solution. Some different answers below:
You can achieve this with a next one-liner that loops through both the time and price columns simultaneously with zip. The next function works exactly the same as a list comprehension, but you use need to parentheses instead of brackets, and it only returns the first True value. You also need to pass None to handle errors as a parameter within in the next function.
You need to pass axis=1, since you are comparing column-wise.
This should speed up performance, as you don't loop through the entire column as the iteration stops after returning the first value and moves to the next row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']), None), axis=1))
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
As you can see list comprehension would return the same result, but in theory will be a lot slower... as the total number of iterating would increase significantly especially with a large dataframe.
df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']], axis=1)).str[0]
df
Out[2]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
Another Option creating a function with some numpy and np.where():
def closest(x):
try:
lst = df.groupby(df['price'].cummax())['time'].transform('first')
lst = np.asarray(lst)
lst = lst[lst>x]
idx = (np.abs(lst - x)).argmin()
return lst[idx]
except ValueError:
pass
df['next_time'] = np.where((df['price'].shift(-1) > df['price']),
df['time'].shift(-1),
df['time'].apply(lambda x: closest(x)))

This one returned a variation of your dataframe with 1,000,000 rows and 162,000 unique prices for me in less than 7 seconds. As such, I think that since you ran it on 660,000 rows and 12,000 unique prices, the increase in speed would be 100x-1000x.
The added complexity of your question is the condition that the closest higher price must be at a later time. This answer https://stackoverflow.com/a/53553226/6366770 helped me discover the bisect functions, but it didn't have your added complexity of having to rely on a time column. As such, I had to tackle the problem from a couple of different angles (as you mentioned in a comment regarding my np.where() to break it down into a couple of different methods).
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
def get_closest_higher(df, col, val):
higher_idx = bisect_right(df[col].values, val)
return higher_idx
df = df.sort_values(['price', 'time']).reset_index(drop=True)
df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))
df['next_time'] = df['next_time'].map(df['time'])
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )
df = df.sort_values('time').reset_index(drop=True)
df['next_time'] = np.where((df['price'].shift(-1) > df['price'])
,df['time'].shift(-1),
df['next_time'])
df['next_time'] = df['next_time'].ffill()
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN

David did come up with a great solution for finding the closest greater price at a later time. However, I did want to find the very next occurrence of a greater price at a later time though. Working with a coworker of mine, we found this solution.
Stack containing tuples (index, price)
Iterate through all rows (index i)
While the stack is non-empty AND the top of the stack has a lesser price, then pop and fill in the popped index with times[index]
Push (i, prices[i]) onto the stack
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
df['next_time'] = next_times
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This solution actually performs very fast. I am not totally sure, but I believe the complexity would be close to O(n) since it is one full pass through the entire dataframe. The reason this performs so well, is the stack is essentially sorted, where the largest prices will be at the bottom, and the smallest price is at the top of the stack.
Here is my test with an actual dataframe in action
print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')
667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00
def find_next_time_with_greater_price(df):
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
return next_times
%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)
434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Average between points based on time

I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points

To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5

I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5

Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79

You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?

From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Pandas: duplicating dataframe entries while column higher or equal to 0

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions

ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

Is numpy.argmax slower than MATLAB [~,idx] = max()?

I am writing a Bayseian classifier for a normal distribution. I have both code in python and MATLAB which are nearly identical. However the MATLAB code runs about 50x faster than my Python script. I'm new to Python, so maybe there's something I did terribly wrong. I assume it's somewhere where I loop over the dataset.
Possibly numpy.argmax() is much slower than [~,idx]=max()? Looping through the data frame is slow? Bad use of dictionaries (previously I tried an object and it was even slow)?
Any advice is welcome.
Python code
import numpy as np
import pandas as pd
#import the data as a data frame
train_df = pd.read_table('hw1_traindata.txt',header = None)#training
train_df.columns = [1, 2] #rename column titles
The data here is a 2 columns (300 rows/samples for training and 300000 for test). This is the function parameters; mi and Si are the sample means and covariances.
case3_p = {'w': [], 'w0': [], 'W': []}
case3_p['w']={1:S1.I*m1,2:S2.I*m2,3:S3.I*m3}
case3_p['w0']={1: -1.0/2.0*(m1.T*S1.I*m1)-
1.0/2.0*np.log(np.linalg.det(S1)),
2: -1.0/2.0*(m2.T*S2.I*m2)-1.0/2.0*np.log(np.linalg.det(S2)),
3: -1.0/2.0*(m3.T*S3.I*m3)-1.0/2.0*np.log(np.linalg.det(S3))}
case3_p['W']={1: -1.0/2.0*S1.I,
2: -1.0/2.0*S2.I,
3: -1.0/2.0*S3.I}
#W1=-1.0/2.0*S1.I
#w1_3=S1.I*m1
#w01_3=-1.0/2.0*(m1.T*S1.I*m1)-1.0/2.0*np.log(np.linalg.det(S1))
def g3(x,W,w,w0):
return x.T*W*x+w.T*x+w0
This is the classifier/loop
train_df['case3'] = 0
for i in range(train_df.shape[0]):
x = np.mat(train_df.loc[i,[1, 2]]).T#observation
#case 3
vals = [g3(x,case3_p['W'][1],case3_p['w'][1],case3_p['w0'][1]),
g3(x,case3_p['W'][2],case3_p['w'][2],case3_p['w0'][2]),
g3(x,case3_p['W'][3],case3_p['w'][3],case3_p['w0'][3])]
train_df.loc[i,'case3'] = np.argmax(vals) + 1 #add one to make it the class value
Corresponding MATLAB code
train = load('hw1_traindata.txt');
The discriminant functions
W1=-1/2*S1^-1;%there isn't one for the other cases
w1_3=S1^-1*m1';%fix the transpose thing
w10_3=-1/2*(m1*S1^-1*m1')-1/2*log(det(S1));
g1_3=#(x) x'*W1*x+w1_3'*x+w10_3';
W2=-1/2*S2^-1;
w2_3=S2^-1*m2';
w20_3=-1/2*(m2*S2^-1*m2')-1/2*log(det(S2));
g2_3=#(x) x'*W2*x+w2_3'*x+w20_3';
W3=-1/2*S3^-1;
w3_3=S3^-1*m3';
w30_3=-1/2*(m3*S3^-1*m3')-1/2*log(det(S3));
g3_3=#(x) x'*W3*x+w3_3'*x+w30_3';
The classifier
case3_class_tr = Inf(size(act_class_tr));
for i=1:length(train)
x=train(i,:)';%current sample
%case3
vals = [g1_3(x),g2_3(x),g3_3(x)];%compute discriminant function value
[~, case3_class_tr(i)] = max(vals);%get location of max
end

In cases such as this it's best to profile your code. First I created some mock data:
import numpy as np
import pandas as pd
fname = 'hw1_traindata.txt'
ar = np.random.rand(1000, 2)
np.savetxt(fname, ar, delimiter='\t')
m1, m2, m3 = [np.mat(ar).T for ar in np.random.rand(3, 2)]
S1, S2, S3 = [np.mat(ar) for ar in np.random.rand(3, 2, 2)]
Then I wrapped your code in a function and profiled with the lprun (line_profiler) IPython magic. These are the results:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 4.77361 s
File: <ipython-input-164-563f57dadab3>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
=====================================================
1 def train(fname, m1, S1, m2, S2, m3, S3):
2 1 9868 9868.0 0.1 train_df = pd.read_table(fname ,header = None)#training
3 1 328 328.0 0.0 train_df.columns = [1, 2] #rename column titles
4
5 1 17 17.0 0.0 case3_p = {'w': [], 'w0': [], 'W': []}
6 1 877 877.0 0.0 case3_p['w']={1:S1.I*m1,2:S2.I*m2,3:S3.I*m3}
7 1 356 356.0 0.0 case3_p['w0']={1: -1.0/2.0*(m1.T*S1.I*m1)-
8
9 1 204 204.0 0.0 1.0/2.0*np.log(np.linalg.det(S1)),
10 1 498 498.0 0.0 2: -1.0/2.0*(m2.T*S2.I*m2)-1.0/2.0*np.log(np.linalg.det(S2)),
11 1 502 502.0 0.0 3: -1.0/2.0*(m3.T*S3.I*m3)-1.0/2.0*np.log(np.linalg.det(S3))}
12 1 235 235.0 0.0 case3_p['W']={1: -1.0/2.0*S1.I,
13 1 229 229.0 0.0 2: -1.0/2.0*S2.I,
14 1 230 230.0 0.0 3: -1.0/2.0*S3.I}
15
16 1 1818 1818.0 0.0 train_df['case3'] = 0
17
18 1001 17409 17.4 0.2 for i in range(train_df.shape[0]):
19 1000 4254511 4254.5 49.9 x = np.mat(train_df.loc[i,[1, 2]]).T#observation
20
21 #case 3
22 1000 298245 298.2 3.5 vals = [g3(x,case3_p['W'][1],case3_p['w'][1],case3_p['w0'][1]),
23 1000 269825 269.8 3.2 g3(x,case3_p['W'][2],case3_p['w'][2],case3_p['w0'][2]),
24 1000 274279 274.3 3.2 g3(x,case3_p['W'][3],case3_p['w'][3],case3_p['w0'][3])]
25 1000 3395654 3395.7 39.8 train_df.loc[i,'case3'] = np.argmax(vals) + 1
26
27 1 45 45.0 0.0 return train_df
There are two lines that together take 90% of the time. So let's split these lines up a bit and rerun the profiler:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 6.15358 s
File: <ipython-input-197-92d9866b57dc>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
======================================================
...
19 1000 5292988 5293.0 48.2 thing = train_df.loc[i,[1, 2]] # Observation
20 1000 265101 265.1 2.4 x = np.mat(thing).T
...
26 1000 143142 143.1 1.3 index = np.argmax(vals) + 1 # Add one to make it the class value
27 1000 4164122 4164.1 37.9 train_df.loc[i,'case3'] = index
Most time is spent indexing the Pandas dataframe! Taking the argmax is only 1.5% of total execution time.
The situation can be improved somewhat by pre-allocating train_df['case3'] and using .iloc:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 3.26716 s
File: <ipython-input-192-f6173cdf9990>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
======= ======= ======================================
16 1 1548 1548.0 0.0 train_df['case3'] = np.zeros(len(train_df))
...
19 1000 2608489 2608.5 44.7 thing = train_df.iloc[i,[0, 1]] # Observation
20 1000 228959 229.0 3.9 x = np.mat(thing).T
...
26 1000 123165 123.2 2.1 index = np.argmax(vals) + 1 # Add one to make it the class value
27 1000 1849283 1849.3 31.7 train_df.iloc[i,2] = index
Still though, iterating individual values from Pandas dataframes in tight loops is a bad idea. In this case use Pandas only for loading the text-data (it's very good at it) but other than that use "raw" Numpy arrays. E.g. use train_data = pd.read_table(fname, header=None).values. And when you reach the analysis stage maybe go back to Pandas.
Some other ramblings:
Use Python's zero-based indexing and don't go out of your way to use
one-based indexing.
Consider using normal Numpy arrays instead of matrices. When you use
matrices you tend to mix them up with arrays and run into hard to debug
problems.
MATLAB has a JIT compliler, so a speed difference between Python and
MATLAB is expected for loop heavy code.

It's really hard to tell, but straight out of the package Matlab will be faster than Numpy. Primarily because it comes with its own Math Kernel Library
Whether 50x is a reasonable approximatino it'll be hard to compare basic Numpy vs Matlab's MKL.
There are other Python distribution that come with their own MKL such as Enthought and Anaconda
In Anaconda's MKL Optimizations page you'll see the chart comparing the difference between regular Anaconda and the one with MKL. The improvement is not linear, but definitely there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Computing using two dataframes in Pandas - python

Related

Pandas DataFrames: Efficiently find next value in one column where another column has a greater value

Average between points based on time

Pandas Way of Weighted Average in a Large DataFrame

Pandas: duplicating dataframe entries while column higher or equal to 0

Is numpy.argmax slower than MATLAB [~,idx] = max()?

Categories

Resources