I am using Python 2.7 and keep getting the below error. Please let me know if you need the full code but it is a bit long. Thank you for your help.
Warning (from warnings module):
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 3619
FutureWarning)
FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated.
Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index
here is the class Portfolio
class Portfolio(object):
"""An abstract base class representing a portfolio of
positions (including both instruments and cash), determined
on the basis of a set of signals provided by a Strategy."""
__metaclass__ = abc.ABCMeta
#abc.abstractmethod
def generate_positions(self):
raise NotImplementedError("Should implement generate_positions()!")
#abc.abstractmethod
def backtest_portfolio(self):
raise NotImplementedError("Should implement backtest_portfolio()!")
here is the code that is causing the issue in the <<<< if name == "main"
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas.io.data import DataReader
from backtest import Strategy, Portfolio
class MovingAverageCrossStrategy(Strategy):
def __init__(self, symbol, bars, short_window=8, long_window=50):
self.symbol = symbol
self.bars = bars
self.short_window = short_window
self.long_window = long_window
def generate_signals(self):
signals = pd.DataFrame(index=self.bars.index)
signals['signal'] = 0.0
# Create the set of short and long simple moving averages over the
# respective periods
signals['short_mavg'] = pd.rolling_mean(bars['Close'], self.short_window, min_periods=1)
signals['long_mavg'] = pd.rolling_mean(bars['Close'], self.long_window, min_periods=1)
# Create a 'signal' (invested or not invested) when the short moving average crosses the long
# moving average, but only for the period greater than the shortest moving average window
signals['signal'][self.short_window:] = np.where(signals['short_mavg'][self.short_window:]
> signals['long_mavg'][self.short_window:], 1.0, 0.0)
# Take the difference of the signals in order to generate actual trading orders
signals['positions'] = signals['signal'].diff()
return signals
class MarketOnClosePortfolio(Portfolio):
def __init__(self, symbol, bars, signals, initial_capital=100000.0):
self.symbol = symbol
self.bars = bars
self.signals = signals
self.initial_capital = float(initial_capital)
self.positions = self.generate_positions()
def generate_positions(self):
positions = pd.DataFrame(index=signals.index).fillna(0.0)
positions[self.symbol] = 100*signals['signal'] # This strategy buys 100 shares
return positions
def backtest_portfolio(self):
portfolio = self.positions*self.bars['Close']
pos_diff = self.positions.diff()
portfolio['holdings'] = (self.positions*self.bars['Close']).sum(axis=1)
portfolio['cash'] = self.initial_capital - (pos_diff*self.bars['Close']).sum(axis=1).cumsum()
portfolio['total'] = portfolio['cash'] + portfolio['holdings']
portfolio['returns'] = portfolio['total'].pct_change()
return portfolio
if __name__ == "__main__":
# Obtain daily bars of stock from Yahoo Finance for the period
# 1st Jan 1990 to 1st Jan 2014 - This is an example from ZipLine
symbol = 'AAPL'
bars = DataReader(symbol, "yahoo", datetime.datetime(1990,1,1), datetime.datetime(2014,1,1))
# Create a Moving Average Cross Strategy instance with a short moving
# average window of 8 days and a long window of 50 days
mac = MovingAverageCrossStrategy(symbol, bars, short_window=8, long_window=50)
signals = mac.generate_signals()
# Create a portfolio of stock, with $100,000 initial capital
portfolio = MarketOnClosePortfolio(symbol, bars, signals, initial_capital=100000.0)
returns = portfolio.backtest_portfolio()
Without being able to run your code, it is difficult to point to the exact reason, but take for example this line in generate_positions:
portfolio = self.positions*self.bars['Close']
And supposing self.positions is a DataFrame, and self.bars['Close'] is a Series (in this case a column of a DataFrame, which is returned as a Series). I try to explain the issue with a toy example:
First generating a dataframe and a series (with a datetimeindex):
In [3]: idx = pd.date_range('2012-01-01', periods=3)
In [5]: df = pd.DataFrame({'A':[1,2,3], 'B':[10,20,30]}, index=idx)
In [6]: df
Out[6]:
A B
2012-01-01 1 10
2012-01-02 2 20
2012-01-03 3 30
In [7]: s = pd.Series([1,2,3], index=idx)
In [8]: s
Out[8]:
2012-01-01 1
2012-01-02 2
2012-01-03 3
Freq: D, dtype: int64
Now if we multiply both, we will get the warning you noticed:
In [10]: df * s
/home/joris/scipy/pandas-np16/pandas/core/frame.py:2920: FutureWarning: TimeSeries
broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op>
to explicitly broadcast arithmetic operations along the index
FutureWarning)
Out[10]:
A B
2012-01-01 1 10
2012-01-02 4 40
2012-01-03 9 90
This is because of what I mentioned in the comments and is explained here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html?highlight=broadcasting#data-alignment-and-arithmetic. Normally if a dataframe and series are multiplied, the series is broadcasted over the columns, while for timeserieses this is done over the rows. But this is deprecated.
So instead you should use an equivalent operator as the warnings advices. In case of a multiplication:
In [13]: df.mul(s, axis=0)
Out[13]:
A B
2012-01-01 1 10
2012-01-02 4 40
2012-01-03 9 90
So for each operator (+, *; <, >, /, etc) there is an equivalent method. See here for the list of methods: http://pandas.pydata.org/pandas-docs/stable/api.html#id4 [url updated]
To show what is meant with 'broadcasted over the columns', another example:
In [14]: s2 = pd.Series([10, 100], index=['A', 'B'])
In [15]: s2
Out[15]:
A 10
B 100
dtype: int64
In [16]: df * s2
Out[16]:
A B
2012-01-01 10 1000
2012-01-02 20 2000
2012-01-03 30 3000
So as you can see, each element of the series is matched with one column, and the whole column is then multiplied with that value. While in the case of the timeseries, each element of the series was matched with a row.
Related
I have the following problem. Suppose I have a wide data Frame consisting of three columns (mock example follows below). Essentially, it consists of three factors, A, B and C for which I have certain values for each business day within a time range.
import pandas as pd
import numpy as np
index_d = pd.bdate_range(start='10/5/2022', end='10/27/2022')
index = np.repeat(index_d,3)
values = np.random.randn(3*len(index_d), 1)
columns_v = len(index_d)*["A","B","C"]
df = pd.DataFrame()
df["x"] = np.asarray(index)
df["y"] = values
df["factor"] = np.asarray([columns_v]).T
I would like to plot the business weekly averages of the the three factors along time. A business week goes from Monday to Friday. However, in the example above I start within a week and end within a week. That means the first weekly averages consist only of the data points on 5th, 6th and 7th of October. Similar for the last week. Ideally, the output should have the form
dt1 = dt.datetime.strptime("20221007", "%Y%m%d").date()
dt2 = dt.datetime.strptime("20221014", "%Y%m%d").date()
dt3 = dt.datetime.strptime("20221021", "%Y%m%d").date()
dt4 = dt.datetime.strptime("20221027", "%Y%m%d").date()
d = 3*[dt1, dt2, dt3, dt4]
values = np.random.randn(len(d), 1)
factors = 4*["A","B","C"]
df_output = pd.DataFrame()
df_output["time"] = d
df_output["values"] = values
df_output["factors"] = factors
I can then plot the weekly averages using seaborn as a lineplot with hue. Important to note is that the respective time value for weekly average is always the last business day in that week (Friday except for the last, where it is a Thursday).
I was thinking of groupby. However, given that my real data is much larger and has possibly some NaN I'm not sure how to do it. In particular with regards to the random start / end points that don't need to be Monday / Friday.
Try as follows:
res = df.groupby([pd.Grouper(key='x', freq='W-FRI'),df.factor])['y'].mean()\
.reset_index(drop=False)
res = res.rename(columns={'x':'time','factor':'factors','y':'values'})
res['time'] = res.time.map(pd.merge_asof(df.x, res.time, left_on='x',
right_on='time', direction='forward')\
.groupby('time').last()['x']).astype(str)
print(res)
time factors values
0 2022-10-07 A 0.171228
1 2022-10-07 B -0.250432
2 2022-10-07 C -0.126960
3 2022-10-14 A 0.455972
4 2022-10-14 B 0.582900
5 2022-10-14 C 0.104652
6 2022-10-21 A -0.526221
7 2022-10-21 B 0.371007
8 2022-10-21 C 0.012099
9 2022-10-27 A -0.123510
10 2022-10-27 B -0.566441
11 2022-10-27 C -0.652455
Plot data:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
fig, ax = plt.subplots(figsize=(8,5))
ax = sns.lineplot(data=res, x='time', y='values', hue='factors')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.show()
Result:
Explanation
First, apply df.groupby. Grouping by factor is of course easy; for the dates we can use pd.Grouper with freq parameter set to W-FRI (each week through to Friday), and then we want to get the mean for column y (NaN values will just be ignored).
In the next step, let's use df.rename to rename the columns.
We are basically done now, except for the fact that pd.Grouper will use each Friday (even if it isn't present in the actual set). E.g.:
print(res.time.unique())
['2022-10-07T00:00:00.000000000' '2022-10-14T00:00:00.000000000'
'2022-10-21T00:00:00.000000000' '2022-10-28T00:00:00.000000000']
If you are OK with this, you can just start plotting (but see below). If you would like to get '2022-10-27' instead of '2022-10-28', we can combine Series.map applied to column time with pd.merge_asof,and perform another groupby to get last in column x. I.e. this will get us the closest match to each Friday within each week (so, in fact just Friday in all cases, except the last: 2022-10-17).
In either scenario, before plotting, make sure to turn the datetime values into strings: res['time'] = res['time'].astype(str)!
You can add a column with the calendar week:
df['week'] = df.x.dt.isocalendar().week
Get a mask for all the Fridays, and for the last day:
last_of_week = (df.x.dt.isocalendar().day == 5).values
last_of_week[-1] = True
Get the actual dates:
last_days = df.x[last_of_week].unique()
Group by week and factor, take the mean:
res = df.groupby(['week', 'factor']).mean().reset_index()
Clean up:
res = res.drop('week', axis=1)
res['x'] = pd.Series(last_days).repeat(3).reset_index(drop=True)
I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!
The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)
Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46
There's a sensor dataset, and the values in value column needs to be corrected based on one specific sensor R in the data. The values are directions in degrees (circle 360 degrees). The correction method is as below formula, for each individual sensor i, calculate sum of sine /cosine differences respecting to the reference sensor and get the corrected degrees by calculating artanh. Then minus it from its original values. Vi(t) is the value of sensor i at time t, and VR(t) is the value of Reference sensor R at time t.
date sensor value tag
0 2000-01-01 1 200 a
1 2000-01-02 1 200 a
''''''''''''''''''''''''''''''''
7 2000-01-08 1 300 b
8 2000-01-02 2 202 c
9 2000-01-03 2 204 c
10 2000-01-04 2 206 c
I have tried some but little confused in how to complete this request in a for loop.
The timestamps for sensors are not matching. The individual sensor may have more or less timestamps than the reference sensor.
I want to add an additional column to store corrected values.
Below is the sample dataset I made. If choose sensor 2 as the reference sensor to correct other sensor values, how can I complete it in a python loop. Thanks in advance!
import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=8),"sensor":[1,1,1,1,1,1,1,1],"value":[200,200,200,200,200,300,300,300],"tag":pd.Series(['a','b']).repeat(4)})
sensor2 = pd.DataFrame({"date": pd.date_range('1/2/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[202,204,206,208,220,250,300,320,280,260],"tag":pd.Series(['c','d']).repeat(5)})
sensor3 = pd.DataFrame({"date": pd.date_range('1/3/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[265,222,232,220,260,300,250,200,190,223],"tag":pd.Series(['e','f']).repeat(5)})
sensor4 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=11),"sensor":[4,4,4,4,4,4,4,4,4,4,4],"value":[206,203,210,253,237,282,320,232,255,225,262],"tag":pd.Series(['c']).repeat(11)})
sensordata = sensor1.append([sensor2,sensor3,sensor4]).reset_index(drop = True)
Here is an inelegant solution, using for loops and multiple merges. As an example, I use sensor4 to correct the remaining sensors. The correction formula was not 100% clear to me, so I interpreted it as adding the sine and the cosine.
def data_correction(vi, vr):
return vi - np.arctan(np.sum(np.sin(vi-vr) + np.cos(vi-vr), axis=0)) # i assume sin and cosine are summed?
sensors = [sensor1, sensor2, sensor3] # assuming you want to correct with sensor 4
sensorR = sensor4.copy()
for i in range(len(sensors)):
# create temp dataframe, with merge on date, so that measurements line up
temp = pd.merge(sensors[i], sensorR, how='inner', left_on='date', right_on='date')
# do correction and assign to new column
temp['value_corrected'] = data_correction(temp['value_x'], temp['value_y'])
# add this column to the original sensor data
sensors[i] = sensors[i].merge(temp[['date', 'value_corrected']], how='inner', left_on='date', right_on='date')
I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454
I have pandas data object - data - that is stored as Series of Series. The first series is indexed on ID1 and the second on ID2.
ID1 ID2
1 10259 0.063979
14166 0.120145
14167 0.177417
14244 0.277926
14245 0.436048
15021 0.624367
15260 0.770925
15433 0.918439
15763 1.000000
...
1453 812690 0.752274
813000 0.755041
813209 0.756425
814045 0.778434
814474 0.910647
814475 1.000000
Length: 19726, dtype: float64
I have a function that uses values from this object for further data processing. Here is the function:
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
I use np.vectorize to apply this function on a DataFrame - dataFrame - that has about 22 million rows.
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
where ID1 and RAND are columns with values that are feeding into the function.
The code takes about 6 hours to process everything. A similar implementation in Java takes only about 6 minutes to get through 22 million rows of data.
On running a profiler on my program I find that the most expensive call is the indexing into data and the second most expensive is searchsorted.
Function Name: pandas.core.series.Series.__getitem__
Elapsed inclusive time percentage: 54.44
Function Name: numpy.core.fromnumeric.searchsorted
Elapsed inclusive time percentage: 25.49
Using data.loc[ID1] to get data makes the program even slower. How can I make this faster? I understand that Python cannot achieve the same efficiency as Java but 6 hours compared to 6 minutes seems too much of a difference. Maybe I should be using a different data structure/ functions? I am using Python 2.7 and PTVS IDE.
Adding a minimum working example:
import numpy as np
import pandas as pd
np.random.seed = 0
#Creating a dummy data object - Series within Series
alt = pd.Series(np.array([ 0.25, 0.50, 0.75, 1.00]), index=np.arange(1,5))
data = pd.Series([alt]*1500, index=np.arange(1,1501))
#Creating dataFrame -
nRows = 200000
d = {'ID1': np.random.randint(1500, size=nRows) + 1
,'RAND': np.random.uniform(low=0.0, high=1.0, size=nRows)}
dataFrame = pd.DataFrame(d)
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
You may get a better performance with this code:
>>> def getData(ts):
... dataID2 = data[ts.name]
... i = np.searchsorted(dataID2.values, ts.values, side='left')
... return dataID2.index[i]
...
>>> dataFrame['ID2'] = dataFrame.groupby('ID1')['RAND'].transform(getData)