Question
Let's assume the following sparse table is given indicating the listing of a security on an index.
identifier from thru
AAPL 1964-03-31 --
ABT 1999-01-03 2003-12-31
ABT 2005-12-31 --
AEP 1992-01-15 2017-08-31
KO 2014-12-31 --
ABT for example is on index from 1999-01-03 to 2003-12-31 and again from 2005-12-31 until today (-- indicates today). During times in between it is not listed on index.
How can I efficiently transform this sparse table to a dense table of the following form
date AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
... ... ... ... ...
1999-01-03 1 1 1 0
1999-01-04 1 1 1 0
... ... ... ... ...
2003-12-31 1 1 1 0
2004-01-01 1 0 1 0
... ... ... ... ...
2017-09-04 1 1 0 1
In the section My Solution you will find my solution to the problem. Unfortunately, the code seems to perform very bad. It took about 22 seconds to process 1648 entries.
As I am new to python I wondered how to efficiently program problems like these.
I do not intend that anyone is providing me with a solution to my problem (unless you wish to do so). My primary goal would be to understand how to efficiently solve problems like these in python. I used the functionalities of pandas to match the respective entries. Should I have used numpy and indexing instead? Should I have used other toolboxes? How can I gain performance improvements?
Please find my approach to this problem in the section below (if it is of interest to you).
Thank you very much for your help
My Solution
I have tried to resolve the problem by looping through every row entry in the first table. During every single loop, I specify a Boolean matrix for the specific from-thru-interval with all elements set to True. This matrix is appended to a list. At the end, I pd.concat the list and unstack and reindex the resulting DataFrame.
import pandas as pd
import numpy as np
def get_ts_data(data, start_date, end_date, attribute=None, identifier=None, frequency=None):
"""
Transform sparse table to dense table.
Parameters
----------
data: pd.DataFrame
sparse table with minimal column specification ['identifier', 'from', 'thru'
start_date: pd.Timestamp, str
start date of the dense matrix
end_date: pd.Timestamp, str
end date of the dense matrix
attribute: str
column name of the value of the dense matrix.
identifier: str
column name of the identifier
frequency: str
frequency of the dense matrix
kwargs:
Allows to overwrite naming of 'from' and 'thru' variables.
e.g.
{'from': 'start', 'thru': 'end'}
Returns
-------
"""
if attribute is None:
attribute = ['on_index']
elif not isinstance(attribute, list):
attribute = [attribute]
if identifier is None:
identifier = ['identifier']
elif not isinstance(identifier, list):
identifier = [identifier]
if frequency is None:
frequency = 'B'
# copy data for security reasons
data_mod = data.copy()
data_mod['on_index'] = True
# specify start date and check type
if not isinstance(start_date, pd.Timestamp):
start_date = pd.Timestamp(start_date)
# specify end date and check type
if not isinstance(end_date, pd.Timestamp):
end_date = pd.Timestamp(end_date)
# specify output date range
date_range = pd.date_range(start_date, end_date, freq=frequency)
#overwrite null indicating that it is valid until today
missing = data_mod['thru'].isnull()
data_mod.loc[missing, 'thru'] = data_mod.loc[missing, 'from'].apply(lambda d: max(d, end_date))
# preallocate frms
frms = []
# add dataframe to frms with time specific entries
for index, row in data_mod.iterrows():
# date range index
d_range = pd.date_range(row['from'], row['thru'], freq=frequency)
# Multi index with date and identifier
d_index = pd.MultiIndex.from_product([d_range] + [[x] for x in row[identifier]], names=['date'] + identifier)
# add DataFrame with repeated values to list
frms.append(pd.DataFrame(data=np.repeat(row[attribute].values, d_index.size), index=d_index, columns=attribute))
out_frame = pd.concat(frms)
out_frame = out_frame.unstack(identifier)
out_frame = out_frame.reindex(date_range)
return out_frame
if __name__ == "__main__":
data = pd.DataFrame({'identifier': ['AAPL', 'ABT', 'ABT', 'AEP', 'KO'],
'from': [pd.Timestamp('1964-03-31'),
pd.Timestamp('1999-01-03'),
pd.Timestamp('2005-12-31'),
pd.Timestamp('1992-01-15'),
pd.Timestamp('2014-12-31')],
'thru': [np.nan,
pd.Timestamp('2003-12-31'),
np.nan,
pd.Timestamp('2017-08-31'),
np.nan]
})
transformed_data = get_ts_data(data, start_date='1964-03-31', end_date='2017-09-04', attribute='on_index', identifier='identifier', frequency='B')
print(transformed_data)
# Ensure dates are Pandas timestamps.
df['from'] = pd.DatetimeIndex(df['from'])
df['thru'] = pd.DatetimeIndex(df['thru'].replace('--', np.nan))
# Get sorted list of all unique dates and create index for full range.
dates = sorted(set(df['from'].tolist() + df['thru'].dropna().tolist()))
dti = pd.DatetimeIndex(start=dates[0], end=dates[-1], freq='B')
# Create new target dataframe based on symbols and full date range. Initialize to zero.
df2 = pd.DataFrame(0, columns=df['identifier'].unique(), index=dti)
# Find all active symbols and set their symbols' values to one from their respective `from` dates.
for _, row in df[df['thru'].isnull()].iterrows():
df2.loc[df2.index >= row['from'], row['identifier']] = 1
# Find all other symbols and set their symbols' values to one between their respective `from` and `thru` dates.
for _, row in df[df['thru'].notnull()].iterrows():
df2.loc[(df2.index >= row['from']) & (df2.index <= row['thru']), row['identifier']] = 1
>>> df2.head(3)
AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
1964-04-02 1 0 0 0
>>> df2.tail(3)
AAPL ABT AEP KO
2017-08-29 1 1 1 1
2017-08-30 1 1 1 1
2017-08-31 1 1 1 1
>>> df2.loc[:'2004-01-02', 'ABT'].tail()
2003-12-29 1
2003-12-30 1
2003-12-31 1
2004-01-01 0
2004-01-02 0
Freq: B, Name: ABT, dtype: int64
>>> df2.loc['2005-12-30':, 'ABT'].head(3)
2005-12-30 0
2006-01-02 1
2006-01-03 1
Freq: B, Name: ABT, dtype: int64
Related
I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:
I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1
I have a Pandas DataFrame that has the dates that SP500 constituents were added to/deleted from the index. It looks something like this:
PERMNO start ending
0 10006.0 1957-03-01 1984-07-18
1 10030.0 1957-03-01 1969-01-08
2 10049.0 1925-12-31 1932-10-01
3 10057.0 1957-03-01 1992-07-02
4 10078.0 1992-08-20 2010-01-28
I also have a list of dates that I am concerned with, it consists of trading days between 1/1/2003 and 6/30/2009. I want to create a dataframe with these dates on the index and PERMNOs as the columns. It will be populated as a truth table of whether the stock was included in the SP500 on that day.
Is there a fast way of doing this?
Note: some stocks are added to the SP500, then removed, then later added again.
If I understand you correctly, you are trying to find the list of S&P 500 constituents as of a series of dates. Assuming your dataframe has start and ending as datetime64 already:
# the list of dates that you are interested in
dates = pd.Series(['1960-01-01', '1980-01-01'], dtype='datetime64[ns]')
start = df['start'].values
end = df['ending'].values
d = dates.values[:, None] # to prepare for array broadcasting
# if the date is between `start` and `ending` of the stock's membership in the S&P 500
match = (start <= d) & (d <= end)
# list of PERMNO for each as-of date
p = dates.index.to_series() \
.apply(lambda i: df.loc[match[i], 'PERMNO']) \
.stack().droplevel(-1)
# tying everything together
result = dates.to_frame('AsOfDate').join(p)
Result:
AsOfDate PERMNO
0 1960-01-01 10006.0
0 1960-01-01 10030.0
0 1960-01-01 10057.0
1 1980-01-01 10006.0
1 1980-01-01 10057.0
You can use Dataframe constructor with np.tile, np.repeat with filter by mask created by ravel:
dates = pd.to_datetime(['1960-01-01', '1980-01-01'])
start = df['start'].values
end = df['ending'].values
d = dates.values[:, None]
#filter by boolean broadcasting
match = (start <= d) & (d <= end)
a = np.tile(df['PERMNO'], len(dates))
b = np.repeat(dates, len(df))
mask = match.ravel()
df1 = pd.DataFrame({'Date1':b[mask], 'PERMNO':a[mask]})
print (df1)
Date1 PERMNO
0 1960-01-01 10006.0
1 1960-01-01 10030.0
2 1960-01-01 10057.0
3 1980-01-01 10006.0
4 1980-01-01 10057.0
Different output like True/False table:
df2 = pd.DataFrame(match, index=dates, columns=df['PERMNO'])
print (df2)
PERMNO 10006.0 10030.0 10049.0 10057.0 10078.0
1960-01-01 True True False True False
1980-01-01 True False False True False
I'm in search for a quick&productive workaround for the following task.
I need to create a separate column for each DeviceID. The column must contain an array with unique SessionStartDate values for each DeviceID.
For example:
8846620190473426378 | [2018-08-01, 2018-08-02]
381156181455864495 | [2018-08-01]
Though user 8846620190473426378 may have had 30 sessions on 2018-08-01, and 25 sessions on 2018-08-02, I'm only interested in unique dates when these sessions occurred.
Currently, I'm using this approach:
df_main['active_days'] = [
sorted(
list(
set(
sessions['SessionStartDate'].loc[sessions['DeviceID'] == x['DeviceID']]
)
)
)
for _, x in df_main.iterrows()
]
df_main here is another DataFrame, containing aggregated data grouped by DeviceID
The approach seems to be very (Wall time: 1h 45min 58s) slow, and I believe there's a better solution for the task.
Thanks in advance!
I believe you need sort_values with SeriesGroupBy.unique:
rng = pd.date_range('2017-04-03', periods=4)
sessions = pd.DataFrame({'SessionStartDate': rng, 'DeviceID':[1,2,1,2]})
print (sessions)
SessionStartDate DeviceID
0 2017-04-03 1
1 2017-04-04 2
2 2017-04-05 1
3 2017-04-06 2
#if necessary convert datetimes to dates
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
.groupby('DeviceID')['SessionStartDate']
.unique())
print (out)
DeviceID
1 [2017-04-03, 2017-04-05]
2 [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object
Another solution is remove duplicates by drop_duplicates and groupby with converting to lists:
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
.drop_duplicates(['DeviceID', 'SessionStartDate'])
.groupby('DeviceID')['SessionStartDate']
.apply(list))
print (out)
DeviceID
1 [2017-04-03, 2017-04-05]
2 [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object
I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)