Pivot Dataframe of start and ending dates into truth table - python

I have a Pandas DataFrame that has the dates that SP500 constituents were added to/deleted from the index. It looks something like this:
PERMNO start ending
0 10006.0 1957-03-01 1984-07-18
1 10030.0 1957-03-01 1969-01-08
2 10049.0 1925-12-31 1932-10-01
3 10057.0 1957-03-01 1992-07-02
4 10078.0 1992-08-20 2010-01-28
I also have a list of dates that I am concerned with, it consists of trading days between 1/1/2003 and 6/30/2009. I want to create a dataframe with these dates on the index and PERMNOs as the columns. It will be populated as a truth table of whether the stock was included in the SP500 on that day.
Is there a fast way of doing this?
Note: some stocks are added to the SP500, then removed, then later added again.

If I understand you correctly, you are trying to find the list of S&P 500 constituents as of a series of dates. Assuming your dataframe has start and ending as datetime64 already:
# the list of dates that you are interested in
dates = pd.Series(['1960-01-01', '1980-01-01'], dtype='datetime64[ns]')
start = df['start'].values
end = df['ending'].values
d = dates.values[:, None] # to prepare for array broadcasting
# if the date is between `start` and `ending` of the stock's membership in the S&P 500
match = (start <= d) & (d <= end)
# list of PERMNO for each as-of date
p = dates.index.to_series() \
.apply(lambda i: df.loc[match[i], 'PERMNO']) \
.stack().droplevel(-1)
# tying everything together
result = dates.to_frame('AsOfDate').join(p)
Result:
AsOfDate PERMNO
0 1960-01-01 10006.0
0 1960-01-01 10030.0
0 1960-01-01 10057.0
1 1980-01-01 10006.0
1 1980-01-01 10057.0

You can use Dataframe constructor with np.tile, np.repeat with filter by mask created by ravel:
dates = pd.to_datetime(['1960-01-01', '1980-01-01'])
start = df['start'].values
end = df['ending'].values
d = dates.values[:, None]
#filter by boolean broadcasting
match = (start <= d) & (d <= end)
a = np.tile(df['PERMNO'], len(dates))
b = np.repeat(dates, len(df))
mask = match.ravel()
df1 = pd.DataFrame({'Date1':b[mask], 'PERMNO':a[mask]})
print (df1)
Date1 PERMNO
0 1960-01-01 10006.0
1 1960-01-01 10030.0
2 1960-01-01 10057.0
3 1980-01-01 10006.0
4 1980-01-01 10057.0
Different output like True/False table:
df2 = pd.DataFrame(match, index=dates, columns=df['PERMNO'])
print (df2)
PERMNO 10006.0 10030.0 10049.0 10057.0 10078.0
1960-01-01 True True False True False
1980-01-01 True False False True False

Related

Replacing a for loop with something more efficient when comparing dates to a list

Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03

Uniqueness Test on Dataframe column and cross reference with value in second column - Python

I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

Python performance improvements and coding style

Question
Let's assume the following sparse table is given indicating the listing of a security on an index.
identifier from thru
AAPL 1964-03-31 --
ABT 1999-01-03 2003-12-31
ABT 2005-12-31 --
AEP 1992-01-15 2017-08-31
KO 2014-12-31 --
ABT for example is on index from 1999-01-03 to 2003-12-31 and again from 2005-12-31 until today (-- indicates today). During times in between it is not listed on index.
How can I efficiently transform this sparse table to a dense table of the following form
date AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
... ... ... ... ...
1999-01-03 1 1 1 0
1999-01-04 1 1 1 0
... ... ... ... ...
2003-12-31 1 1 1 0
2004-01-01 1 0 1 0
... ... ... ... ...
2017-09-04 1 1 0 1
In the section My Solution you will find my solution to the problem. Unfortunately, the code seems to perform very bad. It took about 22 seconds to process 1648 entries.
As I am new to python I wondered how to efficiently program problems like these.
I do not intend that anyone is providing me with a solution to my problem (unless you wish to do so). My primary goal would be to understand how to efficiently solve problems like these in python. I used the functionalities of pandas to match the respective entries. Should I have used numpy and indexing instead? Should I have used other toolboxes? How can I gain performance improvements?
Please find my approach to this problem in the section below (if it is of interest to you).
Thank you very much for your help
My Solution
I have tried to resolve the problem by looping through every row entry in the first table. During every single loop, I specify a Boolean matrix for the specific from-thru-interval with all elements set to True. This matrix is appended to a list. At the end, I pd.concat the list and unstack and reindex the resulting DataFrame.
import pandas as pd
import numpy as np
def get_ts_data(data, start_date, end_date, attribute=None, identifier=None, frequency=None):
"""
Transform sparse table to dense table.
Parameters
----------
data: pd.DataFrame
sparse table with minimal column specification ['identifier', 'from', 'thru'
start_date: pd.Timestamp, str
start date of the dense matrix
end_date: pd.Timestamp, str
end date of the dense matrix
attribute: str
column name of the value of the dense matrix.
identifier: str
column name of the identifier
frequency: str
frequency of the dense matrix
kwargs:
Allows to overwrite naming of 'from' and 'thru' variables.
e.g.
{'from': 'start', 'thru': 'end'}
Returns
-------
"""
if attribute is None:
attribute = ['on_index']
elif not isinstance(attribute, list):
attribute = [attribute]
if identifier is None:
identifier = ['identifier']
elif not isinstance(identifier, list):
identifier = [identifier]
if frequency is None:
frequency = 'B'
# copy data for security reasons
data_mod = data.copy()
data_mod['on_index'] = True
# specify start date and check type
if not isinstance(start_date, pd.Timestamp):
start_date = pd.Timestamp(start_date)
# specify end date and check type
if not isinstance(end_date, pd.Timestamp):
end_date = pd.Timestamp(end_date)
# specify output date range
date_range = pd.date_range(start_date, end_date, freq=frequency)
#overwrite null indicating that it is valid until today
missing = data_mod['thru'].isnull()
data_mod.loc[missing, 'thru'] = data_mod.loc[missing, 'from'].apply(lambda d: max(d, end_date))
# preallocate frms
frms = []
# add dataframe to frms with time specific entries
for index, row in data_mod.iterrows():
# date range index
d_range = pd.date_range(row['from'], row['thru'], freq=frequency)
# Multi index with date and identifier
d_index = pd.MultiIndex.from_product([d_range] + [[x] for x in row[identifier]], names=['date'] + identifier)
# add DataFrame with repeated values to list
frms.append(pd.DataFrame(data=np.repeat(row[attribute].values, d_index.size), index=d_index, columns=attribute))
out_frame = pd.concat(frms)
out_frame = out_frame.unstack(identifier)
out_frame = out_frame.reindex(date_range)
return out_frame
if __name__ == "__main__":
data = pd.DataFrame({'identifier': ['AAPL', 'ABT', 'ABT', 'AEP', 'KO'],
'from': [pd.Timestamp('1964-03-31'),
pd.Timestamp('1999-01-03'),
pd.Timestamp('2005-12-31'),
pd.Timestamp('1992-01-15'),
pd.Timestamp('2014-12-31')],
'thru': [np.nan,
pd.Timestamp('2003-12-31'),
np.nan,
pd.Timestamp('2017-08-31'),
np.nan]
})
transformed_data = get_ts_data(data, start_date='1964-03-31', end_date='2017-09-04', attribute='on_index', identifier='identifier', frequency='B')
print(transformed_data)
# Ensure dates are Pandas timestamps.
df['from'] = pd.DatetimeIndex(df['from'])
df['thru'] = pd.DatetimeIndex(df['thru'].replace('--', np.nan))
# Get sorted list of all unique dates and create index for full range.
dates = sorted(set(df['from'].tolist() + df['thru'].dropna().tolist()))
dti = pd.DatetimeIndex(start=dates[0], end=dates[-1], freq='B')
# Create new target dataframe based on symbols and full date range. Initialize to zero.
df2 = pd.DataFrame(0, columns=df['identifier'].unique(), index=dti)
# Find all active symbols and set their symbols' values to one from their respective `from` dates.
for _, row in df[df['thru'].isnull()].iterrows():
df2.loc[df2.index >= row['from'], row['identifier']] = 1
# Find all other symbols and set their symbols' values to one between their respective `from` and `thru` dates.
for _, row in df[df['thru'].notnull()].iterrows():
df2.loc[(df2.index >= row['from']) & (df2.index <= row['thru']), row['identifier']] = 1
>>> df2.head(3)
AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
1964-04-02 1 0 0 0
>>> df2.tail(3)
AAPL ABT AEP KO
2017-08-29 1 1 1 1
2017-08-30 1 1 1 1
2017-08-31 1 1 1 1
>>> df2.loc[:'2004-01-02', 'ABT'].tail()
2003-12-29 1
2003-12-30 1
2003-12-31 1
2004-01-01 0
2004-01-02 0
Freq: B, Name: ABT, dtype: int64
>>> df2.loc['2005-12-30':, 'ABT'].head(3)
2005-12-30 0
2006-01-02 1
2006-01-03 1
Freq: B, Name: ABT, dtype: int64

Python groupby dates but with if statement

My data looks like this:
id Open Close
1 1/1/15 1/1/15
2 1/1/15 2/1/15
3 3/1/15 4/1/15
I need to create a dataframe that shows the number of open cases on any day, so the result of the data above would look like:
Date #Open
1/1/15 1
2/1/15 0
3/1/15 1
Any ideas?
This method creates an index of all days between the first case open and the max of the last case opened or closed. It then iterates through each of these dates and filters the dataframe for the relevant date, checking the resulting size.
df['Open'] = pd.to_datetime(df.Open)
df['Close'] = pd.to_datetime(df.Close)
idx = pd.date_range(df.Open.min(), max(df.Open.max(), df.Close.max()))
cases = pd.DataFrame([len(df[(date >= df.Open) & (date < df.Close)])
for date in idx],
index=idx, columns=['case_count'])
>>> cases.head(3)
case_count
2015-01-01 1
2015-01-02 1
2015-01-03 1
>>> cases.tail(3)
case_count
2015-03-30 1
2015-03-31 1
2015-04-01 0

Categories

Resources