Retrieving a value with multiple conditions - pandas - python

Sample date:
Dataframe 1
cusip_id
trd_exctn_dt
time_to_maturity
00077AA2
2015-05-09
1.20 years
00077TBO
2015-05-06
3.08 years
Dataframe 2:
Index
SVENY01
SVENY02
SVENY03
SVENY04
2015-05-09
1.35467
1.23367
1.52467
1.89467
2015-05-08
1.65467
1.87967
1.43251
1.98765
2015-05-07
1.35467
1.76567
1.90271
1.43521
2015-05-06
1.34467
1.35417
1.67737
1.11167
Desired output:
I am wanting to exactly match the 'trd_exctn_dt' in df1 with the date in the index of df2, whilst at the same time matching the 'time_to_maturity' in df1 with the nearest SVENYXX in df2 (rounded up e.g. 1.20 years would be equivalent to SVENY02). For example, for cusip_id (00077AA2), the trd_exctn_dt is 2015-05-09 and the time_to_maturity is 1.20 years. As this is the case I want to obtain the corresponding value in df2 with the date of 2015-05-09 in the column SVENY02.
I want to repeat this for several cusip_ids, how would I achieve this?
Any help would be appreciated!

Here is my solution code:
import pandas as pd
SVENYXX = []
for i in range(df1['cusip_id'].shape[0]):
cusip_id = df1['cusip_id'][i]
trd_exctn_date = df1['trd_exctn_dt'][i]
maturity_time = df1['time_to_maturity'][i]
svenyVals = df2.loc[trd_exctn_date]
closestSvenyVal = svenyVals.iloc[(svenyVals-maturity_time).abs().argsort()[0]]
SVENYXX.append(closestSvenyVal)
where df1 is Dataframe 1, df2 is Dataframe 2, and SVENYXX is the list with all the closest SVENYXX values to the given cusip_id.
I loop through all the cusip_id's and obtain the correspond trd_exctn_dt and time_to_maturity values. Then with the extracted data, I find the corresponding row in DataFrame 2, and then by finding the lowest difference in svenyVals compared to time_to_maturity, I append that value to the SVENYXX list.

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Python dataframe: create column with running formula based on values in one row prior

I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the change between any two consecutive values (in Close) as a ratio of the previous value.
For example, in photo below, first entry (-0.0002) for Interday_return is calculated as = (43.06-43.07)/43.07.
Similarly the next value 0.0046 is calculated as = (43.26-43.06)/43.06.
And so on..
I am able to create a new column Interday_Close_change which is basically the difference between each 2 consecutive rows using this code (ie.. finding the numerator of the above mentioned fraction). However, I dont know how to divide any element in Interday_Close_change by value in the preceding row and get a new column Interday_return.
df = pd.DataFrame(data, columns=columns)
df['Interday_Close_change'] = df['Close'].astype(float).diff()
df.fillna('', inplace=True)
This should do it:
df['Interday_Close_change'] = df['Close'].pct_change().fillna('')
Sample input:
Date Open Close
0 1/2/2018 42.54 43.07
1 1/3/2018 43.13 43.06
2 1/4/2018 43.14 43.26
3 1/5/2018 43.36 43.75
Sample output:
Date Open Close Interday_Close_change
0 1/2/2018 42.54 43.07
1 1/3/2018 43.13 43.06 -0.000232
2 1/4/2018 43.14 43.26 0.004645
3 1/5/2018 43.36 43.75 0.011327
Docs on pct_change.

Python pandas column filtering substring

I have a dataframe in python3 using pandas which has a column containing a string with a date.
This is the subset of the column
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"
I would like to remove the rows that have the same month and day twice and keep the one with the newest year.
This would be what I would expect as a result from this subset
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.
I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?
There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.
# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)
This is what it would look like after separation and sorting:
ColA year mo-day
2 2020-04-12 2020 04-12
3 2020-04-08 2020 04-08
4 2020-04-12 2020 04-12
0 2021-04-03 2021 04-03
1 2021-04-08 2021 04-08
Afterwards, we can simply drop the duplicates and remove the additional columns:
# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)
# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)
# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)
Final result:
ColA
0 2020-04-12
1 2020-04-08
2 2021-04-03
This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.
I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:
df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})
df['colA'] = df.colA.apply(pd.to_datetime)
Then you can group by the (day, month) and keep the highest value like so:
df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

How to sum over a Pandas dataframe conditionally

I'm looking for an efficient way (without looping) to add a column to a dataframe, containing a sum over a column of that same dataframe, filtered by some values in the row. Example:
Dataframe:
ClientID Date Orders
123 2020-03-01 23
123 2020-03-05 10
123 2020-03-10 7
456 2020-02-22 3
456 2020-02-25 15
456 2020-02-28 5
...
I want to add a colum "orders_last_week" containing the total number of orders for that specific client in the 7 days before the given date.
The Excel equivalent would be something like:
SUMIFS([orders],[ClientID],ClientID,[Date]>=Date-7,[Date]<Date)
So this would be the result:
ClientID Date Orders Orders_Last_Week
123 2020-03-01 23 0
123 2020-03-05 10 23
123 2020-03-10 7 10
456 2020-02-22 3 0
456 2020-02-25 15 3
456 2020-02-28 5 18
...
I can solve this with a loop, but since my dataframe contains >20M records, this is not a feasible solution. Can anyone please help me out?
Much appreciated!
I'll assume your dataframe is named df. I'll also assume that dates aren't repeated for a given ClientID, and are in ascending order (If this isn't the case, do a groupby sum and sort the result so that it is).
The gist of my solution is, for a given ClientID and Date.
Use groupby.transform to split this problem up by ClientID.
Use rolling to check the next 7 rows for dates that are within the 1-week timespan.
In those 7 rows, dates within the timespan are labelled True (=1). Dates that are not are labelled False (=0).
In those 7 rows, multiply the Orders column by the True/False labelling of dates.
Sum the result.
Actually, we use 8 rows, because, e.g., SuMoTuWeThFrSaSu has 8 days.
What makes this hard is that rolling aggregates columns one at a time, and so doesn't obviously allow you to work with multiple columns when aggregating. If it did, you could make a filter using the date column, and use that to sum the orders.
There is a loophole, though: you can use multiple columns if you're happy to smuggle them in via the index!
I use some helper functions. Note a is understood to be a pandas series with 8 rows and values "Orders", with "Date" in the index.
Curious to know what performance is like on your real data.
import pandas as pd
data = {
'ClientID': {0: 123, 1: 123, 2: 123, 3: 456, 4: 456, 5: 456},
'Date': {0: '2020-03-01', 1: '2020-03-05', 2: '2020-03-10',
3: '2020-02-22', 4: '2020-02-25', 5: '2020-02-28'},
'Orders': {0: 23, 1: 10, 2: 7, 3: 3, 4: 15, 5: 5}
}
df = pd.DataFrame(data)
# Make sure the dates are datetimes
df['Date'] = pd.to_datetime(df['Date'])
# Put into index so we can smuggle them through "rolling"
df = df.set_index(['ClientID', 'Date'])
def date(a):
# get the "Date" index-column from the dataframe
return a.index.get_level_values('Date')
def previous_week(a):
# get a column of 0s and 1s identifying the previous week,
# (compared to the date in the last row in a).
return (date(a) >= date(a)[-1] - pd.DateOffset(days=7)) * (date(a) < date(a)[-1])
def previous_week_order_total(a):
#compute the order total for the previous week
return sum(previous_week(a) * a)
def total_last_week(group):
# for a "ClientID" compute all the "previous week order totals"
return group.rolling(8, min_periods=1).apply(previous_week_order_total, raw=False)
# Ok, actually compute this
df['Orders_Last_Week'] = df.groupby(['ClientID']).transform(total_last_week)
# Reset the index back so you can have the ClientID and Date columns back
df = df.reset_index()
The above code relies upon the fact that the past week encompasses at most 7 rows of data i.e., the 7 days in a week (although in your example, it is actually less than 7)
If your time window is something other than a week, you'll need to replace all the references to a the length of a week in terms of the finest division of your timestamps.
For example, if your date timestamps are spaced are no closer than 1 second, and you are interested in a time window of 1 minutes (e.g., "Orders_last_minute"), replace pd.DateOffset(days=7) with pd.DateOffset(seconds=60), and group.rolling(8,... with group.rolling(61,....)
Obviously, this code is a bit pessimistic: for each row, it always looks at 61 rows, in this case. Unfortunately rolling does not offer a suitable variable window size function. I suspect that in some cases a python loop that takes advantage of the fact that the dataframe is sorted by date might run faster than this partly-vectorized solution.

Slice, combine, and map fiscal year dates to calendar year dates to new column

I have the following pandas data frame:
Shortcut_Dimension_4_Code Stage_Code
10225003 2
8225003 1
8225004 3
8225005 4
It is part of a much larger dataset that I need to be able to filter by month and year. I need to pull the fiscal year from the first two digits for values larger than 9999999 in the Shortcut_Dimension_4_Code column, and the first digit for values less than or equal to 9999999. That value needs to be added to "20" to produce a year i.e. "20" + "8" = 2008 | "20" + "10" = 2010.
That year "2008, 2010" needs to be combined with the stage code value (1-12) to produce a month/year, i.e. 02/2010.
The date 02/2010 then needs to converted from fiscal year date to calendar year date, i.e. Fiscal Year Date : 02/2010 = Calendar Year date: 08/2009. The resulting date needs to be presented in a new column. The resulting df would end up looking like this:
Shortcut_Dimension_4_Code Stage_Code Date
10225003 2 08/2009
8225003 1 07/2007
8225004 3 09/2007
8225005 4 10/2007
I am new to pandas and python and could use some help. I am beginning with this:
Shortcut_Dimension_4_Code Stage_Code CY_Month Fiscal_Year
0 10225003 2 8.0 10
1 8225003 1 7.0 82
2 8225003 1 7.0 82
3 8225003 1 7.0 82
4 8225003 1 7.0 82
I used .map and .str methods to produce this df, but have not been able to figure out how to get the FY's right, for fy 2008-2009.
In below code, I'll assume Shortcut_Dimension_4_Code is an integer. If it's a string you can convert it or slice it like this: df['Shortcut_Dimension_4_Code'].str[:-6]. More explanations in comments alongside the code.
That should work as long as you don't have to deal with empty values.
import pandas as pd
import numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
fiscal_month_offset = 6
input_df = pd.DataFrame(
[[10225003, 2],
[8225003, 1],
[8225004, 3],
[8225005, 4]],
columns=['Shortcut_Dimension_4_Code', 'Stage_Code'])
# make a copy of input dataframe to avoid modifying it
df = input_df.copy()
# numpy will help us with numeric operations on large collections
df['fiscal_year'] = 2000 + np.floor_divide(df['Shortcut_Dimension_4_Code'], 1000000)
# loop with `apply` to create `date` objects from available columns
# day is a required field in date, so we'll just use 1
df['fiscal_date'] = df.apply(lambda row: date(row['fiscal_year'], row['Stage_Code'], 1), axis=1)
df['calendar_date'] = df['fiscal_date'] - relativedelta(months=fiscal_month_offset)
# by default python dates will be saved as Object type in pandas. You can verify with `df.info()`
# to use clever things pandas can do with dates we need co convert it
df['calendar_date'] = pd.to_datetime(df['calendar_date'])
# I would just keep date as datetime type so I could access year and month
# but to create same representation as in question, let's format it as string
df['Date'] = df['calendar_date'].dt.strftime('%m/%Y')
# copy important columns into output dataframe
output_df = df[['Shortcut_Dimension_4_Code', 'Stage_Code', 'Date']].copy()
print(output_df)

Categories

Resources