I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()
Related
I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).
I have 2 DataFrames - 1 contains Stock Tickers and a maximum/minimum price range along with other columns.
The other DataFrame has dates as indexes and is grouped by tickers with various metrics like open,close,high low, etc. Now, I want to take a count of days from this DataFrame when for a given stock, close price was higher than minimum price.
I am stuck here:Now I want to find for example how many days AMZN was trading below the period max price.
I want to take count of days from the second dataframe based on values from the first dataframe,count of days where closing price was lesser/greater than Max/Min period price.
I have added the code to reproduce the DataFrames.
Please check the screen shots.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
import yfinance as yf
start=datetime.datetime.today()-relativedelta(years=2)
end=datetime.datetime.today()
us_stock_list='FB AMZN BABA'
data_metric = yf.download(us_stock_list, start=start, end=end,group_by='column',auto_adjust=True)
data_ticker= yf.download(us_stock_list, start=start, end=end,group_by='ticker',auto_adjust=True)
stock_list=[stock for stock in data_ticker.stack()]
# max_price
max_values=pd.DataFrame(data_ticker.max().unstack()['High'])
# min_price
min_values=pd.DataFrame(data_ticker.min().unstack()['Low'])
# latest_price
latest_day=pd.DataFrame(data_ticker.tail(1).unstack())
latest_day=latest_day.unstack().unstack().unstack().reset_index()
# latest_day=latest_day.unstack().reset_index()
latest_day=latest_day.drop(columns=['level_0','Date'])
latest_day.set_index('level_3',inplace=True)
latest_day.rename(columns={0:'Values'},inplace=True)
latest_day=latest_day.groupby(by=['level_3','level_2']).max().unstack()
latest_day.columns=[ '_'.join(x) for x in latest_day.columns ]
latest_day=latest_day.join(max_values,how='inner')
latest_day=latest_day.join(min_values,how='inner')
latest_day.rename(columns={'High':'Period_High_Max','Low':'Period_Low_Min'},inplace=True)
close_price_data=pd.DataFrame(data_metric['Close'].unstack().reset_index())
close_price_data= close_price_data.rename(columns={'level_0':'Stock',0:'Close_price'})
close_price_data.set_index('Stock',inplace=True)
Use this to reproduce:
{"Values_Close":{"AMZN":2286.0400390625,"BABA":194.4799957275,"FB":202.2700042725},"Values_High":{"AMZN":2362.4399414062,"BABA":197.3800048828,"FB":207.2799987793},"Values_Low":{"AMZN":2258.1899414062,"BABA":192.8600006104,"FB":199.0500030518},"Values_Open":{"AMZN":2336.8000488281,"BABA":195.75,"FB":201.6000061035},"Values_Volume":{"AMZN":9754900.0,"BABA":22268800.0,"FB":30399600.0},"Period_High_Max":{"AMZN":2475.0,"BABA":231.1399993896,"FB":224.1999969482},"Period_Low_Min":{"AMZN":1307.0,"BABA":129.7700042725,"FB":123.0199966431},"%_Position":{"AMZN":0.8382192115,"BABA":0.6383544892,"FB":0.7832576338}}
{"Stock":{
"0":"AMZN",
"1":"AMZN",
"2":"AMZN",
"3":"AMZN",
"4":"AMZN",
"5":"AMZN",
"6":"AMZN",
"7":"AMZN",
"8":"AMZN",
"9":"AMZN",
"10":"AMZN",
"11":"AMZN",
"12":"AMZN",
"13":"AMZN",
"14":"AMZN",
"15":"AMZN",
"16":"AMZN",
"17":"AMZN",
"18":"AMZN",
"19":"AMZN"},
"Date":{
"0":1525305600000,
"1":1525392000000,
"2":1525651200000,
"3":1525737600000,
"4":1525824000000,
"5":1525910400000,
"6":1525996800000,
"7":1526256000000,
"8":1526342400000,
"9":1526428800000,
"10":1526515200000,
"11":1526601600000,
"12":1526860800000,
"13":1526947200000,
"14":1527033600000,
"15":1527120000000,
"16":1527206400000,
"17":1527552000000,
"18":1527638400000,
"19":1527724800000 },
"Close_price":{
"0":1572.0799560547,
"1":1580.9499511719,
"2":1600.1400146484,
"3":1592.3900146484,
"4":1608.0,
"5":1609.0799560547,
"6":1602.9100341797,
"7":1601.5400390625,
"8":1576.1199951172,
"9":1587.2800292969,
"10":1581.7600097656,
"11":1574.3699951172,
"12":1585.4599609375,
"13":1581.4000244141,
"14":1601.8599853516,
"15":1603.0699462891,
"16":1610.1500244141,
"17":1612.8699951172,
"18":1624.8900146484,
"19":1629.6199951172}}
Do a merge between both dataframes, groupby company (index level=0) and apply a custom function:
df_merge = close_price_data.merge(
latest_day[['Period_High_Max', 'Period_Low_Min']],
left_index=True,
right_index=True)
def fun(df):
d = {}
d['days_above_min'] = (df.Close_price > df.Period_Low_Min).sum()
d['days_below_max'] = (df.Close_price < df.Period_High_Max).sum()
return pd.Series(d)
df_merge.groupby(level=0).apply(fun)
Period_Low_Min and Period_High_Max are the min and max respectively so all closing prices will be in that range, if this is not what you are trying to accomplish let me know.
I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement
I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can
tl;dr
I need df.dates[iter]-df.dates[initial_fixed] per slice of a dataframe indexed by an item_id in the fastest way possible (for the sake of learning and improving skills... and deadlines).
How to calculate business hours between these same dates, not just straight time. And I need partial days (4.763 days for example) not just an integer like with .days
Hi,
First, I have a dataframe df
item_id dates new_column ... other_irrelevant_columns
101 2020-09-10-08-... FUNCTION -neglected-
101 2020-09-18-17-... FUNCTION -neglected-
101 2020-10-03-11-... FUNCTION -neglected-
107 2017-08-dd-hh-... FUNCTION -neglected-
107 2017-09-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected
209 2019-01-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected-
where the dates column (type = datetime object) is chronological per item_id, so the first instance is the earliest date.
I have over 400,000 rows, and I need to calculate the elapsed time by taking the distance between each datetime and the origin datetime, per item_id. Then there is a sequence
item_id dates [new_column = elapsed_time] ... other_irrelevant_columns
101 2020-09-10-08-... [dates[0]-dates[0] = 0 days] -neglected- for plotting
101 2020-09-18-17-... [dates[1]-dates[0] = 8.323 days] -neglected-
101 2020-10-03-11-... [dates[2]-dates[0] = 23.56 days] -neglected-
At the moment, I'm stuck using a for loop which I think is vectorized, which calculates the total seconds of a timedelta and converts to days as a float:
for id in df.item_id:
df.elapsed_days[df.item_id == id] = ((df.dates[df.item_id == id] - min(df.dates[df.boot_id == id])).dt.total_seconds()/86400).astype(float)
which is taking forever. Not in the data science spirit. What I'd like to know, is a better way to perform this whether it's using apply() with a lambda, and I tried to use digitize and isin() from this guys article but can't fathom how to bin the item_id to make it work.
Second, I am also interested in a similar duration but over business hours only (8am-6pm no weekends or holidays in Canada),so the real time that the item is active is measured.
Thanks for any help.
You can use join to do that much faster.
First you need to perform the min as you do in your current code:
tmp = df.loc[df['item_id'] == df['boot_id']] # row filtering
tmp = df[['item_id','date']] # column filtering
dateMin = tmp.groupby('item_id', as_index=False).min() # Find the minimal date for each item_id
Then you can do the merge:
# Actual merge
indexed_df = df.set_index('item_id')
indexed_dateMin = dateMin.set_index('item_id')
merged = indexed_df.join(indexed_dateMin, lsuffix='_df', rsuffix='_dateMin')
# Vectorized computation
df['elapsed_days'] = (merged['date_df'] - merged['date_dateMin']).dt.total_seconds()/86400).astype(float)