I am struggling to break down the method required to extract data from deeply nested complex JSON data. I have the following code to obtain the JSON.
import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt
base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers={'User-Agent': 'Myheaderdata'}
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
Which provides an output showing the JSON and a Pandas DataFrame. The dataframe has two columns, with the third (FACTS) containing a lot of nested data.
What I want to understand is how to navigate into that nested structure, to retrieve certain data. For example, I may want to go to the DEI level, or the US GAAP level and retrieve a particular attribute. Let's say DEI > EntityCommonStockSharesOutstanding and obtain the "label", "value" and "FY" details.
When I try to use the get function as follows;
data=[]
for response in response_dic:
data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
new_df=pd.DataFrame(data)
new_df.head()
I end up with the following attribute error;
AttributeError Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
1 data=[]
2 for response in response_dic:
----> 3 data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
4 base_df=pd.DataFrame(data)
5 base_df.head()
AttributeError: 'str' object has no attribute 'get'
Use pd.json_normalize:
For example:
entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']
df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
meta=['label', 'description'])
df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
meta=['label', 'description'])
>>> df1
end val accn ... frame label description
0 2018-10-31 106299106 0001564590-18-028629 ... CY2018Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
1 2019-02-28 106692030 0001627475-19-000007 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
2 2019-04-30 107160359 0001627475-19-000015 ... CY2019Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
3 2019-07-31 110803709 0001627475-19-000025 ... CY2019Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
4 2019-10-31 112020807 0001628280-19-013517 ... CY2019Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
5 2020-02-28 113931825 0001627475-20-000006 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
6 2020-04-30 115142604 0001627475-20-000018 ... CY2020Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
7 2020-07-31 120276173 0001627475-20-000031 ... CY2020Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
8 2020-10-31 122073553 0001627475-20-000044 ... CY2020Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
9 2021-01-31 124962279 0001627475-21-000015 ... CY2020Q4I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
10 2021-04-30 126144849 0001627475-21-000022 ... CY2021Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
[11 rows x 10 columns]
>>> df2
end val accn fy fp form filed frame label description
0 2018-10-03 900000000 0001627475-19-000007 2018 FY 10-K 2019-03-07 CY2018Q3I Entity Public Float The aggregate market value of the voting and n...
1 2019-06-28 1174421292 0001627475-20-000006 2019 FY 10-K 2020-03-02 CY2019Q2I Entity Public Float The aggregate market value of the voting and n...
2 2020-06-30 1532720862 0001627475-21-000015 2020 FY 10-K 2021-02-24 CY2020Q2I Entity Public Float The aggregate market value of the voting and n...
I came across this same issue. While the solution provided meets the requirements of your question it might be a better solution to flatten the entire dictionary and have all the columns represented in a long data frame.
That data frame can be used as a building block for a DB or can simply be queried as you wish.
The facts key can have more than sub key dei or us-gaap.
Also, within the us-gapp dictionary if you want to extract multiple xbrl tags at a time you will have a pretty difficult time.
The solution below is might not be the prettiest or more efficient but it gets all the levels of the dictionary along with all the facts and values.
import requests
import pandas as pd
import json
from flatten_json import flatten
headers= {'User-Agent':'My User Agent 1.0', 'From':'something somethin'}
file = 'https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json'
data = json.loads(requests.get(file, headers = headers).text)
#get the cik and name of the entity
Cik_Name = dict(list(data.items())[0: 2])
Cik_Name_df = pd.DataFrame(Cik_Name,index=[0])
#Flatten file
f = flatten(data['facts'],'|')
#drop into a dataframe and transpose
f = pd.DataFrame(f,index=[0]).T
#reset index
f = f.reset_index(level=0)
#rename columns
f.rename(columns={'index': 'Col_split', 0:'values'}, inplace= True)
#split Col_split column by delimiter
f = f.join(f['Col_split'].str.split(pat='|',expand=True).add_prefix('Col_split'))
#drop original Col_split column
f = f.drop(['Col_split','Col_split4'],axis = 1)
#move values column to the end
f = f[[c for c in f if c not in ['values']] + ['values']]
#create groups based on Col_split2 containing the value label
f['groups'] = f["Col_split2"].eq('label').cumsum()
df_list = []
#loop to break df by group and create new columns for label & description
for i, g in f.groupby('groups'):
label = g['values'].iloc[0]
description = g['values'].iloc[1]
g.drop(index = g.index[:2], axis = 0, inplace = True)
g['label'] = label
g['description'] = description
df_list.append(g)
final_df = pd.concat(df_list)
final_df.rename(columns={'Col_split0':'facts', 'Col_split1':'tag','Col_split3':'units'}, inplace=True)
final_df = final_df[['facts','tag','label','description','units','Col_split5','values']]
final_df['cum _ind'] = final_df["Col_split5"].eq('end').cumsum()
final_df = final_df.pivot(index = ['facts','tag','label','description','units','cum _ind'] , columns = 'Col_split5' ,values='values').reset_index()
final_df['cik'] = Cik_Name_df['cik'].iloc[0]
final_df['entityName'] = Cik_Name_df['entityName'].iloc[0]
final_df = final_df[['cik','entityName','facts','tag','label','description','units','accn','start','end','filed','form','fp','frame','fy','val']]
print(final_df)
please feel free to make improvements as you see fit and share them with the community.
Related
I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).
I want to use S&P500 company information to calculate an index. However, the companies in S&P500 changes frequently, I want to know the constituents for each quarter, but I can only get the most recent list from Wikipedia, the code is as below:
table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
tickers = df.Symbol.to_list()
'tickers' is a list that contains all the company tickers in S&P500 companies
['MMM',
'ABT',
'ABBV',
'ABMD',
'ACN',
'ATVI',
'ADBE',
'AMD',
'AAP',
'AES',
'AFL',
'A',
'APD',
'AKAM',
'ALK',
'ALB',
'ARE',
...]
Now I found a table that contains the historical change information of S&P500 constituents. There are dates, changes, and tickers for all the companies. '1' means the company was added to the list, and '-1' means the company was removed from the list. I want to use this information, particularly 'DateAfterChange', and get the lists of companies in the S&P500 for the past 20 quarters(5 years). A complete list can be found here: https://docs.google.com/spreadsheets/d/1xkq2kkf-iElKl9BhEwqQx3Pgkh0B9dFKJpefQ4oOI_g/edit#gid=455032226.
DateBeforeChange DateAfterChange Change Ticker
20200623 20200624 1 TMUSR
20200618 20200619 1 BIO
20200618 20200619 1 TDY
20200618 20200619 1 TYL
20200618 20200619 -1 ADS
20200618 20200619 -1 HOG
My expected output could be single lists or in a combined format like this:
2019-Q1 2019-Q2 2019-Q3 2019-Q4
A B C D
B C D F
C D E E
D E F G
E F G H
...
What I'm thinking about is to use the most recent list of companies, and first divide the date info into quarters in the change data, and then add back those were removed and remove those were added in the past. But I'm just not sure how to do that in Python. Can anyone please help?
This method works:
import pandas as pd
# current list
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
tickers = df.Symbol.to_list()
# your file of changes
change = pd.read_excel("sp500change.xlsx")
# convert afterchange to datetime and set as index, sorting
change["DateAfterChange"] = pd.to_datetime(change["DateAfterChange"], format="%Y%m%d")
change.set_index("DateAfterChange", inplace=True)
change = change.sort_index(ascending=False)
# groupby quarter, creating list of tickers for additions and deletions from list
change = change.groupby([pd.Grouper(freq="Q"), "Change"])["Ticker"].agg(lambda x: list(x)).to_frame()
# set index afterchange, change to strings and set these as columns
change = change.reset_index(drop=False).set_index("DateAfterChange")
change["Change"] = change["Change"].map({-1: "drop", 1: "add"})
change = change.pivot(columns="Change")
change.columns = change.columns.droplevel(0)
# series of tickers over time
tick_series = pd.Series({pd.to_datetime("today"): tickers})
tick_series = tick_series.append(pd.Series(index=change.index)).sort_index(ascending=False)
for i in tick_series.iloc[1:].index:
tick_series.loc[i] = list(set(tick_series.shift(1).loc[i] + change.loc[i]["drop"]).difference(set(change.loc[i]["add"])))
The for loop takes the previous list (it is working backwards, so this is the more recent list), and adds the tickers that were dropped in the quarter, and removes those that were added in the quarter. Sets were needed to only keep the differences between the "add" and "more recent + drop" lists.
Hopefully you have found a solution by now anyway, and haven't waited for 2 years...
I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement
I have a single .csv file with four tables, each a different financial statement four Southwest Airlines from 2001-1986. I know I could separate each table into separate files, but they are initially downloaded as one.
I would like to read each table to its own pandas DataFrame for analysis.Here is a subset of the data:
Balance Sheet
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Cash & cash equivalents 2279861 522995 418819 378511
Short-term investments - - - -
Accounts & other receivables 71283 138070 73448 88799
Inventories of parts... 70561 80564 65152 50035
Income Statement
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Passenger revenues 5378702 5467965 4499360 3963781
Freight revenues 91270 110742 102990 98500
Charter & other - - - -
Special revenue adjustment - - - -
Statement of Retained Earnings
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Previous ret earn... 2902007 2385854 2044975 1632115
Cumulative effect of.. - - - -
Three-for-two stock split 117885 - 78076 -
Issuance of common.. 52753 75952 45134 10184
The tables each have 17 columns, the first the line item description, but varying numbers of rows i.e. the balance sheet is 100 rows whereas the statement of cash flows is 65
What I've Done
import pandas as pd
import numpy as np
# Lines that separate the various financial statements
lines_to_skip = [0, 102, 103, 158, 159, 169, 170]
with open('LUV.csv', 'r') as file:
fin_statements = pd.read_csv(file, skiprows=lines_to_skip)
balance_sheet = fin_statements[0:100]
I have seen posts with a similar objective noting to utilize nrows and skiprows. I utilized skiprows to read the entire file, then I created the individual financial statement by indexing.
I am looking for comments and cconstructive criticism for creating a dataframe for each respective table in better Pythonic style and best practices.
What you want to do if far beyond what read_csv can do. If fact you input file struct can be modeled as:
REPEAT:
Dataframe name
Header line
REPEAT:
Data line
BLANK LINE OR END OF FILE
IMHO, the simplest way is to parse the line by hand line by line, feeding a temporary csv file per dataframe, then loading the dataframe. Code could be:
df = {} # dictionary of dataframes
def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''
# print("Process", df_name, tmp.name) # uncomment for debugging
if tmp is not None:
tmp.close()
df[df_name] = pd.read_csv(tmp.name)
os.remove(tmp.name) # do not forget to remove the temp file
with open('LUV.csv') as file:
df_name = "NONAME" # should never be in resulting dict...
tmp = None
for line in file:
# print(line) # uncomment for debugging
if len(line.strip()) == 0: # close temp file on empty line
process(tmp, df_name) # and process it
tmp = None
elif tmp is None: # a new part: store the name
df_name = line.strip()
state = 1
tmp = tempfile.NamedTemporaryFile("w", delete=False)
else:
tmp.write(line) # just feed the temp file
# process the last part if no empty line was present...
process(tmp, df_name)
This is not really efficient because each line is written to a temporary file an then read again, but it is simple and robust.
A possible improvement would be to initially parse the parts with the csv module (can parse a stream while pandas wants files). The downside is that the csv module only parse into strings and you lose the automatic conversions to numbers of pandas. My opinion is that it is worth it only if the file is large and the full operation will have to be repeated.
Here is my solution:
My assumption is that each statement starts with an indicator ('Balance Sheet', 'Income Statement', 'Statement of Retained Earnings'), we can split the table based on that to get individual dataframes. That is the premise on which the following code is based. let me know if it is a flawed assumption.
import pandas as pd
import numpy as np
#i copied your data above and created a csv with it
df = pd.read_csv('csvtable_stackoverflow',header=None)
0
0 Balance Sheet
1 Report Date 12/31/2001 12/31/...
2 Cash & cash equivalents 2279861 522995...
3 Short-term investments - - ...
4 Accounts & other receivables 71283 138070...
5 Inventories of parts... 70561 80564...
6 Income Statement
7 Report Date 12/31/2001 12/31/...
8 Passenger revenues 5378702 546796...
9 Freight revenues 91270 110742...
10 Charter & other - - ...
11 Special revenue adjustment - - ...
12 Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2...
14 Previous ret earn... 2902007 2385854...
15 Cumulative effect of.. - - ...
16 Three-for-two stock split 117885 - 78076 -
17 Issuance of common.. 52753 75952...
The code below simply uses numpy select to filter out which rows contain
balance sheet or income statement or cashflow
https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html
bal_sheet = df[0].str.strip()=='Balance Sheet'
income_stmt = df[0].str.strip()=='Income Statement'
cash_flow_sheet = df[0].str.strip()=='Statement of Retained Earnings'
condlist = [bal_sheet, income_stmt, cash_flow_sheet]
choicelist = ['Balance Sheet', 'Income Statement', 'Statement of
Retained Earnings']
Next code below creates a column indicating the sheet type, converts '0' to null and then fills down
df = (df.assign(sheet_type = np.select(condlist,choicelist))
.assign(sheet_type = lambda x: x.sheet_type.replace('0',np.nan))
.fillna(method='ffill')
)
Last step is to pull out the individual dataframes
df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')
df_bal_sheet :
0 sheet_type
0 Balance Sheet Balance Sheet
1 Report Date 12/31/2001 12/31/... Balance Sheet
2 Cash & cash equivalents 2279861 522995... Balance Sheet
3 Short-term investments - - ... Balance Sheet
4 Accounts & other receivables 71283 138070... Balance Sheet
5 Inventories of parts... 70561 80564... Balance Sheet
df_income_sheet :
0 sheet_type
6 Income Statement Income Statement
7 Report Date 12/31/2001 12/31/... Income Statement
8 Passenger revenues 5378702 546796... Income Statement
9 Freight revenues 91270 110742... Income Statement
10 Charter & other - - ... Income Statement
11 Special revenue adjustment - - ... Income Statement
df_cash_flow:
0 sheet_type
12 Statement of Retained Earnings Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2... Statement of Retained Earnings
14 Previous ret earn... 2902007 2385854... Statement of Retained Earnings
15 Cumulative effect of.. - - ... Statement of Retained Earnings
16 Three-for-two stock split 117885 - 78076 - Statement of Retained Earnings
17 Issuance of common.. 52753 75952... Statement of Retained Earnings
You can do further manipulation by fixing the column names and removing rows you do not need.
I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()