I have a single .csv file with four tables, each a different financial statement four Southwest Airlines from 2001-1986. I know I could separate each table into separate files, but they are initially downloaded as one.
I would like to read each table to its own pandas DataFrame for analysis.Here is a subset of the data:
Balance Sheet
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Cash & cash equivalents 2279861 522995 418819 378511
Short-term investments - - - -
Accounts & other receivables 71283 138070 73448 88799
Inventories of parts... 70561 80564 65152 50035
Income Statement
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Passenger revenues 5378702 5467965 4499360 3963781
Freight revenues 91270 110742 102990 98500
Charter & other - - - -
Special revenue adjustment - - - -
Statement of Retained Earnings
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Previous ret earn... 2902007 2385854 2044975 1632115
Cumulative effect of.. - - - -
Three-for-two stock split 117885 - 78076 -
Issuance of common.. 52753 75952 45134 10184
The tables each have 17 columns, the first the line item description, but varying numbers of rows i.e. the balance sheet is 100 rows whereas the statement of cash flows is 65
What I've Done
import pandas as pd
import numpy as np
# Lines that separate the various financial statements
lines_to_skip = [0, 102, 103, 158, 159, 169, 170]
with open('LUV.csv', 'r') as file:
fin_statements = pd.read_csv(file, skiprows=lines_to_skip)
balance_sheet = fin_statements[0:100]
I have seen posts with a similar objective noting to utilize nrows and skiprows. I utilized skiprows to read the entire file, then I created the individual financial statement by indexing.
I am looking for comments and cconstructive criticism for creating a dataframe for each respective table in better Pythonic style and best practices.
What you want to do if far beyond what read_csv can do. If fact you input file struct can be modeled as:
REPEAT:
Dataframe name
Header line
REPEAT:
Data line
BLANK LINE OR END OF FILE
IMHO, the simplest way is to parse the line by hand line by line, feeding a temporary csv file per dataframe, then loading the dataframe. Code could be:
df = {} # dictionary of dataframes
def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''
# print("Process", df_name, tmp.name) # uncomment for debugging
if tmp is not None:
tmp.close()
df[df_name] = pd.read_csv(tmp.name)
os.remove(tmp.name) # do not forget to remove the temp file
with open('LUV.csv') as file:
df_name = "NONAME" # should never be in resulting dict...
tmp = None
for line in file:
# print(line) # uncomment for debugging
if len(line.strip()) == 0: # close temp file on empty line
process(tmp, df_name) # and process it
tmp = None
elif tmp is None: # a new part: store the name
df_name = line.strip()
state = 1
tmp = tempfile.NamedTemporaryFile("w", delete=False)
else:
tmp.write(line) # just feed the temp file
# process the last part if no empty line was present...
process(tmp, df_name)
This is not really efficient because each line is written to a temporary file an then read again, but it is simple and robust.
A possible improvement would be to initially parse the parts with the csv module (can parse a stream while pandas wants files). The downside is that the csv module only parse into strings and you lose the automatic conversions to numbers of pandas. My opinion is that it is worth it only if the file is large and the full operation will have to be repeated.
Here is my solution:
My assumption is that each statement starts with an indicator ('Balance Sheet', 'Income Statement', 'Statement of Retained Earnings'), we can split the table based on that to get individual dataframes. That is the premise on which the following code is based. let me know if it is a flawed assumption.
import pandas as pd
import numpy as np
#i copied your data above and created a csv with it
df = pd.read_csv('csvtable_stackoverflow',header=None)
0
0 Balance Sheet
1 Report Date 12/31/2001 12/31/...
2 Cash & cash equivalents 2279861 522995...
3 Short-term investments - - ...
4 Accounts & other receivables 71283 138070...
5 Inventories of parts... 70561 80564...
6 Income Statement
7 Report Date 12/31/2001 12/31/...
8 Passenger revenues 5378702 546796...
9 Freight revenues 91270 110742...
10 Charter & other - - ...
11 Special revenue adjustment - - ...
12 Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2...
14 Previous ret earn... 2902007 2385854...
15 Cumulative effect of.. - - ...
16 Three-for-two stock split 117885 - 78076 -
17 Issuance of common.. 52753 75952...
The code below simply uses numpy select to filter out which rows contain
balance sheet or income statement or cashflow
https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html
bal_sheet = df[0].str.strip()=='Balance Sheet'
income_stmt = df[0].str.strip()=='Income Statement'
cash_flow_sheet = df[0].str.strip()=='Statement of Retained Earnings'
condlist = [bal_sheet, income_stmt, cash_flow_sheet]
choicelist = ['Balance Sheet', 'Income Statement', 'Statement of
Retained Earnings']
Next code below creates a column indicating the sheet type, converts '0' to null and then fills down
df = (df.assign(sheet_type = np.select(condlist,choicelist))
.assign(sheet_type = lambda x: x.sheet_type.replace('0',np.nan))
.fillna(method='ffill')
)
Last step is to pull out the individual dataframes
df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')
df_bal_sheet :
0 sheet_type
0 Balance Sheet Balance Sheet
1 Report Date 12/31/2001 12/31/... Balance Sheet
2 Cash & cash equivalents 2279861 522995... Balance Sheet
3 Short-term investments - - ... Balance Sheet
4 Accounts & other receivables 71283 138070... Balance Sheet
5 Inventories of parts... 70561 80564... Balance Sheet
df_income_sheet :
0 sheet_type
6 Income Statement Income Statement
7 Report Date 12/31/2001 12/31/... Income Statement
8 Passenger revenues 5378702 546796... Income Statement
9 Freight revenues 91270 110742... Income Statement
10 Charter & other - - ... Income Statement
11 Special revenue adjustment - - ... Income Statement
df_cash_flow:
0 sheet_type
12 Statement of Retained Earnings Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2... Statement of Retained Earnings
14 Previous ret earn... 2902007 2385854... Statement of Retained Earnings
15 Cumulative effect of.. - - ... Statement of Retained Earnings
16 Three-for-two stock split 117885 - 78076 - Statement of Retained Earnings
17 Issuance of common.. 52753 75952... Statement of Retained Earnings
You can do further manipulation by fixing the column names and removing rows you do not need.
Related
I want retrieve the financial data from Refinitiv, I use the code below, the 122.csv have many ric of companies, but when I run the code, the results only shows the first company's information, and the ric of the first company is incomplete,
with open (r'C:\Users\zhang\Desktop\122.csv',encoding='utf-8') as date:
for i, line in enumerate(read_csv(date)):
RIC = line[0]
SDate = '2013-06-10'
EDate = '2015-06-10'
df1,das= ek.get_data(RIC,fields =[
'TR.F.TotAssets(Scale=6)',
'TR.F.DebtTot(Scale=6)',]
parameters = {'SDate':'{}'.format(SDate), 'EDate': '{}'.format(EDate)},
)
df1
the result:
Instrume ntTotal Assets Debt - Total
0 A 10686 2699
1 A 10815 1663
How can I generate the whole list of ric of companies in my csv file?
I am trying to complete a script to store all the trail reports my company gets from various clearing houses. As part of this script I rip the data from multiple excel sheets (over 20 a month) and an amalgamate it in a series of pandas dataframes(organized in a timeline). Unfortunately when I try to output a new spreadsheet with the amalgamated summaries, I get a 'number stored as text' error from excel.
FinalFile = Workbook()
FinalFile.create_sheet(title='Summary') ### This will hold a summary table eventually
for i in Timeline:
index = Timeline.index(i)
sheet = FinalFile.create_sheet(title=i)
sheet[i].number_format = 'Currency'
df = pd.DataFrame(Output[index])
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
df.head()
df = df.set_index('Payment Type')
for r in dataframe_to_rows(df, index=True,header=True):
sheet.append(r)
for cell in sheet['A'] + sheet[1]:
cell.style='Pandas'
SavePath = SaveFolder+'/'+CurrentDate+'.xlsx'
FinalFile.save(SavePath)
using number_format = 'Currency' to format as currency did not resolve this, nor did my attempt to use the write only methond on the openpyxl documentation page
https://openpyxl.readthedocs.io/en/stable/pandas.html
Fundamentally this code is outputting the right index, headers, sheetname and formatting the only issue issue is the numbers stored as text from B3:D7.
Attached is an example month Output
example dataframe of the same month
0 Total Paid Net GST
Payment Type
Adjustments -2800 -2546 -254
Agency Upfront 23500 21363 2135
Agency Trail 46980 42708 4270
Referring Office Trail 16003 14548 1454
NilTrailPayment 0 0 0
I am struggling to break down the method required to extract data from deeply nested complex JSON data. I have the following code to obtain the JSON.
import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt
base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers={'User-Agent': 'Myheaderdata'}
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
Which provides an output showing the JSON and a Pandas DataFrame. The dataframe has two columns, with the third (FACTS) containing a lot of nested data.
What I want to understand is how to navigate into that nested structure, to retrieve certain data. For example, I may want to go to the DEI level, or the US GAAP level and retrieve a particular attribute. Let's say DEI > EntityCommonStockSharesOutstanding and obtain the "label", "value" and "FY" details.
When I try to use the get function as follows;
data=[]
for response in response_dic:
data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
new_df=pd.DataFrame(data)
new_df.head()
I end up with the following attribute error;
AttributeError Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
1 data=[]
2 for response in response_dic:
----> 3 data.append({"EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding')})
4 base_df=pd.DataFrame(data)
5 base_df.head()
AttributeError: 'str' object has no attribute 'get'
Use pd.json_normalize:
For example:
entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']
df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
meta=['label', 'description'])
df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
meta=['label', 'description'])
>>> df1
end val accn ... frame label description
0 2018-10-31 106299106 0001564590-18-028629 ... CY2018Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
1 2019-02-28 106692030 0001627475-19-000007 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
2 2019-04-30 107160359 0001627475-19-000015 ... CY2019Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
3 2019-07-31 110803709 0001627475-19-000025 ... CY2019Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
4 2019-10-31 112020807 0001628280-19-013517 ... CY2019Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
5 2020-02-28 113931825 0001627475-20-000006 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
6 2020-04-30 115142604 0001627475-20-000018 ... CY2020Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
7 2020-07-31 120276173 0001627475-20-000031 ... CY2020Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
8 2020-10-31 122073553 0001627475-20-000044 ... CY2020Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
9 2021-01-31 124962279 0001627475-21-000015 ... CY2020Q4I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
10 2021-04-30 126144849 0001627475-21-000022 ... CY2021Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
[11 rows x 10 columns]
>>> df2
end val accn fy fp form filed frame label description
0 2018-10-03 900000000 0001627475-19-000007 2018 FY 10-K 2019-03-07 CY2018Q3I Entity Public Float The aggregate market value of the voting and n...
1 2019-06-28 1174421292 0001627475-20-000006 2019 FY 10-K 2020-03-02 CY2019Q2I Entity Public Float The aggregate market value of the voting and n...
2 2020-06-30 1532720862 0001627475-21-000015 2020 FY 10-K 2021-02-24 CY2020Q2I Entity Public Float The aggregate market value of the voting and n...
I came across this same issue. While the solution provided meets the requirements of your question it might be a better solution to flatten the entire dictionary and have all the columns represented in a long data frame.
That data frame can be used as a building block for a DB or can simply be queried as you wish.
The facts key can have more than sub key dei or us-gaap.
Also, within the us-gapp dictionary if you want to extract multiple xbrl tags at a time you will have a pretty difficult time.
The solution below is might not be the prettiest or more efficient but it gets all the levels of the dictionary along with all the facts and values.
import requests
import pandas as pd
import json
from flatten_json import flatten
headers= {'User-Agent':'My User Agent 1.0', 'From':'something somethin'}
file = 'https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json'
data = json.loads(requests.get(file, headers = headers).text)
#get the cik and name of the entity
Cik_Name = dict(list(data.items())[0: 2])
Cik_Name_df = pd.DataFrame(Cik_Name,index=[0])
#Flatten file
f = flatten(data['facts'],'|')
#drop into a dataframe and transpose
f = pd.DataFrame(f,index=[0]).T
#reset index
f = f.reset_index(level=0)
#rename columns
f.rename(columns={'index': 'Col_split', 0:'values'}, inplace= True)
#split Col_split column by delimiter
f = f.join(f['Col_split'].str.split(pat='|',expand=True).add_prefix('Col_split'))
#drop original Col_split column
f = f.drop(['Col_split','Col_split4'],axis = 1)
#move values column to the end
f = f[[c for c in f if c not in ['values']] + ['values']]
#create groups based on Col_split2 containing the value label
f['groups'] = f["Col_split2"].eq('label').cumsum()
df_list = []
#loop to break df by group and create new columns for label & description
for i, g in f.groupby('groups'):
label = g['values'].iloc[0]
description = g['values'].iloc[1]
g.drop(index = g.index[:2], axis = 0, inplace = True)
g['label'] = label
g['description'] = description
df_list.append(g)
final_df = pd.concat(df_list)
final_df.rename(columns={'Col_split0':'facts', 'Col_split1':'tag','Col_split3':'units'}, inplace=True)
final_df = final_df[['facts','tag','label','description','units','Col_split5','values']]
final_df['cum _ind'] = final_df["Col_split5"].eq('end').cumsum()
final_df = final_df.pivot(index = ['facts','tag','label','description','units','cum _ind'] , columns = 'Col_split5' ,values='values').reset_index()
final_df['cik'] = Cik_Name_df['cik'].iloc[0]
final_df['entityName'] = Cik_Name_df['entityName'].iloc[0]
final_df = final_df[['cik','entityName','facts','tag','label','description','units','accn','start','end','filed','form','fp','frame','fy','val']]
print(final_df)
please feel free to make improvements as you see fit and share them with the community.
I am trying to parse quarterly investment statements to import transactions into Quicken as my provider (TSP for federal government) does not support online download. I think I have figured out how to create a "qif" file that can be used Quicken import.
I am using Python 3.8 and pdfplumber. Here is a snapshot from one of the pages in the pdf.
I need to parse transactions with a posting date from "Transaction Detail By Source". I need the name of the fund as well as the transaction with posting date for others. Here is my simple Python code -
import pdfplumber
with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf:
for x in pdf.pages :
print (x.extract_text())
#print (x.extract_words())
#print (x.extract_tables())
Well, the good news is that I can parse the pdf and this output is generated for the portion in the image -
TRANSACTION DETAIL BY SOURCE
Agency
Payroll Posting Automatic Agency
Office Date Transaction Type Employee (1%) Matching Total
Beginning Balance $0.00 $0.00 $0.00 $0.00
12400001 03/22/11 Auto Enrollment Contribution 69.00 23.00 69.00 161.00
Increase/Decrease in Value 0.05 0.02 0.05 0.12
Ending Balance $69.05 $23.02 $69.05 $161.12
TRANSACTION DETAIL BY FUND
Government Securities Investment (G) Fund
Number
Posting Transaction Share of Dollar
Date Transaction Type Amount Price Shares Balance
Beginning Balance $13.4882 0.0000 $0.00
03/22/11 Auto Enrollment Contribution $161.00 13.5752 11.8599
Ending Balance $13.5854 11.8599 $161.12
When I use the extract_tables() function, I get blanks.
So, I am looking for suggestions on how to improve this parsing. I need to get the headers and their values distinctly in order to process them accurately.
Thanks much.
RD
def parser(filepath):
L = []
pdf = pdfplumber.open(filepath)
for i in range(15):
page = pdf.pages[i]
try:
table = page.extract_table()
df = pd.DataFrame(table)
if df.shape[0] > 1:
df = df.drop([0,2,5,6,8], axis=1)
for i in range(len(df)):
if df.iloc[i][4] != None:
df.iloc[i][3] = df.iloc[i][4]
df = df.drop([4], axis=1)
L.append(df)
if "Всего" in table[-1]:
break
except Exception:
continue
df = pd.concat(L)
df.columns = ["Раздел", "Слова", "Важность"]
df.reset_index(drop=True)
df = df.drop([0,1])
return df
I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement