Difficulty parsing PDF files, using pdfplumber

Difficulty parsing PDF files, using pdfplumber - python

I am trying to parse quarterly investment statements to import transactions into Quicken as my provider (TSP for federal government) does not support online download. I think I have figured out how to create a "qif" file that can be used Quicken import.
I am using Python 3.8 and pdfplumber. Here is a snapshot from one of the pages in the pdf.
I need to parse transactions with a posting date from "Transaction Detail By Source". I need the name of the fund as well as the transaction with posting date for others. Here is my simple Python code -
import pdfplumber
with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf:
for x in pdf.pages :
print (x.extract_text())
#print (x.extract_words())
#print (x.extract_tables())
Well, the good news is that I can parse the pdf and this output is generated for the portion in the image -
TRANSACTION DETAIL BY SOURCE
Agency
Payroll Posting Automatic Agency
Office Date Transaction Type Employee (1%) Matching Total
Beginning Balance $0.00 $0.00 $0.00 $0.00
12400001 03/22/11 Auto Enrollment Contribution 69.00 23.00 69.00 161.00
Increase/Decrease in Value 0.05 0.02 0.05 0.12
Ending Balance $69.05 $23.02 $69.05 $161.12
TRANSACTION DETAIL BY FUND
Government Securities Investment (G) Fund
Number
Posting Transaction Share of Dollar
Date Transaction Type Amount Price Shares Balance
Beginning Balance $13.4882 0.0000 $0.00
03/22/11 Auto Enrollment Contribution $161.00 13.5752 11.8599
Ending Balance $13.5854 11.8599 $161.12
When I use the extract_tables() function, I get blanks.
So, I am looking for suggestions on how to improve this parsing. I need to get the headers and their values distinctly in order to process them accurately.
Thanks much.
RD

def parser(filepath):
L = []
pdf = pdfplumber.open(filepath)
for i in range(15):
page = pdf.pages[i]
try:
table = page.extract_table()
df = pd.DataFrame(table)
if df.shape[0] > 1:
df = df.drop([0,2,5,6,8], axis=1)
for i in range(len(df)):
if df.iloc[i][4] != None:
df.iloc[i][3] = df.iloc[i][4]
df = df.drop([4], axis=1)
L.append(df)
if "Всего" in table[-1]:
break
except Exception:
continue
df = pd.concat(L)
df.columns = ["Раздел", "Слова", "Важность"]
df.reset_index(drop=True)
df = df.drop([0,1])
return df

Related

use ek.get_data in python, there is only one line corp.ric in csv shows the results, other lines are disappear

I want retrieve the financial data from Refinitiv, I use the code below, the 122.csv have many ric of companies, but when I run the code, the results only shows the first company's information, and the ric of the first company is incomplete,
with open (r'C:\Users\zhang\Desktop\122.csv',encoding='utf-8') as date:
for i, line in enumerate(read_csv(date)):
RIC = line[0]
SDate = '2013-06-10'
EDate = '2015-06-10'
df1,das= ek.get_data(RIC,fields =[
'TR.F.TotAssets(Scale=6)',
'TR.F.DebtTot(Scale=6)',]
parameters = {'SDate':'{}'.format(SDate), 'EDate': '{}'.format(EDate)},
)
df1
the result:
Instrume ntTotal Assets Debt - Total
0 A 10686 2699
1 A 10815 1663
How can I generate the whole list of ric of companies in my csv file?

Resolving numbers stored as text errors

I am trying to complete a script to store all the trail reports my company gets from various clearing houses. As part of this script I rip the data from multiple excel sheets (over 20 a month) and an amalgamate it in a series of pandas dataframes(organized in a timeline). Unfortunately when I try to output a new spreadsheet with the amalgamated summaries, I get a 'number stored as text' error from excel.
FinalFile = Workbook()
FinalFile.create_sheet(title='Summary') ### This will hold a summary table eventually
for i in Timeline:
index = Timeline.index(i)
sheet = FinalFile.create_sheet(title=i)
sheet[i].number_format = 'Currency'
df = pd.DataFrame(Output[index])
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
df.head()
df = df.set_index('Payment Type')
for r in dataframe_to_rows(df, index=True,header=True):
sheet.append(r)
for cell in sheet['A'] + sheet[1]:
cell.style='Pandas'
SavePath = SaveFolder+'/'+CurrentDate+'.xlsx'
FinalFile.save(SavePath)
using number_format = 'Currency' to format as currency did not resolve this, nor did my attempt to use the write only methond on the openpyxl documentation page
https://openpyxl.readthedocs.io/en/stable/pandas.html
Fundamentally this code is outputting the right index, headers, sheetname and formatting the only issue issue is the numbers stored as text from B3:D7.
Attached is an example month Output
example dataframe of the same month
0 Total Paid Net GST
Payment Type
Adjustments -2800 -2546 -254
Agency Upfront 23500 21363 2135
Agency Trail 46980 42708 4270
Referring Office Trail 16003 14548 1454
NilTrailPayment 0 0 0

How to sum the highest/lowest total cost from a specific row and column in a excel file?

I want to write a program that evaluates data about a construction site operation from an ASCII table in CSV format file. The template file is an Excel file.
name Qualification costs
Max Mustermann Seller 6.155,39
Max Mustermann Seller 5.069,15
Max Mustermann Seller 362,08
Klee klumper Seller 4.637,65
Klee klumper Seller 1.159,41
Koch Schnerider Project Engineer 1.358,28
Koch Schnerider Project Engineer 679,14
Müller Manim Distribution 15.149,28
Müller Manim Distribution 16.743,94
Schach Matt Site Manager 14.399,79
Schach Matt Site Manager 1.371,41
Zeimetz Kinder Project Engineer 11.376,50
Zeimetz Kinder Project Engineer 2.133,09
The following data should be evaluated:
Total costs from all operations
All qualifications with the respective sum of costs
I managed to calculate these two above, but how do I manage the other two?
The qualification that has the highest total cost
The qualification that has the lowest total cost
This is my first coding:
import pandas as pd
import os
filename = "site_operation.csv"
path = "."
file = os.path.join(filename, path)
tscv1 = pd.read_csv(file, sep=";", thousands=".", decimal=",", encoding="ansi")
total_cost = tscv1['costs'].sum()
print("Total costs from all operations: ", total_cost)

You can use the groupby function from pandas
for the relative costs, grouped by Qual.
tscv1.groupby('Qualification').sum()
costs
Qualification
Distribution 31893.22
Project Engineer 15547.01
Seller 17383.68
Site Manager 15771.20
# For the min e max values
# an easy way can be sort the results:
sorted_by_qual_value = tscv1.groupby('Qualification').sum().sort_values('costs')
min_qual = sorted_by_qual_value.head(1)
# costs
# Qualification
# Project Engineer 15547.01
max_qual = sorted_by_qual_value.tail(1)
# costs
# Qualification
# Distribution 31893.22

error "can only convert an array of size 1 to a Python scalar" when I try to get the index of a row of a pandas dataframe as an integer

I have an excel file with stock symbols and many other columns. I have a simplified version of the excel file below:
Symbol
Industry
0
AAPL
Technology Manufacturing
1
MSFT
Technology Manufacturing
2
TSLA
Electric Car Manufacturing
Essentially, I am trying to get the Industry based on the Symbol.
For example, if I use 'AAPL' I want to get 'Technology Manufacturing'. Here is my code so far.
import pandas as pd
excel_file1 = 'file.xlsx'
df = pd.read_excel(excel_file1)
stock = 'AAPL'
row_index = df[df['Symbol'] == stock].index.item()
industry = df['Industry'][row_index]
print(industry)
after trying to get row_index, I get an error: "ValueError: can only convert an array of size 1 to a Python scalar"
can someone solve this? Also let's say row_index works: is this code (below) correct?
industry = df['Industry'][row_index]

Use:
stock = 'AAPL'
industry = df[df['Symbol'] == stock]['Industry'][0]
OR:, if you want to search using index, use df.loc:
stock = 'AAPL'
industry = df.loc[df[df['Symbol'] == stock].index, 'Industry'][0]
But the first one's much better.

Read n tables in csv file to separate pandas DataFrames

I have a single .csv file with four tables, each a different financial statement four Southwest Airlines from 2001-1986. I know I could separate each table into separate files, but they are initially downloaded as one.
I would like to read each table to its own pandas DataFrame for analysis.Here is a subset of the data:
Balance Sheet
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Cash & cash equivalents 2279861 522995 418819 378511
Short-term investments - - - -
Accounts & other receivables 71283 138070 73448 88799
Inventories of parts... 70561 80564 65152 50035
Income Statement
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Passenger revenues 5378702 5467965 4499360 3963781
Freight revenues 91270 110742 102990 98500
Charter & other - - - -
Special revenue adjustment - - - -
Statement of Retained Earnings
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Previous ret earn... 2902007 2385854 2044975 1632115
Cumulative effect of.. - - - -
Three-for-two stock split 117885 - 78076 -
Issuance of common.. 52753 75952 45134 10184
The tables each have 17 columns, the first the line item description, but varying numbers of rows i.e. the balance sheet is 100 rows whereas the statement of cash flows is 65
What I've Done
import pandas as pd
import numpy as np
# Lines that separate the various financial statements
lines_to_skip = [0, 102, 103, 158, 159, 169, 170]
with open('LUV.csv', 'r') as file:
fin_statements = pd.read_csv(file, skiprows=lines_to_skip)
balance_sheet = fin_statements[0:100]
I have seen posts with a similar objective noting to utilize nrows and skiprows. I utilized skiprows to read the entire file, then I created the individual financial statement by indexing.
I am looking for comments and cconstructive criticism for creating a dataframe for each respective table in better Pythonic style and best practices.

What you want to do if far beyond what read_csv can do. If fact you input file struct can be modeled as:
REPEAT:
Dataframe name
Header line
REPEAT:
Data line
BLANK LINE OR END OF FILE
IMHO, the simplest way is to parse the line by hand line by line, feeding a temporary csv file per dataframe, then loading the dataframe. Code could be:
df = {} # dictionary of dataframes
def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''
# print("Process", df_name, tmp.name) # uncomment for debugging
if tmp is not None:
tmp.close()
df[df_name] = pd.read_csv(tmp.name)
os.remove(tmp.name) # do not forget to remove the temp file
with open('LUV.csv') as file:
df_name = "NONAME" # should never be in resulting dict...
tmp = None
for line in file:
# print(line) # uncomment for debugging
if len(line.strip()) == 0: # close temp file on empty line
process(tmp, df_name) # and process it
tmp = None
elif tmp is None: # a new part: store the name
df_name = line.strip()
state = 1
tmp = tempfile.NamedTemporaryFile("w", delete=False)
else:
tmp.write(line) # just feed the temp file
# process the last part if no empty line was present...
process(tmp, df_name)
This is not really efficient because each line is written to a temporary file an then read again, but it is simple and robust.
A possible improvement would be to initially parse the parts with the csv module (can parse a stream while pandas wants files). The downside is that the csv module only parse into strings and you lose the automatic conversions to numbers of pandas. My opinion is that it is worth it only if the file is large and the full operation will have to be repeated.

Here is my solution:
My assumption is that each statement starts with an indicator ('Balance Sheet', 'Income Statement', 'Statement of Retained Earnings'), we can split the table based on that to get individual dataframes. That is the premise on which the following code is based. let me know if it is a flawed assumption.
import pandas as pd
import numpy as np
#i copied your data above and created a csv with it
df = pd.read_csv('csvtable_stackoverflow',header=None)
0
0 Balance Sheet
1 Report Date 12/31/2001 12/31/...
2 Cash & cash equivalents 2279861 522995...
3 Short-term investments - - ...
4 Accounts & other receivables 71283 138070...
5 Inventories of parts... 70561 80564...
6 Income Statement
7 Report Date 12/31/2001 12/31/...
8 Passenger revenues 5378702 546796...
9 Freight revenues 91270 110742...
10 Charter & other - - ...
11 Special revenue adjustment - - ...
12 Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2...
14 Previous ret earn... 2902007 2385854...
15 Cumulative effect of.. - - ...
16 Three-for-two stock split 117885 - 78076 -
17 Issuance of common.. 52753 75952...
The code below simply uses numpy select to filter out which rows contain
balance sheet or income statement or cashflow
https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html
bal_sheet = df[0].str.strip()=='Balance Sheet'
income_stmt = df[0].str.strip()=='Income Statement'
cash_flow_sheet = df[0].str.strip()=='Statement of Retained Earnings'
condlist = [bal_sheet, income_stmt, cash_flow_sheet]
choicelist = ['Balance Sheet', 'Income Statement', 'Statement of
Retained Earnings']
Next code below creates a column indicating the sheet type, converts '0' to null and then fills down
df = (df.assign(sheet_type = np.select(condlist,choicelist))
.assign(sheet_type = lambda x: x.sheet_type.replace('0',np.nan))
.fillna(method='ffill')
)
Last step is to pull out the individual dataframes
df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')
df_bal_sheet :
0 sheet_type
0 Balance Sheet Balance Sheet
1 Report Date 12/31/2001 12/31/... Balance Sheet
2 Cash & cash equivalents 2279861 522995... Balance Sheet
3 Short-term investments - - ... Balance Sheet
4 Accounts & other receivables 71283 138070... Balance Sheet
5 Inventories of parts... 70561 80564... Balance Sheet
df_income_sheet :
0 sheet_type
6 Income Statement Income Statement
7 Report Date 12/31/2001 12/31/... Income Statement
8 Passenger revenues 5378702 546796... Income Statement
9 Freight revenues 91270 110742... Income Statement
10 Charter & other - - ... Income Statement
11 Special revenue adjustment - - ... Income Statement
df_cash_flow:
0 sheet_type
12 Statement of Retained Earnings Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2... Statement of Retained Earnings
14 Previous ret earn... 2902007 2385854... Statement of Retained Earnings
15 Cumulative effect of.. - - ... Statement of Retained Earnings
16 Three-for-two stock split 117885 - 78076 - Statement of Retained Earnings
17 Issuance of common.. 52753 75952... Statement of Retained Earnings
You can do further manipulation by fixing the column names and removing rows you do not need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difficulty parsing PDF files, using pdfplumber - python

Related

use ek.get_data in python, there is only one line corp.ric in csv shows the results, other lines are disappear

Resolving numbers stored as text errors

How to sum the highest/lowest total cost from a specific row and column in a excel file?

error "can only convert an array of size 1 to a Python scalar" when I try to get the index of a row of a pandas dataframe as an integer

Read n tables in csv file to separate pandas DataFrames

Categories

Resources