I am trying to parse a pdf into dataframe using camelot
import camelot
import pandas as pd
file = 'foo.pdf'
tables = camelot.read_pdf(file, pages='2', flavor='stream')
v = []
for i, table in enumerate(tables):
v.append(table.df)
w = pd.concat(v)
print(w)
however, its reading as below:
7 Customer No. Document Date Customer PO No. External Doc. No.\nPayment Terms
8 126207 28/02/22 STRICTLY 14 DAYS
9 PO No./Docket Unit Price \nAmount \nGST Amount Incl.
10 Description TASK DATE Quantity UOM
11 No. Excl. GST\nExcl. GST\nAmount GST
12 BOC GAS & GEAR
13 11 SNOW STREET
14 SOUTH LISMORE, NSW 2480
15 CLEAR: FL 1.5M3 BIN-CARDBOARD 02/02/22 1\nEA\n9.18\n9.18\n0.92 10.10
16 CLEAR: FL 1.5M3 BIN-CARDBOARD 16/02/22 1\nEA\n9.18\n9.18\n0.92 10.10
How do I avoid the newline \n when reading the pdf?
Related
I would like to parse the following idx file: https://www.sec.gov/Archives/edgar/daily-index/2022/QTR1/company.20220112.idx into Pandas DataFrame.
I use the following code to check how it would look like as a text file:
import os, requests
base_path = '/Users/GunardiLin/Desktop/Insider_Ranking/temp/'
current_dirs = os.listdir(path=base_path)
local_filename = f'20200102'
local_file_path = '/'.join([base_path, local_filename])
if local_filename in base_path:
print(f'Skipping index file for {local_filename} because it is already saved.')
url = f'https://www.sec.gov/Archives/edgar/daily-index/2020/QTR1/company.20200102.idx'
r = requests.get(url, stream=True, headers= {'user-agent': 'MyName myname#outlook.com'})
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=10240):
f.write(chunk)
Next I would like to build a parser that is fault tollerance, because it should parse daily a new idx file into pd.DataFrame.
My idea was to use string manipulation, but it would be very complicated and not fault tollerance.
I would be thankful if someone can show the best practice to parse and give a boilerplate code.
Since this is mostly a fixed width file you could use pandas read_fwf to read this file. You can skip over the leading information (via skiprows=) and get straight to the data. The column names are predefined and assigned when read:
idx_path = 'company.20220112.idx'
names = ['Company Name','Form Type','CIK','Date Filed','File Name']
df = pd.read_fwf(idx_path, colspecs=[(0,61),(62,74),(74,84),(86,94),(98,146)], names=names, skiprows=11)
df.head(10)
Company Name Form Type CIK Date Filed File Name
0 005 - Series of IPOSharks Venture Master Fund,... D 1888451 20220112 edgar/data/1888451/0001888451-22-000002.txt
1 10X Capital Venture Acquisition Corp. III EFFECT 1848948 20220111 edgar/data/1848948/9999999995-22-000102.txt
2 110 White Partners LLC D 1903845 20220112 edgar/data/1903845/0001884293-22-000001.txt
3 15 Beach, MHC 3 1903509 20220112 edgar/data/1903509/0001567619-22-001073.txt
4 15 Beach, MHC SC 13D 1903509 20220112 edgar/data/1903509/0000943374-22-000014.txt
5 170 Valley LLC D 1903913 20220112 edgar/data/1903913/0001903913-22-000001.txt
6 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000003.txt
7 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000004.txt
8 215 BF Associates LLC D 1904145 20220112 edgar/data/1904145/0001904145-22-000001.txt
9 2401 Midpoint Drive REIT, LLC D 1903337 20220112 edgar/data/1903337/0001903337-22-000001.txt
I am trying to import an excel file with headers on the second row
selected columns=['Student name','Age','Faculty']
data = pd.read_excel(path_in + 'Results\\' + 'Survey_data.xlsx', header = 1,usecols = selected_columns).rename(columns={'Student Name':'First name'}).drop_duplicates()
Currently, the excel looks something like this:
Student name Surname Faculty Major Scholarship Age L1 TFM Date Failed subjects
Ana Ruiz Economics Finance N 20 N 0
Linda Peterson Mathematics Mathematics Y 22 N 2021-12-04 0
Gregory Olsen Engineering Industrial Engineering N 21 N 0
Ana Watson Business Marketing N 22 N 0
I have tried including the last column in the selected_columns list but it returns the same error. Would greatly appreciate if someone can let me know why python is not reading all the lines.
Thanks in advance.
This one has been relatively tricky for me. I am trying to extract the embedded table sourced from google sheets in python.
Here is the link
I do not own the sheet but it is publicly available.
here is my code thus far, when I go to output the headers it is showing me "". Any help would be greatly appreciated. End goal is to convert this table into a pandas DF. Thanks guys
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
Well if you would like to get the data into a DataFrame, you could load it directly:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
This will give you:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO $68.79 $70.00 $1.21 1.76% 7/9/2019
1 WM $32.93 $33.15 $0.22 0.67% 4/15/2019
2 ABBV $197.05 $200.22 $3.17 1.61% 6/25/2019
3 CLF $2.98 $3.02 $0.04 1.34% 12/3/2019
4 SCHW $49.31 $51.27 $1.96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
Note read_html returns a list. In this case there is only
1 DataFrame, so we can refer to the first and only index location [0]
I can download the annual data from this link by the following code, but it's not the same as what's shown on the website because it's the data of June:
Now I have two questions:
How do I specific the date so the annual data is the same as the following picture(September instead of June as shown in red rectangle)?
By clicking quarterly as shown in orange rectangle, the link won't be changed. How do I grab the quarterly data?
Thanks.
Just curious, but why write the html to file first and then read it with pandas? Pandas can take in the html request directly:
import pandas as pd
symbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/%s/financials?p=%s' %(symbol, symbol)
dfs = pd.read_html(url)
print(dfs[0])
Secondly, not sure why yours is popping up with the yearly dates. Doing the way as I have it above is showing September.
print(dfs[0])
0 ... 4
0 Revenue ... 9/26/2015
1 Total Revenue ... 233715000
2 Cost of Revenue ... 140089000
3 Gross Profit ... 93626000
4 Operating Expenses ... Operating Expenses
5 Research Development ... 8067000
6 Selling General and Administrative ... 14329000
7 Non Recurring ... -
8 Others ... -
9 Total Operating Expenses ... 162485000
10 Operating Income or Loss ... 71230000
11 Income from Continuing Operations ... Income from Continuing Operations
12 Total Other Income/Expenses Net ... 1285000
13 Earnings Before Interest and Taxes ... 71230000
14 Interest Expense ... -733000
15 Income Before Tax ... 72515000
16 Income Tax Expense ... 19121000
17 Minority Interest ... -
18 Net Income From Continuing Ops ... 53394000
19 Non-recurring Events ... Non-recurring Events
20 Discontinued Operations ... -
21 Extraordinary Items ... -
22 Effect Of Accounting Changes ... -
23 Other Items ... -
24 Net Income ... Net Income
25 Net Income ... 53394000
26 Preferred Stock And Other Adjustments ... -
27 Net Income Applicable To Common Shares ... 53394000
[28 rows x 5 columns]
For the second part, you could try to find the data 1 of a few ways:
1) Check the XHR requests and get the data you want by including parameters to the request url that generates that data and can return to you in json format (which when I looked for, I could not find right off the bat, so moved on to the next option)
2) Search through the <script> tags, as the json format can sometimes be within those tags (which I didn't search through very thoroughly, and think Selenium would just be a direct way since pandas can read in the tables)
3) Use selenium to simulate opening the browser, getting the table, and clicking on "Quarterly", then getting that table
I went with option 3:
from selenium import webdriver
import pandas as pd
symbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/%s/financials?p=%s' %(symbol, symbol)
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
# Get Table shown in browser
dfs_annual = pd.read_html(driver.page_source)
print(dfs_annual[0])
# Click "Quarterly"
driver.find_element_by_xpath("//span[text()='Quarterly']").click()
# Get Table shown in browser
dfs_quarter = pd.read_html(driver.page_source)
print(dfs_quarter[0])
driver.close()
I am very new to Python but I'm trying to learn more and my first mini project that I've given myself is to create a code that can receive the input for a given user name and then output the amount of hours they worked on a certain project. I receive a csv at the end of the week for employees in my department and within that csv I have various projects that they are working on and the hours dedicated to that project. The catch with the csv file is there are duplicate projects for that user so when my csv outputs I need it to only show one project name and ALL the hours associated with that project. How can I get my code to read the duplicates and only count the hours from the duplicates and just use one project name?
Here is the code I've come up with so far:
import csv
with open('Report.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
firsts = []
lasts = []
projects = []
hours = []
for row in readCSV:
first = row[0]
last = row[1]
project = row[2]
hour = row[3]
firsts.append(first)
lasts.append(last)
projects.append(project)
hours.append(hour)
First_name = input("Please enter the first name: ")
First_1 = firsts.index(First_name)
Last_1 = lasts[First_1]
project_1 = projects[First_1]
hours_1 = hours[First_1]
print(First_1, Last_1, project_1, hours_1)
Here is a sample of the csv
First Last Project Hours
David Ayers AD-0002 Training 24
Oriana Morfitt AD-0002 Training 24
David Ayers AD-0003 Personal Time 8
David Ayers AD-0004 Sick Time 0
Oriana Morfitt AD-0005 Vacation 40
Sujatha Kavuri Beeline Blank 29
Sujatha Kavuri Beeline Blank 16
Sujatha Kavuri OPS-0001 General Operational Support 6
Jeff Moore OPS-0001 General Operational Support 5
Sri Mantri SRV-0001 Service Requests for Base and Direct Services 4
Prasanth Musunuru SRV-0001 Service Requests for Base and Direct Services 11
Prasanth Musunuru SRV-0001 Service Requests for Base and Direct Services 10
Jeff Moore SRV-0006 Standards and Processes 5
Jeff Moore SRV-0006 Standards and Processes 3
Jeff Moore SRV-2503 Internet Access Infrastructure Maintenance & Support 12.5
Jeff Moore SRV-2503 Internet Access Infrastructure Maintenance & Support 7
Jeff Moore 0024495915 Work Instructions (infrastructure) - time tracking 1
Sri Mantri 0026184229 Margin Controlling Java Rewrite 4
Sujatha Kavuri 0029157489 SCRM Life Cycle Extension 3
Jeff Moore 0031369443 Shopcall Attachment Changes 1
Jeff Moore 0031500942 MP Strategy 2015 - Spot PO via EDI (time tracking only) - 0031500942 1
I bet there's a better way of doing it with pandas, but this will work too:
import csv
# I've used full name to avoid duplicate first names in report
full_name = input('Enter your full name: ')
with open('Report.csv') as csvfile:
hour_summation = {}
read_csv = csv.reader(csvfile, delimiter=',')
for row in read_csv:
if ' '.join((row[0], row[1])) == full_name.strip():
hour_summation[row[2]] = hour_summation.get(row[2], 0) + int(row[3])
print('This is {} full hours report:'.format(full_name))
for k, v in hour_summation.items():
print(k + ': ' + str(v) + ' hours')
results for Sujatha Kavuri
Enter your full name: Sujatha Kavuri
This is Sujatha Kavuri full hours report:
Beeline Blank: 45 hours
OPS-0001 General Operational Support: 6 hours
EDIT -
I've only sampled half the file so the results aren't complete.
Hope this helps.