I am trying to extract tables from this pdf link using camelot, however, when a try this follow code:
import camelot
file = 'relacao_medicamentos_rename_2020.pdf'
tables = camelot.read_pdf(file)
tables.export('relacao_medicamentos_rename_2020.csv', f='csv', compress=False)
Simple nothing happens. This is very strange 'cause when I try the same code but with this pdf link works very welll.
As Stefano suggested you need to specify the relevant pages and set the option flavor='stream'. The default flavor='lattice' only works if there are lines between the cells.
Additionally, increasing row_tol helps to group rows together. So that, for example, the header of the first table is not read as three separate rows but as one row. Specifically 'Concentração/Composição' is identified as coherent text.
Also you may want to use strip_text='\n' to remove new line characters.
This results in (reading page 17 and 18 as an example):
import camelot
file = 'relacao_medicamentos_rename_2020.pdf'
tables = camelot.read_pdf(file, pages='17, 18', flavor='stream', row_tol=20, strip_text='\n')
tables.export('foo.csv', f='csv', compress=False)
Still, this way you end up with one table per page and one csv file per table. I.e. in the example above you get two .csv files. This needs to be handled outside camelot.
To merge tables spanning multiple pages using pandas:
import pandas as pd
dfs = [] # list to store dataframes
for table in tables:
df = table.df
df.columns = df.iloc[0] # use first row as header
df = df[1:] # remove the first row from the dataframe
dfs.append(df)
df = pd.concat(dfs, axis=0) # concatenate all dataframes in list
df.to_csv('foo.csv') # export dataframe to csv
Also, there are difficulties identifying table areas on pages containing both text and tables (e.g. pdf page 16).
In these cases the table area can be specified. For the table on page 16, this would be:
tables = camelot.read_pdf(in_dir + file, pages='16', flavor='stream', row_tol=20, strip_text='\n', table_areas=['35,420,380,65'],)
Note: Throughout the post I referenced pages by 'counting' the pages of the file and not by the page numbers printed on each page (the latter one starts at the second page of the document).
Further to Stefano's comment, you need to specify both "stream" and a page number. To get the number of pages, I used PyPDF2, which should be installed by camelot.
In addition, I also suppressed the "no tables found" warning (which is purely optional).
import camelot
import PyPDF2
import warnings
file = 'relacao_medicamentos_rename_2020.pdf'
reader = PyPDF2.PdfFileReader(file)
num_pages = reader.getNumPages()
warnings.simplefilter('ignore')
for page in range(1,num_pages+1):
tables = camelot.read_pdf(file,flavor='stream',pages=f'{page}')
if tables.n > 0:
tables.export(f'relacao_medicamentos_rename_2020.{page}.csv', f='csv', compress=False)
It is hard to tell why this works on one pdf file, but not on the other. There is so many different ways how pdf's are created and written by the authoring software that one has to do trial and error.
Also, not all tables found are actual tables, and the results are a bit messy, and need some cleansing.
Consider doing this straight away with something like:
tables = camelot.read_pdf(file,flavor='stream',pages=f'{page}')
for table in tables:
df = table.df
# clean data
...
df.to_csv(....)
In many ways there are no such things as human conventional "Tables" in a PDF when converted from say a spreadsheet or Word Processor the PDF writer needs to add such PDF tagging for Audio Readers (Accessibility) and may need to store the page block of data as one single object. Many PDF writers do not bother as the printer should not be disabled :-) Thus every PDF with tabular data will be built totally differently from any other applications tabular method.
Thus often there is no connection with another partial "table" on the next page (so "headings" will not be associated with "columns" on a secondary page). A PDF has NO everyday CONSTRUCTS for "rows" or "columns" those will have in most cases been neutered by writing to a blank canvas.
So when it comes down to extraction ignore the foreground impression of a table and breakdown the body text.
Take this section from Page 189
This is the order of that text placement (remember there are no such things as line feeds only strings of fresh text) and the gap I have shown is just to highlight the challenge in human terms of what should an extractor do with seemingly random order. How could it know to insert those 5 lines after "comprimido"
Exclusões
Medicamento
Situação clínica
Relatório de
Recomendação da
Conitec
Portaria SCTIE
dasabuvir* 250 mg
comprimido
Hepatite C
Relatório nº 429
– Ombitasvir 12,5
mg/ veruprevir 75
mg/ ritonavir 50
mg comprimido e
dasabuvir 250 mg
comprimido para
o tratamento da
hepatite C
Nº 44, de 10/9/2019
ombitasvir +
veruprevir +
ritonavir* 12,5 mg
+ 75 mg + 50 mg
comprimido
Thus the only way to deal with a challenge like this is to consider proximity of text as blocklike and position it exactly the way it is done in the PDF as x columns and y rows but allow for overlaps in both directions, much like "spanning" or "cell merging". For that to work you need a free flowing spreadsheet that's intelligent enough to self adjust.
Related
I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!
I have scenario where I have PDFs with a letterhead and table-like body of text. I have tried using pdfminer but I'm struggling to figure out how to approach my problem
An example of the format for one my PDFs
In specific, pdf miner reads the data starting from the letterhead up until the table header. It then reads the table header in a row like fashion from left to right. From there it's just beyond messy.
Here is python to convert pdf to text:
import pdfminer
import sys
from pdfminer.high_level import extract_text
text = extract_text('./quote2.pdf')
print((text))
f = open("results2.txt", "w")
f.write(text)
And here is a snippet of what the output looks like:
... letter head info
ITEM�#
DESCRIPTION
561347
55�PCs-792.00�LB
6061-T651�PLATE�AMS�4027
4�S/C�6"�SQUARE
CUTTING�PLATE�SAW�ALUM
PACKAGING�SKIDDING
SHIP�VIA�:�OUR�TRUCK
Quotation
DATE:
CUSTOMER NUMBER:
QUOTE NUMBER:
FOB:
4/1/2022
319486
957242
Destination
SHIP TO:
The idea was to use regex to extract relevant numbers. As you can see it read the first 2 records for columns ITEM and DESCRIPTION, but from there it starts back up from the letterhead, and it's even more messy below
Is there perhaps a way to seperate the letterhead from the rest of the body as a starting step? Very new to python, not sure how to get what I want, help much appreciated!
I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)
I have an LDAC fits catalog which in a Python code I need to add the elements of two arrays as two new columns to it.
I open the original catalog in python:
from astropy.io import fits
from astropy.table import Table
import astromatic_wrapper as aw
cat1='catalog.cat'
hdulist1 =fits.open(cat1)
data1=hdulist1[1].data
The two arrays are ready and called ra and dec. I give them the key name, format and other needed info and invert them to columns. Finally, I join the two new columns to the original table (Checking newtab.columns and newtab.data shows that the new columns are attached successfully).
racol=fits.Column(name = 'ALPHA_J2000', format = '1D', unit = 'deg', disp = 'F11.7',array=ra)
deccol=fits.Column(name = 'DELTA_J2000', format = '1D', unit = 'deg', disp = 'F11.7',array=dec)
cols = fits.ColDefs([racol, deccol])
tbhdu = fits.BinTableHDU.from_columns(cols)
orig_cols= data1.columns
newtab = fits.BinTableHDU.from_columns(cols + orig_cols)
When I save the new table into a new catalog:
newtab.writeto('newcatalog.cat')
it is not in the format that I need. If I look into the description of each catalog with
ldacdes -i
I see for catalog.cat :
> Reading catalog(s)
------------------Catalog information----------------
Filename:..............catalog.cat
Number of segments:....3
****** Table #1
Extension type:.........(Primary HDU)
Extension name:.........
****** Table #2
Extension type:.........BINTABLE
Extension name:.........OBJECTS
Number of dimensions:...2
Number of elements:.....24960
Number of data fields...23
Body size:..............4442880 bytes
****** Table #3
Extension type:.........BINTABLE
Extension name:.........FIELDS
Number of dimensions:...2
Number of elements:.....1
Number of data fields...4
Body size:..............28 bytes
> All done
and for the new one:
> Reading catalog(s)
------------------Catalog information----------------
Filename:..............newcatalog.cat
Number of segments:....2
****** Table #1
Extension type:.........(Primary HDU)
Extension name:.........
****** Table #2
Extension type:.........BINTABLE
Extension name:.........
Number of dimensions:...2
Number of elements:.....24960
Number of data fields...25
Body size:..............4842240 bytes
> All done
As seen above, in the original catalog catalog.cat there are three tables and I tried to add two columns to the OBJECTS table.
I need that newcatalog.cat also keeps the same structure which is required by other programs, but it does not have the OBJECTS table and considering the "Number of elements" and the "Number of data fields" the newtab is saved into the Table #2.
Is there any solution for controlling the output fits catalog format?
Thank you for your help and I hope that I could structure my very first question on stackoverflow properly .
I don't know specifically about the LDAC format, but from your example file catalog.cat, it appears to be a multi-extension FITS file. That is, each table is stored in a separate HDU (as is typical for any file containing multiple tables with different sets of columns).
When you do something like
newtab = fits.BinTableHDU.from_columns(cols + orig_cols)
newtab.writeto('newcatalog.cat')
You're just creating a single new binary table HDU and writing that HDU to a file by itself (along with the mandatory primary HDU). What you really want is to take the same HDU structure as the original file and replace the existing table HDU with the one to which you added new columns.
Creating multi-extension FITS is discussed some here, but you don't even need to recreate the full HDU structure from scratch. The HDUList object return from fits.open is just a list of HDUs that can be manipulated like a normal Python list (with some extensions, for example, to support indexing by EXTNAME) and written out to a file:
hdulist = fits.open(cat1)
hdulist['OBJECTS'] = newtab
hdulist.writeto('newcatalog.cat')
I have a PDF with a big table splitted in pages, so I need to join the per-page tables into a big table in a large page.
Is this possible with PyPDF2 or another library?
Cheers
Just working on something similar, it takes an input pdf and via a config file you can set the final pattern of single pages.
Implementation with PyPDF2 but it still has issues with some pdf-files (have to dig deeper).
https://github.com/Lageos/pdf-stitcher
In principle adding a page right to another one works like:
import PyPDF2
with open('input.pdf', 'rb') as input_file:
# load input pdf
input_pdf = PyPDF2.PdfFileReader(input_file)
# start new PyPDF2 PageObject
output_pdf = input_pdf.getPage(page_number)
# get second page PyPDF2 PageObject
second_pdf = input_pdf.getPage(second_page_number)
# dimensions for offset from loaded page (adding it to the right)
offset_x = output_pdf.mediaBox[2]
offset_y = 0
# add second page to first one
output_pdf.mergeTranslatedPage(second_pdf, offset_x, offset_y, expand=True)
# write finished pdf
with open('output.pdf', 'wb') as out_file:
write_pdf = PyPDF2.PdfFileWriter()
write_pdf.addPage(output_pdf)
write_pdf.write(out_file)
Adding a page below needs an offset_y. You can get the amount from offset_y = first_pdf.mediaBox[3].
My understanding is that this is quite hard. See here and here.
The problem seems to be that tables aren't very well represented in pdfs but are simply made from absolutely positioned lines (see first link above).
Here are two possible workarounds (not sure if they will do it for you):
you could print multiple pages on one page and scale the page to make it readable....
open the pdf with inkscape or something similar. Once ungrouped, you should have access to the individual elements that make up the tables and be able to combine them the way that suit you
EDIT
Have a look at libre office draw, another vector package. I just opened a pdf in it and it seems to preserve some of the pdf structure and editing individual elements.
EDIT 2
Have a look at pdftables which might help.
PDFTables helps with extracting tables from PDF files.
I haven't tried it though... might have some time a bit later to see if I can get it to work.