Extract Numbers from a certain location in PDF files - python

I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it is that pdfminer exports the numbers in various and unpredictable patters (I used this script). Sometimes it extracts the whole rows together and sometimes it extracts columns, so that's why I want to find a way to extract the numbers from various files in a consistent manner. Any suggestions would be much appreciated!

Try pdfreader. You can extract either text containing "pdf markdown" and than parse it with regular expressions for example:
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
viewer.next()
except PageDoesNotExist:
pass
data = my_total_deviation_parser(pdf_markdown)

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

Merge multiple PDFs to specific pages of single PDF based on PDF titles with PyPDF2

I have a folder of PDFs that I'm currently merging using PyPDF2.
merger = PdfFileMerger()
for file in os.listdir('****'):
if file.endswith(".pdf"):
merger.append('****'+file)
merger.write('****' + str(dt.date.today()) + '.pdf')
merger.close()
The files contain graphs and the titles are very specific. What I would like to be able to do is:
Based on string in title, merge multiple PDFs to the same page of the new PDF (preferably split into two columns) - I know this isn't correct syntax but something like:
if 'dogs' in file:
merger.write(...,page=1,cols=2)
elif 'cats' in file:
merger.write(...,page=2,cols=2)
Not sure if this is possible, have looked at other answers and read through the documentation but can't figure it out. Would also like to be able to have a fair amount of graphs (I guess up to 6?) on a single page.
If the PDF titles are located in the same location in the file consistently, then you could do a extractText() function. Retrieve the title text in each pdf file. Then do your comparison analysis.

Extracting text from a PDF file in Python

I am trying to extract text from a pdf file I usually have to deal with at work, so that I can automize it.
When using PyPDF2, it works for my CV for instance, but not for my work-document. The problem is, that the text is then like that: "Helloworldthisisthetext". I then tried to use .join(" "), but this is not working.
I read that this is a known problem with PyPDF2 - it seems to depend on the way the pdf was built.
Does anyone know another approach how to extract text out of it which I then can use for further steps?
Thank you in advance
I can suggest you to try another tool - pdfreader. You can extract the both plain strings and "PDF markdown" (decoded text strings + operators). "PDF markdown" can be parsed as a regular text (with regular expressions for example).
Below you find the code sample for walking pages and extracting PDF content for further parsing.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
try:
while True:
viewer.render()
pdf_markdown = viewer.canvas.text_content
result = my_text_parser(pdf_markdown)
# The one below will probably be the same as PyPDF2 returns
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
...
def my_text_parser(text):
""" Code your parser here """
...
pdf_markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator.
For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects
You can parse it with regular expressions for example.
I had a similar requirement at work for which I used PyMuPDF. They also have a collection of recipes which cover typical scenarios of text extraction.

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

Combine two lists of PDFs one to one using Python

I have created a series of PDF documents (maps) using data driven pages in ESRI ArcMap 10. There is a page 1 and page 2 for each map generated from separate *.mxd. So I have one list of PDF documents containing page 1 for each map and one list of PDF documents containing page 2 for each map. For example: Map1_001.pdf, map1_002.pdf, map1_003.pdf...map2_001.pdf, map2_002.pdf, map2_003.pdf...and so one.
I would like to append these maps, pages 1 and 2, together so that both page 1 and 2 are together in one PDF per map. For example: mapboth_001.pdf, mapboth_002.pdf, mapboth_003.pdf... (they don't have to go into a new pdf file (mapboth), it's fine to append them to map1)
For each map1_ *.pdf
Walk through the directory and append map2_ *.pdf where the numbers (where the * is) in the file name match
There must be a way to do it using python. Maybe with a combination of arcpy, os.walk or os.listdir, and pyPdf and a for loop?
for pdf in os.walk(datadirectory):
??
Any ideas? Thanks kindly for your help.
A PDF file is structured in a different way than a plain text file. Simply putting two PDF files together wouldn't work, as the file's structure and contents could be overwritten or become corrupt. You could certainly author your own, but that would take a fair amount of time, and intimate knowledge of how a PDF is internally structured.
That said, I would recommend that you look into pyPDF. It supports the merging feature that you're looking for.
This should properly find and collate all the files to be merged; it still needs the actual .pdf-merging code.
Edit: I have added pdf-writing code based on the pyPdf example code. It is not tested, but should (as nearly as I can tell) work properly.
Edit2: realized I had the map-numbering crossways; rejigged it to merge the right sets of maps.
import collections
import glob
import re
# probably need to install this module -
# pip install pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
def group_matched_files(filespec, reg, keyFn, dataFn):
res = collections.defaultdict(list)
reg = re.compile(reg)
for fname in glob.glob(filespec):
data = reg.match(fname)
if data is not None:
res[keyFn(data)].append(dataFn(data))
return res
def merge_pdfs(fnames, newname):
print("Merging {} to {}".format(",".join(fnames), newname))
# create new output pdf
newpdf = PdfFileWriter()
# for each file to merge
for fname in fnames:
with open(fname, "rb") as inf:
oldpdf = PdfFileReader(inf)
# for each page in the file
for pg in range(oldpdf.getNumPages()):
# copy it to the output file
newpdf.addPage(oldpdf.getPage(pg))
# write finished output
with open(newname, "wb") as outf:
newpdf.write(outf)
def main():
matches = group_matched_files(
"map*.pdf",
"map(\d+)_(\d+).pdf$",
lambda d: "{}".format(d.group(2)),
lambda d: "map{}_".format(d.group(1))
)
for map,pages in matches.iteritems():
merge_pdfs((page+map+'.pdf' for page in sorted(pages)), "merged{}.pdf".format(map))
if __name__=="__main__":
main()
I don't have any test pdfs to try and combine but I tested with a cat command on text files.
You can try this out (I'm assuming unix based system): merge.py
import os, re
files = os.listdir("/home/user/directory_with_maps/")
files = [x for x in files if re.search("map1_", x)]
while len(files) > 0:
current = files[0]
search = re.search("_(\d+).pdf", current)
if search:
name = search.group(1)
cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=FULLMAP_%s.pdf %s map2_%s.pdf" % (name, current, name)
os.system(cmd)
files.remove(current)
Basically it goes through and grabs the maps1 list and then just goes through and assumes correct files and just goes through numbers. (I can see using a counter to do this and padding with 0's to get similar effect).
Test the gs command first though, I just grabbed it from http://hints.macworld.com/article.php?story=2003083122212228.
There are examples of how to to do this on the pdfrw project page at googlecode:
http://code.google.com/p/pdfrw/wiki/ExampleTools

Categories

Resources