How to extract particular block of text from pdf file with Python?

How to extract particular block of text from pdf file with Python? - python

I wanted to extract blocks of text as image (here the questions and options in pdf) from pdf and then store each block of text as a image separately.
The length of questions varies a lot, so using pyPDF2 for splitting the pdf at regular intervals is out of question.
Any suggestion? I have read few posts that mention using OCR to get question numbers and then splitting but I didn't completely get what they were trying to say. Is there any easier method?

Related

How to extract data from messy PDF file with no standard formatting?

I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able to read text line by line. But I can not figure out a universal pattern that I can use to extract the pricing list rows which I can store in a pandas dataframe and write to an excel file.
Can you help me if I should construct a regular expression or anything else that I can use to extract the pricing list out of this PDF? Because I can not think of any particular regular expression that would fit the messy nature of data inside the PDF, is there any better approach to take? Or simply it's not possible?
Code
Using the following code, I am able to extract all lines of text but the problem is, one price entry is spread across two rows. Consider current row is where most details about the entry are listed, how can I decide if the previous or next row also has information related to current entry.
If I could somehow figure that out, what might be the right approach to deal with the column values, they can be from 6-13 per line, how can I decide if at this particular location in current line, the column value resides?
import pdfplumber as scrapper
text = []
with scrapper.open('./report.pdf') as pdf:
for page in pdf.pages:
text.append(page.extract_text())
The PDF file I am working with:
https://drive.google.com/file/d/1GtjBf9FcKJCOJVNcGA9mvAshJ6t0oFca/view?usp=sharing
Sample Pictures demonstrating which data should fit in which fields:

Extracting text from two column pdf using python

I am trying to extract text from a two-column pdf. On using pypdf2 and pdfplumber it is reading the page horizontally, i.e. after reading first line of column one it goes on to read first line of column two instead of second line of column one. I have also tried this code githubusercontent
as it is, but I have the same issue. I also saw this How to extract text from two column pdf with Python? but I dont want to convert to image as I have thousands of pages.
Any help will be appreciated. Thannnks!

You can check this blog here which uses PyMuPDF to extract two column pdfs like research papers.
https://towardsdatascience.com/read-a-multi-column-pdf-using-pymupdf-in-python-4b48972f82dc
From what I have tested so far, it works quite well. I would highly recommend the "blocks" option.
# OCR the PDF using the default 'text' parameter
with fitz.open(DIGITIZED_FILE_PATH) as doc:
for page in doc:
text = page.get_text("blocks")
print(text)
Note: It does not work for scanned images. It works only for searchable pdf files.

How to extract charts/tables/graphs from PDF files using Python?

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.
I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.
Can anyone help me with an efficient python 3.6 code to solve the same?
Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.
Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?
https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf

For extracting tables you can use camelot
Here is an article about it.
For images I've found this question and answer Extract images from PDF without resampling, in python?

Try using PyMuPdf(https://github.com/pymupdf/PyMuPDF/tree/1.18.3) for amalgamation of texts, bars, lines and axis. It has so many extra utilities.

Searching for data in a PDF

I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?

I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.

Extracting PDF text by subjects [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm trying to extract the text from PDF by subjects.
in order to do so im trying to identify the labels \ headlines in the PDF.
So far I have converted the PDF into xml file, in order to get the text data more easily, and then using the font \ size of each in to deiced if a line is a label or not.
the main problem with this way, is that each PDF can have its own build, and not necessarily what works for one PDF will work for the other.
I will be glad if someone have an idea how to overcome this problem so that it will be possible to extract the labels (text by subjects) without depending on the PDF (most of the PDFs I work with are articles \ books)
different ways to extract text by subjects also welcome.
(As the tag indicates, I'm trying to do this in Python)
Edit:
At the moment im doing 2 things:
check font of each line
check each line text size
i concluded that: regular text will have the most lines with its font (there are more than x10 lines with this font than all other texts), and that if you look at the median of text size, it will be the size of the regular text.
From the first i can remove all regular text, and from the second i can take all texts that are bigger and all the labels will be in this list.
The problem now is to extract only the labels from this list since usually there is text that is bigger than the regular text yet isn't a label.
I tried to use the amount of time each fonts shows in the text to identify the labels fonts, but without much success. For each PDF the amount can vary.
I'm looking for ideas how to solve this problem, or if someone know a tools that can do it more easily.

I would suggest studying many pdfs and write down every pdf label text size. Then, you can average the top 5 highest fonts and average the top 5 lowest fonts. Now, you can make a range between them and check text if it is in that text size range.
This method will not work always, but, will cover the majority of pdfs.
(The more pdfs you study the better)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.