How do you avoid text from cropped parts in PyPDF? - python

I'm quite new to python and I'm doing a ML project to extract disclosures from PDF's (published annual reports). PyPDF extracts the disclosures i need for my project but it also includes the text from footers in the text which i want to remove.
I browsed through stack-overflow and found a solution to successfully crop out the footer part through PyPDF and save the file as a new pdf. But when I run the cropped PDF through my original code, the text from footers are still included in the extracted text. Is there anyway I can overcome this ?

Not sure after extracted the desired text, why you wish to save it as new pdf & then load it again... anyways, follow the below suggestion...
So, after cropping the footer part from original pdf, instead of saving the extracted text as new pdf... save it as word document... The idea is when we load a word document in python using "docx2python" library, it separates out header, footer, body in it's properties...
My guess is;
1.) The new saved word document shouldn't have any header/footer, just the text...
2.) And in case if loaded word document still has the footer then you can get rid of using the same library...

Related

How to extract text from a photocopy saved as a pdf in Python

I found the following code that allows one to extract text from a pdf file. However, this only works for pdf's where you can copy the text directly from highlighting it. I'm curious if there's some way to extract text from a document where you can't select the text in Python, like a photocopy or scanned document saved as a pdf?
Here is the code I use to take in text from a non-photocopy pdf file
import PyPDF2
pdfFileObject = open('C:\\filepath\\file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
output = ''
for i in range(count):
page = pdfReader.getPage(i)
output += page.extractText()
# output.append(page.extractText())
output
Works like a charm. However I am curious about a way to extract text from a photocopied document saved as a pdf.
Extracting text from a photocopied document saved as a pdf doesn't work when I use the code provided above.
PyPDF2 is a PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.
To retrieve text from a scanned document, you need to do an OCR (Optical Character Recognition (OCR) is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file.) There are various OCR tools and pytesseract is one of mostly used tools. Read more about those OCR tools from here.

While creating the text layer for scanned PDF edit the text without messing up the page looks?

While creating the text(getting text using OCR) layer for scanned PDF edit(because OCR is giving wrong text) the text without messing up the page looks?
ocrmypdf is doing best job to create textlayer(able to search scanned PDF) and giving PDF/A standard documents(without messing any page UI). It's using Tesseract ocr to detect text, but some times the Tesseract is giving wrong detected text. So I want enable user to change that text and complete the creation of PDF.
Example PDF which is OCRis not working properly. So want to update the ocr detected text before rendering into PDF.
Solution need like , Change in source code of ocrmypdf or update text using PDFBOX both works for me.
Example:
OCRMYPDF input file
OCRMYPDF output file

Python - Split pdf or powerpoint by pixel location?

I will explain my dilemma first: I have several thousand powerpoint files (.ppt) that I need to extract the text. The problem is the text is is disorganized in the file and when read as a complete page it makes no sense for what I need (it would read in the example: line 1, line 3, line 2, line 4, line 5).
I was using tika to read the files initially. I then thought if I converted to pdf using glob and win32com.client that I would have some better luck but it's basically the same result. The picture here is an example of what the text is like.
So now my idea now is if I can section the pdf or ppt by pixel location (and save to separate temp files if needed, opened, and read that way) I can keep things in order and get what I need. Although the text moves around within each box, the black outline boxes are always roughly in the same location.
I cannot find anything to split an individual pdf page though, only multiple pages into a single page. Does anyone have an idea how to go about doing this?
I need to read the text in box one together (line 1 and line 2) and load into a dictionary or some other container, and the same for the second box. For reference there is only one slide in the powerpoint.
Allow me to provide the answer as a general guideline:
Both .ppt and .pptx files are glorified .zip files.
Use 7-zip or WinZip to open the .pptx and understand the structure.
Convert them into a .pptx file.
Each slide should now have a .xml file full of tags you can parse.
For example you will find tags for each text box with tags for that box's text nested inside.
Also: python-pptx
Mass convert by tweaking this VBA code: Link for VBA
Or using PowerShell: Link for [PowerShell]

How to extract images from PDF or Word, together with the text around images?

I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image?
Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.
Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)
I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..
Ref: doc2txt
docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.
Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.
As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .
docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True. Well, you get what your image called in pagaraph text and list of image files. Match as you like.

Python text extraction does not work on some pdfs

I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf.
My code looks like this :
url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)
print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()
I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.
I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.
Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?
If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.
Looking at the two PDF files, I can't see anything wrong with the files themselves. But...
The first file contains fully embedded fonts
The second file contains subsetted fonts
This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).

Categories

Resources