How to replace scanned text in a pdf from real text

How to replace scanned text in a pdf from real text - python

I have a scanned pdf and using Python I have already converted it to a searchable pdf but that's not exactly what I want.
It is well-known fact that in a scanned pdf comprising of images of text, pixelation occurs at some point of zooming in, whereas it doesn't in a text pdf. And that's somewhat I want. I want to replace all of the text in my scanned pdf from real text same as a typed pdf. Or possibly create a new one with all the contents of the old pdf.
Actually, I am writing this post after searching well on the internet and due to the fact that I didn't find anywhere to start from. I went through the libraries such as pytesseract and reportlab but couldn't find anything promising. And this is the reason for no code posted in this question.
Hence, I want to know a way to possibly replace all of the text from the scanned pdf with a real text at exactly those positions.
Note: In the pdf, I am working on, there are images as well and I want everything to be exactly at its original place with only the text replaced.
Also, I have already mentioned the reason for giving no code in this post as I haven't found a starting point yet.

Related

Identify the edited location in the PDF modified by online editor www.ilovepdf.com using Python

I have an SBI bank statement PDF which is tampered/forged. Here is the link for the PDF.
This PDF is edited using online editor www.ilovepdf.com. The edited part is the first entry under the 'Credit' column. Original entry was '2,412.00' and I have modified it to '12.00'.
Is there any programmatic way either using Python or any other opensource technology to identify the edited/modified location/area of the PDF (i.e. BBOX(Bounding Box) around 12.00 credit entry in this PDF)?
2 things I already know:
Metadata (Info or XMP metadata) is not useful. Modify date of the metadata doesn't confirm if the PDF is compressed or indeed edited, it will change the modify date in both these cases. Also it doesn't give the location of the edit been done.
PyMuPDF SPANS JSON object is also not useful as the edited entry doesn't come at the end of the SPANS JSON, instead it's in the proper order of the text inside the PDF. Here is the SPAN JSON file generated from PyMuPDF.
Kindly let me know if anyone has any opensource solution to resolve this problem.

iLovePDF completely changes the whole text in the document. You can even see this, just open the original and the manipulated PDFs in two Acrobat Reader tabs and switch back and forth between them, you'll see nearly all letters move a bit.
Internally iLovePDF also rewrote the PDF completely according to its own preferences, and the edit fits in perfectly.
Thus, no, you cannot recognize the manipulated text based on this document alone because it technically is a completely different, a completely new one.

How to search a set of pdfs, with only an image segment of a page

The main idea is this, I have a large collection of IGCSE past papers, I need to find which paper a particular question was from, and all I have is the screenshot of one question. I want to make a program that can input an image of a question, then scan a set of pdfs to find the said question, then output the pdf containing the said question. I have experience in programming but I'm a bit stuck into how to approach the problem at hand.
Solutions I have tried:
I tried combining pdfs into one mega pdf so I could just search the mega pdf, can't do that as the file is too large.
Solutions I think might work but not sure:
Making a program to read through every single pdf to find the keywords in the image.

Did you try the steps in https://automatetheboringstuff.com/chapter13/ ?
- put all pdf's in the same folder
- for each pdf go through each page
- perform extractText()
- use regex or something to parse this extractText for the questionstring then output pdf/page if found

Does python have font face for strings?

I recently used Google Vision API to extract text from a pdf. Now I searching for a keyword in the response text (from API). When I compare the given string and found string, they do not match even they have same characters.
The only reason I can see is font types of given and found string which looks different which lead to different ascii/utf-8 code of the characters in the string. (I never came across such a problem)
How to solve this? How can I bring these two string to same characters? I am using Jupyter notebook but I even pasted the comparison on terminal but still its evaluates it to False.
Here are the strings I am trying to match:
'КА Р5259' == 'KA P5259'
But they look the same on Stack Overflow so here's a screenshot:

Thanks everyone for the your comments.
I found the solution. I am posting it here, it might be helpful for someone. Actually it's correct that python does not support font faces. So if one copies a font faced character and paste it to python console or jupyter notebook (which renders the font faces due to the fact that it uses html to display information) it is considered a different unicode character.
So the idea is to first bring the text response in a plain text format which I achieved by storing the response in a .txt file (or .pkl file more precisely) which I had to do anyway to preserve the response objects for later data analysis purposes. Once the response in stored in plain text file you can read it without any font face problem unlike I faced above.

Finding and identifying streams in PDF using python

I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject'] in them, which results in a KeyError.
What I'm looking for seems to be hiding in streams, which I can't find in pypdf2's dictionary (even after recursively exploring the whole structure, calling .getObject() on every indirect object I can find).
Using pypdf2 I've written one page off the pdf and opened it using Notepad++, to find some streams with the /FlateDecode filter.
pdfrw was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream to get A stream (no clue how to get the others).
Using zlib, I decompressed it, and got something starting with:
/Part <</MCID 0 >>BDC
(It also contains a lot of floating-point numbers, both positive and negative)
From what I could find, BDC has something to do with ghostscript.
At this point I gave up and decided to ask for help.
Is there a python tool to, at least, extract all streams (and identify FlateDecode tag?)
And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?
I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.
Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects when opening the PDF in Notepad++, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject tags, but with what seems like a stream tag - though the information is not compressed.

Unless you are looking to extract inline images, which aren't that common, the content stream is not the place to look for images. The more common case are Streams of type XObject, of subtype Image, which are usually found in a page's Resource->XObject dictionary (see sections 7.3.3, 7.8.3, and 8.95 of the PDF Reference indicated by #mkl).
Alternately, Image XObjects can also be found in Form XObjects (subtype Form, which indicates they have their own content streams) in their own Resource->XObject dictionary, so the search for Image XObjects can be recursive.
An Image XObject can also have a softMask, which is itself its own Image XObject. Form XObjects are also used in Tiling Patterns, and so could conceivably contain Image XObjects (but they aren't that common either), or used in an Annotation's Normal Appearance (but Image XObjects are less commonly used within such Annotations, except maybe 3D or multimedia annotations).

Python to read PDF files [duplicate]

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 1 year ago.
I have found many posts where solutions to read PDFs has been proposed. I want to read a PDF file word by word and do some processing on it. people suggest pdfMiner which converts entire PDF file into text file. But what i want is that to read PDFs word by word. Can anyone suggest a library that does this?

Possibly the fastest way to do this is to first convert your pdf inta a text file using pdftotext (on pdfMiner's site, there's a statement that pdfMiner is 20 times slower than pdftotext) and afterwards parse the text file as usual.
Also, when you said "I want to read a pdf file word by word and do some processing on it", you didn't specify if you want to do processing based on words in a pdf file, or do you actually want to modify the pdf file itself. If it's the second case, then you've got an entirely different problem on your hands.

I'm using pdfminer and it is an excellent lib especially if you're comfortable programming in python. It reads PDF and extracts every character, and it provides its bounding box as a tuple (x0,y0,x1,y1). Pdfminer will extract rectangles, lines and some images, and will try to detect words. It has an unpleasant O(N^3) routine that analyses bounding boxes to coalesce them, so it can get very slow on some files. Try to convert your typical file - maybe it'll be fast for you, or maybe it'll take 1 hour, depends on the file.
You can easily dump a pdf out as text, that's the first thing you should try for your application. You can also dump XML (see below), but you can't modify PDF. XML is the most complete representation of the PDF you can get out of it.
You have to read through the examples to use it in your python code, it doesn't have much documentation.
The example that comes with PdfMiner that transforms PDF into xml shows best how to use the lib in your code. It also shows you what's extracted in human-readable (as far as xml goes) form.
You can call it with parameters that tell it to "analyze" the pdf. If you do, it'll coalesce letters into blocks of text (words and sentences; sentences will have spaces so it's easy to tokenize into words in python).

Whereas I really liked the pdfminer answer I'd say that packages are not the same over time. Currenlty pdfminer still not support Python3 and may need to be updated.
So, to update the subject -even if an answer have been already voted- I'd propose to go pdfrw, from the website :
Version 0.3 is tested and works on Python 2.6, 2.7, 3.3, 3.4, and 3.5
Operations include subsetting, merging, rotating, modifying metadata,etc
The fastest pure Python PDF parser available Has been used for years by a printer in pre-press production
Can be used with rst2pdf to faithfully reproduce vector images
Can be used either standalone, or in conjunction with reportlab to reuse existing PDFs in new ones
Permissively licensed

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.