How to extract specific text from a pdf using python?

How to extract specific text from a pdf using python? - python

These are the items which are needed to be extracted from the pdf:
This is the link to the PDF.
Could anyone solve this problem using Python with proper comments to help me understand?
import pdf2image
from PIL import Image
import pytesseract
image = pdf2image.convert_from_path('/content/SRW1012022Y0002378_220216102321.PDF')
for pagenumber, page in enumerate(image):
detected_text = pytesseract.image_to_string(page)
print(detected_text)
I tried the above code snippet, and I can extract all the text from pdf, but I can't grab specific text to continue applying logic to it.

Related

Python with pytesseract - How to get the same output for pytesseract.image_to_data in a searchable PDF?

I have this piece of code in Python that makes use of pytesseract (method pytesseract.image_to_data).
This gives me great text information and coordinates that are saved in a text file that is fed to a third party software. It works perfectly for PDF files that have been scanned
data = pytesseract.image_to_data(Image.open('file-001-page-001.png')))
The issue now is that I have a demand for output in the exact same structure for PDFs that already contain text. It's possible to keep the same code and continue as if the PDF had no text, extracting images and doing OCR, but it doesn't seem like the right solution...
Is it possible to achieve this with pytesseract?
Suggestions are welcome

You can use this:
import pytesseract
from PIL import Image
# Open the PDF file
with open('file.pdf', 'rb') as f:
# Extract text from the PDF file and save it to a variable
text = pytesseract.image_to_pdf_or_hocr(f, extension='hocr', lang='eng', config='--oem 3 --psm 6')
# Save the extracted text to a file in the desired format
with open('output.hocr', 'w')as f:
f.write(text)

Extract text from PDF files and preserve the orginal layout, in Python

I want to extract text from the PDF files but the layout of text in the PDF should be maintained, like the images below. Images show results from the [github.com/JonathanLink/PDFLayoutTextStripper].
I tried the below code but it doesn't maintain the Layout. I want get results exactly the same way as shown in the images by using any of the Python libraries like PyPDF2, PDFPlumber, PDFminer etc. I tried all these libraries but didn't get the desired results. I need help in extracting the text from the PDF file exactly as is shown in the images.
from pdfminer.high_level import extract_text`
text = extract_text('test.pdf')
print(text)

You can preserve layout/indentation using PDFtotext package.
import pdftotext
with open("target_file.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# All pages
for text in pdf:
print(text)

Wrong image src url when using Python scrapy

I am trying to scrape image from en eCommerce website using scrapy, but for some of the items(5-10 out of 180) image src output is similar to this -
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A .
For the rest of the items I get the correct image URL.
Can someone help me with this.
My code is for image src extraction is
image = response.css('.productimage img::attr(src)').extract()
And due to this I am getting an error while downloading the image to my local system.

This
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A
actually is image - bytes encoded to string using base64, you might use base64 built-in module to get it as file. Consider following simple example:
import base64
txt = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A"
content = base64.b64decode(txt.split(',')[-1])
with open('image.png','wb') as f:
f.write(content)
it will create image.png file in current working directory.

This base64 data:
iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A
It is empty png image (it is not relevat image)
Usually this base64 data occure on e-commerce websites for products which don't have images.
I recommend You to interpret this products with base64.... as products without image.

Paste PDF image into Pyplot figure

How can I plot the image from a PDF file into a Pyplot figure (e.g. with plt.imshow, or inside some container I can add with ax.add_artist)?
Methods that do not work:
import matplotlib.pyplot as plt
im = plt.imread('file.pdf')
(Source: this question, where it works for PNG files.)
from PIL import Image
im = Image.open('file.pdf')
(Source: this doc, but again, it doesn't work for PDF files; the question links a library to read PDFs but the doc shows no obvious way to add them to a Pyplot figure.)
Also, this question exists, but the answers solve the problem without actually loading a PDF file.

There is a module called PyMuPDF that makes this job a lot easier.
Scraping PDF images into PIL Image
To scrape the individual images out of each page tutorials can be found here and here on how to convert them into PIL format.
If the intention is to grab an entire PDF page or pages, the page.get_pixmap() documented here, can do this.
The snippet below shows how to iterate through and grab each page of a PDF as a PIL.Image
import io
import fitz
from PIL import Image
file = 'myfile.pdf'
pdf_file = fitz.open(file)
# in case there is a need to loop through multiple PDF pages
for page_number in range(len(pdf_file)):
page = pdf_file[page_number]
rgb = page.get_pixmap()
pil_image = Image.open(io.BytesIO(rgb.tobytes()))
# display code or image manipulation here for each page #
Displaying scraped PDF Image
In either case, once there is a PIL.Image object, such as the pil_image variable above, the show() function can display it (and does so differently depending on the OS). However, if the preference is to use matplotlib.pyplot.imshow the PIL.Image must be converted to RGB first.
Snippet to display PIL.Image with pyplot.imshow
import matplotlib.pyplot as plt
plt.imshow(pil_image.convert('RGB'))

Unable to extract image from website

I am trying to extract an image from the website using python :
relevant command :
import urllib
imagelink = 'http://searchpan.in/hacked_captcha.php?450535633'
urllib.urlretrieve(imagelink, "image.jpg")
using Firefox to view image shows the following.

You could use the following on Python 3. You need to first do a GET request which of-course is abstracted and retrieve the content, writing it to the given filename.
import urllib.request
imagelink = 'https://i.stack.imgur.com/s2F9o.png'
urllib.request.urlretrieve(imagelink, './sample.png')
Reference https://docs.python.org/3/howto/urllib2.html#fetching-urls

The image is png , all you nedd to do is save it as '.png'
Here is the code
import urllib
imagelink = 'http://searchpan.in/hacked_captcha.php?450535633'
urllib.urlretrieve(imagelink, "image.png")

Maybe this on one line?
import urllib.request
urllib.request.urlretrieve("http://searchpan.in/hacked_captcha.php?450535633", "image..jpg")

there must be .jpg or whatever extension for img the website is using u have to give the full url with extension
Extract and Download Image
Go to this link It may be helpfull.
Best of luck...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract specific text from a pdf using python? - python

Related

Python with pytesseract - How to get the same output for pytesseract.image_to_data in a searchable PDF?

Extract text from PDF files and preserve the orginal layout, in Python

Wrong image src url when using Python scrapy

Paste PDF image into Pyplot figure

Unable to extract image from website

Categories

Resources