Unable to extract image from website - python

I am trying to extract an image from the website using python :
relevant command :
import urllib
imagelink = 'http://searchpan.in/hacked_captcha.php?450535633'
urllib.urlretrieve(imagelink, "image.jpg")
using Firefox to view image shows the following.

You could use the following on Python 3. You need to first do a GET request which of-course is abstracted and retrieve the content, writing it to the given filename.
import urllib.request
imagelink = 'https://i.stack.imgur.com/s2F9o.png'
urllib.request.urlretrieve(imagelink, './sample.png')
Reference https://docs.python.org/3/howto/urllib2.html#fetching-urls

The image is png , all you nedd to do is save it as '.png'
Here is the code
import urllib
imagelink = 'http://searchpan.in/hacked_captcha.php?450535633'
urllib.urlretrieve(imagelink, "image.png")

Maybe this on one line?
import urllib.request
urllib.request.urlretrieve("http://searchpan.in/hacked_captcha.php?450535633", "image..jpg")

there must be .jpg or whatever extension for img the website is using u have to give the full url with extension
Extract and Download Image
Go to this link It may be helpfull.
Best of luck...

Related

How to extract specific text from a pdf using python?

These are the items which are needed to be extracted from the pdf:
This is the link to the PDF.
Could anyone solve this problem using Python with proper comments to help me understand?
import pdf2image
from PIL import Image
import pytesseract
image = pdf2image.convert_from_path('/content/SRW1012022Y0002378_220216102321.PDF')
for pagenumber, page in enumerate(image):
detected_text = pytesseract.image_to_string(page)
print(detected_text)
I tried the above code snippet, and I can extract all the text from pdf, but I can't grab specific text to continue applying logic to it.

Time efficient way to convert PDF to image

Context:
I have PDF files I'm working with.
I'm using an ocr to extract the text from these documents and to be able to do that I have to convert my pdf files to images.
I currently use the convert_from_path function of the pdf2image module but it is very time inefficient (9minutes for a 9page pdf).
Problem:
I am looking for a way to accelerate this process or another way to convert my PDF files to images.
Additional info:
I am aware that there is a thread_count parameter in the function but after several tries it doesn't seem to make any difference.
This is the whole function I am using:
def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin')
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
Link to the convert_from_path reference.
I found an answer to that problem using another module called fitz which is a python binding to MuPDF.
First of all install PyMuPDF:
The documentation can be found here but for windows users it's rather simple:
pip install PyMuPDF
Then import the fitz module:
import fitz
print(fitz.__doc__)
>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).
Open your file and save every page as images:
The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.
def convert_pdf_to_image(fic):
#open your file
doc = fitz.open(fic)
#iterate through the pages of the document and create a RGB image of the page
for page in doc:
pix = page.get_pixmap()
pix.save("page-%i.png" % page.number)
Hope this helps anyone else.

Wrong image src url when using Python scrapy

I am trying to scrape image from en eCommerce website using scrapy, but for some of the items(5-10 out of 180) image src output is similar to this -
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A .
For the rest of the items I get the correct image URL.
Can someone help me with this.
My code is for image src extraction is
image = response.css('.productimage img::attr(src)').extract()
And due to this I am getting an error while downloading the image to my local system.
This
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A
actually is image - bytes encoded to string using base64, you might use base64 built-in module to get it as file. Consider following simple example:
import base64
txt = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A"
content = base64.b64decode(txt.split(',')[-1])
with open('image.png','wb') as f:
f.write(content)
it will create image.png file in current working directory.
This base64 data:
iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8A
It is empty png image (it is not relevat image)
Usually this base64 data occure on e-commerce websites for products which don't have images.
I recommend You to interpret this products with base64.... as products without image.

How do i read image using PILLOW image?

I wanted read a image using PIL.Image.open().But I've image in different path.
The following is the path I've the python script
"D:\YY_Aadhi\holy-edge-master\hed\test.py"
The following is the path I've the image file.
"D:\YY_Aadhi\HED-BSDS\test\2018.jpg"
from PIL import Image
'''some code here'''
image = Image.open(????)
How should I fill the question mark to access the image file.
you can simply do
from PIL import Image
image = Image.open("D:\\YY_Aadhi\\HED-BSDS\\test\\2018.jpg")
or
from PIL import Image
directory = "D:\\YY_Aadhi\\HED-BSDS\\test\\2018.jpg"
image = Image.open(directory)
like this.
you have to write escape sequence twice in windows, when you want to define as directory. and It will be great if you try some stupid code. It helps you a lot.
Does this image = Image.open("D:\YY_Aadhi\HED-BSDS\test\2018.jpg") not do the trick?
You can use this to read an online image
from urllib.request import urlopen
url = 'https://somewebsite/images/logo.png'
msg_image = urlopen(url).read()

Grabbing animated gif with python script

I've been playing around with Pythonista on iOS to create some automation scripts.
I have a problem where I'm trying to grab an animated gif from a remote url. I've come up with the following script.
import Image
from urllib import urlopen
from io import BytesIO
url = "http://someurl.com/funny.gif"
img = Image.open(BytesIO(urlopen(url).read()))
I get the image but it only appears to be the first frame of the gif? I'm guessing it has something to do with the BytesIO not reading in the whole file but I'm not sure?
Hope I'm along the right lines.
You're almost there. You use img.seek to advance frames. So..
import Image
from urllib import urlopen
from io import BytesIO
url = 'http://upload.wikimedia.org/wikipedia/commons/2/2c/Rotating_earth_%28large%29.gif'
img = Image.open(BytesIO(urlopen(url).read()))
# Start with first frame
img.seek(0)
#img.show()
# Advance by one
img.seek(img.tell() + 1)
#img.show()
Here's a SO post showing how to save a gif using the Image class.
According to Pillow Manual:
To save all frames, the save_all parameter must be present and set to True.
So, opened image could be save by:
image.save('filename.gif', save_all=True)

Categories

Resources