Can you use requests.get() without HTTPS/HTTP - python

I have a code that takes a random flag from the Flag Mashup Bot and downloads it:
import requests
DIR = 'C:/Users/myUser/Desktop/Flags/'
URL = 'https://flagsmashupbot.pythonanywhere.com/mashup?passwd=fl4gsm4shupb0t'
def download_image(img_url: str, dest_dir: str):
img_data = requests.get(img_url).content
with open(dest_dir, 'wb') as file:
file.write(img_data)
if __name__ == "__main__":
response = requests.get(URL)
if response.ok:
page = response.text
image_url = page[page.find('data:image', page.find('data:image') + 1):page.find('" download=')]
name = page[page.find('" download=') + 12:page.find('_FlagsMashupBot.png"')]
DIR += (name + '.png')
print(DIR)
download_image(image_url, DIR)
When I run it, I get the following error on line 8:
requests.exceptions.InvalidSchema: No connection adapters were found for [image URL]
When I read about it, I realized that it's because the image URLs from the site don't start with "https://" (or at least that's what I understood).
So, is there a way to use requests.get() without having https at the start of the URL?

The reason you would not get an HTTP/HTTPs based URL is since the data is in href format pointing to the base64 encoded version of the image.
You may use urllib to open up the href download link and save the contents to a file:
data = 'data:image/png;charset=utf-8;base64,iVBORw0KGgoAAAANSUhEUgAABwgAAASwCAIAAABggIlUAAAABmJLR0QA/wD/AP+gvaeTAAAgAElEQVR4nOzdaZhd92Hf97Pde+fOYGYAzAx2gCBBivsCgqtEUl7piF.......'
response = urllib.request.urlopen(data)
with open('image.png', 'wb') as f:
f.write(response.file.read())

Related

Web scraping images are in a unsupported format

I have been trying to scrape some images using Beautifulsoup in Python and I am facing some problems, so the thing is that I am successfully able to scrape the link as well as store it in the folder but the images are in an unsupported format.
res = requests.get('https://books.toscrape.com/')
res.raise_for_status()
file = open('op.html', 'wb')
for i in res.iter_content(10000):
file.write(i)
os.makedirs('images', exist_ok=True)
newfile=open("op.html",'rb')
data=newfile.read()
soup=BeautifulSoup(data,'html.parser')
for link in soup.find_all('img'):
ll=link.get('src')
ima = open(os.path.join('images', os.path.basename(ll)), 'wb')
for down in res.iter_content(1000):
ima.write(down)
It says file format not supported even though it's in a jpeg format
output image in a folder
This line for down in res.iter_content(1000): is not iterating the image from ll - it is reiterating the html result. Your OS may recognize the file from the extension (.jpeg), but this is only because of the filename - not the content (which is not JPEG, but HTML, and hence the error).
You should make another request for the image itself, so it can be fetched and stored:
for link in soup.find_all('img'):
ll = link.get('src')
img_rs = requests.get(os.path.join('https://books.toscrape.com/', ll)) # <-- this line
ima = open(os.path.join('images', os.path.basename(ll)), 'wb')
for down in img_rs.iter_content(1000): # <-- and iterate on the result
ima.write(down)
The reason for saving the HTML is obscure. So, ignoring that part of the code in question, it comes down to this:
import requests
from os.path import join, basename
from bs4 import BeautifulSoup as BS
from urllib.parse import urljoin
URL = 'https://books.toscrape.com'
TARGET_DIR = '/tmp'
with requests.Session() as session:
(r := session.get(URL)).raise_for_status()
for image in BS(r.text, 'lxml').find_all('img'):
src = image['src']
(r := session.get(urljoin(URL, src), stream=True)).raise_for_status()
with open(join(TARGET_DIR, basename(src)), 'wb') as t:
for chunk in r.iter_content(chunk_size=8192):
t.write(chunk)
In terms of performance, this can be significantly enhanced by multithreading
Your problem is that after you find the URL of the image you don't do anything with it and instead you try to save the whole inital request which is just the html file of the whole website. Try something like this instead:
base_url = 'https://books.toscrape.com/'
res = requests.get('https://books.toscrape.com/')
res.raise_for_status()
file = open('op.html', 'wb')
for i in res.iter_content(10000):
file.write(i)
os.makedirs('images', exist_ok=True)
newfile=open("op.html",'rb')
data=newfile.read()
soup=BeautifulSoup(data,'html.parser')
for link in soup.find_all('img'):
ll=link.get('src')
ima = os.path.join('images', os.path.basename(ll))
current_img = os.path.join(base_url, ll)
img_res = requests.get(current_img, stream = True)
with open(ima, 'wb') as f:
shutil.copyfileobj(img_res.raw, f)
del img_res

Python download image from short url by keeping its own name

I would like to download image file from shortener url or generated url which doesn't contain file name on it.
I have tried to use [content-Disposition]. However my file name is not in ASCII code. So it can't print the name.
I have found out i can use urlretrieve, request to download file but i need to save as different name.
I want to download by keeping it's own name..
How can i do this?
matches = re.match(expression, message, re.I)
url = matches[0]
print(url)
original = urlopen(url)
remotefile = urlopen(url)
#blah = remotefile.info()['content-Disposition']
#print(blah)
#value, params = cgi.parse_header(blah)
#filename = params["filename*"]
#print(filename)
#print(original.url)
#filename = "filedown.jpg"
#urlretrieve(url, filename)
These are the list that i have try but none of them work
I was able to get this to work with the requests library because you can use it to get the url that the shortened url redirects to. Then, I applied your code to the redirected url and it worked. There might be a way to only use urllib (I assume thats what you are using) with this, but I dont know.
import requests
from urllib.request import urlopen
import cgi
def getFilenameFromURL(url):
req = requests.request("GET", url)
# req.url is now the url the shortened url redirects to
original = urlopen(req.url)
value, params = cgi.parse_header(original.info()['Content-Disposition'])
filename = params["filename*"]
print(filename)
return filename
getFilenameFromURL("https://shorturl.at/lKOY3")
You can then use urlretrieve with this. Its inefficient but it works... Also since you can get the actual url with the requests library, you can probably get the filname through there.

Python - Downloading images using Wget. How to add a string to each file?

I'm using the following Python code to download images from a certain website. It's part of a code that I'm using to make a web scraper.
for url in links:
# Invoke wget download method to download specified url image.
local_image_filename = wget.download(url)
# Print out local image file name.
local_image_filename
continue
It's working well, but I want to know if it's possible to add a string as a prefix to each file...
My ideia is get the page title via Xpath and add as a prefix for each file.
I don't know where to add a string in this code. Can someone help me?
For example, I'm downloading these files:
logo.jpg, plans.jpg, circle.jpg
And I need to add a prefix, like these:
Beautiful_Plan_logo.jpg, Beautiful_Plan_plans.jpg, Beautiful_Plan_circle.jpg
Following I'll put the entire code:
import requests
import bs4 as bs
import urllib.request
import wget
##################################################
# getting url images #
##################################################
url = "https://tyreehouseplans.com/shop/house-plans/blackberry-blossom/"
opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent' : 'Mozilla'}]
urllib.request.install_opener(opener)
raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all('img')
links = []
for img in imgs:
link = img.get('src')
links.append(link)
print(links)
################################################
# downloading images #
################################################
for url in links:
# Invoke wget download method to download specified url image.
local_image_filename = wget.download(url)
# Print out local image file name.
local_image_filename
continue
Thank you for any help!
python module wget has an option out, which determines the name of the output file. For example, the following script downloads 3 images, adding a prefix Beautiful_Plan_.
import wget
base_url = 'https://homepages.cae.wisc.edu/~ece533/images/'
image_names = ['airplane.png', 'arctichare.png', 'baboon.png']
prefix = 'Beautiful_Plan_'
for image_name in image_names:
wget.download(base_url + image_name, out = prefix + image_name)
you can use shutil for this
import shutil
prefix = "prefix_"
#your piece of code
for url in links:
# Invoke wget download method to download specified url image.
local_image_filename = wget.download(url)
# Print out local image file name.
local_image_filename
shutil.copy(local_image_filename, prefix+local_image_filename)
use os.rename as per this documentation
I wrote code for making a seperate file with the extra information up front with a seperator.
import requests
import bs4 as bs
import urllib.request
import wget
##################################################
# getting url images #
##################################################
url = "https://tyreehouseplans.com/shop/house-plans/blackberry-blossom/"
opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent': 'Mozilla'}]
urllib.request.install_opener(opener)
raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all('img')
links = []
for img in imgs:
link = img.get('src')
links.append(link)
# print(links)
################################################
# downloading images #
################################################
for url in links:
# Invoke wget download method to download specified url image.
try:
local_image_filename = wget.download(url)
except ValueError:
break
# Print out local image file name.
print(local_image_filename)
with open(local_image_filename, 'r') as myFile:
try:
data = myFile.read()
except UnicodeDecodeError:
data = "UNICODE DECODE ERROR"
except ValueError:
data = "VALUE ERROR"
print(data)
print(type(data))
myFile.close()
newSaveString = str(local_image_filename) + "SeperatorOfSomeKind" + str(data)
newFileName = "NEW_" + local_image_filename
with open(newFileName, 'w') as myFile:
myFile.write(newSaveString)
myFile.close()
continue

Save streaming audio from URL as MP3, or even just audio file from URL as MP3

I am trying to have my server, in python 3, go grab files from URLs. Specifically, I would like to pass a URL into a function, I would like the function to go grab an audio file(of many varying formats) and save it as an MP3, probably using ffmpeg or ffmpy. If the URL also has a PDF, I would also like to save that, as a PDF. I haven't done much research on the PDF yet, but I have been working on the audio piece and wasn't sure if this was even possible.
I have looked at several questions here, but most notably;
How do I download a file over HTTP using Python?
It's a little old but I tried several methods in there and always get some sort of issue. I have tried using the requests library, urllib, streamripper, and maybe one other.
Is there a way to do this and with a recommended library?
For example, most of the ones I have tried do save something, like the html page, or an empty file called 'file.mp3' in this case.
Streamripper received a try changing user agents error.
I am not sure if this is possible, but I am sure there is something I'm not understanding here, could someone point me in the right direction?
This isn't necessarily the code I'm trying to use, just an example of something I have used that doesn't work.
import requests
url = "http://someurl.com/webcast/something"
r = requests.get(url)
with open('file.mp3', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
**Edit
import requests
import ffmpy
import datetime
import os
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE AUDIO/MPEG, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.MP3
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE application/pdf, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.PDF
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE other than application/pdf, OR
## audio/mpeg, THE FILE WILL NOT BE SAVED
def BordersPythonDownloader(url):
print('Beginning file download requests')
r = requests.get(url, stream=True)
contype = r.headers['content-type']
if contype == "audio/mpeg":
print("audio file")
filename = '[{}].mp3'.format(str(datetime.datetime.now()))
with open('file.mp3', 'wb+') as f:
f.write(r.content)
ff = ffmpy.FFmpeg(
inputs={'file.mp3': None},
outputs={filename: None}
)
ff.run()
if os.path.exists('file.mp3'):
os.remove('file.mp3')
elif contype == "application/pdf":
print("pdf file")
filename = '[{}].pdf'.format(str(datetime.datetime.now()))
with open(filename, 'wb+') as f:
f.write(r.content)
else:
print("URL DID NOT RETURN AN AUDIO OR PDF FILE, IT RETURNED {}".format(contype))
# INSERT YOUR URL FOR TESTING
# OR CALL THIS SCRIPT FROM ELSEWHERE, PASSING IT THE URL
#DEFINE YOUR URL
#url = 'http://archive.org/download/testmp3testfile/mpthreetest.mp3'
#CALL THE SCRIPT; PASSING IT YOUR URL
#x = BordersPythonDownloader(url)
#ANOTHER EXAMPLE WITH A PDF
#url = 'https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SY/configuration/guide/sy_swcg/etherchannel.pdf'
#x = BordersPythonDownloader(url)
Thanks Richard, this code works and helps me understand this better. Any suggestions for improving the above working example?

How to download file using python when url doesn't change [duplicate]

This question already has answers here:
Download Returned Zip file from URL
(9 answers)
Closed last year.
I want to download a file from a webpage. That webpage only has one .zip file (that's what I want to download), but when I click on the .zip file, it starts download but the URL doesn't change (the URL still remains of the form http://ldn2800:8080/id=2800). How can I download this using python, considering that there is no URL of the form http://example.com/1.zip?
Also, when I directly go to the page http://ldn2800:8080/id=2800, it just opens that page with the .zip file but doesn't download it without clicking. How do download it using python?
UPDATE: Right now I'm doing it this way:
if (str(dict.get('id')) == winID):
#or str(dict.get('id')) == linuxID):
#if str(dict.get('number')) == buildNo:
buildTypeId = dict.get('id')
ID = dict.get('id')
downloadURL = "http://example:8080/viewType.html?buildId=26009&tab=artifacts&buildTypeId=" + ID
directory = BindingsDest + "\\" + buildNo
if not os.path.exists(directory):
os.makedirs(directory)
fileName = None
if buildTypeId == linuxID:
fileName = linuxLib + "-" + buildNo + ".zip"
elif buildTypeId == winID:
fileName = winLib + "-" + buildNo + ".zip"
if fileName is not None:
print(dict)
downloadFile(downloadURL, directory, fileName)
def downloadFile(downloadURL, directory, fileName, user=user, password=password):
if user is not None and password is not None:
request = requests.get(downloadURL, stream=True, auth=(user, password))
else:
request = requests.get(downloadURL, stream=True)
with open(directory + "\\" + fileName, 'wb') as handle:
for block in request.iter_content(1024):
if not block:
break
handle.write(block)
But, it just creates a zip in the required location but that zip can't be opened and has nothing.
Can something like this be done: like searching for the filename on the webpage and then download that pattern matched?
Check the HTTP status code to make sure that no error happened. You may use the builtin method raise_for_status to do so: https://requests.readthedocs.io/en/master/api/#requests.Response.raise_for_status
def downloadFile(downloadURL, directory, fileName, user=user, password=password):
if user is not None and password is not None:
request = requests.get(downloadURL, stream=True, auth=(user, password))
else:
request = requests.get(downloadURL, stream=True)
request.raise_for_status()
with open(directory + "\\" + fileName, 'wb') as handle:
for block in request.iter_content(1024):
if not block:
break
handle.write(block)
Are you sure that there is no networking issue such as proxy/fw/etc ?
EDIT: according to your above comment, I'm not sure that this answers your actual problem. Revised answer:
You access a web page containing a link to a zip file. This link, you say, is the same as the page itself. But if you click on it in a browser, it downloads the file instead of reaching the HTML page again. That's strange but can be explained in various ways. Please copy/paste the whole HTML page code (including the link to the zip file), that will probably help us understanding the issue.

Categories

Resources