Download image from URL that automatically creates download - python

I have a url that when opened, all it does is initiate a download, and immediately closes the page. I need to capture this download (a png) with python and save it to my own directory. I have tried all the usual urlib and urlib2 methods and even tried with mechanize but it's not working.
The url automatically starting a download and then closing is definitely causing some problems.
UPDATE: Specifically it is using Nginx to serve up the file with a X-Accel-Mapping header.

There's nothing particularly special about X-Accel-Mapping header. Perhaps the page makes the HTTP request with ajax, and uses the X-Accel-Mapping reader value to trigger the download?
Here's how I'd do it with urllib2:
response = urllib2.urlopen(url_to_get_x_accel_mapping_header)
download_url = response.headers['X-Accel-Mapping']
download_contents = urllib2.urlopen(download_url).read()

import urllib
URL= YOUR_URL
IMAGE = URL.rsplit('/',1)[1]
urllib.urlretrieve(URL, IMAGE)
For details to dynamically download images from a url list visit here

Related

Python Requests Module - How to download an mp4 behind a login page?

I have an mp4 on a website that I want to download using Python's Requests module. The problem is that each mp4 url is behind a login page. For example, if you go to the url in my code, it requires that one signs into their account before being able to access the mp4 directly.
Is there any way to go about downloading mp4s like this easily? It's ok if I need users to log into their account before being able to download it, but I'm unsure about how to take this into account with Python. Thanks.
Below is an example of what I'm trying to achieve:
import requests
chunk_size = 256
url = "https://juilliard.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=f290a855-c462-4406-a965-af2700f259fe&remoteEmbed=true&remoteHost=https%3A%2F%2Fwww.juilliard.edu&embedApiId=panopto-video-player"
r = requests.get(url, stream=True)
with open("performance.mp4", "wb") as f:
for chunk in r.iter_content(chunk_size=chunk_size):
f.write(chunk)

Send POST request with Python that generates a download and download the file

There's a website that has a button which downloads an Excel file. After I click, it takes around 20 seconds for the server API to generate the file and send it back to my browser for download.
If I monitor the communication after I click the button, I can see how the browser sends a POST request to a server with a series of headers and form values.
Is there a way that I can simulate a similar POST request programmatically using Python, and retrieve the Excel file after the server sends it over?
Thank you in advance
The requests module is used for sending all kinds of request types.
requests.post sends the post requests synchronously.
The payload data can be set using data=
The response can be accessed using .content.
Be sure to check the .status_code and only save on a successful response code
Also note the use of "wb" inside open, because we want to save the file as a binary instead of text.
Example:
import requests
payload = {"dao":"SampleDAO",
"condigId": 1,
...}
r = requests.post("http://url.com/api", data=payload)
if r.status_code == 200:
with open("file.save","wb") as f:
f.write(r.content)
Requests Documentation
I guess You could similarly do this:
file_info = request.get(url)
with open('file_name.extension', 'wb') as file:
file.write(file_info.content)
I honestly do not know how to explain this tho since I have little understanding how it works

Python "requests" module errors sending follow request on Instagram

I am using the following script:
import requests
import json
import os
COOKIES = json.loads("") #EditThisCookie export here (json) to send requests
COOKIEDICTIONARY = {}
for i in COOKIES:
COOKIEDICTIONARY[i['name']] = i['value']
def follow(id):
post = requests.post("https://instagram.com/web/friendships/" + id + "/follow/", cookies=COOKIEDICTIONARY)
print(post.text)
follow('309438189')
os.system("pause")
This script is supposed to send a follow request to the user, '3049438189' on Instagram. However, if the code is run, the post.text outputs some HTML code, including
"This page could not be loaded. If you have cookies disabled in your
browser, or you are browsing in Private Mode, please try enabling
cookies or turning off Private Mode, and then retrying your action."
It's supposed to append the cookies to the variable, COOKIEDICTIONARY in a "requests" module readable format. If you print the array (I don't know what it's called in Python), it replies with all of the cookies and their values.
The cookies put in are valid and the requests syntax (I believe to be) is correct.
I have fixed it. The problem was certain headers that I needed were not present, such as Origin (I will get the full list soon). For anybody who wants to imitate any instagram post request, you need those headers or it will error.

Python Retrieve File while Ignoring Redirect

I'm working on a program that uses Beautiful Soup to scrape a website, and then urllib to retrieve images found on the website (using the image's direct URL). The website I'm scraping isn't the original host of the image, but does link to the original image. The problem I've run into is that for certain websites retrieving www.example.com/images/foobar.jpg redirects me to the homepage www.example.com and produces an empty (0 KB) image. In fact, going to www.example.com/images/foobar.jpg redirects as well. Interesting on the website I'm scraping, the image shows up normal.
I've seen some examples on SO, but they all explain how to capture cookies, headers, and other similar data from websites while getting around the redirect, and I was unable to get them to work for me. Is there a way to prevent a redirect and get the image stored at www.example.com/images/foobar.jpg?
This is the block of code that saves the image:
from urllib import urlretrieve
...
for imData in imList:
imurl = imData['imurl']
fName = os.path.basename(URL)
fName,ext = os.path.splitext(fName)
fName += "_%02d"%(ctr,)+ext
urlretrieve(imurl,fName)
ctr += 1
The code that handles all the scraping is too long too reasonably put here. But I have verified that in imData['imurl'] holds the accurate url for the image, for example http://upload.wikimedia.org/wikipedia/commons/9/95/Brown_Bear_cub_in_river_1.jpg. However certain images redirect like: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/fauna-animals-public-domain-images-pictures/bears-public-domain-images-pictures/brown-bear-in-dog-salmon-creek.jpg.
The website you are attempting to download the image from may have extra checks to limit the amount of screen scraping. A common check is the Referer header which you can try adding to the urllib request:
req = urllib2.Request('<img url>')
req.add_header('Referer', '<page url / domain>')
For example the request my browser used for this an alpaca image from the website you referenced includes a referer header:
Request URL:http://www.public-domain-image.com/cache/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos_w725_h544.jpg
Request Method:GET
....
Referer:http://www.public-domain-image.com/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos.jpg.html
User-Agent:Mozilla/5.0

How to actually download the attachment?

I'm using urllib2 to (try to) download a file from a web site. The file can only be downloaded after specifying some form fields. I can create the request and get the response without any problem, like this:
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
When I look at the response headers like this print response.info()['Content-Disposition'], I see the file there, i.e. it prints something like attachment;filename=myfile.txt
But how do I actually download the attachment? If I do response.read() I just get a string containing the HTML of the page at url. The point is that url is not a file, it is a web page with an "attachment" and I'm trying to download that attachment with urllib2. I believe the attachment is dynamically generated, so it's not just sitting there on the server.
The problem was that I wasn't sending all the necessary headers. In particular, it was important that I send the right cookies in the request headers. I did the following:
Open up Chromium (or Chrome) and hit Ctrl+Shift+I to open up the developer tools.
Click "Network"
Visit the page where the file is to be downloaded.
Click the newly created entry in the developer tools and click Headers. That's where I got all the info on the headers I needed to send.

Categories

Resources