How to actually download the attachment? - python

I'm using urllib2 to (try to) download a file from a web site. The file can only be downloaded after specifying some form fields. I can create the request and get the response without any problem, like this:
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
When I look at the response headers like this print response.info()['Content-Disposition'], I see the file there, i.e. it prints something like attachment;filename=myfile.txt
But how do I actually download the attachment? If I do response.read() I just get a string containing the HTML of the page at url. The point is that url is not a file, it is a web page with an "attachment" and I'm trying to download that attachment with urllib2. I believe the attachment is dynamically generated, so it's not just sitting there on the server.

The problem was that I wasn't sending all the necessary headers. In particular, it was important that I send the right cookies in the request headers. I did the following:
Open up Chromium (or Chrome) and hit Ctrl+Shift+I to open up the developer tools.
Click "Network"
Visit the page where the file is to be downloaded.
Click the newly created entry in the developer tools and click Headers. That's where I got all the info on the headers I needed to send.

Related

How should i get another redirected page url in python?

Like we open a URL to a normal browser so it will redirect to another website url. Example a shorted link. After you open this it will redirect you to the main url.
So how to do this in python I mean I need to open a URL on python and this will redirect to other website page then I will copy the other website page link.
That's all I want to know thank you.
I tried it with python requests and urllib module.
Like this
import requests
a = requests.get("url", allow_redirects = True)
And
import urllib.request
a = urllib.request.urlopen("url")
But it's not working at all. I mean didn't get the redirected page.
I know 4 types of redirections.
server sends response with status 3xx and new address
HTTP/1.1 302 Found
Location: https://new_domain.com/some/folder
Wikipedia: HTTP 301, HTTP 302, HTTP 303
server sends header Refresh with time in seconds and new address
Refresh: 0; url=https://new_domain.com/some/folder
server sends HTML with meta tag which emulates header Refresh
<meta http-equiv="refresh" content="0; url=https://new_domain.com/some/folder">
Wikipedia: meta refresh
JavaScript sets new location
location = url
location.href = url
location.replace(url)
location.assing(url)
The same for document.location, window.location
There should be also combination with open(),document.open(), window.open()
requests automatically redirects for first and (probably) second type. With urllib probably you would have to check status, get url, and run next request - but this is easy. You can even run it in loop because some pages may have many redirections. You can test it on httpbin.org (even for multi-redirections)
For third type it is easy to check if HTML has meta tag and run next request with new url. And again you can run in loop because some pages may have many redirections.
But forth type makes problem because requests can't run JavaScript and there are many different methods to assign new location. They can also hide it in code - "obfuscation".
In requests you can check response.history to see executed redirections

Send POST request with Python that generates a download and download the file

There's a website that has a button which downloads an Excel file. After I click, it takes around 20 seconds for the server API to generate the file and send it back to my browser for download.
If I monitor the communication after I click the button, I can see how the browser sends a POST request to a server with a series of headers and form values.
Is there a way that I can simulate a similar POST request programmatically using Python, and retrieve the Excel file after the server sends it over?
Thank you in advance
The requests module is used for sending all kinds of request types.
requests.post sends the post requests synchronously.
The payload data can be set using data=
The response can be accessed using .content.
Be sure to check the .status_code and only save on a successful response code
Also note the use of "wb" inside open, because we want to save the file as a binary instead of text.
Example:
import requests
payload = {"dao":"SampleDAO",
"condigId": 1,
...}
r = requests.post("http://url.com/api", data=payload)
if r.status_code == 200:
with open("file.save","wb") as f:
f.write(r.content)
Requests Documentation
I guess You could similarly do this:
file_info = request.get(url)
with open('file_name.extension', 'wb') as file:
file.write(file_info.content)
I honestly do not know how to explain this tho since I have little understanding how it works

Python Request .csv 401 Client Error: Unauthorized for URL

I am trying to download a csv file from an authorized website.
I am able to get respond code of 200 with url https://workspace.xxx.com/abc/ (click in this web page to download the csv) but respond code of 401 at url = 'https://workspace.xxx.com/abc/abc.csv'
This is my code:
import requests
r = requests.get(url, auth=('myusername', 'mybasicpass'))
I tried adding header and using session but still get respond code of 401.
First of all, you have to investigate how the website accepts the password.
They might be using HTTP authentication or Authorization header in the request.
You can log in using their website and then download the file .study how they are passing authorization.
I am sure they are not accepting plain passwords in authorization they might be encoding it in base64 or another encoding scheme.
My advice to you is to open the developer console and study their requests in network tab. You can post more information so one could help you more.

Why BeautifulSoup and lxml don't work?

I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!
I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!

Download image from URL that automatically creates download

I have a url that when opened, all it does is initiate a download, and immediately closes the page. I need to capture this download (a png) with python and save it to my own directory. I have tried all the usual urlib and urlib2 methods and even tried with mechanize but it's not working.
The url automatically starting a download and then closing is definitely causing some problems.
UPDATE: Specifically it is using Nginx to serve up the file with a X-Accel-Mapping header.
There's nothing particularly special about X-Accel-Mapping header. Perhaps the page makes the HTTP request with ajax, and uses the X-Accel-Mapping reader value to trigger the download?
Here's how I'd do it with urllib2:
response = urllib2.urlopen(url_to_get_x_accel_mapping_header)
download_url = response.headers['X-Accel-Mapping']
download_contents = urllib2.urlopen(download_url).read()
import urllib
URL= YOUR_URL
IMAGE = URL.rsplit('/',1)[1]
urllib.urlretrieve(URL, IMAGE)
For details to dynamically download images from a url list visit here

Categories

Resources