I made a little script to automate some downloads, but I have a little issue. For example, the link is www.link.com/29292292.pdf but If I press the download button in my browser the name of the file is file with a good name.pdf.
I don't know to tell to requests to download the file but with 'file with a good name.pdf'.
Related
For reasons i'm not going to get into i need to get the name of the file that is being downloaded in the instance of that browser.
Is there a way using selenium to get the name of the file being downloaded? Not talking about waiting for it to download and get the name after
I am using selenium with chrome and python.
So i ended up going in another direction with this, i am simply creating folders for each browser instance i open and each instance only downloads in that folder and retrieve the name from there.
Trying to automate downloading a .zip file from the link here (the links will always be different, but they are always in this format):
If this link is entered into a web browser, it downloads a file called Badges.zip. When trying to download it from Python with the code below, it saves to Badges.zip, but the .zip is not an archive. It's some Google Analytics code. It's like the requests module is not redirecting all the way to the file. I've tried get, head, trying to stream the download, and lots of other ways and I can't get it to download the file correctly. Here's the current code I'm using:
import requests
url = "https://schools.clever.com/files/badges.zip?fromEmail=1&randomID=5f9cffb0ee8c81418ac2e019"
r = requests.get(url, allow_redirects=True)
open('c:/data/Badges.zip', 'wb').write(r.content)
I'm open to any ideas. Have tried other modules and get similar results. I'm even open to kicking off external utilities if needed like wget or curl (which I haven't had any luck with yet either).
Note that the Clever Badges in this download have been voided to prevent use.
Thanks!
I was looking for a way to download pdf files in python, and I saw answers on other questions recommending the urllib module. I tried to download a pdf file using it, but when I try to open the downloaded file, a message shows up saying that the file cannot be opened.
error message
This is the code I used-
import urllib
urllib.urlretrieve("http://papers.gceguide.com/A%20Levels/Mathematics%20(9709)/9709_s11_qp_42.pdf", "9709_s11_qp_42.pdf")
What am I doing wrong? Also, the file automatically saves to the directory my python file is in. How do I change the location to which it gets saved?
Edit-
I tried again with the link to a sample pdf, http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
The code is working with this link, so why won't it work for the other one?
Try this. It works.
import requests
url='https://pdfs.semanticscholar.org/c029/baf196f33050ceea9ecbf90f054fd5654277.pdf'
r = requests.get(url, stream=True)
with open('C:/Users/MICRO HARD/myfile.pdf', 'wb') as f:
f.write(r.content)
You can also use wget to download pdfs via a link:
import wget
wget.download(link)
Here's a guide about how to search & download all pdf files from a webpage in one go: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
You can't download the pdf content from the given url using
requests or urllib.
Because initially the given url was pointed to another web page after that
only it loads the pdf.
If you have doubt save the response as html instead of pdf.
You need to use headless browsers like panthomJS to download files
from these kind of web pages.
I'm trying to export a CSV from this page via a python script. The complicated part is that the page opens after clicking the export button on this page, begins the download, and closes again, rather than just hosting the file somewhere static. I've tried using the Requests library, among other things, but the file it returns is empty.
Here's what I've done:
url = 'http://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx?exportAll=True&%3bexportFormat=CSV&%3bisExport=True%22+id%3d%22M_C_sCDTransactions_csfFilter_ExportDialog_hlAllCSV?exportAll=True&exportFormat=CSV&isExport=True'
with open('CD_Transactions_02-27-2017.CSV', "wb") as file:
# get request
response = get(url)
# write to file
file.write(response.content)
I'm sure I'm missing something obvious, but I'm pulling my hair out.
It looks like the file is being generated on demand, and the url stays only valid as long as the session lasts.
There are multiple requests from the browser to the webserver (including POST requests).
So to get those files via code, you would have to simulate the browser, possibly including session state etc (and in this case also __VIEWSTATE ).
To see the whole communication, you can use developer tools in the browser (usually F12, then select NET to see the traffic), or use something like WireShark.
In other words, this won't be an easy task.
If this is open government data, it might be better to just ask that government for the data or ask for possible direct links to the (unfiltered) files (sometimes there is a public ftp server for example) - or sometimes there is an API available.
The file is created on demand but you can download it anyway. Essentially you have to:
Establish a session to save cookies and viewstate
Submit a form in order to click the export button
Grab the link which lies behind the popped-up csv-button
Follow that link and download the file
You can find working code here (if you don't mind that it's written in R): Save response from web-scraping as csv file
Can you please help me to make script in python that do the following:
download zip file http (I already have a code for this one)
download zip file in file://<server location>, I have problem with this one. the location of the file is in file://<server location>file.zip
can't download the #2 file :(
Code below, #1 is working if using HTTP, but when using file://// it's not working. Anybody has idea how to download a zip file from file:////?
import urllib2
response = urllib2.urlopen('file:////server/file.zip')
print response.info()
html = response.read()
# do something
response.close() # best practice to close the file
urllib2 does not have handlers for the file:// protocol; I think it will open local files if there is no protocol given (//server/file.zip), but I've never used that, and haven't tested it. If you have a local file name, you can just use open() and read() rather than urrlib2.
Your code will be simpler if you use with closing (from contextlib); opened files are already context managers in Python 2.7 and 3.x, so they're even easier to use.