Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:
import urllib2
def pdfDownload(url):
response=urllib2.urlopen(url)
expdf=response.read()
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
I tried this function out with:
pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')
At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.
I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.
When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".
I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.
The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.
I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:
import requests
def pdfDownload(url):
response=requests.get(url)
expdf=response.content
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.
Related
I'm attempting to download college basketball data from barttorvik.com (specifically https://barttorvik.com/team-tables_each.php), a site which aggregates many valuable statistics. There isn't a 'download' link on the site per se, so in order to collect data, one must use a query string including '?csv=1' at the end of the URL. When doing this by hand, in the address bar of my web browser, this works perfectly, and the data is downloaded to my downloads folder with a default name (trank_team_table_data.csv). The problem I'm facing arises when attempting to automate this in a python script. The exact method I use in python is
import pandas as pd
data = pd.read_table('https://barttorvik.com/team-tables_each.php?csv=1', sep=',', encoding='utf-8')
which turns up errors covered in this SO post. Bypassing those errors with the destructive techniques described in that post, the file downloads and opens. However, it seems that this opens the HTML backing up the page itself? Part of the resultant CSV (opened in excel, downloaded using urllib.request) is shown here.
I have a URL in a web analytics reporting platform that basically triggers a download/export of the report you're looking at. The downloaded file itself is a CSV, and the link that triggers the download uses several attached parameters to define things like the fields in the report. What I am looking to do is download the CSV that the link triggers a download of.
I'm using Python 3.6, and I've been told that the server I'll be deploying on does not support Selenium or any webkits like PhantomJS. Has anyone successfully accomplished this?
If the file is a CSV file, you might want to consider downloading it's content directly, by using the requests module, something like this.
import requests
session=requests.Session()
information=session.get(#the link of the page here)
Then You can decode the information and read the contents as you wish using the CSV module, something like this (the csv module should be imported):
decoded_information=information.content.decode('utf-8')
data=decoded_information.splitlines()
data=csv.DictReader(data)
You can use a for loop to access each row in the data as you wish using the column headings as dictionary keys like so:
for row in data:
itemdate=row['Date']
...
Or you can save the decoded contents by writing them to a file with something like this:
decoded_information=information.content.decode('utf-8')
file=open("filename.csv", "w")
file.write(decoded_information)
file.close
A couple of links with documentation on the CSV module is provided here just in case you haven't used it before:
https://docs.python.org/2/library/csv.html
http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
Hope this helps!
I want to save a visited page on disk as a file. I am using a urllib and URLOpener.
I choose a site http://emma-watson.net/. The file is saved correctly as .html, but when I open the file I noticed that the main picture on top which contains bookmarks to other subpages is not displayed and also some other elements (like POTD). How can I save the page correctly to have all of the page saved on disk?
def saveUrl(url):
testfile = urllib.URLopener()
testfile.retrieve(url,"file.html")
...
saveUrl("http://emma-watson.net")
The screen of real page:
The screen of the opened file on my disk:
What you're trying to do is create a very simple web scraper (that is, you want to find all the links in the file, and download them, but you don't want to do so recursively, or do any fancy filtering or postprocessing, etc.).
You could do this by using a full-on web scraper library like scrapy and just restricting it to a depth of 1 and not enabling anything else.
Or you could do it manually. Pick your favorite HTML parser (BeautifulSoup is easy to use; html.parser is built into the stdlib; there are dozens of other choices). Download the page, then parse the resulting file, scan it for img, a, script, etc. tags with URLs, then download those URLs as well, and you're done.
If you want this all to be stored in a single file, there are a number of "web archive file" formats that exist, and different browsers (and other tools) support different ones. The basic idea of most of them is that you create a zipfile with the files in some specific layout and some extension like .webarch instead of .zip. That part's easy. But you also need to change all the absolute links to be relative links, which is a little harder. Still, it's not that hard with a tool like BeautifulSoup or html.parser or lxml.
As a side note, if you're not actually using the UrlOpener for anything, you're making life harder for yourself for no good reason; just use urlopen. Also, as the docs mention, you should be using urllib2, not urllib; in fact urllib.urlopen is deprecated as of 2.6. And, even if you do need to use an explicit opener, as the docs say, "Unless you need to support opening objects using schemes other than http:, ftp:, or file:, you probably want to use FancyURLopener."
Here's a simple example (enough to get you started, once you decide exactly what you do and don't want) using BeautifulSoup:
import os
import urllib2
import urlparse
import bs4
def saveUrl(url):
page = urllib2.urlopen(url).read()
with open("file.html", "wb") as f:
f.write(page)
soup = bs4.BeautifulSoup(f)
for img in soup('img'):
imgurl = img['src']
imgpath = urlparse.urlparse(imgurl).path
imgpath = 'file.html_files/' + imgpath
os.makedirs(os.path.dirname(imgpath))
img = urllib2.urlopen(imgurl)
with open(imgpath, "wb") as f:
f.write(img)
saveUrl("http://emma-watson.net")
This code won't work if there are any images with relative links. To handle that, you need to call urlparse.urljoin to attach a base URL. And, since the base URL can be set in various different ways, if you want to handle every page anyone will ever write, you will need to read up on the documentation and write the appropriate code. It's at this point that you should start looking at something like scrapy. But, if you just want to handle a few sites, just writing something that works for those sites is fine.
Meanwhile, if any of the images are loaded by JavaScript after page-load time—which is pretty common on modern websites—nothing will work, short of actually running that JavaScript code. At that point, you probably want a browser automation tool like Selenium or a browser simulator tool like Mechanize+PhantomJS, not a scraper.
Note: This question has been edited to reflect new information, including the title which used to be 'How to store a PDF in Amazon S3 with Python Boto library'.
I'm trying to save a PDF file using urlfetch (if the url was put into a browser, it would prompt a 'save as' dialog), but there's some kind of encoding issue.
There are lots of unknown characters showing up in the urlfetch result, as in:
urlfetch.fetch(url).text
The result has chars like this: s�*��E����
Whereas the same content in the actual file looks like this: sÀ*ÿ<81>E®<80>Ùæ
So this is presumably some sort of encoding issue, but I'm not sure how to fix it. The version of urlfetch I'm using is 1.0
For what it's worth, the PDF I've been testing with is here: http://www.revenue.ie/en/tax/it/forms/med1.pdf
I switched to urllib instead of urlfetch e.g.
import urllib
result = urllib.urlopen(url)
...and everything seems to be fine.
I am thinking of downloading cplusplus.com's C library by using Python. I want to download it completely and then convert it into a linked document such as Python documentation. This is my initial attempt at downloading the front page.
#! python3
import urllib.request
filehandle = urllib.request.urlopen('http://www.cplusplus.com/reference/clibrary/')
with open('test.html', 'w+b') as f:
for line in filehandle:
f.write(line)
filehandle.close()
The front page is being downloaded correctly but its look is quite different than in the original webpage. By different look I mean that the nice looking formatting on the original webpage is gone after I ran the script to download the webpage.
What's the reason for this?
This occurs because your scraped version doesn't include the Cascading Style Sheets (CSS) linked to by the page. It also won't include any images or javascript linked to either. If you want to obtain the linked files, you'll have to parse the source code you scrape for them.