I am thinking of downloading cplusplus.com's C library by using Python. I want to download it completely and then convert it into a linked document such as Python documentation. This is my initial attempt at downloading the front page.
#! python3
import urllib.request
filehandle = urllib.request.urlopen('http://www.cplusplus.com/reference/clibrary/')
with open('test.html', 'w+b') as f:
for line in filehandle:
f.write(line)
filehandle.close()
The front page is being downloaded correctly but its look is quite different than in the original webpage. By different look I mean that the nice looking formatting on the original webpage is gone after I ran the script to download the webpage.
What's the reason for this?
This occurs because your scraped version doesn't include the Cascading Style Sheets (CSS) linked to by the page. It also won't include any images or javascript linked to either. If you want to obtain the linked files, you'll have to parse the source code you scrape for them.
Related
I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page.
Now I have the following data:
I was wondering if there is a way for me to get the specific article on the specific page using this data. I know now which filenames to check and which page.
Thanks a lot.
There is a library called tika. It can extract the text from a single page. You can split your pdf in such a way, that you have only the page in question still available. Then you can use:
parsed_page = parser.from_file('sample.pdf')
print(parsed_page['content'])
NOTE: This library requires Java to be installed on the system
Goal: want to automatize the download of various .csv files from https://wyniki.tge.pl/en/wyniki/archiwum/2/?date_to=2018-03-21&date_from=2018-02-19&data_scope=contract&market=rtee&data_period=3 using Python (this is not the main issue though)
Specifics: in particular, I am trying to download the csv file for the "Settlement price" and "BASE Year"
Problem: when I see the source code for this web page.I see the references to the "Upload" button, but I don't see refences for the csv file(Tbf I am not very good at looking at the source code). As I am using Python (urllib) I need to know the URL of the csv file but don't know how to get it.
This is not a question of Python per se, but about how to find the URL of some .csv that can be downloaded from a web page. Hence, no code is provided.
If you inspect the source code from that webpage in particular, you will see that the form to obtain the csv file has 3 main inputs:
file_type
fields
contracts
So, to obtain the csv file for the "Settlement price" and "BASE Year", you would simply do a POST request to that same URL, passing these as the payload:
file_type=2&fields=4&contracts=4
I would recommend wget command with python. WGET is a command to download any file. Once you download the file with wget then you can manipulate the csv file using other library.
I found this wget library for python.
https://pypi.python.org/pypi/wget
Regards.
Eduardo Estevez.
I have a URL in a web analytics reporting platform that basically triggers a download/export of the report you're looking at. The downloaded file itself is a CSV, and the link that triggers the download uses several attached parameters to define things like the fields in the report. What I am looking to do is download the CSV that the link triggers a download of.
I'm using Python 3.6, and I've been told that the server I'll be deploying on does not support Selenium or any webkits like PhantomJS. Has anyone successfully accomplished this?
If the file is a CSV file, you might want to consider downloading it's content directly, by using the requests module, something like this.
import requests
session=requests.Session()
information=session.get(#the link of the page here)
Then You can decode the information and read the contents as you wish using the CSV module, something like this (the csv module should be imported):
decoded_information=information.content.decode('utf-8')
data=decoded_information.splitlines()
data=csv.DictReader(data)
You can use a for loop to access each row in the data as you wish using the column headings as dictionary keys like so:
for row in data:
itemdate=row['Date']
...
Or you can save the decoded contents by writing them to a file with something like this:
decoded_information=information.content.decode('utf-8')
file=open("filename.csv", "w")
file.write(decoded_information)
file.close
A couple of links with documentation on the CSV module is provided here just in case you haven't used it before:
https://docs.python.org/2/library/csv.html
http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
Hope this helps!
Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:
import urllib2
def pdfDownload(url):
response=urllib2.urlopen(url)
expdf=response.read()
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
I tried this function out with:
pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')
At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.
I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.
When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".
I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.
The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.
I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:
import requests
def pdfDownload(url):
response=requests.get(url)
expdf=response.content
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.
I want to save a visited page on disk as a file. I am using a urllib and URLOpener.
I choose a site http://emma-watson.net/. The file is saved correctly as .html, but when I open the file I noticed that the main picture on top which contains bookmarks to other subpages is not displayed and also some other elements (like POTD). How can I save the page correctly to have all of the page saved on disk?
def saveUrl(url):
testfile = urllib.URLopener()
testfile.retrieve(url,"file.html")
...
saveUrl("http://emma-watson.net")
The screen of real page:
The screen of the opened file on my disk:
What you're trying to do is create a very simple web scraper (that is, you want to find all the links in the file, and download them, but you don't want to do so recursively, or do any fancy filtering or postprocessing, etc.).
You could do this by using a full-on web scraper library like scrapy and just restricting it to a depth of 1 and not enabling anything else.
Or you could do it manually. Pick your favorite HTML parser (BeautifulSoup is easy to use; html.parser is built into the stdlib; there are dozens of other choices). Download the page, then parse the resulting file, scan it for img, a, script, etc. tags with URLs, then download those URLs as well, and you're done.
If you want this all to be stored in a single file, there are a number of "web archive file" formats that exist, and different browsers (and other tools) support different ones. The basic idea of most of them is that you create a zipfile with the files in some specific layout and some extension like .webarch instead of .zip. That part's easy. But you also need to change all the absolute links to be relative links, which is a little harder. Still, it's not that hard with a tool like BeautifulSoup or html.parser or lxml.
As a side note, if you're not actually using the UrlOpener for anything, you're making life harder for yourself for no good reason; just use urlopen. Also, as the docs mention, you should be using urllib2, not urllib; in fact urllib.urlopen is deprecated as of 2.6. And, even if you do need to use an explicit opener, as the docs say, "Unless you need to support opening objects using schemes other than http:, ftp:, or file:, you probably want to use FancyURLopener."
Here's a simple example (enough to get you started, once you decide exactly what you do and don't want) using BeautifulSoup:
import os
import urllib2
import urlparse
import bs4
def saveUrl(url):
page = urllib2.urlopen(url).read()
with open("file.html", "wb") as f:
f.write(page)
soup = bs4.BeautifulSoup(f)
for img in soup('img'):
imgurl = img['src']
imgpath = urlparse.urlparse(imgurl).path
imgpath = 'file.html_files/' + imgpath
os.makedirs(os.path.dirname(imgpath))
img = urllib2.urlopen(imgurl)
with open(imgpath, "wb") as f:
f.write(img)
saveUrl("http://emma-watson.net")
This code won't work if there are any images with relative links. To handle that, you need to call urlparse.urljoin to attach a base URL. And, since the base URL can be set in various different ways, if you want to handle every page anyone will ever write, you will need to read up on the documentation and write the appropriate code. It's at this point that you should start looking at something like scrapy. But, if you just want to handle a few sites, just writing something that works for those sites is fine.
Meanwhile, if any of the images are loaded by JavaScript after page-load time—which is pretty common on modern websites—nothing will work, short of actually running that JavaScript code. At that point, you probably want a browser automation tool like Selenium or a browser simulator tool like Mechanize+PhantomJS, not a scraper.