with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com?
Regards
Not directly.
If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/
This will let you load a page and follow links from it
Something like:
import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
print(link)
response = br.follow_link(link)
html = response.read()
#save your downloaded page
br.back()
As it stands, this will only get you the pages one link away from your starting point. You could easily adapt it to cover an entire site, though.
If you really just want to mirror an entire site, use wget. Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc)
Related
A webpage I would like to scrape consists of several files:
I'm interested of scraping only the highlighted file, that is: mboxFrame.
My method of scraping pages
import requests
from bs4 import BeautifulSoup
webPage = requests.get(URL, verify=False)
soup = BeautifulSoup(webPage.content, "html.parser" )
is able to scrape only the file mail.html. Is there a way to scrape only what I want?
I would appreciate any hints or tips.
The way to open a file from a server is to request it with a URL.
In fact, in the beginnings of the world wide web this was the only way to get content: content creators would put various files on servers and clients would open or download those files. The dynamic processing of URIs and parameters is a later invention. That is why commenters are asking for the URL you use. We want to see it and modify accordingly to help you see what parts need changing in order to get that particular file. You can omit the password, or replace it with some other string of letters.
In general, the file you want would be under the url you use, but ending with the file name.
If the startong URL is www.example.com/mail/, then this file would be at www.example.com/mail/mbox.msc.
Please note that any parameters should follow the path, so www.example.com/mail?user=hendrra&password=hendras_password would turn into
www.example.com/mail/mbox.msc?user=hendrra&password=hendras_password
I am using the requests library in python and attempting to scrape a website that has lots of public reports and documents in .pdf format. I have successfully done this on other websites, but I have hit a snag on this one: the links are javascript functions (objects? I don't know anything about javascript) that redirect me to another page, which then has the raw pdf link. Something like this:
import requests
from bs4 import BeautifulSoup as bs
url = 'page with search results.com'
html = requests.get(url).text
soup = bs(html)
obj_list = soup.findAll('a')
for a in obj_list:
link = a['href']
print(link)
>> javascript:readfile2("F","2201","2017_2201_20170622F14.pdf")
Ideally I would like a way to find what url this would navigate to. I could use selenium and click on the links, but there are a lot of documents and that would be time- and resource-intensive. Is there a way to do this with requests or a similar library?
Edit: It looks like every link goes to the same url, which loads a different pdf depending on which link you click. This makes me think that there is no way to do this in requests, but I am still holding out hope for something non-selenium-based.
There might be a default url on which these PDF files are present.
You need to find out the url, On which these pdf files open after clicking on hyper link.
Once you got that url, You need to parse pdf name from anchor text.
Afterwards, You append the pdf name with url(On which pdf is present). And request the final url.
My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?
I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]
In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here
I am into a project where I deal with parsing HTML of web pages. So, I took my blog (Bloggers Blog - Dynamic Template) and tried to read the content of it. Unfortunately I failed to look at "actual" source of the blog's webpage.
Here is what I observed:
I clicked view source on a random article of my blog and tried to find the content in it. and I couldn't find any. It was all JavaScript.
So, I saved the webpage to my laptop and checked the source again, this time I found the content.
I also checked the source using developers tools in browsers and again found the content in it.
Now, I tried the python way
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup( urllib.urlopen("my-webpage-address") )
print soup.prettify()
I even didn't find the content in the HTML code in it.
Finally, why I am unable to find the content in the source code in case1, 4.
How should I get the actual HTML code? I wish to hear any python library that would do the job.
The content is loaded via JavaScript (AJAX). It's not in the "source".
In step 2, you are saving the resulting page, not the original source. In step 3, you're seeing what's being rendered by the browser.
Steps 1 and 4 "don't work" because you're getting the page's source (which doesn't contain the content). You need to actually run the JavaScript, which isn't easy for a screen scraper to do.
Quite often i have to download the pdfs from websites but sometimes they are not on one page.
They have divided the links in pagination and I have to click on every page of get the links.
I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.
I am new to python so can anyone please give me the directions how can i do it
Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:
# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse
# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'
# fetch the page
res = urllib2.urlopen(base_url)
# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())
# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}
# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])
Result:
http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...
If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/).
It is pretty easy to understand how to use it and can download pdf files you need.
By phone, maybe it is not very readable
If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests
import requests
page_content=requests.get(url)
But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)
Frist way:make your requests more like a browser(human).
add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers)
make the right post form.This one should copy the ways you post the form by browser.
get the cookies, and add it to requests
Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver)
remeber to add chromedriver to the path
Or use code to load the driver.exe
driver=WebDriver.Chrome(path)
not sure is this set up code
driver.get(url)
It is trully surf the url by browser, so it will decrease the difficulty of grabing things
get the web page
page=driver.page_soruces
some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.
try:
certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID)
WebDriverWait(certain_element)
or use implict wait: wait the time you like
driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)
And you can controll the website by WebDriver. Here is not going to describe. You can search the module.