Scraping subfiles from URL using Python - python

A webpage I would like to scrape consists of several files:
I'm interested of scraping only the highlighted file, that is: mboxFrame.
My method of scraping pages
import requests
from bs4 import BeautifulSoup
webPage = requests.get(URL, verify=False)
soup = BeautifulSoup(webPage.content, "html.parser" )
is able to scrape only the file mail.html. Is there a way to scrape only what I want?
I would appreciate any hints or tips.

The way to open a file from a server is to request it with a URL.
In fact, in the beginnings of the world wide web this was the only way to get content: content creators would put various files on servers and clients would open or download those files. The dynamic processing of URIs and parameters is a later invention. That is why commenters are asking for the URL you use. We want to see it and modify accordingly to help you see what parts need changing in order to get that particular file. You can omit the password, or replace it with some other string of letters.
In general, the file you want would be under the url you use, but ending with the file name.
If the startong URL is www.example.com/mail/, then this file would be at www.example.com/mail/mbox.msc.
Please note that any parameters should follow the path, so www.example.com/mail?user=hendrra&password=hendras_password would turn into
www.example.com/mail/mbox.msc?user=hendrra&password=hendras_password

Related

Automatically download cdf files from specific website in Python

I would like to create a programme that can download a specific CDF file from this website:
http://research.ssl.berkeley.edu/data/psp/data/sci/fields/l2/mag_RTN_4_Sa_per_Cyc/2018/10/
For example I would like the user to get asked which specific date he would like to download and the programme to download the file and store it as data.
In this site all the file names end with the date. For example:
psp_fld_l2_mag_RTN_4_Sa_per_Cyc_20181003_v01.cdf
Where, 20181003 means 2018/10/03 (the date)
Is this possible?
as this is a static website and doesnt involve javascript in loading the files, you can go ahead with requests to get the html from the url using
r = requests.get(url)
you can then go ahead and get all the links using beautifulsoup's web scraping, and finally save the files using
r = requests.get(fetched_url,allow_redirects=True)
open(filename,'wb').write(r.content)

Webscraping: Downloading a pdf from a javascript link

I am using the requests library in python and attempting to scrape a website that has lots of public reports and documents in .pdf format. I have successfully done this on other websites, but I have hit a snag on this one: the links are javascript functions (objects? I don't know anything about javascript) that redirect me to another page, which then has the raw pdf link. Something like this:
import requests
from bs4 import BeautifulSoup as bs
url = 'page with search results.com'
html = requests.get(url).text
soup = bs(html)
obj_list = soup.findAll('a')
for a in obj_list:
link = a['href']
print(link)
>> javascript:readfile2("F","2201","2017_2201_20170622F14.pdf")
Ideally I would like a way to find what url this would navigate to. I could use selenium and click on the links, but there are a lot of documents and that would be time- and resource-intensive. Is there a way to do this with requests or a similar library?
Edit: It looks like every link goes to the same url, which loads a different pdf depending on which link you click. This makes me think that there is no way to do this in requests, but I am still holding out hope for something non-selenium-based.
There might be a default url on which these PDF files are present.
You need to find out the url, On which these pdf files open after clicking on hyper link.
Once you got that url, You need to parse pdf name from anchor text.
Afterwards, You append the pdf name with url(On which pdf is present). And request the final url.

python - pull pdfs from webpage and convert to html

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?
I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]
In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here

Beautiful soup - data not in HTML file

I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks
The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.
My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data

Python - urlretrieve for entire web page

with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com?
Regards
Not directly.
If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/
This will let you load a page and follow links from it
Something like:
import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
print(link)
response = br.follow_link(link)
html = response.read()
#save your downloaded page
br.back()
As it stands, this will only get you the pages one link away from your starting point. You could easily adapt it to cover an entire site, though.
If you really just want to mirror an entire site, use wget. Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc)

Categories

Resources