I would like to create a programme that can download a specific CDF file from this website:
http://research.ssl.berkeley.edu/data/psp/data/sci/fields/l2/mag_RTN_4_Sa_per_Cyc/2018/10/
For example I would like the user to get asked which specific date he would like to download and the programme to download the file and store it as data.
In this site all the file names end with the date. For example:
psp_fld_l2_mag_RTN_4_Sa_per_Cyc_20181003_v01.cdf
Where, 20181003 means 2018/10/03 (the date)
Is this possible?
as this is a static website and doesnt involve javascript in loading the files, you can go ahead with requests to get the html from the url using
r = requests.get(url)
you can then go ahead and get all the links using beautifulsoup's web scraping, and finally save the files using
r = requests.get(fetched_url,allow_redirects=True)
open(filename,'wb').write(r.content)
Related
A webpage I would like to scrape consists of several files:
I'm interested of scraping only the highlighted file, that is: mboxFrame.
My method of scraping pages
import requests
from bs4 import BeautifulSoup
webPage = requests.get(URL, verify=False)
soup = BeautifulSoup(webPage.content, "html.parser" )
is able to scrape only the file mail.html. Is there a way to scrape only what I want?
I would appreciate any hints or tips.
The way to open a file from a server is to request it with a URL.
In fact, in the beginnings of the world wide web this was the only way to get content: content creators would put various files on servers and clients would open or download those files. The dynamic processing of URIs and parameters is a later invention. That is why commenters are asking for the URL you use. We want to see it and modify accordingly to help you see what parts need changing in order to get that particular file. You can omit the password, or replace it with some other string of letters.
In general, the file you want would be under the url you use, but ending with the file name.
If the startong URL is www.example.com/mail/, then this file would be at www.example.com/mail/mbox.msc.
Please note that any parameters should follow the path, so www.example.com/mail?user=hendrra&password=hendras_password would turn into
www.example.com/mail/mbox.msc?user=hendrra&password=hendras_password
I am interesting in scraping the dataset from http://njdep.rutgers.edu/continuous/data.php in order to create a shiny app that allows one to search through the data contained at that site.
Once you fill out the form on the site, it can generate a .csv file. Is there anyway to find out where all of the data from the earliest date to the most recent state is stored and extract it using an R package or python package?
In a browser you can right click and inspect the page. When you click the download button, you can see the underlying rest api in the network tab. It should look something like this:
http://njdep.rutgers.edu/continuous/data/downloadData.php?affiliation=NJDEP+-+Marine+Water+Monitoring&project=-1&huc14=-1&county=-1&munis=-1&station_type=-1&station=-1&start_date=&end_date=¶ms=
If you change the various form parameters you can get an idea of how to change the url to get different variations of data. Then you could use a package like requests to get the data in python.
import requests
url = 'your_modified_url'
res = requests.get(url)
res.raise_for_status()
data = res.content
I have a problem in downloadig URL.
I need to download webpage with the table. When I get .html file with the help of urllib or urllib2, it has some problems connected with javascript (or same languages). There's only source code such as id_name e.t.c, but it don't have any table information (columns and rows).
Nevertheless, when I save .html in Google Chrome, it actually has information in table (not source code, but columns and rows). So what should I do to make it in Python?
You can use selenium to simulate browser. It will execute javascript then you can get the information you want
I'm working on a project where I require the all of the game ID #'s found in the current scores section of http://www.nhl.com/ to download content/ parse stats for each game. I want to be able to get all current game ID's in one go, but for some reason, I'm unable to download the full HTML of the page, no matter how I try. I'm using requests and beautifulsoup4.
Here's my problem:
I've determined that the particular tags I'm interested in are div's where the CSS class = 'scrblk'. So, I wrote a function to pass into BeautifulSoup.find_all() to give me, specifically, blocks with that CSS class. It looks like this:
def find_scrblk(css_class):
return css_class is not None and css_class == 'scrblk'
so, when I actually went to the web page in Firefox and saved it, then loaded the saved file in beautifulsoup4, I did the following:
>>>soup = bs(open('nhl.html'))
>>>soup.find_all(class_=find_scrblk)
[<div class="scrblk" id="hsb2015010029"> <div class="defaultState"....]
and everything was all fine and dandy; I had all the info I needed. However, when I tried to download the page using any of several automated methods I know, this returned simply an empty list. Here's what I tried:
using requests.get() and saving the .text attribute in a file
using the iter_content() and iter_lines() methods of the request
object to write to the file piece by piece
using wget to download the page (through subprocess.call())
and open the resultant file. For this option, I was sure to use the --page-requisites and --convert-links flags so I downloaded (or so I thought)
all the necessary data.
With all of the above, I was unable to parse out the data that I need from the HTML files; it's as if they weren't being completely downloaded or something, but I have no idea what that something is or how to fix it. What am I doing wrong or missing here? I'm using python 2.7.9 on Ubuntu 15.04.
All of the files can be downloaded here:
https://www.dropbox.com/s/k6vv8hcxbkwy32b/nhl_html_examples.zip?dl=0
As the comments on your question state, you have to re-think your approach. What you see in the browser is not what the response contains. The site uses JavaScript to load the information you are after so you should look more carefully in the result what you get to find what you are looking for.
In the future to handle such problems try out Chrome's developer console and disable JavaScript and open a site such way. Then you will see if you are facing JS or the site would contain the values you are looking for.
And by the way what you do is against the Terms of Service of the NHL website (according to Section 2. Prohibited Content and Activities)
Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;
My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?
I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]
In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here