I am using the requests library in python and attempting to scrape a website that has lots of public reports and documents in .pdf format. I have successfully done this on other websites, but I have hit a snag on this one: the links are javascript functions (objects? I don't know anything about javascript) that redirect me to another page, which then has the raw pdf link. Something like this:
import requests
from bs4 import BeautifulSoup as bs
url = 'page with search results.com'
html = requests.get(url).text
soup = bs(html)
obj_list = soup.findAll('a')
for a in obj_list:
link = a['href']
print(link)
>> javascript:readfile2("F","2201","2017_2201_20170622F14.pdf")
Ideally I would like a way to find what url this would navigate to. I could use selenium and click on the links, but there are a lot of documents and that would be time- and resource-intensive. Is there a way to do this with requests or a similar library?
Edit: It looks like every link goes to the same url, which loads a different pdf depending on which link you click. This makes me think that there is no way to do this in requests, but I am still holding out hope for something non-selenium-based.
There might be a default url on which these PDF files are present.
You need to find out the url, On which these pdf files open after clicking on hyper link.
Once you got that url, You need to parse pdf name from anchor text.
Afterwards, You append the pdf name with url(On which pdf is present). And request the final url.
Related
I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
In general, if a website displays a series of links to data containing folders (i.e. spreadsheets with economic data), how can I write a program that identifies all the links and downloads the data?
In particular, I am trying to download all folders from 2012 to 2018 in this website https://www.ngdc.noaa.gov/eog/viirs/download_dnb_composites.html
I tried the approach suggested below, yet it seems the links to the data are not downloaded.
my_target='https://www.ngdc.noaa.gov/eog/viirs/download_dnb_composites.html'
import requests
from bs4 import BeautifulSoup
r = requests.get(my_target)
data = r.text
soup = BeautifulSoup(data)
links=[]
for link in soup.find_all('a'):
links.append(link.get('href'))
print(link.get('href'))
Among all URL appended to links, none directs to the data.
Finally, even once I have the right links, how can they be used to actually download the files?
Many thanks! ;)
That's a typical web scraping task.
Use requests to download the page
then parse the content and extract the URLs usingbeutifulsoup
you can now download the files using their extracted URLs and requests
I am unable to extract any data from this website. This code works for other sites. Also, this website is extendable if a registered user scrolls down. How can I extract data from the table from such a website?
from pyquery import PyQuery as pq
import requests
url = "https://uk.tradingview.com/screener/"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".tv-screener__symbol").text()
Tickers
You're using a class name which doesn't appear in the source of the page. The most likely reason for this is that the page uses javascript to either load data from a server or change the DOM once the page is loaded to add the class name in question.
Since neither the requests library nor the pyquery library you're using have a javascript engine to duplicate the feat, you get left with the raw static html which doesn't contain the tv-screener__symbol.
To solve this, look at the document you actually receive from a server and try to find the data you're interested in the the raw HTML document you receive:
...
content = requests.get(url).content
print(content)
(Or you can look at the data in the browser, but you must turn off Javascript in order to see the same document that Python gets to see)
If the data isn't in the raw HTML, you have to look at the javascript to see how it makes it's requests to the server backend to load the data, and then copy that request using your python requests' library.
I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup
with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com?
Regards
Not directly.
If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/
This will let you load a page and follow links from it
Something like:
import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
print(link)
response = br.follow_link(link)
html = response.read()
#save your downloaded page
br.back()
As it stands, this will only get you the pages one link away from your starting point. You could easily adapt it to cover an entire site, though.
If you really just want to mirror an entire site, use wget. Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc)