I've been scraping (python) articles from a couple of news websites from my country succesfully, basically by parsing the main page, fetching the hrefs and accesing them to parse the articles. But I just hit a wall with https://www.clarin.com/. I am only getting a very limited amount of elements because of the infinite scrolling. I researched a lot but I couldn't find the right resource to overcome this, but of course it is more than likely that I am doing it wrong.
For what I see in the devtools the url request that loads more is a json file, but I don't know how to fetch it automatically in order to parse it. I would like to get some quick guidance on what to learn to do this. I hope I made some sense, this is my base code:
source = requests.get(https://www.clarin.com/)
html = BeautifulSoup(source.text, "lxml")
This is an example request url I am seeing in chrome devtools.
https://www.clarin.com/ondemand/eyJtb2R1bGVDbGFzcyI6IkNMQUNsYXJpbkNvbnRhaW5lckJNTyIsImNvbnRhaW5lcklkIjoidjNfY29sZnVsbF9ob21lIiwibW9kdWxlSWQiOiJtb2RfMjAxOTYyMjQ4OTE0MDgzIiwiYm9hcmRJZCI6IjEiLCJib2FyZFZlcnNpb25JZCI6IjIwMjAwNDMwXzAwNjYiLCJuIjoiMiJ9.json
Related
I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:
#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.
Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.
The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.
Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")
I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.
In this video, I give you a look at the dataset I want to scrape/take from the web. Very sorry about the audio, but did the best with what I have. It is hard for me to describe what I am trying to do as I see a page with thousands of pages and obviously has tables, but pd.read_html doesn't work! Until it hit me, this page has a form to be filled out first....
https://opir.fiu.edu/instructor_eval.asp
Going to this link will allow you to select a semester, and in doing so, will show thousands upon thousands of tables. I attempted to use the URL after selecting a semester hoping to read HTML, but no such luck.. I still don't know what I'm even looking at (like, is it a webpage, or is it ASP? What even IS ASP?). If you follow the video link, you'll see that it gives an ugly error if you select spring semester, copy the link, and put it in the search bar. Some SQL error.
So this is my dilemma. I'm trying to GET this data... All these tables. Last post I made, I did a brute force attempt to get them by just clicking and dragging for 10+ minutes, then pasting into excel. That's an awful way of doing it, and it wasn't even particularly useful when I imported that excel sheet into python because the data was very difficult to work with. Very unstructured. So I thought, hey, why not scrape with bs4? Not that easy either, it seems, as the URL won't work. After filtering to spring semester, the URL just won't work, not for you, and not if you paste it into python for bs4 to use...
So I'm sort of at a loss here of how to reasonably work with this data. I want to scrape it with bs4, and put it into dataframes to be manipulated later. However, as it is ASP or whatever it is, I can't find a way to do so yet :\
ASP stands for Active Server Pages and is a page running a server-side script (usually vbs), so this shouldn't concern you as you want to scrape data from the rendered page.
In order to get a valid response from /instructor_evals/instr_eval_result.asp you have to submit a POST request with the form data of /instructor_eval.asp, otherwise the page returns an error message.
If you submit the correct data with urllib you should be able to get the tables with bs4.
from urllib.request import urlopen, Request
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://opir.fiu.edu/instructor_evals/instr_eval_result.asp'
data = {'Term':'1171', 'Coll':'%', 'Dept':'','RefNum':'','Crse':'','Instr':''}
r = urlopen(Request(url, data=urlencode(data).encode()))
html = r.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
By the way this error message is a strong indication that the page is vulnerable to SQL Injection which is a very nasty bug, and i think you should inform the admin about it.
I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.
The thing I don't understand is that parsing of already loaded page takes too long.
Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text
As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.
EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.
Do you have any explanation? Thanks
These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.
When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.
With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.
Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:
container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()
soup = BeautifulSoup(container, "lxml")