I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.
URL: "https://www.todayonline.com/"
These are the two methods I have tried but failed.
Method 1: Beautiful Soup
tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup # Returns me a HTML with javascript text
soup.find_all('h3')
### Returns me empty list []
Method 2: Selenium + BeautifulSoup
tdy_url = "https://www.todayonline.com/"
options = Options()
options.headless = True
driver = webdriver.Chrome("chromedriver",options=options)
driver.get(tdy_url)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('h3')
### Returns me only less than 1/4 of the 'h3' tags found in the original page source
Please help. I have tried scraping other news websites and it is so much easier. Thank you.
The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.
In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.
However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:
Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox):
Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).
Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.
Here is an example of the code that gets the titles from the main news on the page:
import requests
nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
.json()["nodes"]
for node in nodes:
print(node["node"]["title"])
If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).
Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.
You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()
I will suggest you the fairly simple approach,
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]
print(news)
output
['DBS named world’s best bank by New York-based financial publication',
'Russia has very serious questions to answer on Navalny - UK',
"Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
'Three militants killed after fatal attack on policeman in Tunisia',
.....]
Also, you can check the XML page for more information if required.
P.S. Always check for the compliance before scraping any website :)
There are different ways of gathering the content of a webpage that contains Javascript.
Using selenium with Firefox web driver
Using a headless browser with phantomJS
Making an API call using a REST client or python requests library
You have to do your research first
Related
I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
I am attempting to scrape some basic product information from the url linked here, but the bs4 find_all command isn't finding any data given the name of the class associated with the product div. Specifically, I am trying:
url = https://www.walmart.com/grocery/browse/Cereal-&-Breakfast-Food?aisle=1255027787111_1255027787501
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
product_list = soup.find_all('div', class_='productListTile')
print(product_list)
But this prints an empty list []. Having inspected the webpage on Chrome, I know that 'productListTile' is the correct class name. Any idea what I am doing wrong?
You will need to use Selenium most likely. Beautiful Soup requests get redirected to a "Verify Your Identity" page.
Here is a very similar question to this one, which has code with Selenium and Beautiful Soup working in concert to scrape Wal-Mart
python web scraping using beautiful soup is not working
Web scraping technics vary with websites. In this case, you can either use selenium that is a good option and here I am adding another method with the beautiful soup itself, this helped me a lot.
In this case, inspect the web page and then select network, please refresh the page.
Then sort with type:
In the below image I had marked with red color the API's they called to get the data from the backend. So you can directly call the backend API to fetch the player's data.
Check the "Headers" you will see the API endpoint and in preview, you can see the API response in JSON format.
Now if you want to get the images then please check the source you will see the images and u can download the images and map with the id.
I'm trying to scrape the CO2 per resource trend table data from this url: pcaiso.com/todaysoutlook/pages/emissions.html
The href attribute of the contains the dataset for the chart (as a very long string) I was attempting to return this attribute, but my code is returning a zero set for the following request, no matter how hard I try and google other suggestions.
url = 'http://www.caiso.com/todaysoutlook/pages/emissions.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
html = soup.find("a", {"class": "dropdown-item"})
print(html)
Any thoughts are appreciated! Thanks.
It seems like what you really are asking about is why the element does not have an href attribute when you inspect it in your code. The reason is, when you request the HTML page from the server it actually returns a static page without any chart data. When you view this page in a web browser, it runs some JavaScript code that queries the backend and populates the chart data dynamically. So you'll need to modify your approach to get that data.
One option is to manually inspect the page in your browser, reverse-engineer how it fetches the data, and do the same thing in your code. Most web browser have built-in developer tools that can help with this.
Another option is to use a browser automation tool like Selenium to load the page in a web browser environment and scrape the data from there. This is less efficient, but might be easier for someone inexperienced in web programming because you can treat the JavaScript functionality as a "black box" and interact with the page more like how a real user would.
This question already has answers here:
scrape html generated by javascript with python
(5 answers)
Closed 6 years ago.
I'm currently on a mission to scrape popular joke websites. One example is a website called jokes.cc.com. If you visit the website, hover your cursor above the 'Get Random Joke' button on the left of the page briefly, you will notice the link it redirects to will be jokes.cc.com/#.
If you wait for a while, it changes to a proper link within the website which displays the actual joke. It changes to jokes.cc.com/*legit joke link*.
If you analyze the HTML of the page, you will notice that there is a link ( <a>) with a class=random_link whose <href> stores the link to the random joke the page wants to redirect you do. You can check this after the page has completely loaded. Basically, the '#' is replaced by a legit link.
Now, here is my code for scraping off the HTML as I have done with static websites until now. I have used BeautifulSoup library:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://jokes.cc.com";
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
# Find out the exact position of the joke in the page
print soup.findAll('a', {'class':'random_link'})[0]
Output: #
This is the expected output as I have come to realize that the page has not completely rendered.
How do I scrape the page after waiting a while, or after the rendering is complete. Will I need to use external libraries like Mechanize? I'm unsure on how to do that so any help/guidance is appreciated
EDIT: I was finally able to resolve my issue by using PhantomJS along with Selenium in Python. Here is the code which fetches the page after rendering is complete.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS() #selenium for PhantomJS
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #fetch HTML source code after rendering
# locate the link in HTML
randomJokeLink = soupFromJokesCC.findAll('div', {'id':'random_joke'})[0].findAll('a')[0]['href']
# now go to that page and scrape the joke from there
print randomJokeLink #It works :D
The data you're after is generated by JavaScript running dynamically on page load. BeautifulSoup does not have a JavaScript engine, so it doesn't matter how long you wait, the link will never change. There are Python libraries which can scrape and understand JavaScript, but your best bet is probably to dig and work out how the JS on the website actually works. If they have a data feed of jokes that a random joke is pulled from, for example, it might be in a format such as JSON which Python can parse very easily. This would make your application much more lightweight than including a fully blown scripting engine.
I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup