I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.
Related
I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.
URL: "https://www.todayonline.com/"
These are the two methods I have tried but failed.
Method 1: Beautiful Soup
tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup # Returns me a HTML with javascript text
soup.find_all('h3')
### Returns me empty list []
Method 2: Selenium + BeautifulSoup
tdy_url = "https://www.todayonline.com/"
options = Options()
options.headless = True
driver = webdriver.Chrome("chromedriver",options=options)
driver.get(tdy_url)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('h3')
### Returns me only less than 1/4 of the 'h3' tags found in the original page source
Please help. I have tried scraping other news websites and it is so much easier. Thank you.
The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.
In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.
However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:
Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox):
Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).
Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.
Here is an example of the code that gets the titles from the main news on the page:
import requests
nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
.json()["nodes"]
for node in nodes:
print(node["node"]["title"])
If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).
Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.
You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()
I will suggest you the fairly simple approach,
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]
print(news)
output
['DBS named world’s best bank by New York-based financial publication',
'Russia has very serious questions to answer on Navalny - UK',
"Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
'Three militants killed after fatal attack on policeman in Tunisia',
.....]
Also, you can check the XML page for more information if required.
P.S. Always check for the compliance before scraping any website :)
There are different ways of gathering the content of a webpage that contains Javascript.
Using selenium with Firefox web driver
Using a headless browser with phantomJS
Making an API call using a REST client or python requests library
You have to do your research first
I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.
You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.
The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source
I have a router that I want to login to and retrieve information using Python script. Im a newbie to Python but want to learn and explore more with it. Here is what I have written so far:
from requests.auth import HTTPBasicAuth
import requests
from bs4 import BeautifulSoup
response = requests.get('http://192.168.1.1/Settings.html/', auth=HTTPBasicAuth('Username', 'Password'))
html = response.content
soup = BeautifulSoup(html, "html.parser")
print (soup.prettify())
I have two questions which are:
When I run the script the first time, I receive an authentication error. On running the script a second time it seems to authenticate fine and retrieve the HTML. Is there a better method?
With BS I want to only retrieve the code I require from the script. I cant see a tag to set BS to scrape. At the start of the HTML there are a list of variables of which I want to scrape the data for example:
var Device Pin = '12345678';
Its much easier to retrieve the information using a single script instead of logging onto the web interface each time. It sits within the script type="text/javascript".
Is BS the correct tool for the job. Can I just scrape the one line in the list of variables?
Any help as always very much appreciatted.
As far as I know, BeautifulSoup does not handle javascript. In this case, it's simple enough to just use regular expressions
import re
m = re.search(r"var Device Pin\s+= '(\d+)'", html)
pin = m.group(1)
Regarding the authentication problem, you can wrap your call in try except to redo the call if it doesn't work the first time.
I'd run a packet sniffer, tcpdump or wireshark, to see the interaction between your script and your router. Viewing the interactions may help determine why you're unable to authenticate on the first pass. As a workaround, run the auth section in a for loop which will try N number of times to authenticate before failing.
Regarding scraping, you may want to consider lxml with the beautiful soup parser so you can use XPath. See can we use xpath with BeautifulSoup?
XPath would allow you easily pull a single value, text, attribute, etc. from the html if lxml can parse it.
I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL®ion=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>
The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.
I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup