I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console.
As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:
import bs4
from bs4 import BeautifulSoup
import urllib.request
news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();
soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup.
Despite not understanding how Selenium really works, I updated the code as following:
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");
print(news_list);
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
However, when I run it, a blank browser page opens and this is printed to the console:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
(Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)
I just tried and the following code is working for me. The items = line is horrible, apologies in advance. But for now it works...
EDIT
Just updated the snippet, you can use the ElementTree.iter('tag') to iterate over all the nodes with that tag:
import urllib.request
import xml.etree.ElementTree
news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
xml_page = page.read()
# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)
# Get the item list
for it in e.iter('item'):
print(it.find('title').text)
print(it.find('link').text)
print(it.find('pubDate').text, '\n')
EDIT2: Discussion personal preferences of libraries for scraping
Personally, for interactive/dynamic pages in which I have to do stuff (click here, fill a form, obtain results, ...): I use selenium, and usually I don't have a need to use bs4, since you can use selenium directly to find and parse the specific nodes of the web you are looking for.
I use bs4 in conjunction with requests (instead of urllib.request) for to parse more static webpages in projects I don't want to have a whole webdriver installed.
There is nothing wrong with using urllib.request, but requests (see here for the docs) is one of the best python packages out there (in my opinion) and is a great example of how to create a simple yet powerful API.
Simply use BeautifulSoup with requests.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')
# do whatever you need with news_list
Related
I am using Selenium to do web scraping and would like to instead use beautiful soup, but I am new to this library, I wanna get all company names and the time and jump to the next page.
Please find my codes using selenium first:
driver.get('http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml')
while True:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[#class="sibian"]/tbody/tr/td/table[2]/tbody/tr/td[2]/a')]
for link in links:
driver.get(link)
driver.implicitly_wait(10)
windows = driver.window_handles
driver.switch_to.window(windows[-1])
time = driver.find_element_by_xpath('//*[#class="con_bj"]/table[3]/tbody/tr/td/publishtime').text
company = driver.find_element_by_xpath('//*[#class="title_A"]').text
driver.back()
if(len(links)< 20):
break
I tried doing the same with beautifulsoup as:
from bs4 import BeautifulSoup
import requests
html='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
But I get nothing and stuck with the first step.
Could you please help with that?
you are not making a request. You are thinking that BeautifulSoup is a HTTPRequest library, it is just a parser. Think of driver.get() as requests.get() (yes i know they are not the same, but it is for an easier understanding). You need to do something like this:
from bs4 import BeautifulSoup
import requests
html_link='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
This will allow you to further debug your code. This MAY NOT work as some sites require specific headers or automatically reject your request, such as a user-agent header. Requests is a very easy (subjective of course) library to work with and has a lot of support on this site. To save some head-scratching I will go ahead and tell you that if the site requires javascript, Selenium or some variant is the best option.
I am using BeautifulSoup, the findAll method is missing <p> tags. I run the code and it returns and empty list. But if I inspect the page I can clearly see it as shown in the picture bellow.
I chose some random site.
import requests
from bs4 import BeautifulSoup
#An example web site
url = 'https://www.kite.com/python/answers/how-to-extract-text-from-an-html-file-in-python'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.findAll("p"))
The output:
(env) pinux#main:~/dev$ python trial.py
[]
I inspect the page using the browser:
The text is clearly there. Why doesn't BeautifulSoup catch them? Can someone shed some light on what is going on?
It appears that parts of this webpage is rendered in JavaScript. You can try using selenium, since Selenium WebDrivers automatically wait for the page to fully render.
import bs4
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://url-to-webpage.com")
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
I'm new in web-scraping and using BeautifulSoup4, so I'm sorry if my question is obvious.
I'm trying to get the hours played from Steam, but <div id="games_list_rows" style="position: relative"> returns None when it should return a lot of differents <div class="gameListRow" id="game_730"> with stuff inside.
I've tried with a friend's profile who has a few games because I was thinking that working with a lot of data could make BS4 ignore the div, but it keeps showing the div empty.
Here's my code:
import bs4 as bs
import urllib.request
# Retrieve profile
profile = "chubaquin"#input("enter profile: >")
search = "https://steamcommunity.com/id/"+profile+"/games/?tab=all"
sauce = urllib.request.urlopen(search)
soup = bs.BeautifulSoup(sauce, "lxml")
a = soup.find("div", id="games_list_rows")
print(a)
Thanks for your help!
The website is loaded dynamically, therefore requests doesn't support it. Try using Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://steamcommunity.com/id/chubaquin/games/?tab=all"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(url)
# Wait for the page to fully render before parsing it
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
print(soup.find("div", id="games_list_rows"))
Have you tried the official Steam Web API? (xPaw docs are better than their own)
You need an API key, but they're free, and it's much easier to process the JSON result than to scrape the page(s), especially because the page can change occasionally whereas the JSON is unlikely to do so often at all.
I am trying to scrape this web-page using python requests library.
But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :
import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')
can anybody please help me out?? Thanks
The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing
You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")
I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
AND...
For parsing the code, have a look at BeautifulSoup.
Thanks to you both, #blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
html=driver.page_source
I'm newbie to HTML parsers. I'm actually trying to parse the source code of the webpage with url (http://www.quora.com/How-many-internships-are-necessary-for-a-B-Tech-student). I'm trying to get the answer_count.
I tried it in the following way:
import urllib2
from bs4 import BeautifulSoup
q = urllib2.urlopen(url)
soup = BeautifulSoup(q)
divs = soup.find_all('div',class_='answer_count')
But I get the list 'divs' as empty. Why is it so? Where am I wrong? How do I implement it to get the result as '2 Answers'?
Maybe you don't have the same page as us on your browser (because you are logged in or so).
When I look at the webpage you provided with Google Chrome, there is nowhere 'answer_count' in the source code. So if Google chrome doen't find it, BeautifulSoup won't either