I'm newbie to HTML parsers. I'm actually trying to parse the source code of the webpage with url (http://www.quora.com/How-many-internships-are-necessary-for-a-B-Tech-student). I'm trying to get the answer_count.
I tried it in the following way:
import urllib2
from bs4 import BeautifulSoup
q = urllib2.urlopen(url)
soup = BeautifulSoup(q)
divs = soup.find_all('div',class_='answer_count')
But I get the list 'divs' as empty. Why is it so? Where am I wrong? How do I implement it to get the result as '2 Answers'?
Maybe you don't have the same page as us on your browser (because you are logged in or so).
When I look at the webpage you provided with Google Chrome, there is nowhere 'answer_count' in the source code. So if Google chrome doen't find it, BeautifulSoup won't either
Related
I'd like to make an auto login program in internal network website.
So, I try to parse that site using requests and Beautifulsoup library.
It works...and I get some html alot shorter than that site's html.
what's the problem? maybe security issue?..
pleas help me.
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("http://test.com")
soup = bs(page.text, "html.parse")
print(soup) # I get some html alot shorter than that site's html
I am using Selenium to do web scraping and would like to instead use beautiful soup, but I am new to this library, I wanna get all company names and the time and jump to the next page.
Please find my codes using selenium first:
driver.get('http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml')
while True:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[#class="sibian"]/tbody/tr/td/table[2]/tbody/tr/td[2]/a')]
for link in links:
driver.get(link)
driver.implicitly_wait(10)
windows = driver.window_handles
driver.switch_to.window(windows[-1])
time = driver.find_element_by_xpath('//*[#class="con_bj"]/table[3]/tbody/tr/td/publishtime').text
company = driver.find_element_by_xpath('//*[#class="title_A"]').text
driver.back()
if(len(links)< 20):
break
I tried doing the same with beautifulsoup as:
from bs4 import BeautifulSoup
import requests
html='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
But I get nothing and stuck with the first step.
Could you please help with that?
you are not making a request. You are thinking that BeautifulSoup is a HTTPRequest library, it is just a parser. Think of driver.get() as requests.get() (yes i know they are not the same, but it is for an easier understanding). You need to do something like this:
from bs4 import BeautifulSoup
import requests
html_link='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
This will allow you to further debug your code. This MAY NOT work as some sites require specific headers or automatically reject your request, such as a user-agent header. Requests is a very easy (subjective of course) library to work with and has a lot of support on this site. To save some head-scratching I will go ahead and tell you that if the site requires javascript, Selenium or some variant is the best option.
I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code
Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium
I am trying to scrape this web-page using python requests library.
But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :
import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')
can anybody please help me out?? Thanks
The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing
You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")
I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
AND...
For parsing the code, have a look at BeautifulSoup.
Thanks to you both, #blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
html=driver.page_source
I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console.
As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:
import bs4
from bs4 import BeautifulSoup
import urllib.request
news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();
soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup.
Despite not understanding how Selenium really works, I updated the code as following:
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");
print(news_list);
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
However, when I run it, a blank browser page opens and this is printed to the console:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
(Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)
I just tried and the following code is working for me. The items = line is horrible, apologies in advance. But for now it works...
EDIT
Just updated the snippet, you can use the ElementTree.iter('tag') to iterate over all the nodes with that tag:
import urllib.request
import xml.etree.ElementTree
news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
xml_page = page.read()
# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)
# Get the item list
for it in e.iter('item'):
print(it.find('title').text)
print(it.find('link').text)
print(it.find('pubDate').text, '\n')
EDIT2: Discussion personal preferences of libraries for scraping
Personally, for interactive/dynamic pages in which I have to do stuff (click here, fill a form, obtain results, ...): I use selenium, and usually I don't have a need to use bs4, since you can use selenium directly to find and parse the specific nodes of the web you are looking for.
I use bs4 in conjunction with requests (instead of urllib.request) for to parse more static webpages in projects I don't want to have a whole webdriver installed.
There is nothing wrong with using urllib.request, but requests (see here for the docs) is one of the best python packages out there (in my opinion) and is a great example of how to create a simple yet powerful API.
Simply use BeautifulSoup with requests.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')
# do whatever you need with news_list