This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 years ago.
I want to get the rainfall data of each day from here.
When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.
I am using urllib2 and BeautifulSoup from bs4
Here is my code:
import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"
r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")
And I got an empty array.
My question is: How can I get the page content, but not from the page source code?
If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.
If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.
Try using selenium
How can I parse a website using Selenium and Beautifulsoup in python?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')
html = driver.page_source
soup = BeautifulSoup(html)
print soup.find_all("td", class_="td1_normal_class")
However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.
Related
This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?
BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python
I am trying to scrape this web-page using python requests library.
But I am not able to download complete html source code. When I use my web-browser to inspect elements, it gives complete html, which I believe can be used for scraping, but when I access this url using python requests library, those html tags which have data are simply disappeared and I am not able to scrape data from those. Here is my sample code :
import requests
from bs4 import BeautifulSoup as BS
import urllib
import http.client
url = 'https://www.udemy.com/topic/financial-analysis/?lang=en'
user_agent='my-user-agent'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BS(html,'html.parser')
can anybody please help me out?? Thanks
The page is likely being built by javascript, meaning the site sends over the same source you are pulling from urllib, and then the browser executes the javascript, modifying the source to render the page you are seeing
You will need to use something like selenium, which will open the page in a browser, render the JS, and then return the source e.g.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
driver.page_source # or driver.execute_script("return document.body.innerHTML;")
I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
AND...
For parsing the code, have a look at BeautifulSoup.
Thanks to you both, #blakebrojan i tried your method,, but it opened a new chrome instance and display result there,, but what i want is to get source code in my code and scrape data from that code ... here is the code
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Lenovo\\Desktop\\chrome-driver\\chromedriver.exe')
driver.get("https://www.udemy.com/topic/financial-analysis/?lang=en")
html=driver.page_source
I am doing a Python exercise, and it requires me to get the top news from the Google news website by web scraping and print to the console.
As I was doing it, I just used the Beautiful Soup library to retrieve the news. That was my code:
import bs4
from bs4 import BeautifulSoup
import urllib.request
news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();
soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
But it kept giving me errors by not printing the 'link' and 'pubDate'. After some research, I saw some answers here on Stack Overflow, and they said that, as the website uses Javascript, one should use the Selenium package in addition to Beautiful Soup.
Despite not understanding how Selenium really works, I updated the code as following:
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");
print(news_list);
for news in news_list:
print(news.title.text);
print(news.link.text);
print(news.pubDate.text);
print("-"*60);
However, when I run it, a blank browser page opens and this is printed to the console:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
(Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)
I just tried and the following code is working for me. The items = line is horrible, apologies in advance. But for now it works...
EDIT
Just updated the snippet, you can use the ElementTree.iter('tag') to iterate over all the nodes with that tag:
import urllib.request
import xml.etree.ElementTree
news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
xml_page = page.read()
# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)
# Get the item list
for it in e.iter('item'):
print(it.find('title').text)
print(it.find('link').text)
print(it.find('pubDate').text, '\n')
EDIT2: Discussion personal preferences of libraries for scraping
Personally, for interactive/dynamic pages in which I have to do stuff (click here, fill a form, obtain results, ...): I use selenium, and usually I don't have a need to use bs4, since you can use selenium directly to find and parse the specific nodes of the web you are looking for.
I use bs4 in conjunction with requests (instead of urllib.request) for to parse more static webpages in projects I don't want to have a whole webdriver installed.
There is nothing wrong with using urllib.request, but requests (see here for the docs) is one of the best python packages out there (in my opinion) and is a great example of how to create a simple yet powerful API.
Simply use BeautifulSoup with requests.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')
# do whatever you need with news_list
i working on a script for scraping video titles from this webpage
" https://www.google.com.eg/trends/hotvideos "
but the proplem is the titles are hidden on the html source page but i can see it if i used the inspector to looking for that
that's my code it's working good with this ("class":"wrap")
but when i used that with the hidden one like "class":"hotvideos-single-trend-title-container" that's did't give me anything on output
#import urllib2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.google.com.eg/trends/hotvideos').read()
soup = BeautifulSoup(html)
print (soup.findAll('div',{"class":"hotvideos-single-trend-title-container"}))
#wrap
The page is generated/populated by using Javascript.
BeautifulSoup won't help you here, you need a library which supports Javascript generated HTML pages, see here for a list or have a look at Selenium
I'm newbie to HTML parsers. I'm actually trying to parse the source code of the webpage with url (http://www.quora.com/How-many-internships-are-necessary-for-a-B-Tech-student). I'm trying to get the answer_count.
I tried it in the following way:
import urllib2
from bs4 import BeautifulSoup
q = urllib2.urlopen(url)
soup = BeautifulSoup(q)
divs = soup.find_all('div',class_='answer_count')
But I get the list 'divs' as empty. Why is it so? Where am I wrong? How do I implement it to get the result as '2 Answers'?
Maybe you don't have the same page as us on your browser (because you are logged in or so).
When I look at the webpage you provided with Google Chrome, there is nowhere 'answer_count' in the source code. So if Google chrome doen't find it, BeautifulSoup won't either