ı want to scrape data from Instagram, Twitter. But ı cant do it with requests alone. So ı want to use a WebDriver like selenium to access the page and get the page source on the browser. How can ı get the content of the page which is opened by selenium?
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com")
html_source = driver.page_source
You can use the driver.page_source property
Related
A website loads a part of the site after the site is opened, when I use libraries such as request and urllib3, I cannot get the part that is loaded later, how can I get the html of this website as seen in the browser. I can't open a browser using Selenium and get html because this process should not slow down with the browser.
I tried htppx, httplib2, urllib, urllib3 but I couldn't get the later loaded section.
You can use the BeautifulSoup library or Selenium to simulate a user-like page loading and waiting to load additional HTML elements.
I would suggest using Selenium since it contains the WebDriverWait Class that can help you scrape the additional HTML elements.
This is my simple example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Replace with the URL of the website you want
url = "https://www.example.com"
# Adding the option for headless browser
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
# Create a new instance of the Chrome webdriver
driver = webdriver.Chrome()
driver.get(url)
# Wait for the additional HTML elements to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(#class, 'lazy-load')]")))
# Get HTML
html = driver.page_source
print(html)
driver.close()
In the example above you can see that I'm using an explicit wait to wait (10secs) for a specific condition to occur. More specifically, I'm waiting until the element with the 'lazy-load' class is located By.XPath and then I retrieve the HTML elements.
Finally, I would recommend checking both BeautifulSoup and Selenium since both have tremendous capabilities for scrapping websites and automating web-based tasks.
I am trying to scrape a twitter site, There are a long list of comments so using selenium I scrolled down till the end:
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
for i in range(30):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
Now when I try to get elements by tag name, article all the tags aren't captured.
> len(driver.find_elements_by_tag_name('article'))
16
When I scroll the page manually and try the same code
> len(driver.find_elements_by_tag_name('article'))
20
Same is the case for page_source. When I save the driver.page_source to a file, and open that file to search existing twitter username, the name is not found. Only the usernames at the end of html are present.
First, I thought it might have been browser issue. Then I tried same thing with ChromeDriver. But the results were similar.
When I view the source HTML after manually navigating to the site via Chrome I can see the full page source but on loading the page source via selenium I'm not getting the complete page source.
from bs4 import BeautifulSoup
from selenium import webdriver
import sys,time
driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")
driver.get('http://www.magicbricks.com/')
driver.find_element_by_id("buyTab").click()
time.sleep(5)
driver.find_element_by_id("keyword").send_keys("Navi Mumbai")
time.sleep(5)
driver.find_element_by_id("btnPropertySearch").click()
time.sleep(30)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"lxml")
print soup.prettify()
The website is possibly blocking or restricting the user agent for selenium. An easy test is to change the user agent and see if that does it. More info at this question:
Change user agent for selenium driver
Quoting:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)
Try something like:
import time
time.sleep(5)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
instead of driver.page_source.
Dynamic web pages are often needed to be rendered by JavaScript.
I would crawl the web site where it has the iframe.
see http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20150515001896. It has 2 options at chrome browser.("view page source" and "view frame source" either.)
But accessing the url using Beautiful Soup, urllib2 or selenium gave me only the page source without iframe
How could I access to iframe source which can be seen at the chrome?
The below code is for accessing page source of that website.
from selenium import webdriver
import urllib2
from bs4 import BeautifulSoup
url = "http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20150515001896"
f = urllib2.urlopen(url)
#or
browser = webdriver.Chrome()
browser.get(url)
html_source = browser.page_source
#show only the page sources
It simply solved by accessing below url.
http://dart.fss.or.kr/report/viewer.do?rcpNo=20150515001896&dcmNo=4671059&eleId=17&offset=1015699&length=132786&dtd=dart3.xsd
I am using python2.7 with beautiful Soup4 and Selenium webdriver. Now in my webautomation script i will open the link or URL and get into the home page. Now I need to click onto some anchor Labels to navigate through other pages.I did till now. now when i will be going to a new page, I need to get the new URL from the browser as I need to pass it Beautiful Soup4 for webpage scraping. So now my concern is how to get such URLs dynamic way?
Please advice if any!
You get current_url attribute on the driver:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://www.google.com')
print(browser.current_url)