I am trying to get some comments off the car blog, Jalopnik. It doesn't come with the web page initially, instead the comments get retrieved with some Javascript. You only get the featured comments. I need all the comments so I would click "All" (between "Featured" and "Start a New Discussion") and get them.
To automate this, I tried learning Selenium. I modified their script from Pypi, guessing the code for clicking a link was link.click() and link = broswer.find_element_byxpath(...). It doesn't look liek the "All" button (displaying all comments) was pressed.
Ultimately I'd like to download the HTML of that version to parse.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers/") # Load page
time.sleep(0.2)
link = browser.find_element_by_xpath("//a[#class='tc cn_showall']")
link.click()
browser.save_screenshot('screenie.png')
browser.close()
Using Firefox with the Firebug plugin, I browsed to http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers.
I then opened the Firebug console and clicked on ALL; it obligingly showed a single AJAX call to http://jalopnik.com/index.php?op=threadlist&post_id=5912009&mode=all&page=0&repliesmode=hide&nouser=true&selected_thread=null
Opening that url in a new window gets me the comment feed you are seeking.
More generally, if you substitute the appropriate article-ID into that url, you should be able to automate the process without Selenium.
Related
Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.
I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?
Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html
you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click
I have the following code to login in a website.
from selenium import webdriver
driver = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
driver.get ("https://examplesite.com")
driver.find_element_by_id("username").send_keys("MyUsername")
driver.find_element_by_id("password").send_keys("MyPassword")
I do some clicks in that homepage and then a second page https://secondpage.com/some/text is opened in a different tab. I need to make some automation testing
in this second page but if I try to work directly in second page changing in my above code from this
driver.get ("https://examplesite.com")
to this
driver.get ("https://secondpage.com/some/text")
I'm being redirected to first page https://examplesite.com to login again.
I´ve tried to pass the credentials directly in get command like this:
driver.get ("https://MyUsarname:MyPassword#secondpage.com/some/text")
but the same happens and I'm redirected to the login page.
Is there a way to run the script directly in second page without need to login each time I test something?
Maybe mantain in memory Selenium that I´m already logged in?
Thanks for any help
Here is my problem: I'm trying to use selenium to access a webpage and the special about this page is it is an auto redirecting page (you open that page and after few seconds, it automatically redirect to another page). When i use driver = webdriver.Firefox(), my IDM catched that link just perfectly after few seconds.
And because i don't want the browser to come up so i use Phantomjs instead, ut it not working. My application just can get the loading page url (bitdl-1336...) but not the redirected link. Please help!
This is my code:
link = 'http://torrent.ajee.sh/hash.php?hash=' + self.global_hash_code
driver = webdriver.PhantomJS('phantomjs.exe')
driver.get(str(link))
element = driver.find_element_by_link_text('Download Zip')
element.click()
time.sleep(10)
msg = QMessageBox.information(self, QString('Thành công'),QString(driver.current_url))
And this is the result:
Please help!
Sorry about my english
Not exactly an answer to your PhantomJS-specific question, but a workaround to the problem.
And because i don't want the browser to come up so i use Phantomjs instead
You can continue using Firefox, but start it in a Virtual Display, see more information at:
How do I run Selenium in Xvfb?
You may also need to let the browser automatically save the archive in a specified directory, see:
How do I automatically download files from a pop up dialog using selenium-python
Access to file download dialog in Firefox
I'm trying to use PhantomJS to write a scraper but even the example in the documentation of morph.io is not working. I guess the problem is "https", I tested it with http and it is working. Can you please give me a solution?
I tested it using firefox and it works.
from splinter import Browser
with Browser("phantomjs") as browser:
# Optional, but make sure large enough that responsive pages don't
# hide elements on you...
browser.driver.set_window_size(1280, 1024)
# Open the page you want...
browser.visit("https://morph.io")
# submit the search form...
browser.fill("q", "parliament")
button = browser.find_by_css("button[type='submit']")
button.click()
# Scrape the data you like...
links = browser.find_by_css(".search-results .list-group-item")
for link in links:
print link['href']
PhantomJS is not working on https urls?
Splinter uses the Selenium WebDriver bindings (example) for Python under the hood, so you can simply pass the necessary options like this:
with Browser("phantomjs", service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any']) as browser:
...
See PhantomJS failing to open HTTPS site for why those options might be necessary. Take a look at the PhantomJS commandline interface for more options.