I'm trying to submit a dynamically generated user login form using Scrapy and then parse the HTML on the page that corresponds to a successful login.
I was wondering how I could do that with Scrapy or a combination of Scrapy and Selenium. Selenium makes it possible to find the element on the DOM, but I was wondering if it would be possible to "give control back" to Scrapy after getting the full HTML in order to allow it to carry out the form submission and save the necessary cookies, session data etc. in order to scrape the page.
Basically, the only reason I thought Selenium was necessary was because I needed the page to render from the Javascript before Scrapy looks for the <form> element. Are there any alternatives to this, however?
Thank you!
Edit: This question is similar to this one, but unfortunately the accepted answer deals with the Requests library instead of Selenium or Scrapy. Though that scenario may be possible in some cases (watch this to learn more), as alecxe points out, Selenium may be required if "parts of the page [such as forms] are loaded via API calls and inserted into the page with the help of javascript code being executed in the browser".
Scrapy is not actually a great fit for coursera site since it is extremely asynchronous. Parts of the page are loaded via API calls and inserted into the page with a help of javascript code being executed in the browser. Scrapy is not a browser and cannot handle it.
Which raises the point - why not use the publicly available Coursera API?
Aside from what is documented, there are other endpoints that you can see called in browser developer tools - you need to be authenticated to be able to use them. For example, if you are logged in, you can see the list of courses you've taken:
There is a call to memberships.v1 endpoint.
For the sake of an example, let's start selenium, log in and grab the cookies with get_cookies(). Then, let's yield a Request to memberships.v1 endpoint to get the list of archived courses providing the cookies we've got from selenium:
import json
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
LOGIN = 'email'
PASSWORD = 'password'
class CourseraSpider(scrapy.Spider):
name = "courseraSpider"
allowed_domains = ["coursera.org"]
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.maximize_window()
self.driver.get('https://www.coursera.org/login')
form = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[#data-js='login-body']//div[#data-js='facebook-button-divider']/following-sibling::form")))
email = WebDriverWait(form, 10).until(EC.visibility_of_element_located((By.ID, 'user-modal-email')))
email.send_keys(LOGIN)
password = form.find_element_by_name('password')
password.send_keys(PASSWORD)
login = form.find_element_by_xpath('//button[. = "Log In"]')
login.click()
WebDriverWait(self.driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[. = 'My Courses']")))
self.driver.get('https://www.coursera.org/')
cookies = self.driver.get_cookies()
self.driver.close()
courses_url = 'https://www.coursera.org/api/memberships.v1'
params = {
'fields': 'courseId,enrolledTimestamp,grade,id,lastAccessedTimestamp,role,v1SessionId,vc,vcMembershipId,courses.v1(display,partnerIds,photoUrl,specializations,startDate,v1Details),partners.v1(homeLink,name),v1Details.v1(sessionIds),v1Sessions.v1(active,dbEndDate,durationString,hasSigTrack,startDay,startMonth,startYear),specializations.v1(logo,name,partnerIds,shortName)&includes=courseId,vcMembershipId,courses.v1(partnerIds,specializations,v1Details),v1Details.v1(sessionIds),specializations.v1(partnerIds)',
'q': 'me',
'showHidden': 'false',
'filter': 'archived'
}
params = '&'.join(key + '=' + value for key, value in params.iteritems())
yield scrapy.Request(courses_url + '?' + params, cookies=cookies)
def parse(self, response):
data = json.loads(response.body)
for course in data['linked']['courses.v1']:
print course['name']
For me, it prints:
Algorithms, Part I
Computing for Data Analysis
Pattern-Oriented Software Architectures for Concurrent and Networked Software
Computer Networks
Which proves that we can give Scrapy the cookies from selenium and successfully extract the data from the "for logged in users only" pages.
Additionally, make sure you don't violate the rules from the Terms of Use, specifically:
In addition, as a condition of accessing the Sites, you agree not to
... (c) use any high-volume, automated or electronic means to access
the Sites (including without limitation, robots, spiders, scripts or
web-scraping tools);
Related
I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?
Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html
you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click
I am trying to scrape a table but can't seem to make it visible.
The table is on this page after you expand the 'Code History' section in light purple. Login credentials are below but also easy to get from a trial account:
username = jd#mailinator.com
password = m%$)-Y95*^.1Gin+
Below's a graphic illustrating the data I'm trying to get to. I'm interested in the bottom row:
Here's the code I'm using:
from selenium import webdriver
driver_path = "path to chromedriver.exe"
url_login = "https://www.findacode.com/signin.html"
url_code = "https://www.findacode.com/code.php?set=CPT&c="
username = 'jd#mailinator.com'
password = 'm%$)-Y95*^.1Gin+'
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
driver.get(url_login)
form = driver.find_element_by_name('login')
form.find_element_by_name('id').send_keys(username)
form.find_element_by_name('password').send_keys(password)
form.find_element_by_xpath("//input[#value='Sign In']").submit()
driver.get(url_code+'0001U')
driver.find_element_by_id('history').click()
At this point, when I look in driver.page_source, I was expecting elements of the table to be visible but that's not the case. Where's the flaw in my thinking?
This site loads fragments of the page when it's needed (aka lazy loading). Therefore, the actual contents will be loaded when that portion of the page is expanded. This helps when your "trial" expires, the server can return generic content back to prevent unauthorized access.
I can see 3 ways to remedy this:
Wait for the data to be available after #history.click() and the content divs are loaded (following .sectionbody div is not empty).
Get the data from the fragment by directly calling the same URL after logging in. i.e. .get("https://www.findacode.com/logs/codepage_stats.php?section=sh_history_div&set=CPT&c=0001U")
Utilize their built-in auto-open feature my checking the appropriate checkbox once, then load all the data you expect normally in future requests.
Is it possible to catch the event when the url is changed inside my browser using selenium?
Here is my scenario:
I load my website test.com
After all the static files are loaded, when executing one of the js file, I am redirected (not sure how) to another page redirect-one.test.com/blah
My browser gets the url redirect-one.test.com/blah and gets a 307 response to go to redirect-two.test.com/blahblah
Here my browser receives a final 302 to go to final.test.com/
The page of final.test.com/ is loaded and at the end of this, selenium enables me to search for elements and so on...
I'd like to be able to intercept (and time the moment it happens) each time I am redirected.
After that, I still need to do some other steps for which selenium is more suitable:
Enter my username and password
Test some functionnalities
Log out
Here a sample of how I tried to intercept the first redirect:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
def url_contains(url):
def check_contains_url(driver):
return (url in driver.current_url)
return check_contains_url
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.FIREFOX)
driver.get("http://test.com/")
try:
url = "redirect-one.test.com"
first_redirect = WebDriverWait(driver, 20).until(url_contains(url))
print("found first redirect")
finally:
print("move on to the next redirect...."
Is this even possible using selenium?
I cannot change the behavior of the website and the reason it is built like this is because of an SSO mechanism I cannot bypass.
I realize I specified python but I am open to tools in other languages.
Selenium is not the tool for this. All the redirects that the browser encounters are handled by the browser in a way that Selenium does not allow you to check.
You can perform the checks using urllib2, or if you prefer a sane interface, using requests.
I'm trying to get all of the article link from a web scraped search query, however I don't seem to get any results.
Web page in question: http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=
my approach:
I'm trying to get the ids of articles and then append them to the already known url (http://www.seek.com.au/job/+ id) however there are no ids on my request(python package from http://docs.python-requests.org/en/latest/) retrieval, in fact there are no articles at all.
it seems that in this particular case I need to execute the scripts(that generate ids) in some way to get the full data, how could I do that?
maybe there are other ways to retrieve all of the results from this search query?
As mentioned, download Selenium. There are python bindings.
Selenium is a web testing automation framework. In effect, by using selenium you are remote controlling a web browser. This is necessary as web browsers have javascript engines and DOMs, allowing AJAX to occur.
Using this test script (it assumes you have Firefox installed; Selenium supports other browsers if needed):
# Import 3rd Party libraries
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
class requester_firefox(object):
def __init__(self):
self.selenium_browser = webdriver.Firefox()
self.selenium_browser.set_page_load_timeout(30)
def __del__(self):
self.selenium_browser.quit()
self.selenium_browser = None
def __call__(self, url):
try:
self.selenium_browser.get(url)
the_page = self.selenium_browser.page_source
except Exception:
the_page = ""
return the_page
test = requester_firefox()
print test("http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=").encode("ascii", "ignore")
It will load SEEK and wait for AJAX pages. The encode method is necessary (for me at least) because SEEK returns a unicode string which the Windows console seemingly can't print.
I am trying to scrape and interact with a site. Using BeautifulSoup, I can do MOST of what I want, but not all of it. Selenium should able to handle that portion. I can get it to working using the Selenium Firefox Plugin. I just need to automate it now. My problem is, the area that I need to interact with sits behind a login prompt, which is handled via an OpenID Provider.
Fortunately, I was able to use this bookmarklet to get the cookie that is set. javascript:void(document.cookie=prompt(document.cookie,document.cookie)); This allows me to login an parse the page using BeautifulSoup.
This is done via this code:
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(("Cookie","__cfduid=<hex string>; __utma=59652655.1231969161.1367166137.1368651910.1368660971.15; __utmz=59652655.1367166137.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=<a session id>; __utmb=59652655.1.10.1368660971; __utmc=59652655"))
page = opener.open(url).read()
soup = BeautifulSoup(scrap1)
...parse stuff...
At this point, the jar is empty and I need to do the final interaction (clicking on a couple DIV elements and verifying that another DIV has been updated appropriately. However, I need the above cookie jar to populate in a selenium session so that I am logged in appropriately.
How can I move the above cookie into something that selenium knows and recognizes?
I've tried code like this
for c in jar:
driver.add_cookie({'name':c.name, 'value':c.value, 'path':'/', 'domain':c.domain})
But, since the jar is empty, this doesn't work. Is there a way to put this cookie in the jar? Since I'm bypassing the OpenId login by using this cookie, I'm not receiving anything back from the server.
I think you might be approaching this backwards. Instead of passing a cookie to Selenium, why not perform the login with Selenium directly?
For example:
browser = webdriver.Firefox()
username = 'myusername'
password = 'mypassword'
browser.get('http://www.mywebsite.com/')
username_input = browser.find_element_by_id('username') #Using id only as an example
password_input = browser.find_element_by_id('password')
login_button = browser.find_element_by_id('login')
username_input.send_keys(username)
password_input.send_keys(password)
login_button.click()
This way you won't have to worry about manually collecting cookies.
From here, you can grab the page source and pass it to BeautifulSoup:
source = browser.page_source
soup = BeautifulSoup(source)
I hope this helped.