I am trying to scrape a table but can't seem to make it visible.
The table is on this page after you expand the 'Code History' section in light purple. Login credentials are below but also easy to get from a trial account:
username = jd#mailinator.com
password = m%$)-Y95*^.1Gin+
Below's a graphic illustrating the data I'm trying to get to. I'm interested in the bottom row:
Here's the code I'm using:
from selenium import webdriver
driver_path = "path to chromedriver.exe"
url_login = "https://www.findacode.com/signin.html"
url_code = "https://www.findacode.com/code.php?set=CPT&c="
username = 'jd#mailinator.com'
password = 'm%$)-Y95*^.1Gin+'
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
driver.get(url_login)
form = driver.find_element_by_name('login')
form.find_element_by_name('id').send_keys(username)
form.find_element_by_name('password').send_keys(password)
form.find_element_by_xpath("//input[#value='Sign In']").submit()
driver.get(url_code+'0001U')
driver.find_element_by_id('history').click()
At this point, when I look in driver.page_source, I was expecting elements of the table to be visible but that's not the case. Where's the flaw in my thinking?
This site loads fragments of the page when it's needed (aka lazy loading). Therefore, the actual contents will be loaded when that portion of the page is expanded. This helps when your "trial" expires, the server can return generic content back to prevent unauthorized access.
I can see 3 ways to remedy this:
Wait for the data to be available after #history.click() and the content divs are loaded (following .sectionbody div is not empty).
Get the data from the fragment by directly calling the same URL after logging in. i.e. .get("https://www.findacode.com/logs/codepage_stats.php?section=sh_history_div&set=CPT&c=0001U")
Utilize their built-in auto-open feature my checking the appropriate checkbox once, then load all the data you expect normally in future requests.
Related
I am currently using selenium to automate the input of data in to a website. The website never changes, and the fields are always the same with obviously the data differing.
How I want it to work is for the user to already be logged in to the website, they run a script and a new tab opens in their current browser session with the relevant fields having the data in them.
At the moment it opens a new Chrome session (ignoring the login from the previous session), has to log-in to the site, open a new tab, go to the data input page and push the keys from there. This can be a time consuming activity, and I don't like how it has to login each time. Snippet of my code below.
req = request.get_json()
jsonify(req)
url1 = "www.loginpage.com"
driver = webdriver.Chrome(executable_path=r'chromedriver.exe')
driver.get(url1)
u = driver.find_element_by_id('username')
u.send_keys("username")
u = driver.find_element_by_id('password')
u.send_keys("password")
u = driver.find_element_by_id('loginButton').submit()
driver.execute_script('''window.open("www.datainputpage.com","_blank");''')
driver.switch_to_window(driver.window_handles[1])
driver.find_element_by_id('Field1').send_keys(req[0])
driver.find_element_by_id('Field2').send_keys(req[1])
driver.find_element_by_id('Field3').send_keys(req[2])
driver.find_element_by_id('Field4').send_keys(req[3])
Is there a way using python I can automate it as mentioned? Opens new tab in current session - fills in fields?
You can use profiles in Chrome. You specify the directory of your profile and all cookies and stuff will be saved in there. So the next time you run it, it should load those same cookies from your previous session and stay logged in.
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(executable_path=r'chromedriver.exe', chrome_options=chrome_options)
Another possible option is saving the cookies to a json file, then on the next run, load them and set them in the browser.
Selenium Cookies
Reading & Writing JSON
I am trying to automate a web data gathering process using Python. In my case, I need to pull the information from https://app.ixml.com.br/documentos/nfe page. However, before you go to this page, you need to log in at https://app.ixml.com/login. The code below should theoretically log into the site:
import re
from robobrowser import RoboBrowser
username = 'email'
password = 'password'
br = RoboBrowser()
br.open('https://app.ixml.com.br/login')
form = br.get_form()
form['email'] = username
form['senha'] = password
br.submit_form(form)
src = str(br.parsed())
However, by printing the src variable, I get the source code from the https://app.ixml.com.br/login page, ie before logging in. If I add the following lines at the end of the previous code
br.open('https://app.ixml.com.br/documentos/nfe')
src2 = str(br.parsed())
The src2 variable contains the source code of the page https://app.ixml.com.br/. I tried some variations, such as creating a new br object, but got the same result. How can I access the information at https://app.ixml.com.br/documentos/nfe?
If it is ok to have a webpage opening you can try to solve this using selenium. This package makes it possible to create a program that reacts just like a user would.
The following code would have you login:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://app.ixml.com.br/login")
browser.find_element_by_id("email").send_keys("abc#mail")
browser.find_element_by_id("senha").send_keys("abc")
browser.find_element_by_css_selector("button").click()
I have posted on a science forum (roughly 290 questions) that I would like to get back by downloading them with all the associated answers.
The first issue is that I have to be logged on my personal space to have the list of all the messages. How to circumvent this first barrier to be able with a shell script or a single wget command to get back all URL and their content. Can I pass to wgeta login and a password to be logged and redirected to the appropriate URL obtaining the list of all messages?
Once this first issue will be solved, the second issue is that I have to start from 6 different menu pages that all contain the title and the link of the questions.
Moreover, concerning some of my questions, the answers and the discussions may be on multiple pages.
So I wonder if I could achieve this operation of global downloading knowing I would like to store them statically with local CSS stored also on my computer (to keep the same format into my browser when I consult them on my PC).
The URL of the first menu page of questions is (once I am logged on the website : that could be an issue also to download with wget if I am obliged to be connected).
An example of URL containing the list of messages, once I am logged, is:
https://forums.futura-sciences.com/search.php?searchid=22897684
The other pages (there all 6 or 7 pages of discussions title in total appering in the main menu page) have the format:
"https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=2" (for page 2).
https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=5
(for page 5)
One can see on each of these pages the title and the link of each of the discussions that I would like to download with also CSS (knowing each discussion may contain multiple pages also) :
for example the first page of discussion "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps.html"
has page 2: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-2.html"
and page 3: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-3.html"
Naively, I tried to do all this with only one command (with the example of URL on my personal space that I have taken at the beginning of post, i.e "https://forums.futura-sciences.com/search.php?searchid=22897684"):
wget -r --no-check-certificate --html-extension --convert-links "https://forums.futura-sciences.com/search.php?searchid=22897684"
but unfortunately, this command downloads all files, and even maybe not what I want, i.e my discussions.
I don't know what the approach to use: must I firstly store all URL in a file (with all sub-pages containing all answers and the global discussion for each of mu initial question)?
And after, I could do maybe a wget -i all_URL_questions.txt. How can I carry out this operation?
Update
My issue needs a script, I tried with Python the following things:
1)
import urllib, urllib2, cookielib
username = 'USERNAME'
password = 'PASSWORD'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('https://forums.futura-sciences.com/login.php', login_data)
resp = opener.open('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print resp.read()
But the page printed is not the page of my home into personal space.
2)
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'USERNAME',
'inUserPass': 'PASSWORD'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text.encode('utf8')
# An authorised request.
r = s.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print r.text.encode('utf8')
Here too, this doesn't work
3)
import requests
import bs4
site_url = 'https://forums.futura-sciences.com/login.php?do=login'
userid = 'USERNAME'
password = 'PASSWWORD'
file_url = 'https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1'
o_file = 'abc.html'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'vb_login_username': userid, 'vb_login_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print("requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()
Same thing, the content is wrong
4)
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
webdriver.get('https://forums.futura-sciences.com/')
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
browser = webdriver.Firefox()
browser.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
Still not getting to log in with USERNAME and PASSSWORD and get content of homepage of personal space
5)
from selenium import webdriver
from selenium.webdriver.firefox.webdriver import FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
def MS_login(username, passwd): # call this with username and password
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['moz:webdriverClick'] = False
driver = webdriver.Firefox(capabilities=firefox_capabilities)
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2) # 0 means to download to the desktop, 1 means to download to the default "Downloads" directory, 2 means to use the directory
fp.set_preference("browser.download.dir","/Users/user/work_archives_futura/")
driver.get('https://forums.futura-sciences.com/') # change the url to your website
time.sleep(5) # wait for redirection and rendering
driver.delete_all_cookies() # clean up the prior login sessions
driver.find_element_by_xpath("//input[#name='vb_login_username']").send_keys(username)
elem = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//input[#name='vb_login_password']")))
elem.send_keys(Keys.TAB)
driver.find_element_by_xpath("//input[#type='submit']").click()
print("success !!!!")
driver.close() # close the browser
return driver
if __name__ == '__main__':
MS_login("USERNAME","PASSWORD")
The window is well opened, username filled but impossible to fill or submit the password and click on submit.
PS: the main issue could come from that password field has display:none property, So I can't simulate TAB operation to password field and pass it, once I have put the login.
It seems you're pretty knowledgeable already about scraping using the various methods. All that was missing were the correct field names in the post request.
I used the chrome dev tools (f12 - then go to networking tab). With this open if you login and quickly stop the browser window from redirecting, you'll be able to see the full request to login.php and look at the fields etc.
With that I was able to build this for you. It includes a nice dumping function for responses. To test my code works you can use your real password for positive case and the bad password line for negative case.
import requests
import json
s = requests.Session()
def dumpResponseData(r, fileName):
print(r.status_code)
print(json.dumps(dict(r.headers), indent=1))
cookieDict = s.cookies.get_dict()
print(json.dumps(cookieDict, indent=1))
outfile = open(fileName, mode="w")
outfile.write(r.text)
outfile.close()
username = "your-username"
password = "your-password"
# password = "bad password"
def step1():
data = dict()
data["do"] = "login"
data["vb_login_md5password"] = ""
data["vb_login_md5password_utf"] = ""
data["s"] = ""
data["securitytoken"] = "guest"
data["url"] = "/search.php?do=finduser&userid=1077817&contenttype=vBForum_Post&showposts=1"
data["vb_login_username"] = username
data["vb_login_password"] = password
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=data)
# Logged In?
if "vbseo_loggedin" in s.cookies.keys():
print("Logged In!")
else:
print("Login Failed :(")
if __name__ == "__main__":
step1()
I don't have any posts in my newly created Futura account so I can't really do any more testing for you - I don't want to spam their forum with garbage.
But I would probably start by doing a request of post search url and scrape the links using bs4.
Then you could probably just use wget -r for each link you've scraped.
#Researcher is correct on their advice when it comes to the requests library. You are not posting all of the request params that the browser would send. Overall, I think it will be difficult to get requests to pull everything when you factor in static content and client side javascript
Your selenium code from section 4 has a few mistakes in it:
# yours
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
# should be
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
You may need to fiddle with the xpath for the submit button.
Hint: You can debug along the way by taking a screenshots :
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.get_screenshot_as_file('before_submit.png')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
webdriver.get_screenshot_as_file('after_submit.png')
I'm trying to submit a dynamically generated user login form using Scrapy and then parse the HTML on the page that corresponds to a successful login.
I was wondering how I could do that with Scrapy or a combination of Scrapy and Selenium. Selenium makes it possible to find the element on the DOM, but I was wondering if it would be possible to "give control back" to Scrapy after getting the full HTML in order to allow it to carry out the form submission and save the necessary cookies, session data etc. in order to scrape the page.
Basically, the only reason I thought Selenium was necessary was because I needed the page to render from the Javascript before Scrapy looks for the <form> element. Are there any alternatives to this, however?
Thank you!
Edit: This question is similar to this one, but unfortunately the accepted answer deals with the Requests library instead of Selenium or Scrapy. Though that scenario may be possible in some cases (watch this to learn more), as alecxe points out, Selenium may be required if "parts of the page [such as forms] are loaded via API calls and inserted into the page with the help of javascript code being executed in the browser".
Scrapy is not actually a great fit for coursera site since it is extremely asynchronous. Parts of the page are loaded via API calls and inserted into the page with a help of javascript code being executed in the browser. Scrapy is not a browser and cannot handle it.
Which raises the point - why not use the publicly available Coursera API?
Aside from what is documented, there are other endpoints that you can see called in browser developer tools - you need to be authenticated to be able to use them. For example, if you are logged in, you can see the list of courses you've taken:
There is a call to memberships.v1 endpoint.
For the sake of an example, let's start selenium, log in and grab the cookies with get_cookies(). Then, let's yield a Request to memberships.v1 endpoint to get the list of archived courses providing the cookies we've got from selenium:
import json
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
LOGIN = 'email'
PASSWORD = 'password'
class CourseraSpider(scrapy.Spider):
name = "courseraSpider"
allowed_domains = ["coursera.org"]
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.maximize_window()
self.driver.get('https://www.coursera.org/login')
form = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[#data-js='login-body']//div[#data-js='facebook-button-divider']/following-sibling::form")))
email = WebDriverWait(form, 10).until(EC.visibility_of_element_located((By.ID, 'user-modal-email')))
email.send_keys(LOGIN)
password = form.find_element_by_name('password')
password.send_keys(PASSWORD)
login = form.find_element_by_xpath('//button[. = "Log In"]')
login.click()
WebDriverWait(self.driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[. = 'My Courses']")))
self.driver.get('https://www.coursera.org/')
cookies = self.driver.get_cookies()
self.driver.close()
courses_url = 'https://www.coursera.org/api/memberships.v1'
params = {
'fields': 'courseId,enrolledTimestamp,grade,id,lastAccessedTimestamp,role,v1SessionId,vc,vcMembershipId,courses.v1(display,partnerIds,photoUrl,specializations,startDate,v1Details),partners.v1(homeLink,name),v1Details.v1(sessionIds),v1Sessions.v1(active,dbEndDate,durationString,hasSigTrack,startDay,startMonth,startYear),specializations.v1(logo,name,partnerIds,shortName)&includes=courseId,vcMembershipId,courses.v1(partnerIds,specializations,v1Details),v1Details.v1(sessionIds),specializations.v1(partnerIds)',
'q': 'me',
'showHidden': 'false',
'filter': 'archived'
}
params = '&'.join(key + '=' + value for key, value in params.iteritems())
yield scrapy.Request(courses_url + '?' + params, cookies=cookies)
def parse(self, response):
data = json.loads(response.body)
for course in data['linked']['courses.v1']:
print course['name']
For me, it prints:
Algorithms, Part I
Computing for Data Analysis
Pattern-Oriented Software Architectures for Concurrent and Networked Software
Computer Networks
Which proves that we can give Scrapy the cookies from selenium and successfully extract the data from the "for logged in users only" pages.
Additionally, make sure you don't violate the rules from the Terms of Use, specifically:
In addition, as a condition of accessing the Sites, you agree not to
... (c) use any high-volume, automated or electronic means to access
the Sites (including without limitation, robots, spiders, scripts or
web-scraping tools);
I am trying to scrape and interact with a site. Using BeautifulSoup, I can do MOST of what I want, but not all of it. Selenium should able to handle that portion. I can get it to working using the Selenium Firefox Plugin. I just need to automate it now. My problem is, the area that I need to interact with sits behind a login prompt, which is handled via an OpenID Provider.
Fortunately, I was able to use this bookmarklet to get the cookie that is set. javascript:void(document.cookie=prompt(document.cookie,document.cookie)); This allows me to login an parse the page using BeautifulSoup.
This is done via this code:
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(("Cookie","__cfduid=<hex string>; __utma=59652655.1231969161.1367166137.1368651910.1368660971.15; __utmz=59652655.1367166137.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=<a session id>; __utmb=59652655.1.10.1368660971; __utmc=59652655"))
page = opener.open(url).read()
soup = BeautifulSoup(scrap1)
...parse stuff...
At this point, the jar is empty and I need to do the final interaction (clicking on a couple DIV elements and verifying that another DIV has been updated appropriately. However, I need the above cookie jar to populate in a selenium session so that I am logged in appropriately.
How can I move the above cookie into something that selenium knows and recognizes?
I've tried code like this
for c in jar:
driver.add_cookie({'name':c.name, 'value':c.value, 'path':'/', 'domain':c.domain})
But, since the jar is empty, this doesn't work. Is there a way to put this cookie in the jar? Since I'm bypassing the OpenId login by using this cookie, I'm not receiving anything back from the server.
I think you might be approaching this backwards. Instead of passing a cookie to Selenium, why not perform the login with Selenium directly?
For example:
browser = webdriver.Firefox()
username = 'myusername'
password = 'mypassword'
browser.get('http://www.mywebsite.com/')
username_input = browser.find_element_by_id('username') #Using id only as an example
password_input = browser.find_element_by_id('password')
login_button = browser.find_element_by_id('login')
username_input.send_keys(username)
password_input.send_keys(password)
login_button.click()
This way you won't have to worry about manually collecting cookies.
From here, you can grab the page source and pass it to BeautifulSoup:
source = browser.page_source
soup = BeautifulSoup(source)
I hope this helped.