I am trying to scrape and interact with a site. Using BeautifulSoup, I can do MOST of what I want, but not all of it. Selenium should able to handle that portion. I can get it to working using the Selenium Firefox Plugin. I just need to automate it now. My problem is, the area that I need to interact with sits behind a login prompt, which is handled via an OpenID Provider.
Fortunately, I was able to use this bookmarklet to get the cookie that is set. javascript:void(document.cookie=prompt(document.cookie,document.cookie)); This allows me to login an parse the page using BeautifulSoup.
This is done via this code:
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(("Cookie","__cfduid=<hex string>; __utma=59652655.1231969161.1367166137.1368651910.1368660971.15; __utmz=59652655.1367166137.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=<a session id>; __utmb=59652655.1.10.1368660971; __utmc=59652655"))
page = opener.open(url).read()
soup = BeautifulSoup(scrap1)
...parse stuff...
At this point, the jar is empty and I need to do the final interaction (clicking on a couple DIV elements and verifying that another DIV has been updated appropriately. However, I need the above cookie jar to populate in a selenium session so that I am logged in appropriately.
How can I move the above cookie into something that selenium knows and recognizes?
I've tried code like this
for c in jar:
driver.add_cookie({'name':c.name, 'value':c.value, 'path':'/', 'domain':c.domain})
But, since the jar is empty, this doesn't work. Is there a way to put this cookie in the jar? Since I'm bypassing the OpenId login by using this cookie, I'm not receiving anything back from the server.
I think you might be approaching this backwards. Instead of passing a cookie to Selenium, why not perform the login with Selenium directly?
For example:
browser = webdriver.Firefox()
username = 'myusername'
password = 'mypassword'
browser.get('http://www.mywebsite.com/')
username_input = browser.find_element_by_id('username') #Using id only as an example
password_input = browser.find_element_by_id('password')
login_button = browser.find_element_by_id('login')
username_input.send_keys(username)
password_input.send_keys(password)
login_button.click()
This way you won't have to worry about manually collecting cookies.
From here, you can grab the page source and pass it to BeautifulSoup:
source = browser.page_source
soup = BeautifulSoup(source)
I hope this helped.
Related
I am trying to automate some processes on the site. At first I tried to use queries, but a captcha came in response. Now I'm using selenium queries, and here's the problem: when I log in using selenium tools only, everything works fine, but I can't add coupons on the site and confirm them.
from seleniumrequests import Firefox
driver = Firefox()
user = '000000'
password = '000000'
driver_1x.get("https://1xstavka.ru/")
driver.find_element_by_id('curLoginForm').click()
driver.find_element_by_id('auth_id_email').send_keys(user)
driver.find_element_by_id('auth-form-password').send_keys(password)
driver.find_element_by_class_name('auth-button__text').click()
But if you use:
from seleniumrequests import Firefox
driver = Firefox()
driver.request('GET', 'https://1xstavka.ru')
The window opens for a second and immediately closes, a 200 response is received, but there are no cookies. It's the same with publishing requests, with which I'm trying to automate the process. After the request for publication, the response is 200, but nothing happens on the site.
driver.request('POST', 'https://1xstavka.ru/user/auth', json=json)
please tell me what is wrong or how you can solve this problem
I am unable to access the URL specified in the query; but for captcha/coupons, I created a loop with an interim stop function. This gives me the chance to input it manually and then continue the loop.
So I am trying to login programatically (python) to https://www.datacamp.com/users/sign_in using my email & password.
I have tried 2 methods of login. One using requests library & another using selenium (code below). Both time facing [403] issue.
Could someone please help me login programatically to it ?
Thank you !
Using Requests library.
import requests; r = requests.get("https://www.datacamp.com/users/sign_in"); r (which gives <response [403]>)
Using Selenium webdriver.
driver = webdriver.Chrome(executable_path=driver_path, options=option)
driver.get("https://www.datacamp.com/users/sign_in")
driver.find_element_by_id("user_email") # there is supposed to be form element with id=user_email for inputting email
Implicit wait at least should have worked, like this:
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
url = "https://www.datacamp.com/users/sign_in"
driver.get(url)
driver.find_element_by_id("user_email").send_keys("test#dsfdfs.com")
driver.find_element_by_css_selector("#new_user>button[type=button]").click()
BUT
The real issue is the the site uses anti-scraping software.
If you open Console and go to request itself you'll see:
It means that the site blocks your connection even before you try to login.
Here is similar question with different solutions: Can a website detect when you are using Selenium with chromedriver?
Not all answers will work for you, try different approaches suggested.
With Firefox you'll have the same issue (I've already checked).
You have to add a wait after driver.get("https://www.datacamp.com/users/sign_in") before driver.find_element_by_id("user_email") to let the page loaded.
Try something like WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'user_email')))
Was hoping someone could help me understand what's going on:
I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):
le = browser.find_elements_by_xpath('//*[#title="Download PDF"]')
time.sleep(5)
if le:
pdf_link = le[0].get_attribute("href")
browser.get(pdf_link)
The code does download the pdf, but after that just stays idle.
This seems to be related to the following browser settings:
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?
For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:
There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.
You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)
However, we can bypass this by passing Selenium's cookie session to requests.session().
Here's a toy example:
import requests
from selenium import webdriver
pdf_url = "/url/to/some/file.pdf"
# setup driver with options
driver = webdriver.Firefox(..options)
# do whatever you need to do to auth/login/click/etc.
# navigate to the PDF URL in case the PDF link issues a
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)
# get the URL from Selenium
current_pdf_url = driver.current_url
# create a requests session
session = requests.session()
# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
# Note: If headers are also important, you'll need to use
# something like seleniumwire to get the headers from Selenium
# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)
# access the bytes response from the session
pdf_bytes = pdf_response.content
I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.
I'm trying to automate a duolingo login with Selenium with the code posted below.
While everything seems to work as expected at first, I always get an "Wrong password" message on the website after the login button is clicked.
I have checked the password time and time again and even changed it to one without special characters, but still the login fails.
I have seen in other examples that there is sometimes an additional password input field, however I cannot find one while inspecting the html.
What could I be missing ?
(Side note: I'm also open to a completely different solution without a webdriver since I really only want to get to the duolingo.com/learn page to scrape some data, but as of yet I haven't found an alternative way to login)
The code used:
from selenium import webdriver
from time import sleep
url = "https://www.duolingo.com/"
def login():
driver = webdriver.Chrome()
driver.get(url)
sleep(2)
hve_acnt_btn = driver.find_element_by_xpath("/html/body/div/div/div/span[1]/div/div[1]/div[2]/div/div[2]/a")
hve_acnt_btn.click()
sleep(2)
email_input = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[2]/form/div[1]/div/label[1]/div/input")
email_input.send_keys("email#email.com")
sleep(2)
pwd_input = driver.find_element_by_css_selector("input[type=password]")
pwd_input.clear()
pwd_input.send_keys("password")
sleep(2)
login_btn = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[2]/form/div[1]/button")
login_btn.click()
sleep(5)
login()
I couldn't post the website's html because of the character limit, so here is the link to the duolingo page: Duolingo
Switch to Firefox or a browser which does not tell the page that you are visiting it automated. See my earlier answer for a very similar issue here: https://stackoverflow.com/a/57778034/8375783
Long story short: When you start Chrome it will run with navigator.webdriver=true. You can check it in console. Pages can detect that flag and block login or other actions, hence the invalid login. This is a read-only flag set by the browser during startup.
With Chrome I couldn't log in to Duolingo either. After I switched the driver to Firefox, the very same code just worked.
Also if I may recommend, try to use Xpath with attributes.
Instead of this:
hve_acnt_btn = driver.find_element_by_xpath("/html/body/div/div/div/span[1]/div/div[1]/div[2]/div/div[2]/a")
You can use:
hve_acnt_btn = driver.find_element_by_xpath('//*[#data-test="have-account"]')
Same goes for:
email_input = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[2]/form/div[1]/div/label[1]/div/input")
vs:
email_input = driver.find_element_by_xpath('//input[#data-test="email-input"]')
I have posted on a science forum (roughly 290 questions) that I would like to get back by downloading them with all the associated answers.
The first issue is that I have to be logged on my personal space to have the list of all the messages. How to circumvent this first barrier to be able with a shell script or a single wget command to get back all URL and their content. Can I pass to wgeta login and a password to be logged and redirected to the appropriate URL obtaining the list of all messages?
Once this first issue will be solved, the second issue is that I have to start from 6 different menu pages that all contain the title and the link of the questions.
Moreover, concerning some of my questions, the answers and the discussions may be on multiple pages.
So I wonder if I could achieve this operation of global downloading knowing I would like to store them statically with local CSS stored also on my computer (to keep the same format into my browser when I consult them on my PC).
The URL of the first menu page of questions is (once I am logged on the website : that could be an issue also to download with wget if I am obliged to be connected).
An example of URL containing the list of messages, once I am logged, is:
https://forums.futura-sciences.com/search.php?searchid=22897684
The other pages (there all 6 or 7 pages of discussions title in total appering in the main menu page) have the format:
"https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=2" (for page 2).
https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=5
(for page 5)
One can see on each of these pages the title and the link of each of the discussions that I would like to download with also CSS (knowing each discussion may contain multiple pages also) :
for example the first page of discussion "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps.html"
has page 2: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-2.html"
and page 3: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-3.html"
Naively, I tried to do all this with only one command (with the example of URL on my personal space that I have taken at the beginning of post, i.e "https://forums.futura-sciences.com/search.php?searchid=22897684"):
wget -r --no-check-certificate --html-extension --convert-links "https://forums.futura-sciences.com/search.php?searchid=22897684"
but unfortunately, this command downloads all files, and even maybe not what I want, i.e my discussions.
I don't know what the approach to use: must I firstly store all URL in a file (with all sub-pages containing all answers and the global discussion for each of mu initial question)?
And after, I could do maybe a wget -i all_URL_questions.txt. How can I carry out this operation?
Update
My issue needs a script, I tried with Python the following things:
1)
import urllib, urllib2, cookielib
username = 'USERNAME'
password = 'PASSWORD'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('https://forums.futura-sciences.com/login.php', login_data)
resp = opener.open('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print resp.read()
But the page printed is not the page of my home into personal space.
2)
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'USERNAME',
'inUserPass': 'PASSWORD'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text.encode('utf8')
# An authorised request.
r = s.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print r.text.encode('utf8')
Here too, this doesn't work
3)
import requests
import bs4
site_url = 'https://forums.futura-sciences.com/login.php?do=login'
userid = 'USERNAME'
password = 'PASSWWORD'
file_url = 'https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1'
o_file = 'abc.html'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'vb_login_username': userid, 'vb_login_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print("requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()
Same thing, the content is wrong
4)
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
webdriver.get('https://forums.futura-sciences.com/')
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
browser = webdriver.Firefox()
browser.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
Still not getting to log in with USERNAME and PASSSWORD and get content of homepage of personal space
5)
from selenium import webdriver
from selenium.webdriver.firefox.webdriver import FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
def MS_login(username, passwd): # call this with username and password
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['moz:webdriverClick'] = False
driver = webdriver.Firefox(capabilities=firefox_capabilities)
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2) # 0 means to download to the desktop, 1 means to download to the default "Downloads" directory, 2 means to use the directory
fp.set_preference("browser.download.dir","/Users/user/work_archives_futura/")
driver.get('https://forums.futura-sciences.com/') # change the url to your website
time.sleep(5) # wait for redirection and rendering
driver.delete_all_cookies() # clean up the prior login sessions
driver.find_element_by_xpath("//input[#name='vb_login_username']").send_keys(username)
elem = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//input[#name='vb_login_password']")))
elem.send_keys(Keys.TAB)
driver.find_element_by_xpath("//input[#type='submit']").click()
print("success !!!!")
driver.close() # close the browser
return driver
if __name__ == '__main__':
MS_login("USERNAME","PASSWORD")
The window is well opened, username filled but impossible to fill or submit the password and click on submit.
PS: the main issue could come from that password field has display:none property, So I can't simulate TAB operation to password field and pass it, once I have put the login.
It seems you're pretty knowledgeable already about scraping using the various methods. All that was missing were the correct field names in the post request.
I used the chrome dev tools (f12 - then go to networking tab). With this open if you login and quickly stop the browser window from redirecting, you'll be able to see the full request to login.php and look at the fields etc.
With that I was able to build this for you. It includes a nice dumping function for responses. To test my code works you can use your real password for positive case and the bad password line for negative case.
import requests
import json
s = requests.Session()
def dumpResponseData(r, fileName):
print(r.status_code)
print(json.dumps(dict(r.headers), indent=1))
cookieDict = s.cookies.get_dict()
print(json.dumps(cookieDict, indent=1))
outfile = open(fileName, mode="w")
outfile.write(r.text)
outfile.close()
username = "your-username"
password = "your-password"
# password = "bad password"
def step1():
data = dict()
data["do"] = "login"
data["vb_login_md5password"] = ""
data["vb_login_md5password_utf"] = ""
data["s"] = ""
data["securitytoken"] = "guest"
data["url"] = "/search.php?do=finduser&userid=1077817&contenttype=vBForum_Post&showposts=1"
data["vb_login_username"] = username
data["vb_login_password"] = password
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=data)
# Logged In?
if "vbseo_loggedin" in s.cookies.keys():
print("Logged In!")
else:
print("Login Failed :(")
if __name__ == "__main__":
step1()
I don't have any posts in my newly created Futura account so I can't really do any more testing for you - I don't want to spam their forum with garbage.
But I would probably start by doing a request of post search url and scrape the links using bs4.
Then you could probably just use wget -r for each link you've scraped.
#Researcher is correct on their advice when it comes to the requests library. You are not posting all of the request params that the browser would send. Overall, I think it will be difficult to get requests to pull everything when you factor in static content and client side javascript
Your selenium code from section 4 has a few mistakes in it:
# yours
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
# should be
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
You may need to fiddle with the xpath for the submit button.
Hint: You can debug along the way by taking a screenshots :
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.get_screenshot_as_file('before_submit.png')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
webdriver.get_screenshot_as_file('after_submit.png')