Python - Logging in to web scrape

Python - Logging in to web scrape - python

I'm trying to web-scrape a page on www.roblox.com that requires me to be logged in. I have done this using the .ROBLOSECURITY cookie, however, that cookie changes every few days. I want to instead log in using the login form and Python. The form and what I have so far is below. I do NOT want to use any add-on libraries like mechanize or requests.
Form:
<form action="/newlogin" id="loginForm" method="post" novalidate="novalidate" _lpchecked="1"> <div id="loginarea" class="divider-bottom" data-is-captcha-on="False">
<div id="leftArea">
<div id="loginPanel">
<table id="logintable">
<tbody><tr id="username">
<td><label class="form-label" for="Username">Username:</label></td>
<td><input class="text-box text-box-medium valid" data-val="true" data-val-required="The Username field is required." id="Username" name="Username" type="text" value="" autocomplete="off" aria-required="true" aria-invalid="false" style="cursor: auto; background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGP6zwAAAgcBApocMXEAAAAASUVORK5CYII=);"></td>
</tr>
<tr id="password">
<td><label class="form-label" for="Password">Password:</label></td>
<td><input class="text-box text-box-medium" data-val="true" data-val-required="The Password field is required." id="Password" name="Password" type="password" autocomplete="off" style="cursor: auto; background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGP6zwAAAgcBApocMXEAAAAASUVORK5CYII=);"></td>
</tr>
</tbody></table>
<div>
</div>
<div>
<div id="forgotPasswordPanel">
<a class="text-link" href="/Login/ResetPasswordRequest.aspx" target="_blank">Forgot your password?</a>
</div>
<div id="signInButtonPanel" data-use-apiproxy-signin="False" data-sign-on-api-path="https://api.roblox.com/login/v1">
<a roblox-js-onclick="" class="btn-medium btn-neutral">Sign In</a>
<a roblox-js-oncancel="" class="btn-medium btn-negative">Cancel</a>
</div>
<div class="clearFloats">
</div>
</div>
<span id="fb-root">
<div id="SplashPageConnect" class="fbSplashPageConnect">
<a class="facebook-login" href="/Facebook/SignIn?returnTo=/home" ref="form-facebook">
<span class="left"></span>
<span class="middle">Login with Facebook<span>Login with Facebook</span></span>
<span class="right"></span>
</a>
</div>
</span>
</div>
</div>
<div id="rightArea" class="divider-left">
<div id="signUpPanel" class="FrontPageLoginBox">
<p class="text">Not a member?</p>
<h2>Sign Up to Build & Make Friends</h2>
Sign Up
^Don't know what that "Sign Up" thing is doing there, can't delete it.
What I have so far:
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
authentication_url = 'http://www.roblox.com/newlogin'
payload = {
'ReturnUrl' : 'http://www.roblox.com/home',
'Username' : 'usernamehere',
'Password' : 'passwordhere'
}
data = urllib.urlencode(payload)
req = urllib2.Request(authentication_url, data)
resp = urllib2.urlopen(req)
contents = resp.read()
print contents
I am very new to Python so I don't know how much of this works. Please let me know what is wrong with my code; I only get the log in page when I print contents
PS: The login page is HTTPS

Solution from OP.
I finished the script myself with the code below:
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
authentication_url = 'https://www.roblox.com/newlogin'
payload = {
'username' : 'YourUsernameHere',
'password' : 'YourPasswordHere',
'' : 'Log In',
}
data = urllib.urlencode(payload)
req = urllib2.Request(authentication_url, data)
resp = urllib2.urlopen(req)
PageYouWantToOpen = urllib2.urlopen("http://www.roblox.com/develop").read()

I made this class a few weeks ago using just urllib.request for some webscraping/autotab opening. This may help you out or perhaps get you on the right path.
import urllib.request
class Log_in:
def __init__(self, loginURL, username, password):
self.loginURL = loginURL
self.username = username
self.password = password
def log_in_to_site(self):
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm = None,
uri=self.loginURL,
user=self.username,
passwd=self.password)
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)

Related

Login to website using BeautifulSoup cookies problem

I have checked many posts, but I can't find what I am doing wrong. I am trying to login to a URL but after the request.post, the printed result is still the sign in page. Can you tell me what am I doing wrong ?
Here is the HTML
<div class="sign-in-section form-section">
<form class="simple_form form-vertical" novalidate="novalidate" action="/users/sign_in" accept-charset="UTF-8" method="post">
<input name="utf8" type="hidden" value="✓">
<input type="hidden" name="authenticity_token" value="TneAOk3yv/mEeuKsJucEN1YUSng7+EJc5YfBqKxwugv6gq2lxsZIjGecFwOK/jA0fYYF3aRb9ih15glcoHCWkg==">
<div class="form-group email optional user_email">
<input type="text" class="string email optional form-control" placeholder="Email" value="" name="user[email]" id="user_email">
</div>
<div class="password-and-submit-wrapper">
<div class="form-group password optional user_password">
<input class="password optional form-control" placeholder="Password" type="password" name="user[password]" id="user_password">
</div>
<div class="sign-in-button-and-forgot-password-link-wrapper">
<div class="forgot-password-link-wrapper">
<a class="forgot-password-link" href="/users/password/reset_email">Forgot password?</a>
</div>
<div class="sign-in-button">
<input type="submit" name="commit" value="Sign in" class=" submit elcurator-button">
</div>
</div>
</div>
</form>
</div>
And here is my code which works but doesn't take me to the after-login page:
# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
from lxml import html
login_url = "https://www.elcurator.net/users/sign_in"
url = "https://www.elcurator.net/shared_articles"
#Login
USERNAME = "some_email"
PASSWORD = "some_password"
def main():
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36",
}
session_requests = requests.session()
session_requests.headers.update(headers)
result = session_requests.get(login_url,verify=False)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[#name='authenticity_token']/#value")))[0]
# Create payload
payload = {
"user[email]": USERNAME,
"user[password]": PASSWORD,
"authenticity_token": authenticity_token
}
result = session_requests.post(login_url, data = payload, headers = dict(referer=login_url))
print(result)
# Connect to the URL
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, "html.parser")
print(soup)
if __name__ == '__main__':
main()
The output shows "you have to login first" and other messages that let me know that I am not logged in.
What's wrong?

Thanks it worked.
Had to change these lines
result = session_requests.post(login_url, data = payload, headers = dict(referer=login_url), cookies=result.cookies)
response = requests.get(url, verify=False,cookies=result.cookies)

#Danisotomy
You can try parsing the cookie from sign_in
call's response headers, and send them in every subsequent request.

Can't login on a form with requests

I think what i do all correctly but the script do not login on a simple form.
After login i use the get method to try if i can see the user panel but i allways recive the index of the page as if it no were logged
The user and password inputs are well.
some idea ??
import requests
url = 'http://streamcloud.eu/login.html'
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {
'login':'my_login',
'password':'my_password'
}
r = requests.session()
r.get(url)
login = r.post(url,data=payload,headers=headers)
result = r.get('http://streamcloud.eu/?op=my_account')
print(result.text)

You need to "post" your form-data to http://streamcloud.eu/. Additionally pass an additional parameter called op with the value login, to indicate, that you want to log in. All of this can be found out through a quick look at the html of the target website:
<form method="POST" action="http://streamcloud.eu/" class="proform" name="FL">
<input type="hidden" name="op" value="login">
<input type="hidden" name="redirect" value="http://streamcloud.eu/">
<input type="hidden" name="rand" value="">
<p>
<label>Benutzername:</label>
<input type="text" style="font-style: normal;" name="login" value="my_login" class="text_field">
</p>
<div class="clear"></div>
<p>
<label>Passwort:</label>
<input type="password" style="font-style: normal;" name="password" class="text_field">
</p>
<div class="clear"></div>
<div class="clear"></div>
<br>
<div>
<input type="submit" class="button blue medium" value="Senden">
</div>
<div class="clear"></div>
</form>
As you can see the form posts its information to http://streamcloud.eu/:
<form method="POST" action="http://streamcloud.eu/" class="proform" name="FL">
Here you can see the hidden op parameter:
<input type="hidden" name="op" value="login">
Here is the updated code:
import requests
url = 'http://streamcloud.eu'
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {
'op': 'login',
'login': 'my_login',
'password': 'my_password'
}
r = requests.session()
r.get(url)
login = r.post(url, data=payload, headers=headers)
result = r.get('http://streamcloud.eu/?op=my_account')
print(result.text)

After r.post(...) (r.headers['Set-Cookie'])
U will obtain a cookie, so i guess u have to pass that cookie on r.get(...)

python urllib form post

<div id="login-section">
<fieldset class="validation-group">
<table id="navLgnMember" cellspacing="0" cellpadding="0" style="border-collapse:collapse;">
<tr>
<td>
<div id="login-user">
<div class="input" id="username-wrapper">
<div class="loginfield-label">Number / ID / Email</div>
<div class="input-field-small float-left submit-on-enter"><div class="left"></div><input name="ctl00$ctl01$navLgnMember$Username" type="text" maxlength="80" id="Username" title="Username" class="center" style="width:85px;" /><div class="right"></div></div>
</div>
<div class="input" id="password-wrapper">
<div class="loginfield-label">
Password</div>
<div class="input-field-small float-left submit-on-enter"><div class="left"></div><input name="ctl00$ctl01$navLgnMember$Password" type="password" id="Password" title="Password" class="center" title="Password" style="width:85px;" /><div class="right"></div></div>
</div>
<div id="login-wrapper">
<input type="submit" name="ctl00$ctl01$navLgnMember$Login" value="" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ctl01$navLgnMember$Login", "", false, "", "https://tatts.com/tattersalls", false, false))" id="Login" class="button-login" />
</div>
how would one go about submitting to this form from urllib as the current code i have:
import cookielib
import urllib
import urllib2
# Store the cookies and create an opener that will hold them
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# Add our headers
opener.addheaders = [('User-agent', 'RedditTesting')]
# Install our opener (note that this changes the global opener to the one
# we just made, but you can also just call opener.open() if you want)
urllib2.install_opener(opener)
# The action/ target from the form
authentication_url = 'https://tatts.com/tattersalls'
# Input parameters we are going to send
payload = {
'_EVENTTARGET:' ''
'__EVENTARGUMENT:' ''
'__VIEWSTATE:' '/wEPDwUKMTIwNzM2NDc5NQ9kFgICCBBkZBYCAgcPZBYGZg9kFgJmD2QWAmYPFgIeB1Zpc2libGVoZAIBD2QWAmYPZBYCZg8WAh8AaGQCAg9kFgJmD2QWAgIBD2QWAmYPZBYCZg9kFgYCAw9kFgICBw8WAh4FY2xhc3MFFmxhdGVzdFJlc3VsdHNCb2R5RGl2QmdkAgsPZBYCZg9kFgICBQ8WBB4JaW5uZXJodG1sBR8qRGl2IDEgJDFNIGZvciB1cCB0byA0IHdpbm5lcnMuHwBnZAIND2QWAmYPZBYCZg9kFgYCAQ8PFgIeBFRleHQFNVdobyB3b24gbW9uZXkgaW4gRHJvbWFuYT8gVGF0dHNMb3R0byBwcml6ZSB1bmNsYWltZWQhZGQCAw8PFgIfAwV5QSBkaXZpc2lvbiBvbmUgVGF0dHNMb3R0byBwcml6ZSBvZiAkODI5LDM2MS42OCB3YXMgd29uIGluIERyb21hbmEgaW4gU2F0dXJkYXkgbmlnaHTigJlzIGRyYXcgYnV0IHRoZSB3aW5uZXIgaXMgYSBteXN0ZXJ5IWRkAgUPDxYCHgtOYXZpZ2F0ZVVybAUbL3RhdHRlcnNhbGxzL3dpbm5lci1zdG9yaWVzZGRk40y89P1oSwLqvsMH4ZGTu9vsloo='
'__PREVIOUSPAGE:' 'PnGXOHeTQRfdct4aw9jgJ_Padml1ip-t05LAdAWQvBe5-2i1ECm5zC0umv9-PrWPJIXsvg9OvNT2PNp99srtKpWlE4J-6Qp1mICoT3eP49RSXSmN6p_XiieWS68YpbKqyBaJrkmYbJpZwCBw0Wq3tSD3JUc1'
'__EVENTVALIDATION:': '/wEdAAfZmOrHFYG4x80t+WWbtymCH/lQNl+1rLkmSESnowgyHVo7o54PGpUOvQpde1IkKS5gFTlJ0qDsO6vsTob8l0lt1XHRKk5WhaA0Ow6IEfhsMPG0mcjlqyEi79A1gbm2y9z5Vxn3bdCWWa28kcUm81miXWvi1mwhfxiUpcDlmGDs/3LMo4Y='
'ctl00$ctl01$showUpgradeReminderHid:' 'false'
'ctl00$ctl01$navLgnMember$Username:' 'x-tinct'
'ctl00$ctl01$navLgnMember$Password:' '########'
'ctl00$ctl01$navLgnMember$Login:'
}
# Use urllib to encode the payload
data = urllib.urlencode(payload)
# Build our Request object (supplying 'data' makes it a POST)
req = urllib2.Request(authentication_url, data)
# Make the request and read the response
resp = urllib2.urlopen(req)
contents = resp.read()
print (resp)
is a fair way off submitting to the right part of the webfrom.
im trying to log in and create a session to then be able to post further webform data to further parts of the site.
Thanks in advance.

According to this other post from SO : Mechanize and Javascript, you have different options, from simulating in Python what javascript script is doing, to using the full fledged Selenium with its Python bindings.
If you try to proceed the simple Python way, I would strongly urge you to use a network spy such as the excellent wireshark to analyse what a successful login through a real browser actualy gets and sends, and what your Python simulation sends.

Python 3 script for logging into a website using the Requests module

I'm trying to write some Python (3.3.2) code to log in to a website using the Requests module. Here is the form section of the login page:
<form method="post" action="https://www.ibvpn.com/billing/dologin.php" name="frmlogin">
<input type="hidden" name="token" value="236647d2da7c8408ceb78178ba03876ea1f2b687" />
<div class="logincontainer">
<fieldset>
<div class="clearfix">
<label for="username">Email Address:</label>
<div class="input">
<input class="xlarge" name="username" id="username" type="text" />
</div>
</div>
<div class="clearfix">
<label for="password">Password:</label>
<div class="input">
<input class="xlarge" name="password" id="password" type="password"/>
</div>
</div>
<div align="center">
<p>
<input type="checkbox" name="rememberme" /> Remember Me
</p>
<p>Request a Password Reset</p>
</div>
</fieldset>
</div>
<div class="actions">
<input type="submit" class="btn primary" value="Login" />
</div>
</form>
Here is my code, trying to deal with hidden input:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ibvpn.com/billing/clientarea.php'
body = {'username':'my email address','password':'my password'}
s = requests.Session()
loginPage = s.get(url)
soup = BeautifulSoup(loginPage.text)
hiddenInputs = soup.findAll(name = 'input', type = 'hidden')
for hidden in hiddenInputs:
name = hidden['name']
value = hidden['value']
body[name] = value
r = s.post(url, data = body)
This just returns the login page. If I post my login data to the URL in the 'action' field, I get a 404 error.
I've seen other posts on StackExchange where automatic cookie handling doesn't seem to work, so I've also tried dealing with the cookies manually using:
cookies = dict(loginPage.cookies)
r = s.post(url, data = body, cookies = cookies)
But this also just returns the login page.
I don't know if this is related to the problem, but after I've run either variant of the code above, entering r.cookies returns <<class 'requests.cookies.RequestsCookieJar'>[]>
If anyone has any suggestions, I'd love to hear them.

You are loading the wrong URL. The form has an action attribute:
<form method="post" action="https://www.ibvpn.com/billing/dologin.php" name="frmlogin">
so you must post your login information to:
https://www.ibvpn.com/billing/dologin.php
instead of posting back to the login page. POST to soup.form['action'] instead:
r = s.post(soup.form['action'], data=body)
Your code is handling cookies just fine; I can see that s.cookies holds a cookie after requesting the login form, for example.
If this still doesn't work (a 404 is returned), then the server is using additional techniques to detect scripts vs. real browsers. Usually this is done by parsing the request headers. Look at your browser headers and replicate those. It may just be the User-Agent header that they parse, but Accept-* headers and Referrer can also play a role.

Login to webpage not working with Python requests module

I am trying to use the Python requests module to authenticate to a website and then retrieve some information from it. This is the login part of the page:
<div>
<label class="label-left" for="username"> … </label>
<input id="username" class="inputbox" type="text" size="18" alt="username" name="username"></input>
</div>
<div>
<label class="label-left" for="passwd"> … </label>
<input id="passwd" class="inputbox" type="password" alt="password" size="18" name="passwd"></input>
</div>
<div> … </div>
<div class="readon">
<input class="button" type="submit" value="Login" name="Submit"></input>
What I am doing now is:
payload = {
'username': username,
'passwd': password,
'Submit':'Login'
}
with requests.Session() as s:
s.post(login, data=payload)
ans = s.get(url)
print ans.text
The problem is that I get the same login page, even after the authentication. The response code is 200, so everything should be ok. Am I missing something?
UPDATE
Thanks to the comment, I have analyzed the post requests and I've seen that there are some hidden parameters. Among them, there are some parameters whose values vary between different requests. For this reason, I am simply getting them with BeautifulSoup and then updating the payload of the post request as follows:
with requests.Session() as s:
login_page = s.get(login)
soup = BeautifulSoup(login_page.text)
inputs = soup.findAll(name='input',type='hidden')
for el in inputs:
name = el['name']
value = el['value']
payload[name]=value
s.post(login, data=payload)
ans = s.get(url)
Nevertheless, I am still getting the login page. There can be some other influencing elements?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Logging in to web scrape - python

Related

Login to website using BeautifulSoup cookies problem

Can't login on a form with requests

python urllib form post

Python 3 script for logging into a website using the Requests module

Login to webpage not working with Python requests module

Categories

Resources