log in to webpage with python to scrape data

log in to webpage with python to scrape data - python

I am trying to build a webscraper to extract my stats data from MWO Mercs. To do so it is necessary to login to the page and then go through the 6 different stats pages to get the data (this will go into a data base later but that is not my question).
The login form is given below (from https://mwomercs.com/login?return=/profile/stats?type=mech)- from what I see there are two fields that need data EMAIL and PASSWORD and need to be posted. It should then open http://mwomercs.com/profile/stats?type=mech . After that I need have a session to cycle through the various stats pages.
I have tried using urllib, mechanize and requests but I have been totally unable to find the right answer - I would prefer to use requests.
I do realise that similar questions have been asked in stackoverflow but I have searched for a very long time with no success.
Thank you for any help that could be provided
<div id="stubPage">
<div class="container">
<h1 id="stubPageTitle">LOGIN</h1>
<div id="loginForm">
<form action="/do/login" method="post">
<legend>MechWarrior Online REGISTER</legend>
<label>Email Address:</label>
<div class="input-prepend"><span class="add-on textColorBlack textPlain">#</span><input id="email" name="email" class="span4" size="16" type="text" placeholder="user#example.org"></div>
<label>Password:</label>
<div class="input-prepend"><span class="add-on"><span class="icon-lock"></span></span><input id="password" name="password" class="span4" size="16" type="password"></div>
<br>
<button type="submit" class="btn btn-large btn-block btn-primary">LOGIN</button>
<br>
<span class="pull-right">[ Forgot Your Password? ]</span>
<br>
<input type="hidden" name="return" value="/profile/stats?type=mech">
</form>
</div>
</div>
</div>

The Requests documentation is very simple and easy to follow when it comes to submitting form data. Please give this a read-through: More Complicated POST requests
Logins usually come down to saving the cookie and sending it with future requests.
After you POST to the login page with requests.post(), use the request object to retieve the cookies. This is one way to do it:
post_headers = {'content-type': 'application/x-www-form-urlencoded'}
payload = {'username':username, 'password':password}
login_request = requests.post(login_url, data=payload, headers=post_headers)
cookie_dict = login_request.cookies.get_dict()
stats_reqest = requests.get(stats_url, cookies=cookie_dict)
If you still have problems, check the return code from the request with login_request.status_code or the page content for an error with login_request.text
Edit:
Some sites will redirect you several times when you make a request. Make sure to check the request.history object to see what happened and why you got bounced out. For example, I get redirects like this all of the time:
>>> some_request.history
(<Response [302]>, <Response [302]>)
Each item in the history tuple is another request. You can inspect them like normal requests objects, such as request.history[0].url and you can disable the redirects by putting allow_redirects=False in your request parameters:
login_request = requests.post(login_url, data=payload, headers=post_headers, allow_redirects=False)
In some cases, I've had to disallow redirects and add new cookies before progressing to the proper page. Try using something like this to keep your existing cookies and add the new cookies to it:
cookie_dict = dict(cookie_dict.items() + new_request.cookies.get_dict().items())
Doing this after each request will keep your cookies up-to-date for your next request, similar to how your browser would.

Related

Failing to login to website using python's requests library

edit I believe the issue is that when opening the website for the first time, you must click a "Our privacy policy has been updated..." Accept button. I'm now looking into how I might go about "clicking" this button using python requests, however the button calls a javascript function so I'm not sure if it's possible to do with just the requests library.
I am attempting to write a script for this website: https://rocket-league.com/
I need to be logged in to perform my tasks, but I'm having trouble logging in.
I feel as though I've accounted for all the correct params. I fetch the csrf_token dynamically using regex. Maybe I need to do something with the cookies, but I'm not sure what?
This is my first attempt at writing a script that interacts with websites so I'm sorry if I'm naive.
I'd be grateful for any insights that aren't just a disguised way of telling me I'm dumb and/or lazy.
Here is my code so far:
import requests;
import re;
payload = {
'csrf_token': '',
'email': 'myemail',
'password': 'mypassword',
'submit': 'Login'
}
url = 'https://rocket-league.com/login'
with requests.Session() as s:
r = s.get(url)
m = re.search("<input type='hidden' name='csrf_token' value='(.+)'", r.text)
if m: payload['csrf_token'] = m.group(1)
else: print("couldnt find csrf_token")
p = s.post('https://rocket-league.com/functions/login.php', data=payload)
print(p.text)
Running this code prints out HTML that still has the login form in it which means I am not being logged in.
When I login from my browser with the developer tools > network open, I get this information for my post request:
Headers:
Request URL: https://rocket-league.com/functions/login.php
Request method: POST
Status code: 302 Found
Version HTTP/2.0
Cookies:
__cfduid: d2c83b2c9ad728195366656a56592f6d71549577451
acceptedPrivacyPolicy: 2.0
euconsent: BObnsHAObnsHAABABAENCF-AAAAkF7_______9______9uz_Ov_v_f__33e8__9v_l_7_-___u_-33d4-_1vf99yfm1-7ftr3tp_87ues2_Xur__59__3z3_tphPhA
fantasy_rlcs_id: hA8fga9ghIgaFGA9
PHPSESSID: lrse3tgov95eg574sqga6la9jc
Params:
csrf_token: 3f2588113e8921f52dc3eb78e51246a1
email: myemail
password: mypassword
submit: Login
Here is the actual HTML for the login form:
<form class="rlg-form" method="post" action="/functions/login.php">
<input type="hidden" name="csrf_token" value="3accad82ad0957cab634f805a7e28beb">
<input class="rlg-input" type="email" name="email" placeholder="Email" required="">
<input class="rlg-input" type="password" name="password" placeholder="Password" autocomplete="off" required="">
<fieldset class="rlg-checkbox">
<input type="checkbox" name="rememberme" id="rememberme-login">
<label for="rememberme-login">Remember me?</label>
</fieldset>
<input class="rlg-btn-primary" type="submit" name="submit" value="Login">
</form>

python autofill form in a webpage

I am trying to fill a form in a webpage that has a single text box and a send button the html looks like this
<form class="form-horizontal">
<div class="row">
<div class="col-md-12">
<div id="TextContainer" class="textarea-container">
<textarea id="Text" rows="5" maxlength="700" class="form-control remove-border" style="background:none;"></textarea>
</div><button id="Send" class="btn btn-primary-outline" type="button" onclick="SendMessage()" style="margin-top:10px" data-loading-text="Loading..."><span class="icon icon-pencil"></span> Send</button>
</div>
</div>
</form>
I tried to use mechanize to submit the form with this code
import re
from mechanize import Browser
br = Browser()
response=br.open("https://abcd.com/")
for f in br.forms():
if f.attrs['class'] == 'form-horizontal':
br.form = f
text = br.form.find_control(id="Text")
text.value = "something"
br.submit()
The code runs without an error, but no submission is happening , how do I do it?
Here is the SendMessage function
function SendMessage() {
var text = $('#Text').val();
var userId = $('#RecipientId').val();
if (text.trim() === "")
{
$('#TextContainer').css('border-color', 'red');
}
else if (new RegExp("([a-zA-Z0-9]+://)?([a-zA-Z0-9_]+:[a-zA-Z0-9_]+#)?([a-zA-Z0-9.-]+\\.[A-Za-z]{2,4})(:[0-9]+)?(/.*)?").test(text))
{
$('#TextContainer').css('border-color', 'red');
$('#message').html("Links are not allowed in messages");
}
else
{
$('#Send').button('loading');
$.ajax(
{
url: '/Messages/SendMessage',
type: 'POST',
cache: false,
data:
{
__RequestVerificationToken: $('<input name="__RequestVerificationToken" type="hidden" value="CfDJ8MQSRebrM95Pv2f7WNJmKQWGnVR66zie_VVqFsquOCZLDuYRRBPP1yzk_755VDntlD3u0L3P-YYR0-Aqqh1qIjd09HrBg8GNiN_AU48MMlrOtUKDyJyYCJrD918coQPG0dmgkLR3W85gV6P4zObdEMw" />').attr('value'),
userId: userId,
text: text
}
});
}
}

I suspect the issue is that the submit button in the HTML form is not of type=submit - so mechanise won't know what to do when you call br.submit(). The fix is to either change the button type on the HTML website, or tell Browser which button to use for submitting the form:
br.submit(type='button', id='Send')
The submit method takes the same arguments as the HTML Forms API, so I recommend taking a look at the documentation for more details.
Update
The problem here seems to be the JavaScript method attached to the button. Mechanize does not support calling JavaScript functions, hence you won't be able to just use the .submit() method to submit the form. Instead, the best option would probably be to read in the SendMessage() JavaScript function, which gets called if someone clicks on the Send button, and translate it to Python manually. In the best case it consists of a simple AJAX POST request which is very easy to implement in Python. Please look here for a related question.
Second Update
Given the new information in your question, in particular the JavaScript function, you can now manually implement the POST request inside your Python script. I suggest the use of the Requests module which will make the implementation much easier.
import requests
data = {
"__RequestVerificationToken": "CfDJ8MQSRebrM95Pv2f7WNJmKQWGnVR66zie_VVqFsquOCZLDuYRRBPP1yzk_755VDntlD3u0L3P-YYR0-Aqqh1qIjd09HrBg8GNiN_AU48MMlrOtUKDyJyYCJrD918coQPG0dmgkLR3W85gV6P4zObdEMw",
"userId": "something",
"text": "something else"
}
response = requests.post("https://example.com/Messages/SendMessage", data=data)
response will now consist of the response which you can use to check if the request was successfully made. Please note that you might need to read out the __RequestVerificationToken with mechanize as I suspect it is generated each time you open the website. You could just read out the HTML source with html_source = br.read() and then search for __RequestVerificationToken and try to extract the corresponding value.

You can give name attribute to your text area like:
<form class="form-horizontal">
<div class="row">
<div class="col-md-12">
<div id="TextContainer" class="textarea-container">
<textarea id="Text" name="sometext" rows="5" maxlength="700" class="form-control remove-border" style="background:none;"></textarea>
</div><button id="Send" class="btn btn-primary-outline" type="button" onclick="SendMessage()" style="margin-top:10px" data-loading-text="Loading..."><span class="icon icon-pencil"></span> Send</button>
</div>
</div>
</form>
Then try this out:
import re
from mechanize import Browser
br = mechanize.Browser()
br.open("https://abcd.com/")
br.select_form(nr=0) #in case of just single form you can select form passing nr=0
br["sometext"] = "something"
response = br.submit()
print(response.read())
If it successfully submits form then you can read your response body.

Login to https website using Python

I'm new to posting on stackoverflow so please don't bite! I had to resort to making an account and asking for help to avoid banging my head on the table any longer...
I'm trying to login to the following website https://account.socialbakers.com/login using the requests module in python. It seems as if the requests module is the place to go but the session.post() function isn't working for me. I can't tell if there is something unique about this type of form or the fact the website is https://
The login form is the following:
<form action="/login" id="login-form" method="post" novalidate="">
<big class="error-message">
<big>
<strong>
</strong>
</big>
</big>
<div class="item-full">
<label for="">
<span class="label-header">
<span>
Your e-mail address
</span>
</span>
<input id="email" name="email" type="email"/>
</label>
</div>
<div class="item-list">
<div class="item-big">
<label for="">
<span class="label-header">
<span>
Password
</span>
</span>
<input id="password" name="password" type="password"/>
</label>
</div>
<div class="item-small">
<button class="btn btn-green" type="submit">
Login
</button>
</div>
</div>
<p>
<a href="/email/reset-password">
<strong>
Lost password?
</strong>
</a>
</p>
</form>
Based on the following post How to "log in" to a website using Python's Requests module? among others I have tried the following code:
url = 'https://account.socialbakers.com/login'
payload = dict(email = 'Myemail', password = 'Mypass')
with session() as s:
soup = BeautifulSoup(s.get(url).content,'lxml')
p = s.post(url, data = payload, verify=True)
print(p.text)
This however just gives me the login page again and doesn't seem to log me in
I have checked in the form that I am referring to the correct names of the inputs 'email' and 'password'. I've tried explicitly passing through cookies as well. The verify=True parameter was suggested as a way to deal with the fact the website is https.
I can't work out what isn't working/what is different about this form to the one on the linked post.
Thanks
Edit: Updated p = s.get to p = s.post

Checked the website. It is sending the SHA3 hash of the password instead of sending as plaintext. You can see this in line 111 of script.js which is included in the main page as :
<script src="/js/script.js"></script>
inside the head tag.
So you need to replicate this behaviour while sending POST requests. I found pysha3 library that does the job pretty well.
So first install pysha3 by running pip install pysha3 (give sudo if necessary) then run the code below
import sha3
import hashlib
import request
url = 'https://account.socialbakers.com/login'
myemail = "abhigolu10#gmail.com"
mypassword = hashlib.sha3_512(b"st#ck0verflow").hexdigest() #take SHA3 of password
payload = {'email':myemail, 'password':mypassword}
with session() as s:
soup = BeautifulSoup(s.get(url).content,'lxml')
p = s.post(url, data = payload, verify=True)
print(p.text)
and you will get the correct logged in page!

Two things to look out. One, try to use s.post and second you need to check in the browser if there is any other value the form is sending by looking at the network tab.

Form is not sending password in clear text. It is encrypting or hashing it before sending. When you type password aaaa in form via network it sends
b3744bb9a8adb2d67cfdf79095bd84f5e77500a76727e6d73eef460eb806511ba73c9f765d4b3738e0b1399ce4a4c4ac3aed17fff34e0ef4037e9be466adec61
so no easy way to login via requests library without duplicating this behavior.

Python 3 script for logging into a website using the Requests module

I'm trying to write some Python (3.3.2) code to log in to a website using the Requests module. Here is the form section of the login page:
<form method="post" action="https://www.ibvpn.com/billing/dologin.php" name="frmlogin">
<input type="hidden" name="token" value="236647d2da7c8408ceb78178ba03876ea1f2b687" />
<div class="logincontainer">
<fieldset>
<div class="clearfix">
<label for="username">Email Address:</label>
<div class="input">
<input class="xlarge" name="username" id="username" type="text" />
</div>
</div>
<div class="clearfix">
<label for="password">Password:</label>
<div class="input">
<input class="xlarge" name="password" id="password" type="password"/>
</div>
</div>
<div align="center">
<p>
<input type="checkbox" name="rememberme" /> Remember Me
</p>
<p>Request a Password Reset</p>
</div>
</fieldset>
</div>
<div class="actions">
<input type="submit" class="btn primary" value="Login" />
</div>
</form>
Here is my code, trying to deal with hidden input:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ibvpn.com/billing/clientarea.php'
body = {'username':'my email address','password':'my password'}
s = requests.Session()
loginPage = s.get(url)
soup = BeautifulSoup(loginPage.text)
hiddenInputs = soup.findAll(name = 'input', type = 'hidden')
for hidden in hiddenInputs:
name = hidden['name']
value = hidden['value']
body[name] = value
r = s.post(url, data = body)
This just returns the login page. If I post my login data to the URL in the 'action' field, I get a 404 error.
I've seen other posts on StackExchange where automatic cookie handling doesn't seem to work, so I've also tried dealing with the cookies manually using:
cookies = dict(loginPage.cookies)
r = s.post(url, data = body, cookies = cookies)
But this also just returns the login page.
I don't know if this is related to the problem, but after I've run either variant of the code above, entering r.cookies returns <<class 'requests.cookies.RequestsCookieJar'>[]>
If anyone has any suggestions, I'd love to hear them.

You are loading the wrong URL. The form has an action attribute:
<form method="post" action="https://www.ibvpn.com/billing/dologin.php" name="frmlogin">
so you must post your login information to:
https://www.ibvpn.com/billing/dologin.php
instead of posting back to the login page. POST to soup.form['action'] instead:
r = s.post(soup.form['action'], data=body)
Your code is handling cookies just fine; I can see that s.cookies holds a cookie after requesting the login form, for example.
If this still doesn't work (a 404 is returned), then the server is using additional techniques to detect scripts vs. real browsers. Usually this is done by parsing the request headers. Look at your browser headers and replicate those. It may just be the User-Agent header that they parse, but Accept-* headers and Referrer can also play a role.

What happes when submitting HTML survey

my question is what is really happens when you hit the submit button on a html servey like this one?
<INPUT TYPE="radio" NAME="bev" VALUE="no" CHECKED>No beverage<BR>
<INPUT TYPE="radio" NAME="bev" VALUE="tea">Tea<BR>
<INPUT TYPE="radio" NAME="bev" VALUE="cof">Coffee<BR>
<INPUT TYPE="radio" NAME="bev" VALUE="lem">Lemonade<BR>
To be more specific, I mean how does the browser sending the data of my choie to the server, because I want to make a Python code that will vote for me in a HTML survey like this

If the form method attribute is post(which I think is) , then the browser sends a post request.If you're using requests library, this is the code
data = {'bev': 'tea'}
#Define a dict with parameter with keys as name attribute values and value as the content you want to send
r = requests.get("http://awebsite.com/", params=data)
print r.content
Requests docs POST requests
And if you aren't using requests, then God help you write the code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.