I'm trying to log in to Wikipedia using a python script, but despite following the instructions here, I just can't get it to work.
import urllib
import urllib2
import cookielib
username = 'myname'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6")]
login_data = urllib.urlencode({'wpName' : username, 'wpPassword' : password})
opener.open('http://en.wikipedia.org/w/index.php?title=Special:UserLogin', login_data)
resp = opener.open('http://en.wikipedia.org/wiki/Special:Watchlist')
All I get is the "You're not logged in" page. I tried logging in to another site with the script with the same negative result. I suspect it's either got something to do with cookies, or I'm missing something incredibly simple here. But I just cannot find it.
If you inspect the raw request sent to the login URL (with the help of a tool such as Charles Proxy), you will see that it is actually sending 4 parameters: wpName, wpPassword, wpLoginAttempt and wpLoginToken. The first 3 are static and you can fill them in anytime, the 4th one however needs to be parsed from the HTML of the login page. You will need to post this value you parsed, in addition to the other 3, to the login URL to be able to login.
Here is the working code using Requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wpLoginAttempt': 'Log in',
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')
Adding these two lines
r = bs(response.content)
print r.get_text()
I should be able to understand if I'm logged in or not, right? I keep seeing "Please log in to view or edit items on your watchlist." but I'm using the clean code given above, with my login and password.
Where is the mistake?
Wikipedia now forces HTTPS and requires other parameters, and wpLoginAttempt became wploginattempt, here is an updated version of K Z initial answer:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
You need to add header Content-Type: application/x-www-form-urlencoded to your POST request.
I also added the following lines and see myself as not logged in.
page = response.text.encode('utf8')
if page.find('Not logged in'):
print 'You are not logged in. :('
else:
print 'YOU ARE LOGGED IN! :)'
Related
I'm trying to log to a website https://app.factomos.com/connexion but that doesn't work, I still have the error 403, I try with different headers and different data, but I really don't know where is the problem...
I try another way with MechanicalSoup but that still return to the connection page.
If someone can help me... Thank you for your time :/
import requests
from bs4 import BeautifulSoup as bs
url = 'https://factomos.com'
email = 'myemail'
password = 'mypassword'
url_login = 'https://factomos.com/connexion'
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
})
data_login = {
'appAction': 'login',
'email': email,
'password': password
}
with requests.Session() as s:
dash = s.post(url_login, headers=headers, data=data_login)
print(dash.status_code)
# MechanicalSoup
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
resp = browser.open("https://app.factomos.com/connexion")
browser.select_form('form[id="login-form"]')
browser["email"] = 'myemail'
browser["password"] = 'mypassword'
response = browser.submit_selected()
print("submite: ", response.status_code)
print(browser.get_current_page())
I expect a response 200 with the dashboard page but the actual response is 403 or the connection page.
The URL you are using to login (https://factomos.com/connexion) is not the correct endpoint to log in. You can find this out using a browser's devtools/inspect element panel, specifically the "Network" tab.
Accessing this panel will vary by browser. Here's how you do it in Chrome, but in general you can access it by right clicking and clicking Inspect element.
From there, I sent a fake login attempt and I found the actual login endpoint is:
https://app.factomos.com/controllers/app-pro/login-ajax.php
As soon as you send the request, you can view details about it. Once you get a response, you can see those details too. Here are the request details:
Here was the form data I sent:
And the response:
{"error":{"code":-1,"message":"Identifiant ou mot de passe incorrect"}}
I'm trying to log in to this website: https://archiwum.polityka.pl/sso/loginform to scrape some articles.
Here is my code:
import requests
from bs4 import BeautifulSoup
login_url = 'https://archiwum.polityka.pl/sso/loginform'
base_url = 'http://archiwum.polityka.pl'
payload = {"username" : XXXXX, "password" : XXXXX}
headers = {"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0"}
with requests.Session() as session:
# Login...
request = session.get(login_url, headers=headers)
post = session.post(login_url, data=payload)
# Now I want to go to the page with a specific article
article_url = 'https://archiwum.polityka.pl/art/na-kanapie-siedzi-len,393566.html'
request_article = session.get(article_url, headers=headers)
# Scrape its content
soup = BeautifulSoup(request_article.content, 'html.parser')
content = soup.find('p', {'class' : 'box_text'}).find_next_sibling().text.strip()
# And print it.
print(content)
But my output is lik this:
... [pełna treść dostępna dla abonentów Polityki Cyfrowej]
Which means in my native language
... [full content available for subscribers of the Polityka Cyfrowa]
My credentials are correct because I have full access to the content from the browser but not with Requests.
I will be grateful for any suggestions as to how I can do this with Requests. Or do I have to use Selenium for this?
I can help you with the login prodedure. The rest, I suppose, you can manage yourself. Your payload doesn't contain all the necessary information to fetch a valid response. Fill in the two fields username, password from the script below and run the it. I suppose, you will see your name what you see when you are already logged in that webpage.
import requests
from bs4 import BeautifulSoup
payload = {
'username': 'username here',
'password': 'your password here',
'login_success': 'http://archiwum.polityka.pl',
'login_error': 'http://archiwum.polityka.pl/sso/loginform?return=http%3A%2F%2Farchiwum.polityka.pl'
}
with requests.Session() as session:
session.headers={"User-Agent":"Mozilla/5.0"}
page = session.post('https://www.polityka.pl/sso/login', data=payload)
soup = BeautifulSoup(page.text,"lxml")
profilename = soup.select_one("#container p span.border").text
print(profilename)
I am new to Python and trying to crawling the some information on my school website (based on aspx).
What I am trying to do is:
http://url.of.the.page
Login
Open the 4th link on the left
I was trying to log into my account by using req = urllib2.Request(url,data) (the data contains the id, password and some other information I can see through wireshark) together with result = opener.open(req) and print result.read().
Unfortunately the result printed out is the same as the original login page, so I guess I did not login successfully, this result is also same as when I click the 4th lick without login.
(Another proof is that when I wanted to get another link on the web page, I was redirected to the login page).
My question will be:
Do I really fail to login?
If so what is the correct way to login?
My code is as follow:
# -*- coding: utf-8 -*-
import urllib2
import urllib
import cookielib
from bs4 import BeautifulSoup
import datetime
import time
from urlgrabber.keepalive import HTTPHandler
def get_ViewState(soup):
view_input = soup.find(id="__VIEWSTATE")
return (view_input['value'])
def get_EventValidation(soup):
event_input = soup.find(id="__EVENTVALIDATION")
return event_input['value']
cookie = cookielib.CookieJar()
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie),keepalive_handler)
urllib2.install_opener(opener)
__url = 'http://url.of.the.page'
opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36')
,('Connection', 'Keep-Alive')
,('Referer',__url)]
page = urllib.urlopen(__url).read()
soup = BeautifulSoup(page)
viewstate = get_ViewState(soup)
eventvalidation = get_EventValidation(soup)
postdata = urllib.urlencode({
'__EVENTTARGET':'',
'__EVENTARGUMENT:':'',
'__VIEWSTATE':viewstate,
'TxtStudentId':'xxxxxxx',
'TxtPassword':'xxxxxxx',
'BtnLogin':'login',
'__EVENTVALIDATION':eventvalidation
})
req = urllib2.Request(
url = __url,
data = postdata
)
result = opener.open(req)
print result.read()
# result = opener.open(req)
# print result.info()
# print result
# print result.read()
print "------------------------------------------------"
#after login, I need to get the scores table
__queryUrl = 'http://url.of.the.page?key=0'
now = datetime.datetime.now()
opener.addheaders = [('Referer', 'http://url.of.the.page?i='+now.strftime('%H:%M:%S'))]
result = opener.open(__queryUrl)
print result.read()
for item in cookie:
print 'Cookie:Name = '+item.name
print 'Cookie:Value = '+item.value
For logging in you will have to use a Programming API for the website because it will probably ask if you are a robot. For clicking the fourth link, simply view the source code (HTML) of the website and find the class and ID of the link you want. Then, with some Googling, you can add that to the code and you are all set :)
By capturing the packages I found out that my POST message get an OK message from the server, that means I did login successfully.
The reason why GET message got a 302 found as a return is because I didn't include a cookie in the header. I was using urllib2 and it did not include the cookie in the GET message automatically.
So I hard coded the cookie into the header by doing the following:
cookie = cookielib.CookieJar()
ckName = ''
ckValue = ''
for item in cookie:
ckName = item.name
ckValue = item.value
opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36')
,('Referer', 'http://202.120.108.14/ecustedu/K_StudentQuery/K_StudentQueryLeft.aspx?i='+now.strftime('%H:%M:%S'))
,('Cookie',ckName+'='+ckValue)]
I'm trying to write a simple scraper to get usage details on my internet account - I've successfully written it using Powershell, but I'd like to move it to Python for ease of use/deployment. If I print r.text (result of POST to login page) I just get the login page form details again.
I think the solution might be something along the lines of using prepare_request? Apologies if I'm missing something super obvious, been about 5 years since I touched python ^^
import requests
USERNAME = 'usernamehere'
PASSWORD = 'passwordhere'
loginURL = 'https://myaccount.amcom.com.au/ClientLogin.aspx'
secureURL = 'https://myaccount.amcom.com.au/FibreUsageDetails.aspx'
session = requests.session()
req_headers = {'Content-Type': 'application/x-www-form-urlencoded'}
formdata = {
'ctl00$MemberToolsContent$txtUsername': USERNAME,
'ctl00$MemberToolsContent$txtPassword': PASSWORD,
'ctl00$MemberToolsContent$btnLogin' : 'Login'
}
session.get(loginURL)
r = session.post(loginURL, data=formdata, headers=req_headers, allow_redirects=False)
r2 = session.get(secureURL)
I've referenced these threads in my attempts:
HTTP POST and GET with cookies for authentication in python
Authentication and python Requests
Powershell script for reference:
$r=Invoke-WebRequest -Uri 'https://myaccount.amcom.com.au/ClientLogin.aspx' -UseDefaultCredentials -SessionVariable RequestForm
$r.Forms[0].Fields['ctl00$MemberToolsContent$txtUsername'] = "usernamehere"
$r.Forms[0].Fields['ctl00$MemberToolsContent$txtPassword'] = "passwordhere"
$r.Forms[0].Fields['ctl00$MemberToolsContent$btnLogin'] = "Login"
$response = Invoke-WebRequest -Uri 'https://myaccount.amcom.com.au/ClientLogin.aspx' -WebSession $RequestForm -Method POST -Body $r.Forms[0].Fields -ContentType 'application/x-www-form-urlencoded'
$response2 = Invoke-WebRequest -Uri 'https://myaccount.amcom.com.au/FibreUsageDetails.aspx' -WebSession $RequestForm
import requests
import re
from bs4 import BeautifulSoup
user="xyzmohsin"
passwd="abcpassword"
s=requests.Session()
headers={"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
s.headers.update(headers)
login_url="https://myaccount.amcom.com.au/ClientLogin.aspx"
r=s.get(login_url)
soup=BeautifulSoup(r.content)
RadMasterScriptManager_TSM=soup.find(src=re.compile("RadMasterScriptManager_TSM"))['src'].split("=")[-1]
EVENTTARGET=soup.find(id="__EVENTTARGET")['value']
EVENTARGUMENT=soup.find(id="__EVENTARGUMENT")['value']
VIEWSTATE=soup.find(id="__VIEWSTATE")['value']
VIEWSTATEGENERATOR=soup.find(id="__VIEWSTATEGENERATOR")['value']
data={"RadMasterScriptManager_TSM":RadMasterScriptManager_TSM,
"__EVENTTARGET":EVENTTARGET,
"__EVENTARGUMENT":EVENTARGUMENT,
"__VIEWSTATE":VIEWSTATE,
"__VIEWSTATEGENERATOR":VIEWSTATEGENERATOR,
"ctl00_TopMenu_RadMenu_TopNav_ClientState":"",
"ctl00%24MemberToolsContent%24HiddenField_Redirect":"",
"ctl00%24MemberToolsContent%24txtUsername":user,
"ctl00%24MemberToolsContent%24txtPassword":passwd,
"ctl00%24MemberToolsContent%24btnLogin":"Login"}
headers={"Content-Type":"application/x-www-form-urlencoded",
"Host":"myaccount.amcom.com.au",
"Origin":"https://myaccount.amcom.com.au",
"Referer":"https://myaccount.amcom.com.au/ClientLogin.aspx"}
r=s.post(login_url,data=data,headers=headers)
I don't have the username and password hence the couldn't test the headers in the final post requests. If it doesn't work - then please remove Host, Origin and Referer from the final post requests's headers.
Hope that helps :-)
I'm trying to fix a program which can login to my MU account and retrieve some data....
I don't know what am I doing wrong....That's the code:
#!/usr/bin/env python
import urllib, urllib2, cookielib
username = 'username'
password = 'password'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('http://megaupload.com/index.php?c=login', login_data)
resp = opener.open('http://www.megaupload.com/index.php?c=filemanager')
print resp.read()
Thx for any answer!
You can simulate the filling of the form.
For that you can use mechanize lib base on perl module WWW::Mechanize.
#!/usr/bin/env python
import urllib, urllib2, cookielib, mechanize
username = 'username'
password = 'password'
br = mechanize.Browser()
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6')]
br.open('http://www.megaupload.com/?c=login')
br.select_form('loginfrm')
br.form['username'] = username
br.form['password'] = password
br.submit()
resp = br.open('http://www.megaupload.com/index.php?c=filemanager')
print resp.read()
See Use mechanize to log into megaupload
Okay I just implemented it myself and it seems you just forgot one value - that's why I always use TamperData or something similar to just check what my browser is sending to the server - WAY easier and shorter than going through the HTML.
Anyways just add 'redir' : 1 to your dict and it'll work:
import http.cookiejar
import urllib
if __name__ == '__main__':
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
login_data = urllib.parse.urlencode({'username' : username, 'password' : password, 'login' : 1, 'redir' : 1})
response = opener.open("http://www.megaupload.com/?c=login", login_data)
with open("test.txt", "w") as file:
file.write(response.read().decode("UTF-8")) #so we can compare resulting html easily
Although I must say I'll have a look at mechanize and co now - I do something like that often enough that it could be quite worthwhile. Although I can't stress enough that the most important help is still a browser plugin that lets you check the sent data ;)
You might have more luck with mechanize or twill which are designed to streamline these kinds of processes. Otherwise, I think your opener is missing at least one important component: something to process cookies. Here's a bit of code I have laying around from the last time I did this:
# build opener with HTTPCookieProcessor
cookie_jar = cookielib.MozillaCookieJar('tasks.cookies')
o = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cookie_jar)
)
My guess is to add the c=login name/value pair to login_data rather than including it dorectly on the URL.
You're probably also breaking a TOS/EULA, but I can't say I care that much.