What's wrong with my requests.Session for python crawler?

What's wrong with my requests.Session for python crawler? - python

I'm coding a crawler for www.researchgate.net, but it seems that I'll be stuck in the login page forever.
Here's my code:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
params = {'login': 'my_email', 'password': 'my_password'}
session.post("https://www.researchgate.net/application.Login.html", data = params)
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title
Can anybody find anything wrong with my code? Why does s redirect to login page every time?

There are hidden fields in the login form that probably need to be supplied (I can't test - I don't have a login there).
One is request_token which is set to a long base64 encoded string. Others are invalidPasswordCount and loginCookie which might also be required.
Further to that there is a session cookie that you might need to send with the login credentials.
To make this work will require an initial GET to get the request_token, which you need to extract somehow - e.g. with BeautifulSoup. If you use your requests session then the cookie will be presented in the following POST, so you shouldn't need to worry about that.
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# initial GET to retrieve token and set cookies
r = session.get('https://www.researchgate.net/application.Login.html')
soup = r.BeautifulSoup(r.text)
request_token = soup.find('input', attrs={'name':'request_token'})['value']
params = {'login': 'my_email', 'password': 'my_password', 'request_token': request_token, 'invalidPasswordCount': 0, 'loginCookie': 'yes'}
session.post("https://www.researchgate.net/application.Login.html", data=params)
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title

Thanks for mhawke, I modified my original code as he suggested and I finally logged in successfully.
Here's my new code:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
loginpage = session.get("https://www.researchgate.net/application.Login.html")
request_token = BeautifulSoup(loginpage.text).form.find("input",{"name":"request_token"}).attrs["value"]
print request_token
params = {"request_token":request_token,
"invalidPasswordCount":"0",
'login': 'my_email',
'password': 'my_password',
"setLoginCookie":"yes"
}
session.post("https://www.researchgate.net/application.Login.html", data = params)
#print s.cookies.get_dict()
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title

Related

Module 'requests' doesn't go through with the login

I am trying to get information from a website by using the requests module. To get to the information you have to be logged in and then you can access the page. I looked into the input tags and noticed that they are called login_username and login_password but for some reasons the post doesn't go through. I also read here that he solved it by waiting for few seconds before going thorugh the other page, it didn't helped either..
Here is my code:
import requests
import time
#This URL will be the URL that your login form points to with the "action" tag.
loginurl = 'https://jadepanel.nephrite.ro/login'
#This URL is the page you actually want to pull down with requests.
requesturl = 'https://jadepanel.nephrite.ro/clan/view/123'
payload = {
'login_username': 'username',
'login_password': 'password'
}
with requests.Session() as session:
post = session.post(loginurl, data=payload)
time.sleep(3)
r = session.get(requesturl)
print(r.text)

login_username and login_password are not all the necessary parameters. If you look at the /login/ POST request in the browser developer tools, you would see that there is also a _token being sent.
This is something you would need to parse out of the login HTML. So the flow would be the following:
get the https://jadepanel.nephrite.ro/login page
HTML parse it and extract _token value
make a POST request with login, password and token
use the logged in session to navigate the site
For the HTML parsing we could use BeautifulSoup (there are other options, of course):
from bs4 import BeautifulSoup
login_html = session.get(loginurl).text
soup = BeautifulSoup(login_html, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
payload = {
'login_username': 'username',
'login_password': 'password',
'_token': token
}
Complete code:
import time
import requests
from bs4 import BeautifulSoup
# This URL will be the URL that your login form points to with the "action" tag.
loginurl = 'https://jadepanel.nephrite.ro/login'
# This URL is the page you actually want to pull down with requests.
requesturl = 'https://jadepanel.nephrite.ro/clan/view/123'
with requests.Session() as session:
login_html = session.get(loginurl).text
soup = BeautifulSoup(login_html, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
payload = {
'login_username': 'username',
'login_password': 'password',
'_token': token
}
post = session.post(loginurl, data=payload)
time.sleep(3)
r = session.get(requesturl)
print(r.text)

Python web scraping login

I am trying to login to a website using python.
The login URL is :
https://login.flash.co.za/apex/f?p=pwfone:login
and the 'form action' url is shown as :
https://login.flash.co.za/apex/wwv_flow.accept
When I use the ' inspect element' on chrome when logging in manually, these are the form posts that show up (pt_02 = password):
There a few hidden items that I'm not sure how to add into the python code below.
When I use this code, the login page is returned:
import requests
url = 'https://login.flash.co.za/apex/wwv_flow.accept'
values = {'p_flow_id': '1500',
'p_flow_step_id': '101',
'p_page_submission_id': '3169092211412',
'p_request': 'LOGIN',
'p_t01': 'solar',
'p_t02': 'password',
'p_checksum': ''
}
r = requests.post(url, data=values)
print r.content
How can I adjust this code to perform a login?
Chrome network:

This is more or less your script should look like. Use session to handle the cookies automatically. Fill in the username and password fields manually.
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"lxml")
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t01': 'username',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
r = s.post(posturl, data=values)
print r.content

since I cannot recreate your case I can't tell you what exactly to change, but when I was doing such things I used Postman to intercept all requests my browser sends. So I'd install that, along with browser extension and then perform login. Then you can view the request in Postman, also view the response it received there, what's more it provides you with Python code of request too, so you could simply copy and use it then.
Shortly, use Pstman, perform login, clone their request.

Login in a website with requests

I need to log me in a website with requests, but all I have try don't work :
from bs4 import BeautifulSoup as bs
import requests
s = requests.session()
url = 'https://www.ent-place.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=5'
def authenticate():
headers = {'username': 'myuser', 'password': 'mypasss', '_Id': 'submit'}
page = s.get(url)
soup = bs(page.content)
value = soup.form.find_all('input')[2]['value']
headers.update({'value_name':value})
auth = s.post(url, params=headers, cookies=page.cookies)
authenticate()
or :
import requests
payload = {
'inUserName': 'user',
'inUserPass': 'pass'
}
with requests.Session() as s:
p = s.post('https://www.ent-place.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=5', data=payload)
print(p.text)
print(p.status_code)
r = s.get('A protected web page url')
print(r.text)
When I try this with the .status_code, it return 200 but I want 401 or 403 for do a script like 'if login'...
I have found this but I think it works in python 2, but I use python 3 and I don't know how to convert... :
import requests
import sys
payload = {
'username': 'sopier',
'password': 'somepassword'
}
with requests.Session(config={'verbose': sys.stderr}) as c:
c.post('http://m.kaskus.co.id/user/login', data=payload)
r = c.get('http://m.kaskus.co/id/myform')
print 'sopier' in r.content
Somebody know how to do ?
Because each I have test test all script I have found and it don't work...

When you submit the logon, the POST request is sent to https://www.ent-place.fr/CookieAuth.dll?Logon not https://www.ent-place.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=5 -- You get redirected to that URL afterwards.
When I tested this, the post request contains the following parameters:
curl:Z2F
flags:0
forcedownlevel:0
formdir:5
username:username
password:password
SubmitCreds.x:69
SubmitCreds.y:9
SubmitCreds:Ouvrir une session
So, you'll likely need to supply those additional parameters as well.
Also, the line s.post(url, params=headers, cookies=page.cookies) is not correct. You should pass headers into the keyword argument data not params -- params encodes to the request url -- you need to pass it in the form data. And I'm assuming you really mean payload when you say headers
s.post(url, data=headers, cookies=page.cookies)
The site you're trying to login to has an onClick JavaScript when you process the login form. requests won't be able to execute JavaScript for you. This may cause issues with the site functionality.

Scrape data from a page that requires a login

I am new to Python and Web Scraping and I am trying to write a very basic script that will get data from a webpage that can only be accessed after logging in. I have looked at a bunch of different examples but none are fixing the issue. This is what I have so far:
from bs4 import BeautifulSoup
import urllib, urllib2, cookielib
username = 'name'
password = 'pass'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('WebpageWithLoginForm')
resp = opener.open('WebpageIWantToAccess')
soup = BeautifulSoup(resp, 'html.parser')
print soup.prettify()
As of right now when I print the page it just prints the contents of the page as if I was not logged in. I think the issue has something to do with the way I am setting the cookies but I am really not sure because I do not fully understand what is happening with the cookie processor and its libraries.
Thank you!
Current Code:
import requests
import sys
EMAIL = 'usr'
PASSWORD = 'pass'
URL = 'https://connect.lehigh.edu/app/login'
def main():
# Start a session so we can have persistant cookies
session = requests.session(config={'verbose': sys.stderr})
# This is the form data that the page sends when logging in
login_data = {
'username': EMAIL,
'password': PASSWORD,
'LOGIN': 'login',
}
# Authenticate
r = session.post(URL, data=login_data)
# Try accessing a page that requires you to be logged in
r = session.get('https://lewisweb.cc.lehigh.edu/PROD/bwskfshd.P_CrseSchdDetl')
if __name__ == '__main__':
main()

You can use the requests module.
Take a look at this answer that i've linked below.
https://stackoverflow.com/a/8316989/6464893

count cookies in python

I am struggling with python ,I want to Write a Python script that creates a cookie and Counts how many times the cookie is called during a session
I just have tried as per below:
import Cookie
import os
if os.environ.has_key('HTTP_COOKIE'):
cookie=SimpleCookie(os.environ['HTTP_COOKIE'])
cookie=SimpleCookie()
for key in initialvalues.keys():
if not cookie.has_key(key):
cookie[key]=intialvalues[key]
return cookie
if __name__=='__main__':
c=getCookie({'counter':0})
c['counter']=int(c['counter'].value)+1
print c
But I know it is wrong, can someone help me to write down the script?
Any help would be appreciated

I'm confused by your question. What I believe you want to do is request some webpage and count how many times your cookie was found? You can gather cookies using a CookieJar:
import urllib, urllib2, cookielib
url = "http://example.com/cookies"
form_data = {'username': '', 'password': '' }
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
form_data = urllib.urlencode(form_data)
# data returned from this pages contains redirection
resp = opener.open(url, form_data)
print resp.read()
for cookie in jar:
# Look for your cookie

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's wrong with my requests.Session for python crawler? - python

Related

Module 'requests' doesn't go through with the login

Python web scraping login

Login in a website with requests

Scrape data from a page that requires a login

count cookies in python

Categories

Resources