python-requests and complicated forms - python

I'm trying to make a web scraper for my university web, but I can't get past the login page.
import requests
URL = "https://login.ull.es/cas-1/login?service=https%3A%2F%2Fcampusvirtual.ull.es%2Flogin%2Findex.php%3FauthCAS%3DCAS"
USER = "myuser"
PASS = "mypassword"
payload = {
"username": USER,
"password": PASS,
"warn": "false",
"lt": "LT-2455188-fQ7b5JcHghCg1cLYvIMzpjpSEd0rlu",
"execution": "e1s1",
"_eventId": "submit",
"submit": "submit"
}
with requests.Session() as s:
r = s.post(URL, data=payload)
#r = s.get(r"http://campusvirtual.ull.es/my/index.php")
with open("test.html","w") as f:
f.write(r.text)
That code is obviously not working and I don't know where's the mistake, I tried putting only the username and the password in the payload (the other values are in the source code of the web that are marked as hidden) but that is also failing.
Can anyone point me in the right direction? Thanks. (sorry for my english)

The "lt": "LT-2455188-fQ7b5JcHghCg1cLYvIMzpjpSEd0rlu" is a session ID or some sort of anti-CSRF protection or similar (wild guess: hmac-ed random id number). What matters is that it is not a constant value, you will have to read it from the same URL by issuing a GET request.
In the GET response you have something like:
<input type="hidden" name="lt" value="LT-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" />
Additionally, there is a JSESSIONID cookie that might be important.
This should be your flow:
GET the URL
extract the lt parameter and the JSESSIONID cookie from the response
fill the payload['lt'] field
set cookie header
POST the same URL.
Extracting the cookie is very simple, see the requests documentation.
Extracting the lt param is a bit more difficult, but you can do it using BeautifulSoup package. Assuming that you have the response in a variable named text, you can use:
from BeautifulSoup import BeautifulSoup as soup
payload['lt'] = soup(text).find('input', {'name': 'lt', 'type': 'hidden'}).get('value')

Related

Scrape order numbers from Wix.com

I want to organize a PUBG Tournament. To do so, a friend and I have created a website on wix.com. Here interested players can "buy" a ticket for free and get, with this ticket, an order number. Additionally we have set up a Discord Server which is "protected" by a password-bot. The required password should be this order number, which I mentioned earlier.
The "problem" is, that the only way for me, to get these order numbers, is to log in to the wix.com admin site and go to the order overview. There I can either, get an .csv file via pressing a button or scrape the data from the code.
I tried to program the bot to scrape the data, but it just won´t do it.
I really would appreciate any help what so ever. Please be advised, that neither I nor my friend are particularly skilled in Python and we are still learning.
Thanks in advance.
Here is my code:
import os
import requests
from lxml import html
import selenium
from bs4 import BeautifulSoup
EMAIL = "<EMAIL>"
PASSWORD = "<PASSWORD>"
my_secret = os.environ['EMAIL']
my_secret = os.environ['password']
LOGIN_URL = "https://users.wix.com/signin"
URL = "https://manage.wix.com/dashboard/63e1c04d-52a6-49cc-91b2-149afa7ad7a9/events/b1839f48-412d-4274-b4de-2f8417028bcf/guest-list?referralInfo=sidebar"
def main():
session_requests = requests.session()
# Create payload
payload = {
"email": EMAIL,
"password": PASSWORD,
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
Order = tree.xpath("//div[#class='_3Eow9']/a/text()")
print(Order)

How to scrape data behind a login

I am going to extract posts in a forum, named positive wellbeing during isolation" in HealthUnlocked.com
I can extract posts without login, but I cannot extract posts with logging. I used " url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber={0}'.format(page)" to extract pots, but I don't know how I can connect it to login as the URL is in JSON format.
I would appreciate it if you could help me.
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
url = "https://healthunlocked.com/private/programs/subscribed?user-id=1290456"
payload = {
"username" : "my username goes here",
"Password" : "my password goes hereh"
}
s= requests.Session()
p= s.post(url, data = payload)
headers = {"user-agent": "Mozilla/5.0"}
pages =2
data = []
listtitles=[]
listpost=[]
listreplies=[]
listpostID=[]
listauthorID=[]
listauthorName=[]
for page in range(1,pages):
url = 'https://solaris.healthunlocked.com/posts/positivewellbeing/popular?pageNumber=
{0}'.format(page)
r = requests.get(url,headers=headers)
posts = json.loads(r.text)
for post in posts:
sleep(3.5)
listtitles.append(post['title'])
listreplies.append(post ["totalResponses"])
listpostID.append(post["postId"])
listauthorID.append(post ["author"]["userId"])
listauthorName.append(post ["author"]["username"])
url = 'https://healthunlocked.com/positivewellbeing/posts/{0}'.format(post['postId'])
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
listpost.append(soup.select_one('div.post-body').get_text('|', strip=True))
## save to CSV
df=pd.DataFrame(list(zip(*
[listpostID,listtitles,listpost,listreplies,listauthorID,listauthorName]))).add_prefix('Col')
df.to_csv('out1.csv',index=False)
print(df)
sleep(2)
For most websites, you have to first get a token by logging in. Most of the time, this is a cookie. Then, in authorized requests, you can send that cookie along. Open the network tab in developer tools and then log in with your username and password. You'll be able to see how the request is formatted and where it is too. From there, just try to replicate it in your code.

CSRF Token Missing When Posting Request To DVWA Using Python Requests Library

I'm trying to make a program that will allow me to submit username and password on a website. For this, I am using DVWA(Damn Vulnerable Web Application) which is running on localhost:8080.
But whenever I try to send post request, it always returns an error.
csrf token is incorrect
Here's my code:
import requests
url = 'http://192.168.43.1:8080/login.php'
data_dict = {"username": "admin", "password": "password", "Login": "Login"}
response = requests.post(url, data_dict)
print(response.text)
You need to make GET request for that URL first, and parse the correct "CSRF" value from the response (in this case user_token). From response HTML, you can find hidden value:
<input type="hidden" name="user_token" value="28e01134ddf00ec2ea4ce48bcaf0fc55">
Also, it seems that you need to include cookies from first GET request for following request - this can be done automatically by using request.Session() object. You can see cookies by for example print(resp.cookies) from first response.
Here is modified code. I'm using BeautifulSoup library for parsing the html - it finds correct input field, and gets value from it.
POST method afterwards uses this value in user_token parameter.
from bs4 import BeautifulSoup
import requests
with requests.Session() as s:
url = 'http://192.168.43.1:8080/login.php'
resp = s.get(url)
parsed_html = BeautifulSoup(resp.content, features="html.parser")
input_value = parsed_html.body.find('input', attrs={'name':'user_token'}).get("value")
data_dict = {"username": "admin", "password": "password", "Login": "Login", "user_token":input_value}
response = s.post(url, data_dict)
print(response.content)

Python web scraping login

I am trying to login to a website using python.
The login URL is :
https://login.flash.co.za/apex/f?p=pwfone:login
and the 'form action' url is shown as :
https://login.flash.co.za/apex/wwv_flow.accept
When I use the ' inspect element' on chrome when logging in manually, these are the form posts that show up (pt_02 = password):
There a few hidden items that I'm not sure how to add into the python code below.
When I use this code, the login page is returned:
import requests
url = 'https://login.flash.co.za/apex/wwv_flow.accept'
values = {'p_flow_id': '1500',
'p_flow_step_id': '101',
'p_page_submission_id': '3169092211412',
'p_request': 'LOGIN',
'p_t01': 'solar',
'p_t02': 'password',
'p_checksum': ''
}
r = requests.post(url, data=values)
print r.content
How can I adjust this code to perform a login?
Chrome network:
This is more or less your script should look like. Use session to handle the cookies automatically. Fill in the username and password fields manually.
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"lxml")
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t01': 'username',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
r = s.post(posturl, data=values)
print r.content
since I cannot recreate your case I can't tell you what exactly to change, but when I was doing such things I used Postman to intercept all requests my browser sends. So I'd install that, along with browser extension and then perform login. Then you can view the request in Postman, also view the response it received there, what's more it provides you with Python code of request too, so you could simply copy and use it then.
Shortly, use Pstman, perform login, clone their request.

Passing CSRF token

This doesn't get past the login screen. I don't think I am passing in the CSRF token correctly. How should I do it?
from bs4 import BeautifulSoup
import requests
url = 'https://app.greenhouse.io/people/new?hiring_plan_id=24047'
cookies = {'_session_id':'my_session_id'}
client = requests.session()
soup = BeautifulSoup(client.get(url, cookies=cookies).content)
csrf_metatags = soup.find_all('meta',attrs={'name':'csrf-token'})[0].get('content')
posting_data = dict(person_first_name='Morgan') ## this is what I want to post to the form
headers = dict(Referer=url, csrf_token=csrf_metatags)
r = client.post(url, data=posting_data, headers=headers)
Thanks!
If you inspect the code, you'll find that the form has a hidden attached value like this:
<input name="authenticity_token" type="hidden"
value="2auOlN425EcdnmmoXmd5HFCt4PkEOhq0gpjOCzxNKns=" />
You can catch this value with:
csrf_data = soup.find("input", {"name": "authenticity_token"}).get("value")
Now re-attach the value to the posting data, as you did with person_first_name:
posting_data = dict(person_first_name='Morgan',
authenticity_token=csrf_data)

Categories

Resources