Unable to find CSRF token - python

I am attempting to log in to this website (https://isf.scout7.com/Apps/Login) to then scrape some data using Python and the requests library.
In the past I have followed the instructions in Step 1 on this website (http://kazuar.github.io/scraping-tutorial/) which has always worked well for me.
I believe to input the username and password, I should use login_form.login_model.username and login_form.login_model.password respectively. However with the website I'm trying to sign in to, I have been unable to find the CSRF token need to log in. I have gone through the html by inspecting the page using Chrome, but I can't find anything that resembles a CSRF token.
Am I completely missing it, or do I not need it to log in?

I entered some values to login and password fields, then used my browser tools to examine http request that is sent when clicking on Login button. Here it is:
You see - no CSRF token is sent here. So I guess you can just post login=<login>&password=<password>&grant_type=password (and maybe some other values/headers from my request) to https://api.scout7.com//token - and you will get OAuth token in response.

Related

Is there any way to scrape data if DUO authentication is required for a website?

I am trying to get the job listings from the workday portal of my university. The portal url is https://www.myworkday.com/northeastern/d/home.htmld. When this is clicked, if the cookies are correct, it only asks me for the login data and bypasses the DUO auth. But when I send a post request to this website with my data payload, it gives me an error NoSuchExecutionFlow. I am assuming it needs the duo auth and thats probably why it doesnt allow me to go further. Is there any way to make this work?
I tried using parseHub and that gives the same error as well. I tried getting the cookies through the browser and passing those in as well but it lands on the same page everytime. It's supposed to go to this website, redirect to Northeastern for authentication, then come back to workday for the portal.

How to bypass 'headless' reCaptcha V2?

I'm creating a bot using requests, BeautifulSoup, and possibly Twill. The bot will scrape a large number of forums and gather data from them. However, the current forum I am working on (https://wearedevs.net/) uses reCaptcha V2 on its login page, so the bot cannot log in. I discovered this when after trying to log in through code, and instead of returning a valid response and reloading the page, I would continuously get a 404 error. I thought it was an error with my code, but even when trying Twill it still didn't log in.
I need to be able to log in through the site so I can access features that guest users wouldn't be able to access.
I knew the site had reCaptcha, so I looked into a reCaptcha bypass, the issue is it's not the visual reCaptcha, it's the "headless" version. As shown below:
Bottom right corner of the page:
In other words, it's the reCaptcha that doesn't give you a captcha prompt but instead analyzes your behavior on the site and determines if you're a bot or not.
I suspected that the 404 was because the reCaptcha determining that the requests were bots. So the second thing I attempted was sending a direct POST request from the code to the sites API, which is here:
https://wearedevs.net/api/v1/account/login
Along with the required JSON data, which is in this format:
{"g-recaptcha-response":"recaptcha-response-here", "username": "example_username", "password": "example_password", "token2fa": ""}
I didn't have a valid reCaptcha response to send to the server, so I tried excluding that from the JSON data but, while the request was successful, the server sent back an error saying that the login failed because a reCaptcha response was not present.
So then I tried using BeautifulSoup to send a request to the login page, grab the reCaptcha response, then include that in the JSON data to be sent, but I was unable to grab the reCaptcha response using BeautifulSoup.
I have tried Selenium, but I'm currently working in an environment in which a browser is not present, so Selenium won't work and therefore is not an option.
If anyone has any ways to bypass, or validate, the headless reCaptcha V2, please share and I would be grateful. Thanks!

Getting CSRF Token from Login Page when it is not provided in the Cookies

So, I understand sometimes you can get a CSRF token from a cookie. Using the Python Requests module, I could do client.get and something like value = client.cookies['csrftoken'].
What other ways are there for a login page to generate a CSRF token? Is it possible for the browser itself to do it?
How would I get that token?
Might I use selenium to make sure whatever processes that generate the token?

Python - reading the returnURL

I'm using python without a server to deploy to. Im trying to test the accept payment flow for Paypal.
In the code after sending a POST request I store the result into a local file and then open this file using "webbrowser".
However I suspect that I am missing something here since once I log in as a user I'm not automatically redirected to authorize the transaction but rather just log in.
Now I suspect this is because once I hit the Payment API endpoint it is redirecting me to the login API with some parameters
http://www.paypal.com?hypotheticalredirecturl=etc
Now I am capturing the response i.e - the html of the paypal login page. However the whole
URL cannot be captured. Hence the part
hypotheticalredirecturl=etc
cannot be captured and I think that this is stopping the flow from going to the
authorization page after I log in as a user.
I think if I appended the "hypotheticalredirect" part to my webpage after opening it using "webbrowser" I might be able make the flow normal.
Does any one know of any way to capture the url of the response?
I tried looking into the page itself but I dont think its there.
Any help will be appreciated.
Thanks,
Ashwin
EDIT : Using urllib and urllib2. Should I be looking at httplib?

Python urllib2 accesses page without sending authentication details

I was reading urllib2 tutorial wherein it mentions that in order to access a page that requires authentication (e.g. valid username and password), the server first sends an HTTP header with error code 401 and (python) client then sends a request with authentication details.
Now, the problem in my case is that there exist two different versions of a webpage, one that can be accessed without supplying any authentication details and one that is quite different when authentication details are supplied (i.e. when the user is logged in the system). As an example think about url www.gmail.com, when you are not logged in you get a log-in page, but if your browser remembers you from your last login then the result is your email account homepage with your inbox displayed.
I follow all the details to set up an handler for authentication and install an opener. However everytime I request the page get back the version of the webpage that does not have the user logged-in.
How can I access the other version of webpage that has the user logged-in?
Requests makes this easy. As its creators say:
Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken.
Try using Mechanize. It has cookie handling features that would allow your program to be "logged in" even though it's not a real person.

Categories

Resources