How to bypass 'headless' reCaptcha V2? - python

I'm creating a bot using requests, BeautifulSoup, and possibly Twill. The bot will scrape a large number of forums and gather data from them. However, the current forum I am working on (https://wearedevs.net/) uses reCaptcha V2 on its login page, so the bot cannot log in. I discovered this when after trying to log in through code, and instead of returning a valid response and reloading the page, I would continuously get a 404 error. I thought it was an error with my code, but even when trying Twill it still didn't log in.
I need to be able to log in through the site so I can access features that guest users wouldn't be able to access.
I knew the site had reCaptcha, so I looked into a reCaptcha bypass, the issue is it's not the visual reCaptcha, it's the "headless" version. As shown below:
Bottom right corner of the page:
In other words, it's the reCaptcha that doesn't give you a captcha prompt but instead analyzes your behavior on the site and determines if you're a bot or not.
I suspected that the 404 was because the reCaptcha determining that the requests were bots. So the second thing I attempted was sending a direct POST request from the code to the sites API, which is here:
https://wearedevs.net/api/v1/account/login
Along with the required JSON data, which is in this format:
{"g-recaptcha-response":"recaptcha-response-here", "username": "example_username", "password": "example_password", "token2fa": ""}
I didn't have a valid reCaptcha response to send to the server, so I tried excluding that from the JSON data but, while the request was successful, the server sent back an error saying that the login failed because a reCaptcha response was not present.
So then I tried using BeautifulSoup to send a request to the login page, grab the reCaptcha response, then include that in the JSON data to be sent, but I was unable to grab the reCaptcha response using BeautifulSoup.
I have tried Selenium, but I'm currently working in an environment in which a browser is not present, so Selenium won't work and therefore is not an option.
If anyone has any ways to bypass, or validate, the headless reCaptcha V2, please share and I would be grateful. Thanks!

Related

Unable to find CSRF token

I am attempting to log in to this website (https://isf.scout7.com/Apps/Login) to then scrape some data using Python and the requests library.
In the past I have followed the instructions in Step 1 on this website (http://kazuar.github.io/scraping-tutorial/) which has always worked well for me.
I believe to input the username and password, I should use login_form.login_model.username and login_form.login_model.password respectively. However with the website I'm trying to sign in to, I have been unable to find the CSRF token need to log in. I have gone through the html by inspecting the page using Chrome, but I can't find anything that resembles a CSRF token.
Am I completely missing it, or do I not need it to log in?
I entered some values to login and password fields, then used my browser tools to examine http request that is sent when clicking on Login button. Here it is:
You see - no CSRF token is sent here. So I guess you can just post login=<login>&password=<password>&grant_type=password (and maybe some other values/headers from my request) to https://api.scout7.com//token - and you will get OAuth token in response.

Web login using python3 requests

I am trying to web scrape the a piece of news. I try to login into the website by python so that I can have full access to the whole web page. But I have looked at so many tutorials but still fail.
Here is the code. Can anyone tell me why.
There is no bug in my code. But I still can not see the full text, which means I am still not log in.
`
url='https://id.wsj.com/access/pages/wsj/us/signin.html?mg=id-wsj&mg=id-wsj'
payload={'username':'my_user_name',
'password':'******'}
session=requests.Session()
session.get(url)
response=session.post(url,data=payload)
print(response.cookies)
r=requests.get('https://www.wsj.com/articles/companies-push-to-repeal-amt-after-senates-last-minute-move-to-keep-it-alive-1512435711')
print(r.text)
`
Try sending your last GET request using the response variable. After all, it's the one who made the login and holds the cookies (if there are any). You've used a new requests object for your last request thus ignoring the login you just made.

Using python to parse a webpage that is already open

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

Downloading URL To file... Not returning JSON data but Login HTML instead

I am writing a web scraping application. When I enter the URL directly into a browser, it displays the JSON data I want.
However, if I use Python's request lib, or URLDownloadToFile in C++, it simply downloads the html for the login page.
The site I am trying to scrape it from (DraftKings.com) requires a login. The other sites I scrape from don't.
I am 100% sure this is related, since if I paste the url when I am logged out, I get the login page, rather than the JSON data. Once I log in, if I paste the URL again, I get the JSON data again.
The thing is that if I remain logged in, and then use the Python script or C++ app to download the JSON data, as mentioned.... it downloads the Login HTML.
Anyone know how I can fix this issue?
Please don't ask us to help with an activity that violates the terms of service of the site you are trying to (ab-)use:
Using automated means (including but not limited to harvesting bots, robots, parser, spiders or screen scrapers) to obtain, collect or access any information on the Website or of any User for any purpose.
Even if that kind of usage were allowed, the answer would be boring:
You'd need to implement the login functionality in your scraper.

Login to site programatically using python

I am wondering on how to log in to specific site, however no luck so far.
The way it happens on browser is that you click on button, it triggers jQuery AJAX request to /ajax/authorize_ajax.html with post variables login and pass. When it returns result = true it reloads document and you are logged in.
When I go to /ajax/authorize_ajax.html on my browser it gives me {"data": [{"result":false}]} in response. Using C# I did went to this address and posted login and pass and it gave me {"data": [{"result":true}]} in response. However then, of course, when I go back to main folder of the website I'm not logged in.
Can anyone help me solve this problem? I think that cookies are set via javascript, is it even possible in that case? I did some research and all I could do is this, please help me to get around with this problem. Used urllib in python and web libraries in .NET.
EDIT 0
It is setting cookie in response headers. SID, PATH & DOMAIN.
Example: sid=bf32b9ff0dfd24059665bf1d767ad401; path=/; domain=site
I don't know how to save this cookie and go back to / using this cookie. I've never done anything like this before, can someone give me some example using python?
EDIT 1
All done, thanks to this post - How to use Python to login to a webpage and retrieve cookies for later usage?
Here's a blog post I did a while ago about using an HttpWebRequest to post to a site when cookies are involved:
http://crazorsharp.blogspot.com/2009/06/c-html-screen-scraping-part-2.html
The idea is, when you get a Response using the HttpWebRequest, you can get access to the Cookies that are sent down. For every subsequent request, you can new up a CookieContainer on the request object, and add the cookies that you got into that container.

Categories

Resources