How do I bypass the recaptcha using Python? - python

I'm trying to log into a website using Python.
I'm using the Requests Module and Beautiful Soup so far.
I was able to get the "csrfmiddlewaretoken" with following code:
get = s.get(url, headers=headers)
soup = BeautifulSoup(get.text, "lxml")
csrf = soup.find("input",{"name":"csrfmiddlewaretoken"})["value"]
data["csrfmiddlewaretoken"] = csrf
Now I thought I could do the same with the "g-recaptcha-response" with the following code:
recaptcha = soup.find("input",{"id":"recaptcha-token"})["value"]
data["g-recaptcha-response"] = recaptcha
That of course doesn't work.
When I inspect the login page in my browser I can find what I'm looking for.
Firefox Developer Tool
The value appears once I click the "I am not a robot" Captcha.
Now I need some ideas how to integrate the value into my data dictionary so I can successfully login.
Many thanks in advance.

Related

Python Web Crawler No Results

I am making a basic Web Crawler/Spider with Python. I am trying to crawl through a YouTube channel and print all the titles of the videos on it but it never returns anything.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/c/DanTDM/videos'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
x = soup.select(".yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(x)
And the output is always: []. An empty list (which means it didn't find anything). I need to know what I'm doing wrong.
The code seems correct.
Call print(response.text) and see if YouTube is returning you a blocking page.
Anti scraping measures can be in action, as checking your user agent, etc.
Browser Automation with Selenium
When I send a request to YouTube, I receive the following page:
(A 'Before you continue to
Youtube' page).
So...
We should use Selenium instead as we need to click one of the buttons. I don't think we can interact with the website using the requests module.
Selenium allows you to have control over your browser. Read the documentation!

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

PYTHON 3 - How to web scrape a password protected website?

I'm trying to access a website in my work, however it's username/password protected. The user/pw pop-up also looks as in the picture.Login image
I attach my code to view the website.
I can see the HTML code, however with an error "401 Authorization Required".
Can you please help?
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("http://10.75.19.101/mfgindex", auth=('root', 'password'))
# Convert to beautiful soup object
soup = bs(r.content, features="html.parser")
# print
print(soup.prettify())
Generally if site is password-protected you can't obviously bypass the login procedure. That forces you to leverage a RPA process where your code controls the web browser and performs login action leveraging real login and pwd, followed by automated browsing of the pages you need and extraction of the elements you require from HTML using the BeautifulSoup.
For that I suggest to try out Selenium (https://www.selenium.dev/)
A short tutorial is here:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
I tried it to solve similar task some time ago and it worked good

How to login to discord with beautifulsoup and requests

I am trying to use beautiful soup to send messages to somebody every so often. This can not be a bot as it has to work in DM's. However in order to get to the page to DM somebody I need to login first. I have looked into HTTP Post but it can not find the form data to use the post method. How could I, with BeautifulSoup, get past this first login page. Here is my code so far:
import requests
import lxml
from bs4 import BeautifulSoup
website = requests.get("https://discord.com/channels/#me/727172129799405609")
src = website.content
soup = BeautifulSoup(src, "lxml")
print(soup.prettify())
When I recieve the prettified code for this page, I can't see the form inputs anywhere. Is this normal? I am just assuming that I am on the discord login page but this may not be the case and this may be the reason I can't find form data to submit.

Why can't I scrape using beautiful soup?

I need to scrape the only table from this website: https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu
I used beautiful soup and requests but not successful. Can you guys suggest me where I am going wrong?
mandal_url = "https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu"
r = requests.get(mandal_url, verify=False).content
soup = bs4.BeautifulSoup(r, 'lxml')
df = pd.read_html(str(soup.find('table',{"id":"gvAgricultureVillage"})))
I am geeting 'Page not found' in the data frame. I don't know where I am going wrong!
The page likely requires some sort of sign-in. Viewing it myself by clicking on the link, I get .
You would need to add the cookies / some other headers to the request to appear "signed in".
Try to click the link you're trying to scrape from an invalid link. When I click the link you provide, or the link you store in mandal_url, both return an 'Page not found' page. So you are scraping in the correct way, but the url you provide to the scraper is invalid/not up anymore.
I couldn't get access to the website. But you can read form(s) on a webpage directly by using:
dfs = pd.read_html(your_url, header=0)
In the case that url requires authentication, you can get the form by:
r = requests.get(url_need_authentivation, auth=('myuser', 'mypasswd'))
pd.read_html(r.text, header=0)[1]
This will simplify your code. Hope it helps!

Categories

Resources