Is it possible to receive the status code of a URL with headless Chrome (python)?
As stated in this answer it's not possible using python and selenium (I guess you are using selenium?).
A working alternative from selenium and chrome is requests. With requests an example would look like this:
import requests
response = requests.get('https://api.github.com')
print(response.status_code)
Related
I am new to python and web scraping and i'm trying to scrape a website that uses JavaScript. I have managed to automate the log in sequence via Selenium, however when I try to send the API call to get the data, I am not able to get anything. I'm assuming it's because the API call requires some sort of authentication. How can I get past this?
Here's my code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import pandas as pd
import requests
import json
username = 'xxx'
password = 'xxx'
url = 'https://www.example.com/login'
#log in
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
driver.find_element(By.XPATH, '//*[#id="username"]').send_keys(username)
driver.find_element(By.XPATH, '//*[#id="password"]').send_keys(password)
driver.find_element(By.XPATH, '//*[#id="login_button"]').click()
# go to User Lines
driver.get('http://www.example.com/lines')
time.sleep(5)
response = requests.request("GET", url, headers=headers, data=payload)
subs = json.loads(response.text)
print(subs)
Every time an HTTP request is made some metadata is included. This is all the header data and cookies and maybe some other session data. It has to be sent every time because that's the only way to maintain a 'session'
If you login in Selenium, the browser is managing your session there. Making a request with the python requests library has nothing to do with Selenium, and most likely the authentication that you're missing is what is provided by logging in in Selenium.
So you have a few options:
1. Make the API call using Selenium After logging in just get() the API URL and the page source should be the data within a tag.
2. Log in using the requests library Instead of using Selenium, you can exclusively use requests. This can be tedious; you'll have to inspect the network calls using the devtools and piece together what you would need to replicate using requests to simulate the same login that happens on the browser. You would also need to use a persistent session using requests.Session() to create a session instance. You can use this object to make the requests instead of the requests library directly. But once you do, you can just make the API request as you were. This method has the fastest runtime too since you're not rendering a whole browser and running the javascript within that, and making all the network requests therein.
3. Pass the session data from Selenium to your requests' session instance I haven't tried doing this, but since session data is just passed along in the headers and are just strings, you can probably find a way to get the cookies from Selenium and add them to your session requests instance to make your API call without selenium.
Context
I am currently attempting to build a small-scale bot using Selenium and Requests module in Python.
However, the webpage I want to interact with is running behind Cloudflare.
My python script is running over Tor using stem module.
My traffic analysis is based on Firefox's "Developer options->Network" using Persist Logs.
My findings so far:
Selenium's Firefox webdriver can often access the webpage without going through "checking browser page" (return code 503) and "captcha page" (return code 403).
Requests session object with the same user agent always results in "captcha page" (return code 403).
If Cloudflare was checking my Javascript functionality, shouldn't my requests module return 503 ?
Code Example
driver = webdriver.Firefox(firefox_profile=fp, options=fOptions)
driver.get("https://www.cloudflare.com") # usually returns code 200 without verifying the browser
session = requests.Session()
# ... applied socks5 proxy for both http and https ... #
session.headers.update({"user-agent": driver.execute_script("return navigator.userAgent;")})
page = session.get("https://www.cloudflare.com")
print(page.status_code) # return code 403
print(page.text) # returns "captcha page"
Both Selenium and Requests modules are using the same user agent and ip.
Both are using GET without any parameters.
How does Cloudflare distinguish these traffic?
Am I missing something?
I tried to transfer cookies from the webdriver to the requests session to see if a bypass is possible but had no luck.
Here is the used code:
for c in driver.get_cookies():
session.cookies.set(c['name'], c['value'], domain=c['domain'])
There are additional JavaScript APIs exposed to the webpage when using Selenium. If you can disable them, you may be able to fix the problem.
Cloudflare doesn't only check HTTP headers or javascript — it also analyses the TLS header. I'm not sure exactly how it does it, but I've found that it can be circumvented by using NSS instead of OpenSSL (though it's not well integrated into Requests).
The captcha response depends on the browser fingerprint. It's not about just sending Cookies and User-agent.
Copy all the headers from Network Tab in Developers console, and send all the key value pairs as headers in request library.
This method should work logically.
I have a Python script using MechanicalSoup StatefulBrowser to open URL that used to work. But it stopped working recently opening a specific website, and I haven't changed any code.
I tried opening other websites, and it's fine. This is the specific website that fails to open: http://a810-bisweb.nyc.gov/bisweb/ComplaintsByAddressServlet?allbin=4606689
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# open url test
url = "http://www.cnn.com"
print("opening website: {}".format(url))
browser.open(url)
print("done website: {}".format(url))
url = "http://a810-bisweb.nyc.gov/bisweb/ComplaintsByAddressServlet?allbin=4606689"
print("opening website: {}".format(url))
browser.open(url)
print("done website: {}".format(url))
The following is the output I got is from www.cnn.com which opened up as expected. But the 2nd link just hangs.
Any help? Or if anyone know of a way to contact MechanicalSoup developer, please let me know.
Output:
opening website: http://www.cnn.com
done website: http://www.cnn.com
opening website: http://a810-bisweb.nyc.gov/bisweb/ComplaintsByAddressServlet?allbin=4606689
... hangs ...
Thank you.
Many portals block connection if it has wrong header "User-Agent" which inform server what web browser is used to connect.
Python's tools (like requests) often use word Python in User-Agent so server can recognize that it is not real web browser and block connection.
If I use text "Mozilla/5.0" as User-Agent then I can connect again
browser = mechanicalsoup.StatefulBrowser()
browser.set_user_agent('Mozilla/5.0')
Text "Mozilla/5.0" is not full text used by read web browser so you could find better text. Or it should be python's module with User-Agent from different web browsers so you can use different values in different days.
As a newbie, I wonder whether there is a method to get the http response status code to judge some expections, like remote server down, url broken, url redirect, etc...
In Selenium it's Not Possible!
For more info click here.
You can accomplish it with request:
import requests
from selenium import webdriver
driver = webdriver.get("url")
r = requests.get("url")
print(r.status_code)
Update:
It actually is possible using the chrome-developer-protocoll with event listeners.
See example script at https://stackoverflow.com/a/75067388/20443541
I am looking to open a connection with python to http://www.horseandcountry.tv which takes my login parameters via the POST method. I would like to open a connection to this website in order to scrape the site for all video links (this, I also don't know how to do yet but am using the project to learn).
My question is how do I pass my credentials to the individual pages of the website? For example if all I wanted to do was use python code to open a browser window pointing to http://play.horseandcountry.tv/live/ and have it open with me already logged in, how do I go about this?
As far as I know you have two options depending how you want to crawl and what you need to crawl:
1) Use urllib. You can do your POST request with the necessary login credentials. This is the low level solution, which means that this is fast, but doesn't handle high level stuff like javascript codes.
2) Use selenium. Whith that you can simulate a browser (Chrome, Firefox, other..), and run actions via your python code. Then it is much slower but works well with too "sophisticated" websites.
What I usually do: I try the first option and if a encounter a problem like a javascript security layer on the website, then go for option 2. Moreover, selenium can open a real web browser from your desktop and give you a visual of your scrapping.
In any case, just goolge "urllib/selenium login to website" and you'll find what you need.
If you want to avoid using Selenium (opening web browsers), you can go for requests, it can login the website and grab anything you need in the background.
Here is how you can login to that website with requests.
import requests
from bs4 import BeautifulSoup
#Login Form Data
payload = {
'account_email': 'your_email',
'account_password': 'your_passowrd',
'submit': 'Sign In'
}
with requests.Session() as s:
#Login to the website.
response = s.post('https://play.horseandcountry.tv/login/', data=payload)
#Check if logged in successfully
soup = BeautifulSoup(response.text, 'lxml')
logged_in = soup.find('p', attrs={'class': 'navbar-text pull-right'})
print s.cookies
print response.status_code
if logged_in.text.startswith('Logged in as'):
print 'Logged In Successfully!'
If you need explanations for this, you can check this answer, or requests documentation
You could also use the requests module. It is one of the most popular. Here are some questions that relate to what you would like to do.
Log in to website using Python Requests module
logging in to website using requests