MechanicalSoup StatefulBrowser: Unable to Open URL - python

I have a Python script using MechanicalSoup StatefulBrowser to open URL that used to work. But it stopped working recently opening a specific website, and I haven't changed any code.
I tried opening other websites, and it's fine. This is the specific website that fails to open: http://a810-bisweb.nyc.gov/bisweb/ComplaintsByAddressServlet?allbin=4606689
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# open url test
url = "http://www.cnn.com"
print("opening website: {}".format(url))
browser.open(url)
print("done website: {}".format(url))
url = "http://a810-bisweb.nyc.gov/bisweb/ComplaintsByAddressServlet?allbin=4606689"
print("opening website: {}".format(url))
browser.open(url)
print("done website: {}".format(url))
The following is the output I got is from www.cnn.com which opened up as expected. But the 2nd link just hangs.
Any help? Or if anyone know of a way to contact MechanicalSoup developer, please let me know.
Output:
opening website: http://www.cnn.com
done website: http://www.cnn.com
opening website: http://a810-bisweb.nyc.gov/bisweb/ComplaintsByAddressServlet?allbin=4606689
... hangs ...
Thank you.

Many portals block connection if it has wrong header "User-Agent" which inform server what web browser is used to connect.
Python's tools (like requests) often use word Python in User-Agent so server can recognize that it is not real web browser and block connection.
If I use text "Mozilla/5.0" as User-Agent then I can connect again
browser = mechanicalsoup.StatefulBrowser()
browser.set_user_agent('Mozilla/5.0')
Text "Mozilla/5.0" is not full text used by read web browser so you could find better text. Or it should be python's module with User-Agent from different web browsers so you can use different values in different days.

Related

BeautifulSoup not returning the title of page

I tried to getting the title of a web page by web scraping using Beautifulsoup4 python module and it's returning a string "Not Acceptable!" as the title, but when I open the webpage via browser the title is different. I tried looping through list of links and extract titles of all the webpages but it's returning the same string "Not Acceptable!" for all the links.
here is the python code
from bs4 import BeautifulSoup
import requests
URL = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
result = requests.get(URL)
doc = BeautifulSoup(result.text, 'html.parser')
tag = doc.title
print(tag.get_text())
here is link to the corresponding web page webpage link
I don't know if it is a problem with Beautifulsoup4 or with requests library, is it because the site has enabled bot protection and not returning the HTML when sending the requests?
The server expects the User-Agent header. Interestingly, it is happy with any User-Agent, even a fictitious one:
result = requests.get(URL, headers = {'User-Agent': 'My User Agent 1.0'})
An easy way to debug this kind of issue is to print (or write to a file) the request.text. This is because some servers don't allow scraping. Some websites generate HTML using JavaScript at runtime (e.g. YouTube). These are some of the scenarios where the request.text can be different than the source HTML we see in the browser. The below text has been returned by the server.
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
Edit:
As pointed by DYZ, this is a 406 error and User Agent in the request header was missing.
https://www.exai.com/blog/406-not-acceptable
The 406 Not Acceptable status code is a client-side error. It's part
of the HTTP response status codes in the 4xx category, which are
considered client error responses

Using Selenium to access a site hosted on sharepoint

For privacy concerns, I cannot distribute the url publicly.
I have been able to access this site successfully using python requests session = requests.Session(); r = session.post(url, auth = HttpNtlmAuth(USERNAME, PASSWORD), proxies = proxies) which works great and I can parse the webpage with bs4. I have tried to return cookies using session.cookies.get_dict() but it returns an empty dict (assuming b/c site is hosted using sharepoint). My original thought was to retrieve cookies then use them to access the site.
The issue that I'm facing is when you redirect to the url, a box comes up asking for credentials - which when entered directs you to the url. You can not inspect the page that the box is on- which means that I can't use send.keys() etc. to login using selenium/chromedriver.
I read through some documentation but was unable to find a way to enter pass/username when calling driver = webdriver.Chrome(path_driver) or following calls.
Any help/thoughts would be appreciated.
When right clicking the below - no option to inspect webpage.

How does Cloudflare differentiate Selenium and Requests traffic?

Context
I am currently attempting to build a small-scale bot using Selenium and Requests module in Python.
However, the webpage I want to interact with is running behind Cloudflare.
My python script is running over Tor using stem module.
My traffic analysis is based on Firefox's "Developer options->Network" using Persist Logs.
My findings so far:
Selenium's Firefox webdriver can often access the webpage without going through "checking browser page" (return code 503) and "captcha page" (return code 403).
Requests session object with the same user agent always results in "captcha page" (return code 403).
If Cloudflare was checking my Javascript functionality, shouldn't my requests module return 503 ?
Code Example
driver = webdriver.Firefox(firefox_profile=fp, options=fOptions)
driver.get("https://www.cloudflare.com") # usually returns code 200 without verifying the browser
session = requests.Session()
# ... applied socks5 proxy for both http and https ... #
session.headers.update({"user-agent": driver.execute_script("return navigator.userAgent;")})
page = session.get("https://www.cloudflare.com")
print(page.status_code) # return code 403
print(page.text) # returns "captcha page"
Both Selenium and Requests modules are using the same user agent and ip.
Both are using GET without any parameters.
How does Cloudflare distinguish these traffic?
Am I missing something?
I tried to transfer cookies from the webdriver to the requests session to see if a bypass is possible but had no luck.
Here is the used code:
for c in driver.get_cookies():
session.cookies.set(c['name'], c['value'], domain=c['domain'])
There are additional JavaScript APIs exposed to the webpage when using Selenium. If you can disable them, you may be able to fix the problem.
Cloudflare doesn't only check HTTP headers or javascript — it also analyses the TLS header. I'm not sure exactly how it does it, but I've found that it can be circumvented by using NSS instead of OpenSSL (though it's not well integrated into Requests).
The captcha response depends on the browser fingerprint. It's not about just sending Cookies and User-agent.
Copy all the headers from Network Tab in Developers console, and send all the key value pairs as headers in request library.
This method should work logically.

Creating a connection to a subscription site in python

I am looking to open a connection with python to http://www.horseandcountry.tv which takes my login parameters via the POST method. I would like to open a connection to this website in order to scrape the site for all video links (this, I also don't know how to do yet but am using the project to learn).
My question is how do I pass my credentials to the individual pages of the website? For example if all I wanted to do was use python code to open a browser window pointing to http://play.horseandcountry.tv/live/ and have it open with me already logged in, how do I go about this?
As far as I know you have two options depending how you want to crawl and what you need to crawl:
1) Use urllib. You can do your POST request with the necessary login credentials. This is the low level solution, which means that this is fast, but doesn't handle high level stuff like javascript codes.
2) Use selenium. Whith that you can simulate a browser (Chrome, Firefox, other..), and run actions via your python code. Then it is much slower but works well with too "sophisticated" websites.
What I usually do: I try the first option and if a encounter a problem like a javascript security layer on the website, then go for option 2. Moreover, selenium can open a real web browser from your desktop and give you a visual of your scrapping.
In any case, just goolge "urllib/selenium login to website" and you'll find what you need.
If you want to avoid using Selenium (opening web browsers), you can go for requests, it can login the website and grab anything you need in the background.
Here is how you can login to that website with requests.
import requests
from bs4 import BeautifulSoup
#Login Form Data
payload = {
'account_email': 'your_email',
'account_password': 'your_passowrd',
'submit': 'Sign In'
}
with requests.Session() as s:
#Login to the website.
response = s.post('https://play.horseandcountry.tv/login/', data=payload)
#Check if logged in successfully
soup = BeautifulSoup(response.text, 'lxml')
logged_in = soup.find('p', attrs={'class': 'navbar-text pull-right'})
print s.cookies
print response.status_code
if logged_in.text.startswith('Logged in as'):
print 'Logged In Successfully!'
If you need explanations for this, you can check this answer, or requests documentation
You could also use the requests module. It is one of the most popular. Here are some questions that relate to what you would like to do.
Log in to website using Python Requests module
logging in to website using requests

Python Requests - Emulate Opening a Browser at Work

At work, I sit behind a proxy. When I connect to the company WiFi and open up a browser, a pop-up box usually appears asking for my company credentials before it will let me navigate to any internal/external site.
I am using the Python Requests package to automate pulling data from an external site but am encountering a 401 error that is related to not having authenticated first. This happens when I don't authenticate first using the browser. If I authenticate first with the browser and then use Python requests then everything is fine and I'm able to navigate to any site.
My questions is how do I perform the work authentication part using Python? I want to be able to automate this process so that I can set a cron job that grabs data from an external source every night.
I've tried providing an blank URL:
import requests
response = requests.get('')
But requests.get() requires a properly structure URL. I want to be able to emulate as if I've opened up a browser and capturing the pop-up that asks for authentication. This does not rely on any URL being used.
From the requests documentation
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)

Categories

Resources