How to bypass Mod_Security while scraping - python

I tried running this Python script using BeautifulSoup and requests modules :
from bs4 import BeautifulSoup as bs
import requests
url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')
But when I send this line :
print(soup.get_text())
It doesn't scrape the text data but instead, It returns this output:
Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage
Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.
Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?
Any help would be very very helpful, Thanks.

EDIT: The Dash in "User-Agent" is essential.
Following this Answer https://stackoverflow.com/a/61968635/8106583
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
Your User-Agent is the problem. This User-Agent works for me.
Also: Your ip might be blocked by now :D

Related

Why does Jupyter give me a ModSecurity error when I try to run Beautiful Soup?

I am currently trying to build a webscraping program to pull data from a real estate website using Beautiful Soup. I haven't gotten very far but the code is as follows:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/")
c=r.content
soup=BeautifulSoup(c,"html.parser")
print(soup)
When I try to print the data to at least see if the program is working I get an error message saying "Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security." How do I get the server to stop blocking my IP address? I've read some similar issues with other programs and tried clearing the cookies, trying different browsers, etc and nothing has fixed it.
This is happening since the webpage thinks that your a bot (and is correct), therefore you will get blocked when sending a request.
To "bypass" this issue, try adding the user-agent to the headers parameter in the requests.get() method.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
url = "http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print(soup.prettify())

BeautifulSoup is returing none

I am trying to get a value from a website using beautiful soup but it keeps returning none. This is what I have for my code so far
def getmarketchange():
source = requests.get("https://www.coinbase.com/price").text
soup = bs4.BeautifulSoup(source, "lxml")
marketchange = soup.get("MarketHealth__Percent-sc-1a64a42-2.bEMszd")
print(marketchange)
getmarketchange()
and attached is a screenshot of html code I was trying to grab.
Thank you for your help in advance!
Have a look at the HTML source returned from your get() request - it's a CAPTCHA challenge. You won't be able to get to the Coinbase pricing without passing this challenge.
Excerpt:
<h2 class="cf-subheadline"><span data-translate="complete_sec_check">
Please complete the security check to access</span> www.coinbase.com
</h2>
<div class="cf-section cf-highlight cf-captcha-container">
Coinbase is recognizing that the HTTP request isn't coming from a standard browser-based user, and it is challenging the requester. BeautifulSoup doesn't have a way to pass this check on its own.
Passing in User-Agent headers (to mimic a browser request) also doesn't resolve this issue.
For example:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
source = requests.get("https://www.coinbase.com/price", headers=headers).text
You might find better luck with Selenium, although I'm not sure about that.
To prevent a captcha page, try to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/price'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('[class^="MarketHealth__Percent"]').text)
Prints:
0.65%

BeautifulSoup using Python keep returning null even though the element exists

I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.
When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.

Web Scraping with python requests

I want to scrape https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch and download the entire set of PDF files that show up in the search results as on date (say 01-01-2016). The employee fields are optional. On clicking search, the site throws up a list of all the employees. I am unable to get the post method to work using python requests. Keep getting a 405 error. My code is below
from bs4 import BeautifulSoup
import requests
url = "https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch"
data = {
'assessmentYearId':'vH4pgBbZ8y8rhOFBoM0g7w',
'empName':'',
'allotmentYear':'',
'cadreId':'',
'iprReportType':'cqZvyXc--mpmnRNfPp2k7w',
'userType':'JgPOADxEXU1jGi53Xa2vGQ',
'_csrf':'7819ec72-eedf-4290-ba70-6f2b14cc4b79'
}
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'184',
'Content-Type':'application/x-www-form-urlencoded',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.post(url,data=data,headers=headers)
I'm not familiar with the website but I strongly suggest reading their policy before trying to scrape the content.
In similar scenarios when you don't get the expected results by a simple post, using requests.Session usually helps.
The problem lay in my using the same csrf code. Needs to be changed with every request.

How to scrape a webpage using Beautiful Soup as if using Chrome?

I am getting certain elements in google chrome (Inspect) but not in internet explorer when I view the source of the same web page.
I assume Beautiful Soup uses internet explorer inside? Its results match IE more closely.
However when I use the Inspect feature of chrome, I see certain elements not listed in the source.
Is there a way I can emulate this in Python or using Beautiful Soup?
You can change your user agent's to one of the following:
https://webscraping.com/blog/User-agents/
A Snippet: changing User-Agent forces the page to open different contets (mobile vs Chrome)
from bs4 import BeautifulSoup
import requests
#headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'}
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}
result = requests.get("http://derstandard.at", headers=headers)
c = result.content
print result.request.headers
print len(c)
Note: Some websites are protecting them selfes for user-agent spoofing. So not all websites might respond to these frequent jumps.

Categories

Resources