How to bypass the DDOS attack check in particular site in python? - python

I am trying to scrape this site but while getting the data from the site its doing a ddos checkup on me, Which it checks for like 5 seconds and then redirects to the same url but the page opens (On normal browser) but in python i am trying to request the same thing its just returning the ddos check up page. Is there any way i can bypass that or any workaround ?
this is my code :
thanks :)
import requests
from urllib2 import build_opener
import time
import json
url = 'https://www.masterani.me/api/anime/63-naruto-shippuuden/detailed'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
page = requests.get(url, headers = headers)
print page.text

Using a headless browser will work. Use PhantomJS with Selenium webdriver to scrape such sites, or the ones which uses AJAX to load content.
I found these links useful.
https://www.guru99.com/selenium-python.html
https://vocuzi.in/blog/preventing-website-web-scrapers/

Anti ddos solutions usually take into account various parameters when inspecting a request's validity. For example, your geographincal location may be a huge factor: When trying to reproduce your issue I'm getting a 200 response, meaning the anti ddos has decided to allow my code to access the site.
I would suggest the using VPN/proxy service such as this one, or, if this is a system destined for production, I would suggest a paid service as these are much more reliable. Notice that some service are robust enough to block many proxy IPs as well

Related

How can login to a website such as Stock X using python request (request module)?

I'm having trouble finding which link I have to enact the POST request to along with what necessary credentials and or headers.
I tried
from bs4 import BeautifulSoup
callback = 'https://accounts.stockx.com/login/callback'
login = 'https://accounts.stockx.com/login'
shoe = "https://stockx.com/nike-dunk-high-prm-dark-russet"
headers = {
}
request = requests.get(shoe, headers=headers)
#print(request.text)
soup = BeautifulSoup(request.text, 'lxml')
print(soup.prettify())```
but I keep getting
`Access to this page has been denied`
You're attempting a very difficult task as a lot of these websites have impeccable bot detection.
You can try mimicking your browsers headers by copying them from your browsers network request menu (CTRL-SHIFT-i > then go to the network tab). The most important being your user-agent, then add it to your headers. They will look something like this:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
However you'll probably be met with some sort of captcha or other issues. It's a very uphill battle my friend. You will want to look into understanding http requests better. Look into proxies and not being IP limited. But even then, these websites have much more advanced methods of detecting a python script / bots such as TLS Fingerprinting and various other sorts of fingerprinting.
Your best bet would to try and find out the actual API that your target website uses if it is exposed.
Otherwise there is not much you can do except respect that what you are doing is against their Terms Of Service.

Bypass cookie agreement page while web scraping using Python

I am facing an issue with google agreement page cookies after scraping on a redirect google url.
I am trying to scrape from different pages on Google News uri, but when i run this code:
req = requests.get(url,headers=headers)
with "headers" = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.1 (KHTML, like Gecko) Chrome/6.0.422.0 Safari/534.1', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'DNT': '1', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'it-IT'}
and for example URL = https://news.google.com/./articles/CAIiEMb3PYSjFFVbudiidQPL79QqGQgEKhAIACoHCAow-ImTCzDRqagDMKiIvgY?hl=it&gl=IT&ceid=IT%3Ait
the "request.content" is the HTMLs code of agreement cookies page by Google.
I have tried also to convert the redirect link into a normal link but the response gives me the redirect link to this
I have the same problem related to this question (How can I bypass a cookie agreement page while web scraping using Python?).
Anyway, the solution proposed in that works only for the specific site.
Note: the entire code worked until few weeks ago.
I solved the problem by adding the line
'Cookie':'CONSENT=YES+cb.20210418-17-p0.it+FX+917; '
to the request header.
Although the page returned by the request is still a Google page, but that page contains the link to the site from which the request originated.
So, once I got the page I did some more scraping so that I could get the link and start the request I wanted.

BeautifulSoup returning 'None'

I am trying to get a list of reviews from the G2 website using BeautifulSoup. However, for some reason, when I run the code below, it says that 'reviews' is 'NoneType'. I can't figure this out because it clearly shows the class name in the HTML from the website (see the picture below). I have used this exact syntax to webscrape from other sites and it has worked so I have no idea why it is returning NoneType. I tried to use 'find_all' and return the length of the list (number of reviews), but that also shows nonetype. I am super confused. Please help!
response = requests.get('https://www.g2.com/products/mailchimp/reviews?filters%5Bcomment_answer_values%5D=&order=most_recent&page=1')
text = BeautifulSoup(response.text, 'html.parser')
num_reviews = 500
reviews = text.find('div', attrs={'class': 'paper paper--white paper--box mb-2 position-relative border-bottom '})
print(reviews)
You need to pass headers to the HTTP request. It's detecting that you're not a browser, if you print the variable text out, you'll see that.
Parsed HTML you get
...
<h1>Pardon Our Interruption...</h1>
<p>
As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
<li>You're a power user moving through this website with super-human speed.</li>
<li>You've disabled JavaScript and/or cookies in your web browser.</li>
<li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.</li>
...
So passing headers, is enough to mimic browser activity.
To grab the headers
Code Example
import requests
headers = {
'authority': 'www.g2.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfduid=df6514ad701b86146978bf17180a5e6f01597144255; events_distinct_id=822bbff7-912d-4a5e-bd80-4364690b2e06; amplitude_session=1597144258387; _g2_session_id=424bfbe09b254b1a9484f50b70c3381c; reese84=3:BJ8QXTaIa+brQNrbReKzww==:n5v0tg/Q590u2q44+xAi7rnSO1i2Kn7Lp1Ar+2SCMJF5HiBJNqLVR3IPzPF0qIqgxpWjZ9veyhywY4JNSbBOtz5sJOwEecGJE9tT+NInof+vlP3hKTb6bqA3cvAf6cfDIrtEmhI0Dsjoe3ct3NtwvvcA9p8FXHPR7PAFP42nWqAAfDH88vj0hQwWlIjio/fT4g5iDsT1qZH3alC8ZbUhOURKNk9JUz2sBz+RjgkRyctO0VTGzjxmHCd2r40WJqWjVDwRmBl+/msW+/V0PW93vjFs45bMD63D5Q4JeRreBxkAN9ufIajaV0MmkYbxlFnwIZ3cEBHi/X76n+PvAobd5/UgCwgUIvt/P4pl7NEcDWR/ORaZ8gLPl4HbuQaRhEVd23Ez5OBnYFP1wjqLT/ECDkRzQq0Nn8U6qVbMO25Hp6U=:/JrPeXs0AKDQw5FlG3vKQX1dPIsF/TEXTLgQ+mktyAo=; ue-event-segment-983a43a0-1c10-4dfb-96d7-60049c0dcd62=W1siL3VzZXJzL2NvbnNlbnQvc2VsZWN0ZWQiLHsiY29uc2VudF90eXBlIjoi%0AY29va2llcyIsImdyYW50ZWQiOiJ0cnVlIn0sIjk4M2E0M2EwLTFjMTAtNGRm%0AYi05NmQ3LTYwMDQ5YzBkY2Q2MiIsIlVzZXIgQ29uc2VudCBTZWxlY3RlZCIs%0AWyJhbXBsaXR1ZGUiXV1d%0A',
'if-none-match': 'W/"3658e5098c91c183288fd70e6cfd9028"',
}
response = requests.get('https://www.g2.com/products/mailchimp/reviews', headers=headers)
text = BeautifulSoup(response.text, 'html.parser')
num_reviews = 500
reviews = text.select('div[class*="paper paper--white paper--box"]')
print(len(reviews))
Output
25
Explanation
Sometimes in order to make an HTTP request, it's necessary to pass either headers, user agent, cookies, parameters. You can play about with this, I must admit I was lazy and just sent the entire headers. Essentially you're trying to mimic a browser request by using the requests package. Sometimes it's abit more nuanced in detecting a bot.
Here i've inspected the page and gone to network tools. There's a tab called doc. I've then copied that request, by right clicking the request and clicking COPY curl(bash). As I said i'm lazy so I've pasted that into curl.trillworks.com which converts this into a nice python format as well the boilerplate for a request.
I've modified your script slightly as it was quite a long attribute
the CSS selector div[class*=""] grabs any element with class "" you specific.

Html in browser different than the one requested in Python

import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get("https://sky.lea.moe/stats/PapaGordsmack/", headers=headers)
html_contents = page.text
print(html_contents)
I am trying to webscrape from sky.lea.moe website for a specific user, but when I request the html and print it, it is different than the one shown in browser(on chrome, viewing page source).
The one I get is: https://pastebin.com/91zRw3vP
Analyzing this one, it is something about checking browser and redirecting. Any ideas what I should do?
This is cloudflare's anti-dos protection, and it is effective at stopping scraping. A JS script will usually redirect you after a few seconds.
Something like Selenium is probably your best option for getting around it, though you might be able to scrape the JS file and get the URL to redirect. You could also try spoofing your referrer to be this page, so it goes to the correct one.
Browsers indeed do more than just download a webpage. They also download additional resources, parse style and things like that. To scrape a webpage it is advised to use a scraping library like Scrapy that does all these things for you and provide a complete library to easily extract information from these pages.

It is not possible to parse a part of a webpage that is visible when open with browser

I have this strange problem parsing the webpage Herald Sun to get the list of rss from it. When I look at the webpage in the browser, I can see the links with titles. Though, when I used Python and Beautiful Soup to parse the page, the response does not even have the section I would like to parse.
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.71 (KHTML, like Gecko) Version/7.0 Safari/537.71',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request("http://www.heraldsun.com.au/help/rss", headers=hdr)
try:
page = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.fp.read())
html_doc = page.read()
f = open("Temp/original.html", 'w')
f.write(html_doc.decode('utf-8'))
The written file as you can check, does not have the results in there, so obviously, Beautiful Soup has nothing to do here.
I wonder, how does the webpage enable this protection and how to overcome it? Thanks,
For commercial use, read the terms of services First
There are really not that much information the server know about who is making this request.
Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.
JavaScript or Not?
(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS.
However, in this case, your website looks it is not that sophisticated.
User-Agent? Cookie?
(2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.
As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.
Here are the codes how to add cookie using urllib2
import urllib2, bs4, re
opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})
for a in div.find_all('a'):
try:
if 'feeds.news' in a['href']:
print a
except:
pass
And here are the outputs:
Breaking News
Top Stories
World News
Victoria and National News
Sport News
...
The site could very likely be serving different content, depending on the User-Agent string in the headers. Websites will often do this for mobile browsers, for example.
Since you're not specifying one, urllib is going to use its default:
By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number.
You could try spoofing a common User-Agent string, by following the advice in this question. See What's My User Agent?

Categories

Resources