Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes
import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper
Please use robots.txt
Your IP has been rate limited
To check the problem I wrote:
try:
page_response = requests.get(baseurl, timeout =5)
if page_response.status_code ==200:
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
else:
print(page_response.status_code)
except requests.Timeout as e:
print(str(e))
Then I get 429 (too many requests).
What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?
If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:
requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})
to declare yourself as a bot or
requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:
https://www.whatismybrowser.com/detect/what-is-my-user-agent
Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.
Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)
If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.
You can read about that here: Parsing Robots.txt in python
Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)
Related
I'm having trouble finding which link I have to enact the POST request to along with what necessary credentials and or headers.
I tried
from bs4 import BeautifulSoup
callback = 'https://accounts.stockx.com/login/callback'
login = 'https://accounts.stockx.com/login'
shoe = "https://stockx.com/nike-dunk-high-prm-dark-russet"
headers = {
}
request = requests.get(shoe, headers=headers)
#print(request.text)
soup = BeautifulSoup(request.text, 'lxml')
print(soup.prettify())```
but I keep getting
`Access to this page has been denied`
You're attempting a very difficult task as a lot of these websites have impeccable bot detection.
You can try mimicking your browsers headers by copying them from your browsers network request menu (CTRL-SHIFT-i > then go to the network tab). The most important being your user-agent, then add it to your headers. They will look something like this:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
However you'll probably be met with some sort of captcha or other issues. It's a very uphill battle my friend. You will want to look into understanding http requests better. Look into proxies and not being IP limited. But even then, these websites have much more advanced methods of detecting a python script / bots such as TLS Fingerprinting and various other sorts of fingerprinting.
Your best bet would to try and find out the actual API that your target website uses if it is exposed.
Otherwise there is not much you can do except respect that what you are doing is against their Terms Of Service.
import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get("https://sky.lea.moe/stats/PapaGordsmack/", headers=headers)
html_contents = page.text
print(html_contents)
I am trying to webscrape from sky.lea.moe website for a specific user, but when I request the html and print it, it is different than the one shown in browser(on chrome, viewing page source).
The one I get is: https://pastebin.com/91zRw3vP
Analyzing this one, it is something about checking browser and redirecting. Any ideas what I should do?
This is cloudflare's anti-dos protection, and it is effective at stopping scraping. A JS script will usually redirect you after a few seconds.
Something like Selenium is probably your best option for getting around it, though you might be able to scrape the JS file and get the URL to redirect. You could also try spoofing your referrer to be this page, so it goes to the correct one.
Browsers indeed do more than just download a webpage. They also download additional resources, parse style and things like that. To scrape a webpage it is advised to use a scraping library like Scrapy that does all these things for you and provide a complete library to easily extract information from these pages.
I am trying to scrape through google news search results using python's requests to get links to different articles. I get the links by using Beautiful Soup.
The problem I get is that although in browser's source view all links look normal, after the operation they are changed - all of the start with "/url?q=" and after the "core" of the link is finished there goes a string of characters which starts with "&". Also - some characters inside the link are also changed - for example url:
http://www.azonano.com/news.aspx?newsID=35576
changes to:
http://www.azonano.com/news.aspx%newsID%35576
I'm using standard "getting started" code:
import requests, bs4
url_list = list()
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.select('h3 > a'):
url_list.append(link.get('href'))
# First link on google news page is:
# https://www.theengineer.co.uk/graphene-sensor-could-speed-hepatitis-diagnosis/
print url_list[0] #this line will print url modified by requests.
I know it's possible to get around this problem by using selenium, but I'd like to know where lies a root cause of this problem with requests (or more plausible not with requests but the way I'm using it).
Thanks for any help!
You're comparing what you are seeing with a browser with what requests generates (i.e. there is no user agent header). If you specify this before making the initial request it will reflect what you would see in a web browser. Google serves the requests differently it looks like:
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} # I just used a general Chrome 41 user agent header
res = requests.get(url, headers=headers)
I am trying to make a small program that downloads subtitles for movie files.
I noticed however that when I follow a link in chrome and when opening it with urllib2.urlopen() does not give the same results.
As an example let's consider the link http://www.opensubtitles.org/en/subtitleserve/sub/5523343 . In chrome this redirects to http://osdownloader.org/en/osdownloader.subtitles-for.you/subtitles/5523343 which after a little while downloads the file I want.
However, when I use the following code in python, I get redirected to another page:
import urllib2
url = "http://www.opensubtitles.org/en/subtitleserve/sub/5523343"
response = urllib2.urlopen(url)
if response.url == url:
print "No redirect"
else:
print url, " --> ", response.url
Result: http://www.opensubtitles.org/en/subtitleserve/sub/5523343 --> http://www.opensubtitles.org/en/subtitles/5523343/the-musketeers-commodities-en
Why does that happen? How can I follow the same redirect as with the browser?
(I know that these sites offer APIs in python, but this is meant as practice in python and playing with urllib2 for the first time)
There's a significant difference in the request you're making from Chrome and your script using urllib2 above, and that is the HTTP header User-Agent (https://en.wikipedia.org/wiki/User_agent).
opensubtitles.org probably identifies that you're trying to programmatically retrieving the webpage, and are blocking it. Try to use one of the User-Agent strings from Chrome (more here http://www.useragentstring.com/pages/Chrome/):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36
in your script.
See this question on how to edit your script to support a custom User-Agent header - Changing user agent on urllib2.urlopen.
I would also like to recommend using the requests library for Python instead of urllib2, as the API is much easier to understand - http://docs.python-requests.org/en/latest/.
I looked here and here for information on my issue, but with no luck.
I made some python code that is intended to grab a webpage's source, as in Safari's Web Inspector. However, I have been getting different code from my application and Safari's Web Inspector. Here is my code so far:
#!/usr/bin/python
import urllib2
# headers
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.28.10 (KHTML, like Gecko) Version/6.0.3 Safari/536.28.10',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Cache-Control': 'max-age=0'}
# request data
req = urllib2.Request("https://www.google.com/#q=rainbow&safe=active", headers=hdr)
# try to get data
try:
page = urllib2.urlopen(req)
print page.info()
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
#print content
print content
And the headers match up to what is in Web Inspector:
The code returned is different, though, for a google search for "rainbow".
My python:
http://paste.ubuntu.com/6270549/
Web Inspector:
http://paste.ubuntu.com/6270606/
As far as I know, it seems that my code is missing a large number of the ubiquitous }catch(e){gbar_._DumpException(e)} lines that are present in the Web Inspector code. Also, my code only has 78 lines, while the Web Inspector code has 235 lines. Does this mean that my code is not getting all of the javascript or some other portion of the webpage? How can I get my code retrieve the same data as the Web Inspector?
You are using the wrong link to search with google search- the correct link should be:
https://www.google.com/search?q=rainbow&safe=active
instead of:
https://www.google.com/#q=rainbow&safe=active
The second link will cause a redirect to Google's homepage when used in python, because it is incorrect (for some reason) when not used in Safari. This is why the code is different.