import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get("https://sky.lea.moe/stats/PapaGordsmack/", headers=headers)
html_contents = page.text
print(html_contents)
I am trying to webscrape from sky.lea.moe website for a specific user, but when I request the html and print it, it is different than the one shown in browser(on chrome, viewing page source).
The one I get is: https://pastebin.com/91zRw3vP
Analyzing this one, it is something about checking browser and redirecting. Any ideas what I should do?
This is cloudflare's anti-dos protection, and it is effective at stopping scraping. A JS script will usually redirect you after a few seconds.
Something like Selenium is probably your best option for getting around it, though you might be able to scrape the JS file and get the URL to redirect. You could also try spoofing your referrer to be this page, so it goes to the correct one.
Browsers indeed do more than just download a webpage. They also download additional resources, parse style and things like that. To scrape a webpage it is advised to use a scraping library like Scrapy that does all these things for you and provide a complete library to easily extract information from these pages.
Related
I'm having trouble finding which link I have to enact the POST request to along with what necessary credentials and or headers.
I tried
from bs4 import BeautifulSoup
callback = 'https://accounts.stockx.com/login/callback'
login = 'https://accounts.stockx.com/login'
shoe = "https://stockx.com/nike-dunk-high-prm-dark-russet"
headers = {
}
request = requests.get(shoe, headers=headers)
#print(request.text)
soup = BeautifulSoup(request.text, 'lxml')
print(soup.prettify())```
but I keep getting
`Access to this page has been denied`
You're attempting a very difficult task as a lot of these websites have impeccable bot detection.
You can try mimicking your browsers headers by copying them from your browsers network request menu (CTRL-SHIFT-i > then go to the network tab). The most important being your user-agent, then add it to your headers. They will look something like this:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
However you'll probably be met with some sort of captcha or other issues. It's a very uphill battle my friend. You will want to look into understanding http requests better. Look into proxies and not being IP limited. But even then, these websites have much more advanced methods of detecting a python script / bots such as TLS Fingerprinting and various other sorts of fingerprinting.
Your best bet would to try and find out the actual API that your target website uses if it is exposed.
Otherwise there is not much you can do except respect that what you are doing is against their Terms Of Service.
I'm a Korean who just started learning Python.
First, I apologize for my English.
I learned how to use beautifulSoup on YouTube. And on certain sites, crawling was successful.
However, I found out that crawl did not go well on certain sites, and that I had to set up user-agent through a search.
So I used 'requests' to make code to set user-agent. Subsequently, the code to read a particular class from html was used equally, but it did not work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
url ='https://store.leagueoflegends.co.kr/skins'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
for skin in soup.select(".item-name"):
print(skin)
Here's my code. I have no idea what the problem is.
Please help me.
Your problem is that requests do not render javascript. instead, it only gives you the "initial" source code of the page. what you should use is a package called Selenium. it lets you control your browser )Chrome, Firefox, ...etc) from Python. the website won't be able to tell the difference and you won't need to mess with the headers and user-agents. there are plenty of videos on Youtube on how to use it.
Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes
import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper
Please use robots.txt
Your IP has been rate limited
To check the problem I wrote:
try:
page_response = requests.get(baseurl, timeout =5)
if page_response.status_code ==200:
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
else:
print(page_response.status_code)
except requests.Timeout as e:
print(str(e))
Then I get 429 (too many requests).
What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?
If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:
requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})
to declare yourself as a bot or
requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:
https://www.whatismybrowser.com/detect/what-is-my-user-agent
Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.
Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)
If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.
You can read about that here: Parsing Robots.txt in python
Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)
I am trying to make a small program that downloads subtitles for movie files.
I noticed however that when I follow a link in chrome and when opening it with urllib2.urlopen() does not give the same results.
As an example let's consider the link http://www.opensubtitles.org/en/subtitleserve/sub/5523343 . In chrome this redirects to http://osdownloader.org/en/osdownloader.subtitles-for.you/subtitles/5523343 which after a little while downloads the file I want.
However, when I use the following code in python, I get redirected to another page:
import urllib2
url = "http://www.opensubtitles.org/en/subtitleserve/sub/5523343"
response = urllib2.urlopen(url)
if response.url == url:
print "No redirect"
else:
print url, " --> ", response.url
Result: http://www.opensubtitles.org/en/subtitleserve/sub/5523343 --> http://www.opensubtitles.org/en/subtitles/5523343/the-musketeers-commodities-en
Why does that happen? How can I follow the same redirect as with the browser?
(I know that these sites offer APIs in python, but this is meant as practice in python and playing with urllib2 for the first time)
There's a significant difference in the request you're making from Chrome and your script using urllib2 above, and that is the HTTP header User-Agent (https://en.wikipedia.org/wiki/User_agent).
opensubtitles.org probably identifies that you're trying to programmatically retrieving the webpage, and are blocking it. Try to use one of the User-Agent strings from Chrome (more here http://www.useragentstring.com/pages/Chrome/):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36
in your script.
See this question on how to edit your script to support a custom User-Agent header - Changing user agent on urllib2.urlopen.
I would also like to recommend using the requests library for Python instead of urllib2, as the API is much easier to understand - http://docs.python-requests.org/en/latest/.
I have this strange problem parsing the webpage Herald Sun to get the list of rss from it. When I look at the webpage in the browser, I can see the links with titles. Though, when I used Python and Beautiful Soup to parse the page, the response does not even have the section I would like to parse.
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.71 (KHTML, like Gecko) Version/7.0 Safari/537.71',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request("http://www.heraldsun.com.au/help/rss", headers=hdr)
try:
page = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.fp.read())
html_doc = page.read()
f = open("Temp/original.html", 'w')
f.write(html_doc.decode('utf-8'))
The written file as you can check, does not have the results in there, so obviously, Beautiful Soup has nothing to do here.
I wonder, how does the webpage enable this protection and how to overcome it? Thanks,
For commercial use, read the terms of services First
There are really not that much information the server know about who is making this request.
Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.
JavaScript or Not?
(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS.
However, in this case, your website looks it is not that sophisticated.
User-Agent? Cookie?
(2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.
As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.
Here are the codes how to add cookie using urllib2
import urllib2, bs4, re
opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})
for a in div.find_all('a'):
try:
if 'feeds.news' in a['href']:
print a
except:
pass
And here are the outputs:
Breaking News
Top Stories
World News
Victoria and National News
Sport News
...
The site could very likely be serving different content, depending on the User-Agent string in the headers. Websites will often do this for mobile browsers, for example.
Since you're not specifying one, urllib is going to use its default:
By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number.
You could try spoofing a common User-Agent string, by following the advice in this question. See What's My User Agent?