urllib.error.HTTPError: HTTP Error 413: Payload Too Large - python

I am scraping a variety of pages (the_url) within a large website using the following code:
opener = urllib.request.build_opener()
url = opener.open(the_url)
contents_of_webpage = url.read()
url.close()
contents_of_webpage = contents_of_webpage.decode("utf-8")
This works fine for almost every page but occasionally I get:
urllib.error.HTTPError: HTTP Error 413: Payload Too Large
Looking for solutions I come up against answers of the form: well a web server may choose to give this as a response... as if there was nothing to be done - but all of my browsers can read the page without problem and presumably my browsers should be making the same kind of request. So surely there exists some kind of solution... For example can you ask for a web page a little bit at a time to avoid a large payload?

It depends heavily on the site and the URL you're requesting. To avoid your problem, most sites/APIs offer pagination on their endpoints. Try to check if the endpoint you're requesting accepts GET parameters like ?offset=<int>&limit=<int> or smth.
UPD: besides that, urllib is not so good in emulating browser behavior.
So you could try making the same request using requests, or setting the User-Agent header your browser has.

Related

Python scraping HTTPError: 403 Client Error: Forbidden for url:

My python code used to work, but when I tried it today it did not work anymore.
I assume the website owner forbade non browsers requests recently.
code
import requests, bs4
res = requests.get('https://manga1001.com/日常-raw-free/')
res.raise_for_status()
print(res.text)
I read that adding header in the requests.get method may work, but I don't know which header info exactly I need to make it work.
error
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-15-ed1948d83d51> in <module>
3 # res = requests.get('https://manga1001.com/日常-raw-free/', headers=headers_dic)
4 res = requests.get('https://manga1001.com/日常-raw-free/')
----> 5 res.raise_for_status()
6 print(res.text)
7
~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://manga1001.com/%E6%97%A5%E5%B8%B8-raw-free/
Requests get a header argument
res = requests.get('https://manga1001.com/日常-raw-free/', headers="")
I think adding a proper value here could make it work, but I don't know what the value is.
I would really appreciate if you could tell you.
And if you know any other ways to make it work, that is also quite helpful.
Btw I have also tried the code below but it also didn't work.
code 2
from requests_html import HTMLSession
url = "https://search.yahoo.co.jp/realtime"
session = HTMLSession()
r = session.get(url)
r = r.html.render()
print(r)
FYI HTMLSession may not work on IDLE like Jupyter notebook so I tired it after saving it as a python file but it still did not work.
When I run first code without res.raise_for_status() then I can see in HTML with Why do I have to complete a CAPTCHA? and Cloudflare Ray ID which shows what is the problem. It uses Cloudflare to detect scripts/bots/hackers/spamers and it uses Captcha to check it. But if I use header 'User-Agent' with value from real browser or even with short 'Mozilla/5.0' then it get expected page.
It works for me with both pages.
import requests
headers = {
'User-Agent': 'Mozilla/5.0'
}
url = 'https://manga1001.com/日常-raw-free/'
#url = 'https://search.yahoo.co.jp/realtime'
res = requests.get(url, headers=headers)
print('status_code:', res.status_code)
print(res.text)
BTW:
If you will run it often for many links in short time then it may display again CAPTCHA and then you may need other methods to behave more like real human - ie. sleep() with random time, Session() to use cookies, first get main page (to get fresh cookies) and later get this page, add other headers.
I wanted to expand on the answer given by #Furas because I understand his fix will not be the solution in all cases. Yes, In this instance you're getting the 403 and Cloudflare/security captcha page when you make a request because of not "scoring" high enough on the security system (Your HTTP browser isn't similar enough to a real browser)
This creates a big question. What is a real browser and what score do I need to beat it? How do I increase my browser score and make my HTTP-request based browser look more real to the bot protection?
Firstly, it's important to understand that these 403/Security blocks are based on different levels on security. Something you do on one site may not work on the other due to different security configurations/version. Two sites may use the same security system and still the request you make may only work on one.
Why would they have different configurations and everyone not use the highest security available? Because with each additional security measure, there's more false-positives and challenges to pass, on a large scale or for an e-commerce store this can mean lost sales due to a poor user-experience or additional bugs/downtime which are introduced via the security program.
What is a real browser?
A real browser can perform SSL/TLS handshakes, parse and run Javascript and make TCP/requests. Along with this, the security programs will analyze the patterns and timings of everything from Layer 2 to see if you're a "real" human. When you use something like Python to make a request that is only performing a HTTP(s) request it's really easy for these security programs to recognise you as a bot without some heavy configuration.
One way that security systems combat bots is by putting a Javascript challenge as a proxy between the bot and a site, this requires running client-side Javascript which bots cannot do by default, not only do you need to run the client-side Javascript, it also needs to be similar to one that your own browser would generate, the challenge can typically consist of a few hundred individual "browser" challenges or/along with a manual captcha to fingerprint and track your browser to see if you're a human (This is the page you're seeing).
The typical and more lower-standard security systems/configurations can be beaten by using the correct headers (with capitalization, header order and HTTP versions. Like #Furas mentioned, using consistent sessions can also help create longer-lasting sessions before getting another 403. More advanced and higher-level security configurations can do tracking on lower-levels by looking at some flags (Such as WindowSize) of the TCP connection and JA3 fingerprinting analyzing the TLS handshake which will look at your cipher suites and ALPN amongst other things. Security systems can see characteristics which differentiate between browsers, browser-versions and operating systems and compare these all together to generate your realness score. Your IP can also be an important factor, requests can be cross-checked with other sites, intervals, older requests you tried before and much more, you can use proxies to divide your requests between and look less suspicious, but this can come with additional problems and affect your request also causing it to be fingerprinted and blocked.
To understand this better, here's a great site you can go to in your browser and also make a GET request to, check your browser "Rank" and look at the different values which can be seen just from the TLS request alone.
I hope this provides some insight into why a block might appear, although it's impossible to tell from a single URL since blocks can appear for such a variety of different reasons.

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Python requests, change IP address

I am coding a web scraper for the website with the following Python code:
import requests
def scrape(url):
req = requests.get(url)
with open('out.html', 'w') as f:
f.write(req.text)
It works a few times but then an error HTML page is returned by the website (when I open my browser, I have a captcha to complete).
Is there a way to avoid this “ban” by for example changing the IP address?
As already mentioned in the comments and from yourself, changing the IP could help. To do this quite easily have a look at vpngate.py:
https://gist.github.com/Lazza/bbc15561b65c16db8ca8
An How to is provided at the link.
You can use a proxy with the requests library. You can find some free proxies at a couple different websites like https://www.sslproxies.org/ and http://free-proxy.cz/en/proxylist/country/US/https/uptime/level3 but not all of them work and they should not be trusted with sensitive information.
example:
proxy = {
"https": 'https://158.177.252.170:3128',
"http": 'https://158.177.252.170:3128'
}
response=requests.get('https://httpbin.org/ip', proxies=proxy)
I recently answered this on another question here, but using the requests-ip-rotator library to rotate IPs through API gateway is usually the most effective way.
It's free for the first million requests per region, and it means you won't have to give your data to unreliable proxy sites.
Late answer, I found this looking for IP-spoofing, but to the OP's question - as some comments point out, you may or may not actually be getting banned. Here's two things to consider:
A soft ban: they don't like bots. Simple solution that's worked for me in the past is to add headers, so they think you're a browser, e.g.,
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
On-page active elements, scripts or popups that act as content gates, not a ban per se - e.g., country/language selector, cookie config, surveys, etc. requiring user input. Not-as-simple solution: use a webdriver like Selenium + chromedriver to render the page including JS and then add "user" clicks to deal with the problems.

Extract HTML-Content from URL of Site that probably uses Cookies via Python

I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)

What is the difference between HTTP Post URL with /post and without using Python requests module?

I am using Python 2.7 with the requests module to send http post with parameters. I encountered a strange problem.
To do http post, it is just one line;
x = requests.post(URL, params)
I have no problem with the params. It is the URL that puzzled me.
Sometimes, this URL http://hostname/path/post works. Sometimes, I use http://hostname/path without the /post to get the HTTP post to work. I am puzzled why is this so. What is the difference between the two? Under what conditions do I use which one?
'http://hostname/path/post' is a path. You could in principle issue and HTTP GET request to that same path (although probably you wouldn't get anything meaningful back).
In general, you should look at the site's API documentation and post to the url that they say you should post to without adding anything extra to the url.
There are two different concepts, url and HTTP method. You are confused by trying to mix them.
url - an address you talk to
The url is addressing something on some server. If you get valid url, you can take it as a string, do not read in, and use it. Consider it to be a string.
If I would link it to a visiting your friend, url is address of a doors to come to.
HTTP method (POST, GET, DELETE...)
There are multiple HTTP methods which differ in the way, how you talk to given url.
Linking it to visiting a friend, it would be the way, you try to make the doors open (use the bell, knock or use a hammer)

Categories

Resources