I am using Python requests get method to query the MediaWiki API, but it takes a lot of time to receive the response. The same requests receive the response very fast through a web browser. I have the same issue requesting google.com. Here are the sample codes that I am trying in Python 3.5 on Windows 10:
response = requests.get("https://www.google.com")
response = requests.get("https://en.wikipedia.org/wiki/Main_Page")
response = requests.get("http://en.wikipedia.org/w/api.php?", params={'action':'query', 'format':'json', 'titles':'Labor_mobility'})
However, I don't face this issue retrieving other websites like:
response = requests.get("http://www.stackoverflow.com")
response = requests.get("https://www.python.org/")
This sounds like there is an issue with the underlying connection to the server, because requests to other URLs work. These come to mind:
The server might only allow specific user-agent strings
Try adding innocuous headers, e.g.: requests.get("https://www.example.com", headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"})
The server rate-limits you
Wait for a few minutes, then try again. If this solves your issue, you could slow down your code by adding time.sleep() to prevent being rate-limited again.
IPv6 does not work, but IPv4 does
Verify by executing curl --ipv6 -v https://www.example.com. Then, compare to curl --ipv4 -v https://www.example.com. If the latter is significantly faster, you might have a problem with your IPv6 connection. Check here for possible solutions.
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
Related
I'm having trouble finding which link I have to enact the POST request to along with what necessary credentials and or headers.
I tried
from bs4 import BeautifulSoup
callback = 'https://accounts.stockx.com/login/callback'
login = 'https://accounts.stockx.com/login'
shoe = "https://stockx.com/nike-dunk-high-prm-dark-russet"
headers = {
}
request = requests.get(shoe, headers=headers)
#print(request.text)
soup = BeautifulSoup(request.text, 'lxml')
print(soup.prettify())```
but I keep getting
`Access to this page has been denied`
You're attempting a very difficult task as a lot of these websites have impeccable bot detection.
You can try mimicking your browsers headers by copying them from your browsers network request menu (CTRL-SHIFT-i > then go to the network tab). The most important being your user-agent, then add it to your headers. They will look something like this:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
However you'll probably be met with some sort of captcha or other issues. It's a very uphill battle my friend. You will want to look into understanding http requests better. Look into proxies and not being IP limited. But even then, these websites have much more advanced methods of detecting a python script / bots such as TLS Fingerprinting and various other sorts of fingerprinting.
Your best bet would to try and find out the actual API that your target website uses if it is exposed.
Otherwise there is not much you can do except respect that what you are doing is against their Terms Of Service.
I am having a problem in accessing a URL via ruby but it is working with python's requests library.
Here is what I am doing, this link https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN I want to access and start session with it and then need to hit link https://www.nseindia.com/api/option-chain-equities?symbol=SBIN' in the same session and this answer really helped me a lot but I need to do this in ruby. I have tried rest-client, net/http, httparty, httpclient, even when I am simply doing this
require 'rest-client'
request = RestClient.get 'https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN'
It goes for infinite time with no response, I tried same thing with headers too but still no response for infinite time.
Any help would be appreciated.
Thanks.
Are you able to confirm that RestClient is working for other urls, such as google.com?
require 'rest-client'
RestClient.get "https://www.google.com"
For what it's worth, I was able to make a successful get request to Google through RestClient, but not with the url you provided. However, I was able to get a response by specifying a User-Agent in the headers:
require 'rest-client'
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27"
=> Hangs...
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27", {"User-Agent": "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"}
=> RestClient::Unauthorized: 401 Unauthorized
I assume there is some authentication required if you want to get any useful data from the api.
I'm playing with crawling Bing web search page using python.
I find the raw content received looks like byte type, but the attempt to decompress it has failed.
Does someone have clue what kind of data is this, and how should I extract readable from this raw content? Thanks!
My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines in below.
Code:
import urllib.request as Request
import gzip
req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage)) # try decompression
Result:
RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...
Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')
Process finished with exit code 1
It's much easier to get started with the requests library. Plus, this is also the most commonly used lib for http requests nowadays.
Install requests in your python environment:
pip install requests
In your .py file:
import requests
r = requests.get("http://www.bing.com")
print(r.text)
OSError: Not a gzipped file (b'+p')
You either need to add "accept-encoding: "gzip" or "br" to request headers or read content-encoding from the response and choose the correct one, or use requests library instead that will do everything for you.
The second problem that might appear, you need to pass a user-agent to request headers to act as a "real" user visit.
If no user-agent is being passed into request headers while using requests library it defaults to python-requests so Bing or other search engine understands that it's a bot/script, and blocks a request. Check what's your user-agent.
Pass user-agent using requests library:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines or figure out different tedious problems such as picking the correct CSS selector if the HTML layout is not the best out there.
Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.
I am trying to make a small program that downloads subtitles for movie files.
I noticed however that when I follow a link in chrome and when opening it with urllib2.urlopen() does not give the same results.
As an example let's consider the link http://www.opensubtitles.org/en/subtitleserve/sub/5523343 . In chrome this redirects to http://osdownloader.org/en/osdownloader.subtitles-for.you/subtitles/5523343 which after a little while downloads the file I want.
However, when I use the following code in python, I get redirected to another page:
import urllib2
url = "http://www.opensubtitles.org/en/subtitleserve/sub/5523343"
response = urllib2.urlopen(url)
if response.url == url:
print "No redirect"
else:
print url, " --> ", response.url
Result: http://www.opensubtitles.org/en/subtitleserve/sub/5523343 --> http://www.opensubtitles.org/en/subtitles/5523343/the-musketeers-commodities-en
Why does that happen? How can I follow the same redirect as with the browser?
(I know that these sites offer APIs in python, but this is meant as practice in python and playing with urllib2 for the first time)
There's a significant difference in the request you're making from Chrome and your script using urllib2 above, and that is the HTTP header User-Agent (https://en.wikipedia.org/wiki/User_agent).
opensubtitles.org probably identifies that you're trying to programmatically retrieving the webpage, and are blocking it. Try to use one of the User-Agent strings from Chrome (more here http://www.useragentstring.com/pages/Chrome/):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36
in your script.
See this question on how to edit your script to support a custom User-Agent header - Changing user agent on urllib2.urlopen.
I would also like to recommend using the requests library for Python instead of urllib2, as the API is much easier to understand - http://docs.python-requests.org/en/latest/.
I'm writing a simple Python 3 script to retrieve HTML data. Here's my test script:
import urllib.request
url="http://techxplore.com/news/2015-05-audi-r8-e-tron-aims-high.html"
req = urllib.request.Request(
url,
data=None,
headers={
'User-agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11',
'Referer': 'http://www.google.com'
}
)
f = urllib.request.urlopen(req)
This works fine for most websites but returns the following error for certain ones:
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
The URL shown in the script is one of the sites that returns this error. Based on research from other posts and sites, it seems like manually setting the user-agent and/or the referer should solve the problem, but this script still times out. I'm not sure why this is occurring only for certain websites, and I don't know what else to try. I wold appreciate any suggestions the community could offer.
I tried the script again today without changing anything, and it worked perfectly. Looks like it was just something strange going on with the remote web server.