I am trying to do an automated task via python through the mechanize module:
Enter the keyword in a web form, submit the form.
Look for a specific element in the response.
This works one-time. Now, I repeat this task for a list of keywords.
And am getting HTTP Error 429 (Too many requests).
I tried the following to workaround this:
Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .
br=mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
br.addheaders = [('Connection', 'keep-alive')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')]
br.addheaders = [('Upgrade-Insecure-Requests','1')]
br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')]
br.addheaders = [('Accept-Language','en-US,en;q=0.8')]`
Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .
Neither of the two methods worked.
You need to limit the rate of your requests to conform to what the server's configuration permits. (Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)
mechanize uses a heavily-patched version of urllib2 (Lib/site-packages/mechanize/_urllib2.py) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector.
So, the simplest method to patch its logic seems to add a handler to your Browser object
with default_open and appropriate handler_order to place it before everyone (lower is higher priority).
that would stall until the request is eligible with e.g. a Token bucket or Leaky bucket algorithm e.g. as implemented in Throttling with urllib2 . Note that a bucket should probably be per-domain or per-IP.
and finally return None to push the request to the following handlers
Since this is a common need, you should probably publish your implementation as an installable package.
Related
I'm having trouble finding which link I have to enact the POST request to along with what necessary credentials and or headers.
I tried
from bs4 import BeautifulSoup
callback = 'https://accounts.stockx.com/login/callback'
login = 'https://accounts.stockx.com/login'
shoe = "https://stockx.com/nike-dunk-high-prm-dark-russet"
headers = {
}
request = requests.get(shoe, headers=headers)
#print(request.text)
soup = BeautifulSoup(request.text, 'lxml')
print(soup.prettify())```
but I keep getting
`Access to this page has been denied`
You're attempting a very difficult task as a lot of these websites have impeccable bot detection.
You can try mimicking your browsers headers by copying them from your browsers network request menu (CTRL-SHIFT-i > then go to the network tab). The most important being your user-agent, then add it to your headers. They will look something like this:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
However you'll probably be met with some sort of captcha or other issues. It's a very uphill battle my friend. You will want to look into understanding http requests better. Look into proxies and not being IP limited. But even then, these websites have much more advanced methods of detecting a python script / bots such as TLS Fingerprinting and various other sorts of fingerprinting.
Your best bet would to try and find out the actual API that your target website uses if it is exposed.
Otherwise there is not much you can do except respect that what you are doing is against their Terms Of Service.
I am having a problem in accessing a URL via ruby but it is working with python's requests library.
Here is what I am doing, this link https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN I want to access and start session with it and then need to hit link https://www.nseindia.com/api/option-chain-equities?symbol=SBIN' in the same session and this answer really helped me a lot but I need to do this in ruby. I have tried rest-client, net/http, httparty, httpclient, even when I am simply doing this
require 'rest-client'
request = RestClient.get 'https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN'
It goes for infinite time with no response, I tried same thing with headers too but still no response for infinite time.
Any help would be appreciated.
Thanks.
Are you able to confirm that RestClient is working for other urls, such as google.com?
require 'rest-client'
RestClient.get "https://www.google.com"
For what it's worth, I was able to make a successful get request to Google through RestClient, but not with the url you provided. However, I was able to get a response by specifying a User-Agent in the headers:
require 'rest-client'
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27"
=> Hangs...
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27", {"User-Agent": "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"}
=> RestClient::Unauthorized: 401 Unauthorized
I assume there is some authentication required if you want to get any useful data from the api.
I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')
The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.
Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.
I am trying to scrape this site but while getting the data from the site its doing a ddos checkup on me, Which it checks for like 5 seconds and then redirects to the same url but the page opens (On normal browser) but in python i am trying to request the same thing its just returning the ddos check up page. Is there any way i can bypass that or any workaround ?
this is my code :
thanks :)
import requests
from urllib2 import build_opener
import time
import json
url = 'https://www.masterani.me/api/anime/63-naruto-shippuuden/detailed'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
page = requests.get(url, headers = headers)
print page.text
Using a headless browser will work. Use PhantomJS with Selenium webdriver to scrape such sites, or the ones which uses AJAX to load content.
I found these links useful.
https://www.guru99.com/selenium-python.html
https://vocuzi.in/blog/preventing-website-web-scrapers/
Anti ddos solutions usually take into account various parameters when inspecting a request's validity. For example, your geographincal location may be a huge factor: When trying to reproduce your issue I'm getting a 200 response, meaning the anti ddos has decided to allow my code to access the site.
I would suggest the using VPN/proxy service such as this one, or, if this is a system destined for production, I would suggest a paid service as these are much more reliable. Notice that some service are robust enough to block many proxy IPs as well
One of my script runs perfectly on an XP system, but the exact script hangs on a 2003 system. I always use mechanize to send the http request, here's an example:
import socket, mechanize, urllib, urllib2
socket.setdefaulttimeout(60) #### No idea why it's not working
MechBrowser = mechanize.Browser()
Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)', 'Referer': 'http://www.porn-w.org/ucp.php?mode=login'}
Request = urllib2.Request("http://google.com", None, Header)
Response = MechBrowser.open(Request)
I don't think there's anything wrong with my code, but each time when it comes to a certain http POST request to a specific url, it hangs on that 2003 computer (only on that url). What could be the reason of all this and how should I debug?
By the way, the script runs all right until several hours ago. And no setting is changed.
You could use Fiddler or Wire Shark to see what is happening at the HTTP-level.
It is also worth checking out if the machine has been blocked from making requests to the machine you are trying to access. Use a regular browser (with your own HTML form), and the HTTP library used by Mechanize and see if you can manually construct a request. Fiddler can also help you do this.