Requests.Get fails on one server but works on another - python

I have the following Python 2.7 code:
import requests
from urllib3 import Retry
s = requests.Session()
http_retries = Retry(3)
https_retries = Retry(3)
http = requests.adapters.HTTPAdapter(max_retries=http_retries)
https = requests.adapters.HTTPAdapter(max_retries=https_retries)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36',}
s.mount('http://', http)
s.mount('https://', https)
response = s.get(URL, headers=headers, timeout=10)
I keep getting
Failed to establish a new connection: [Errno 101] Network is unreachable'
when I run script from Amazon AWS Instance but on another network it works fine.
Any idea why

Related

Connection timeouts as a protection from site scraping?

I am new to Python and Web scraping but it's been two weeks that I periodically scrape one website and successfully download images from it. I use different proxies and sometimes change them. But starting yesterday all my proxies suddenly stopped working with a timeout error. I've tried a whole list of them and all fail.
Could this be a kind of site protection from scraping? If yes, is there a way to overcome it?
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
proxies = {
"http": "http://188.114.99.153",
"https": "http://180.94.69.66:8080"
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
html = requests.get(url, headers=header, proxies=proxies, timeout=10).text
soup = BeautifulSoup(html, 'lxml')
Error message:
ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001536A8E7190>, 'Connection to 180.94.69.66 timed out. (connect timeout=10)'))
This will GET the URL and retry 3 times in case of ConnectTimeoutError. It will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://parovoz.com/newgallery/index.php?&LNG=RU&NO_ICONS=0&CATEG=-1&HOWMANY=192'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
html = session.get(url, headers=header).text
soup = BeautifulSoup(html, 'lxml')
print(soup)

python web scraping IP blocked

I am trying to extract the source code of the html page. It was working fine before. but now the source web server wanted more evidence that I am NOT a bot. This is the error: your IP is blocked. My IP is NOT blocked for sure as I can still open the page manually via any browser. Do I need to change any parameters before making the request. Thanks.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = requests.get(url, headers=headers)
url_content = req.content
url_content = url_content.replace(b'data-imgid', b'\ndata-imgid')
output_file = open('downloaded.txt', 'wb')
output_file.write(url_content)
output_file.close()

How to access site with requests and SOCKS5 with Python 3

I'm using requests in Python 3.8 in order to connect to an Amazon web page.
I'm also using tor, in order to connect via SOCKS5.
This is the relevant piece of code:
session = requests.session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/44.0.2403.157 Safari/537.36'})
anon = {'http': "socks5://localhost:9050", 'https': "socks5://localhost:9050"}
r = session.get("myurl", proxies=anon)
print(r.content)
However, it doesn't work. It gives me the Amazon 503 error. What I need to know is if there is some method to bypass this problem or it depends on a sort of "ip blocking".
Thank you

Python Requests Get not Working

I have a simple Get request I'd like to make using Python's Request library.
import requests
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'http://stats.nba.com/scores/'}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
However, when I make the requests.get call, I get the error requests.exceptions.ReadTimeout: HTTPConnectionPool(host='stats.nba.com', port=80): Read timed out. (read timeout=5). But I am able to copy/paste that url into my browser and view the resulting JSON. Why is requests not able to get the result?
Your HEADERS format is wrong. I tried with this code and it worked without any issues:
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
print(response.text)

Error logging into a HTTP Server

I'm trying to login into a http server to fetch some tables. The code I'm using is this:
MechBrowser = mechanize.Browser()
LoginUrl = 'http://www.jlrvehiclefeedback.com'
LoginData = "username=my_username&password=my_password&do=login"
LoginHeader = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
LoginRequest = urllib2.Request(LoginUrl, LoginData, LoginHeader)
LoginResponse = MechBrowser.open(LoginRequest)
however I get this error:
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
As you could see, I've defined a UserAgent but still can't get through the bot policy.

Categories

Resources