My code works when I make request from location machine.
When I try to make the request from AWS EC2 I get the following error:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www1.xyz.com', port=443): Read timed out. (read timeout=20)
I tried checking the url and that was not the issue. I then went ahead and tried to visit the page using the url and hidemyass webproxy with location set to the AWS EC2 machine, it got a 404.
The code:
# Dummy URL's
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = 'https://www1.xyz.com/iKeys.jsp?symbol={}&date=31DEC2020'.format(
symbol)
raw_page = requests.get(url, timeout=10, headers=header).text
I have tried setting the proxies to another ip address in the request, which I searched online:
proxies = {
"http": "http://125.99.100.193",
"https": "https://125.99.100.193",}
raw_page = requests.get(url, timeout=10, headers=header, proxies=proxies).text
Still got the same error.
1- Do I need to specify the port in proxies? Could this be causing the error when proxy is set?
2- What could be a solution for this?
Thanks
Related
I'm trying to scrap this website https://triller.co/ , so I want to get information from profile pages like this https://triller.co/#warnermusicarg , what I do is trying to request the json url that contains the information, in this case it's https://social.triller.co/v1.5/api/users/by_username/warnermusicarg
When I use requests.get() it works normally and I can retrieve all the information.
import requests
import urllib.parse
from urllib.parse import urlencode
url = 'https://social.triller.co/v1.5/api/users/by_username/warnermusicarg'
headers = {'authority':'social.triller.co',
'method':'GET',
'path':'/v1.5/api/users/by_username/warnermusicarg',
'scheme':'https',
'accept':'*/*',
'accept-encoding':'gzip, deflate, br',
'accept-language':'ar,en-US;q=0.9,en;q=0.8',
'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsImlhdCI6MTY0MDc4MDc5NSwiZXhwIjoxNjkyNjIwNzk1fQ.eyJpZCI6IjUyNjQ3ODY5OCJ9.Ds-acbfcGSeUrGDSs47pBiT3b13Eb9SMcB8BF8OylqQ',
'origin':'https://triller.co',
'sec-ch-ua':'" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'empty',
'sec-fetch-mode':'cors',
'sec-fetch-site':'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
The problem arises when I try to use an API proxy providers as Webscraping.ai, ScrapingBee, etc
api_key='my_api_key'
api_url='https://api.webscraping.ai/html?'
params = {'api_key': api_key, 'timeout': '20000', 'url':url}
proxy_url = api_url + urlencode(params)
response2 = requests.get(proxy_url, headers=headers)
This gives me this error
2022-01-08 22:30:59 [urllib3.connectionpool] DEBUG: https://api.webscraping.ai:443 "GET /html?api_key=my_api_key&timeout=20000&url=https%3A%2F%2Fsocial.triller.co%2Fv1.5%2Fapi%2Fusers%2Fby_username%2Fwarnermusicarg&render_js=false HTTP/1.1" 502 91
{'status_code': 403, 'status_message': '', 'message': 'Unexpected HTTP code on the target page'}
What I tried to do is:
1- I searched for the meaning of 403 code in the documentation of my API proxy provider, it said that api_key is wrong, but I'm 100% sure it's correct,
Also, I changed to another API proxy provider but the same issue,
Also, I had the same issue with twitter.com
And I don't know what to do?
Currently, the code on the question successfully returns a response with code 200, but there are 2 possible issues:
Some sites block datacenter proxies, try to use proxy=residential API parameter (params = {'api_key': api_key, 'timeout': '20000', proxy: 'residential', 'url':url}).
Some of the headers on your headers parameter are unnecessary. Webscraping.AI uses its own set of headers to mimic the behaviors of normal browsers, so setting custom user-agent, accept-language, etc., may interfere with them and cause 403 responses from the target site. Use only the necessary headers. Looks like it will be only the authorization header in your case.
I don't know exactly what caused this error but I tried using their webscraping_ai.ApiClient() instance as in here and it worked,
configuration = webscraping_ai.Configuration(
host = "https://api.webscraping.ai",
api_key = {
'api_key': 'my_api_key'
}
)
with webscraping_ai.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = webscraping_ai.HTMLApi(api_client)
url_j = url # str | URL of the target page
headers = headers
timeout = 20000
js = False
proxy = 'datacenter'
api_response = api_instance.get_html(url_j, headers=headers, timeout=timeout, js=js, proxy=proxy)
I am trying to access https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050. It is working fine from my localhost (code compiled in vscode) but when I deploy it on the server I get HTTP 499 error.
Did anybody get through this and was able to fetch the data using this approach?
Looks like nse is blocking the request somehow. But then how is it working from a localhost?
P.S. - I am a paid user of pythonAnywhere (Hacker) subscription
import requests
import time
def marketDatafn(query):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36'}
main_url = "https://www.nseindia.com/"
session = requests.Session()
response = session.get(main_url, headers=headers)
cookies = response.cookies
url = "https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%2050"
nifty50DataReq = session.get(url, headers=headers, cookies=cookies, timeout=15)
nifty50DataJson = nifty100DataReq.json()
return nifty50DataJson['data']
Actually "Pythonanywhere" only supports those website which are in this whitelist.
And I have found that there are only two subdomain available under "nseindia.com", which is not that you are trying to request.
bricsonline.nseindia.com
bricsonlinereguat.nseindia.com
So, pythonanywhere is blocking you to sent request to that website.
Here's the link to read more about how to request to add your website there.
I'm using requests in Python 3.8 in order to connect to an Amazon web page.
I'm also using tor, in order to connect via SOCKS5.
This is the relevant piece of code:
session = requests.session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/44.0.2403.157 Safari/537.36'})
anon = {'http': "socks5://localhost:9050", 'https': "socks5://localhost:9050"}
r = session.get("myurl", proxies=anon)
print(r.content)
However, it doesn't work. It gives me the Amazon 503 error. What I need to know is if there is some method to bypass this problem or it depends on a sort of "ip blocking".
Thank you
I am trying to web scrape a http website and I am getting below error when I am trying to read the website.
HTTPSConnectionPool(host='proxyvipecc.nb.xxxx.com', port=83): Max retries exceeded with url: http://campanulaceae.myspecies.info/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden',)))
Below is the code I have written with similar website. I tried using urllib and user-agent and still the same issue.
url = "http://campanulaceae.myspecies.info/"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'})
soup = BeautifulSoup(response.text, 'html.parser')
Can anyone help me with the issue. Thanks in advance
you should try to add proxy while requesting url.
proxyDict = {
'http' : "add http proxy",
'https' : "add https proxy"
}
requests.get(url, proxies=proxyDict)
you can find more information here
i tried using User-Agent: Defined and it worked for me.
url = "http://campanulaceae.myspecies.info/"
headers = {
"Accept-Language" : "en-US,en;q=0.5",
"User-Agent": "Defined",
}
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.text
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())
If you get an error that says "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html-parser." Then it means you're not using the right parser, you'll need to import lxml at the top and install the module then use "lxml" instead of "html.parser" when you make soup.
I want to check the login status so. I make program to check it
import requests
import json
import datetime
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7",
"Connection": "keep-alive",
"Content-Length": "50",
"Content-Type": "application/json;charset=UTF-8",
"Cookie": "_ga=GA1.2.290443894.1570500092; _gid=GA1.2.963761342.1579153496; JSESSIONID=A4B3165F23FBEA34B4BBE429D00F12DF",
"Host": "marke.ai",
"Origin": "http://marke",
"Referer": "http://marke/event2/login",
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Mobile Safari/537.36",
}
url = "http://mark/api/users/login"
va = {"username": "seg", "password": "egkegn"}
c = requests.post(url, data=json.dumps(va), headers=headers)
if c.status_code != 200:
print("error")
This is working very well in my windows local with Pycharm
but when i ran the code in Linux i got the error like this
requests.exceptions.ProxyError: HTTPConnectionPool(host='marke', port=80):
Max retries exceeded with url: http://marke.ai/api/users/login (
Caused by ProxyError('Cannot connect to proxy.',
NewConnectionError('<urllib3.connection.HTTPConnection>: Failed to establish a new connection: [Errno 110] Connection timed out',)
)
)
So.. what is the problem please teach me also if you know the solution please teach me!!
thank you
According to your error, it seems you are behind a proxy.
So you have to specify your proxy parameters when building your request.
Build your proxies as a dict following this format
proxies = {
"http": "http://my_proxy:my_port",
"https": "https://my_proxy:my_port"
}
If you don't know your proxy parameters, then you can get them using urllib module :
import urllib
proxies = urllib.request.getproxies()
There's a proxy server configured on that Linux host, and it can't connect to it.
Judging by the documentation, you may have a PROXY_URL environment variable set.
Modifying #Arkenys answer. Please try this.
import urllib.request
proxies = urllib.request.getproxies()
# all other things
c = requests.post(url, data=json.dumps(va), headers=headers, proxies=proxies)