import sys
import pdb
import http.client
def PassParse():
headers = {"Accept":" application/json, text/plain, */*",
"Authorization":" Basic YWRtaW46YXNkZg==",
"Referer":" http://192.168.1.113:8080/#/apps",
"Accept-Language":" zh-CN",
"Accept-Encoding":" gzip, deflate",
"User-Agent":" Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko LBBROWSER",
"Host":" 192.168.1.113:8080",
"DNT":" 1",
"Connection":" Keep-Alive"};
conn = http.client.HTTPConnection("192.168.1.113:8080");
conn.request(method="Get",url="/api/v1/login",body=None,headers=headers);
response = conn.getresponse();
responseText = response.getheaders("content-lentgh");
print ("succ!^_^!");
#print (response.status);
print (responseText);
conn.close();
run error:
Traceback (most recent call last):
File "F:\Python\test1-3.4.py", line 32, in <module>
PassParse();
File "F:\Python\test1-3.4.py", line 24, in PassParse
response = conn.getresponse();
File "E:\program files\Python 3.4.3\lib\http\client.py", line 1171, in getresponse
response.begin()
File "E:\program files\Python 3.4.3\lib\http\client.py", line 351, in begin
version, status, reason = self._read_status()
File "E:\program files\Python 3.4.3\lib\http\client.py", line 333, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <html>
It seems that the httpserver didn't return valid http response, you can use telnet to check it.
telnet 192.168.1.113 8080
then send:
GET /api/v1/login HTTP/1.1
reference: https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html
From the Python docs for httplib:
exception httplib.BadStatusLine
A subclass of HTTPException. Raised if a server responds with a HTTP status code that we don’t understand.
Sounds like your API might not be returning something that can be parsed as a valid HTTP response with a valid HTTP status code. You might want to check that the code for your API endpoint is working as expected and is not failing.
Besides that, your code runs fine except for one thing: response.getheader() takes an argument, while response.getheaders() takes no argument, so Python with complain about that once you resolve the BadStatusLine exception.
I have solve the problem using the following code:
from requests.auth import HTTPBasicAuth
res=requests.get('http://192.168.1.113:8080/api/v1/login', auth=(username, password));
Related
I am trying to build a python webscraper with beautifulsoup4. If I run the code on my Macbook the script works, but if I let the script run on my homeserver (ubuntu vm) I get the following error msg (see below). I tried a vpn connection and multiple headers without success.
Highly appreciate your feedback on how to get the script working. THANKS!
Here the error msg:
{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7 ChromePlus/1.5.0.0alpha1'}
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
[...]
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
[Finished in 15.9s with exit code 1]
Here my code:
from bs4 import BeautifulSoup
import requests
import pyuser_agent
URL = f"https://www.edmunds.com/inventory/srp.html?radius=5000&sort=publishDate%3Adesc&pagenumber=2"
ua = pyuser_agent.UA()
headers = {'User-Agent': ua.random}
print(headers)
response = requests.get(url=URL, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
print(overview)
I tried multiple headers, but do not get a result
Try to use real web-browser User Agent instead random one from pyuser_agent. For example:
import requests
from bs4 import BeautifulSoup
URL = f"https://www.edmunds.com/inventory/srp.html?radius=5000&sort=publishDate%3Adesc&pagenumber=2"
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"}
response = requests.get(url=URL, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
overview = soup.find()
print(overview)
The possible explanation could be that server keeps a list of real-world User Agents and don't serve any page to some non-existent ones.
I'm pretty bad at figuring out the right set of headers and cookies, so in these situations, I often end up resorting to:
either cloudscraper
response = cloudscraper.create_scraper().get(URL)
or HTMLSession - which is particularly nifty in that it also parses the HTML and has some JavaScript support as well
response = HTMLSession().get(URL)
I am trying to make a python3 script that iterates through a list of mods hosted on a shared website and download the latest one. I have gotten stuck on step one, go to the website and get the mod version list. I am trying to use urllib but it is throwing a 403: Forbidden error.
I thought it might be due to this being some sort of anti-scraping rejection from the server and I read that you could get around it via defining the headers. I ran wireshark while using my browser and was able to identify the headers it was sending out:
Host: ocsp.pki.goog\r\n
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\r\n
Accept: */*\r\n
Accept-Language: en-US,en;q=0.5\r\n
Accept-Encoding: gzip, deflate\r\n
Content-Type: application/ocsp-request\r\n
Content-Length: 83\r\n
Connection: keep-alive\r\n
\r\n
I believe I was able to define the header correctly, but I had to back two entries out as they gave a 400 error:
from urllib.request import Request, urlopen
count = 0
mods = ['mod1', 'mod2', ...] #this has been created to complete the URL and has been tested to work
#iterate through all mods and download latest version
while mods:
url = 'https://Domain/'+mods[count]
#change the header to the browser I was using at the time of writing the script
req = Request(url)
#req.add_header('Host', 'ocsp.pki.goog\\r\\n') #this reports 400 bad request
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\\r\\n')
req.add_header('Accept', '*/*\\r\\n')
req.add_header('Accept-Language', 'en-US,en;q=0.5\\r\\n')
req.add_header('Accept-Encoding', 'gzip, deflate\\r\\n')
req.add_header('Content-Type', 'application/ocsp-request\\r\\n')
#req.add_header('Content-Length', '83\\r\\n') #this reports 400 bad request
req.add_header('Connection', 'keep-alive\\r\\n')
html = urlopen(req).read().decode('utf-8')
This still throws a 403: Forbidden error:
Traceback (most recent call last):
File "SCRIPT.py", line 19, in <module>
html = urlopen(req).read().decode('utf-8')
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I'm not sure what I'm doing wrong. I assume there is something wrong with the way I've defined my header values, but I am not sure what is wrong with them. Any help would be appreciated.
I am using urllib.request to open a page source with Python 3.2.1, but I am getting an error saying urllib.error.HTTPError: HTTP Error 503: Service Unavailable. Please find the code and error below.
Code
import re
import urllib.request
html = urllib.request.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html").read().decode()
print (html)
Error
Traceback (most recent call last):
File "I:/Private/nabm/python/python_challenge/python_challenge_2.py", line 4, in <module>
html = urllib.request.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html").read().decode()
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 375, in open
response = meth(req, response)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 487, in http_response
'http', request, response, code, msg, hdrs)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 413, in error
return self._call_chain(*args)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 347, in _call_chain
result = func(*args)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 495, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
Process finished with exit code 1
Could anyone see what could be causing this error?
HTTP error 503 means that the server wasn't able to respond at that moment, either due to overload or because it refused your connection. In other words, there is nothing you can change in your code to fix it.
I faced the same issue with some URLs and providing a header helped. When I looked more into it I found out that the servers sometimes identify that a bot is trying to access the website and so to prevent it they give a fake a connection error.
from urllib.request import urlopen, Request
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36."}
req = Request("url", headers=header)
response = urlopen(req, timeout=60)
I know it's been a while from the date. But I will post how I dealt with the "HTTP Error 503" in case it might help someone else.
First of all, I did put request.urlretrieve(...) into a try block to catch the error.
In my case it is true that the server I tried to access needs time to handle the requests. (The server I accessed is not Amazon.com or that sort which were said to prevent programs to access their content.)
Withing the try block, in the case the exception occurs, I made the program to wait for 20 seconds using time.sleep(20). This enables my program to complete.
I am really confused, new to Python and I am working on a script that scrapes a website for products on Python27. I am trying to use urllib2 to do this and when I run the script it prints multiple traceback errors. Suggestions?
Script:
import urllib2, zlib, json
url='https://launches.endclothing.com/api/products'
req = urllib2.Request(url)
req.add_header(':host','launches.endclothing.com');req.add_header(':method','GET');req.add_header(':path','/api/products');req.add_header(':scheme','https');req.add_header(':version','HTTP/1.1');req.add_header('accept','application/json, text/plain, */*');req.add_header('accept-encoding','gzip,deflate');req.add_header('accept-language','en-US,en;q=0.8');req.add_header('cache-control','max-age=0');req.add_header('cookie','__/');req.add_header('user-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36');
resp = urllib2.urlopen(req).read()
resp = zlib.decompress(bytes(bytearray(resp)),15+32)
data = json.loads(resp)
for product in data:
for attrib in product.keys():
print str(attrib)+' :: '+ str(product[attrib])
print '\n'
Error(s):
C:\Users\Luke>py C:\Users\Luke\Documents\EndBot2.py
Traceback (most recent call last):
File "C:\Users\Luke\Documents\EndBot2.py", line 5, in <module>
resp = urllib2.urlopen(req).read()
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 409, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1181, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python27\lib\urllib2.py", line 1148, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error>
You're running into issues with SSL configuration of your request. I'm sorry, but I won't correct your code, because we're in 2016, and there's a wonderful library that you should use instead: requests
So its usage is pretty simple:
>>> user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
>>> result = requests.get('https://launches.endclothing.com/api/products', headers={'user-agent': user_agent})
>>> result
<Response [200]>
>>> result.json()
[{u'name': u'Adidas Consortium x HighSnobiety Ultraboost', u'colour': u'Grey', u'id': 30, u'releaseDate': u'2016-04-09T00:01:00+0100', …
You'll notice that I changed the user-agent in the previous query to have it working, because weirdly enough, the website is refusing API access to requests:
>>> result = requests.get('https://launches.endclothing.com/api/products')
>>> result
<Response [403]>
>>> result.text
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p></div><div class="error-right"><h3>What can I do to resolve this?</h3><p>If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p><p>If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.
Otherwise, now that you've tried requests and your life has changed, you might still run into this issue again. As you might read from many places on internet, this is related to SNI and outdated libraries and you might get headaches trying to figure this out. My best advice would be for you to upgrade to Python3, as the problem is likely to be solved by installing a new vanilla version of python and the libs involved.
HTH
I want to fetch a Image(GIF format) from a website.So I use tornado in-build asynchronous http client to do it.My code is like the following:
import tornado.httpclient
import tornado.ioloop
import tornado.gen
import tornado.web
tornado.httpclient.AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
http_client = tornado.httpclient.AsyncHTTPClient()
class test(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
content = yield http_client.fetch('http://www.baidu.com/img/bdlogo.gif')
print('=====', type(content.body))
application = tornado.web.Application([
(r'/', test)
])
application.listen(80)
tornado.ioloop.IOLoop.instance().start()
So when I visit the server it should fetch a gif file.However It catch a exception.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 8: invalid start byte
ERROR:tornado.application:Uncaught exception GET / (127.0.0.1)
HTTPRequest(protocol='http', host='127.0.0.1', method='GET', uri='/', version='HTTP/1.1', remote_ip='127.0.0.1', headers={'Accept-Language': 'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate', 'Host': '127.0.0.1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130922 Firefox/17.0', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'If-None-Match': '"da39a3ee5e6b4b0d3255bfef95601890afd80709"'})
Traceback (most recent call last):
File "/usr/lib/python3.2/site-packages/tornado/web.py", line 1144, in _when_complete
if result.result() is not None:
File "/usr/lib/python3.2/site-packages/tornado/concurrent.py", line 129, in result
raise_exc_info(self.__exc_info)
File "<string>", line 3, in raise_exc_info
File "/usr/lib/python3.2/site-packages/tornado/stack_context.py", line 302, in wrapped
ret = fn(*args, **kwargs)
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 550, in inner
self.set_result(key, result)
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 476, in set_result
self.run()
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 505, in run
yielded = self.gen.throw(*exc_info)
File "test.py", line 12, in get
content = yield http_client.fetch('http://www.baidu.com/img/bdlogo.gif')
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 496, in run
next = self.yield_point.get_result()
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 395, in get_result
return self.runner.pop_result(self.key).result()
File "/usr/lib/python3.2/concurrent/futures/_base.py", line 393, in result
return self.__get_result()
File "/usr/lib/python3.2/concurrent/futures/_base.py", line 352, in __get_result
raise self._exception
tornado.curl_httpclient.CurlError: HTTP 599: Failed writing body (0 != 1024)
ERROR:tornado.access:500 GET / (127.0.0.1) 131.53ms
It seems to attempt to decode my binary file as UTF-8 text, which is unnecessary.IF I comment
tornado.httpclient.AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
out, which will use a simple http client instead of pycurl, it works well.(It tell me that the type of "content" is bytes)
So if it return a bytes object, why it tries to decode it? I think the problems is the pycurl or the wrapper of pycurl in tornado, right?
My python version is 3.2.5, tornado 3.1.1, pycurl 7.19.
Thanks!
pycurl 7.19 doesn't support Python 3. Ubuntu (and possibly other Linux distributions) ship a modified version of pycurl that partially works with Python 3, but it doesn't work with Tornado (https://github.com/facebook/tornado/issues/671), and fails with an exception that looks like the one you're seeing here.
Until there's a new version of pycurl that officially supports Python 3 (or you use the change suggested in that Tornado bug report), I'm afraid you'll need to either go back to Python 2.7 or use Tornado's simple_httpclient instead.