Python urllib2 - cannot read a page - python

I am using urllib2 in Python to scrape a webpage. However, the read() method does not return.
Here is the code I am using:
import urllib2
url = 'http://edmonton.en.craigslist.ca/kid/'
headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib2.Request(url, headers=headers)
f_webpage = urllib2.urlopen(request)
html = f_webpage.read() # <- does not return
I last ran the script a month ago and it was working fine then.
Note that the same script runs well for webpages of other categories on Edmonton Craigslist like http://edmonton.en.craigslist.ca/act/ or http://edmonton.en.craigslist.ca/eve/.

As requested in comments :)
Install requests by $ pip install requests
Use requests as the following:
>>> import requests
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = requests.get(url, headers=headers)
>>> request.ok
True
>>> request.text # content in string, similar to .read() in question
...
...
Disclaimer: this is not technically the answer to OP's question, but solves OP's problem as urllib2 is known to be problematic and requests library is born to solve such problems.

It returns (or more specifically, errors out) fine for me:
>>> import urllib2
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = urllib2.Request(url, headers=headers)
>>> f_webpage = urllib2.urlopen(request)
>>> html = f_webpage.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 647, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
Chances are that Craigslist is detecting that you are a scraper and refusing to give you the actual page.

I met the similar problem with you. Part of my error information:
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
File "C:\Python27\lib\httplib.py", line 573, in read
s = self.fp.read(amt)
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
error: [Errno 10054]
I solve it by reading the buffer in small batches instead of reading directly.
def readBuf(fsrc, length=16*1024):
result=''
while 1:
buf = fsrc.read(length)
if not buf:
break
else:
result+=buf
return result
Instead of using html=f_webpage.read(), you can use html=readBuf(f_webpage) to scrape the webpage.

Related

Face API Python

I am trying to use Face API Python 2.7. I wanted to make gender recognition by photo but always getting the error. This is my code:
from six.moves import urllib
import httplib
import json
params = urllib.urlencode({
'returnFaceId': 'true',
'returnFaceAttributes': 'true',
'returnFaceAttributes': '{string}',
})
headers = {
'ocp-apim-subscription-key': "ee6b8785e7504dfe91efb96d37fc7f51",
'content-type': "application/octet-stream"
}
img = open("d:/Taylor.jpg", "rb")
conn = httplib.HTTPSConnection("api.projectoxford.ai")
conn.request("POST", "/vision/v1.0/tag?%s" % params, img, headers)
res = conn.getresponse()
data = res.read()
conn.close()
I've got this error :
Traceback (most recent call last):
File "<ipython-input-314-df31294bc16f>", line 3, in <module>
res = conn.getresponse()
File "d:\Anaconda2\lib\httplib.py", line 1136, in getresponse
response.begin()
File "d:\Anaconda2\lib\httplib.py", line 453, in begin
version, status, reason = self._read_status()
File "d:\Anaconda2\lib\httplib.py", line 409, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "d:\Anaconda2\lib\socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize)
File "d:\Anaconda2\lib\ssl.py", line 756, in recv
return self.read(buflen)
File "d:\Anaconda2\lib\ssl.py", line 643, in read
v = self._sslobj.read(len)
error: [Errno 10054]
If I use link instead photo:
img_url='https://raw.githubusercontent.com/Microsoft/Cognitive-Face-Windows/master/Data/detection1.jpg'
conn = httplib.HTTPSConnection("api.projectoxford.ai")
conn.request("POST", "/vision/v1.0/tag?%s" % params, img_url, headers)
res = conn.getresponse()
data = res.read()
conn.close()
I get:
data
Out[330]: '{ "statusCode": 401, "message": "Access denied due to invalid subscription key. Make sure to provide a valid key for an active subscription." }'
If I use:
KEY = 'ee6b8785e7504dfe91efb96d37fc7f51'
CF.Key.set(KEY)
img_url = 'https://raw.githubusercontent.com/Microsoft/Cognitive-Face-Windows/master/Data/detection1.jpg'
result = CF.face.detect(img_url)
all works fine:
result
[{u'faceId': u'52c0d1ac-f041-49cd-a587-e81ef67be2fb',
u'faceRectangle': {u'height': 213,
u'left': 154,
u'top': 207,
u'width': 213}}]
But in this case I don't know how to use method returnFaceAttribute (for gender detection)
and also if I use img in result = CF.face.detect(img_url) instead img_url
I get an error: status_code: 400
response: {"error":{"code":"InvalidImageSize","message":"Image size is too small or too big."}}
Traceback (most recent call last):
File "<ipython-input-332-3fe2623ccadc>", line 1, in <module>
result = CF.face.detect(img)
File "d:\Anaconda2\lib\site-packages\cognitive_face\face.py", line 41, in detect
data=data)
File "d:\Anaconda2\lib\site-packages\cognitive_face\util.py", line 84, in request
error_msg.get('message'))
CognitiveFaceException: Image size is too small or too big.
This happends with all sorts of img sizes.
Could anyone explain how to solve these problems?
It seems like you need to use your API key when connecting to Face API. That's why your second example works and the others don't.
Code 401 means that you're unauthorised, i.e. at that point you didn't log in with your key.
Maybe it'd be easier for you to try and use requests instead of urllib.
The first and error is caused by sending file object instead of file content. You should send image.read() instead of image.
The second error is caused by lacking of subscription key in the request header.
For the last error, I suspect you may have read the image file object once and thus reading it again will return empty result. You can try to send file path instead and it should work with the helper in Face API Python SDK.

How do I use Python and lxml to parse a local html file?

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
You can also try using Beautiful Soup
from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")
soup = BeautifulSoup(f)
f.close()

Transcribing sound files to text in python and google speech api

I have a bunch of files in wav. I made a simple script to convert them to flac so I can use it with the google speech api. Here is the python code:
import urllib2
url = "https://www.google.com/speech-api/v1/recognize?client=chromium&lang=en-US"
audio = open('somefile.flac','rb').read()
headers={'Content-Type': 'audio/x-flac; rate=16000', 'User-Agent':'Mozilla/5.0'}
request = urllib2.Request(url, data=audio, headers=headers)
response = urllib2.urlopen(request)
print response.read()
However I am getting this error:
Traceback (most recent call last):
File "transcribe.py", line 7, in <module>
response = urllib2.urlopen(request)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 392, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 370, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1194, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1161, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 32] Broken pipe>
I thought at first that it was because the file is too big. But I recorded myself for 5 seconds and it still does the same.
I dont think google ha released the api yet so it's hard to understand why its failing.
Is there any other good speech-to-text api out there that can be used in either Python or Node?
----- Editing for my attempt with requests:
import json
import requests
url = 'https://www.google.com/speech-api/v1/recognize?client=chromium&lang=en-US'
data = {'file': open('file.flac', 'rb')}
headers = {'Content-Type': 'audio/x-flac; rate=16000', 'User-Agent':'Mozilla/5.0'}
r = requests.post(url, data=data, headers=headers)
# r = requests.post(url, files=data, headers=headers) ## does not work either
# r = requests.post(url, data=open('file.flac', 'rb').read(), headers=headers) ## does not work either
print r.text
Produced the same problem as above.
The API Accepts HTTP POST requests. You're using a HTTP GET Request here. This can be confirmed by loading the URI in your code directly into a browser:
HTTP method GET is not supported by this URL
Error 405
Also, i'd recommend using the requests python library. See http://www.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
Lastly, it seems that the API only accepts segments up to 15 seconds long. Perhaps your error is the file is too large? If you can upload an example flac file, perhaps we could diagnose further.

python unhashable type - posting xml data

First, I'm not a python programmer. I'm an old C dog that's learned new Java and PHP tricks, but python looks like a pretty cool language.
I'm getting an error that I can't quite follow. The error follows the code below.
import httplib, urllib
url = "pdb-services-beta.nipr.com"
xml = '<?xml version="1.0"?><!DOCTYPE SCB_Request SYSTEM "http://www.nipr.com/html/SCB_XML_Request.dtd"><SCB_Request Request_Type="Create_Report"><SCB_Login_Data CustomerID="someuser" Passwd="somepass" /><SCB_Create_Report_Request Title=""><Producer_List><NIPR_Num_List_XML><NIPR_Num NIPR_Num="8980608" /><NIPR_Num NIPR_Num="7597855" /><NIPR_Num NIPR_Num="10166016" /></NIPR_Num_List_XML></Producer_List></SCB_Create_Report_Request></SCB_Request>'
params = {}
params['xmldata'] = xml
headers = {}
headers['Content-type'] = 'text/xml'
headers['Accept'] = '*/*'
headers['Content-Length'] = "%d" % len(xml)
connection = httplib.HTTPSConnection(url)
connection.set_debuglevel(1)
connection.request("POST", "/pdb-xml-reports/scb_xmlclient.cgi", params, headers)
response = connection.getresponse()
print response.status, response.reason
data = response.read()
print data
connection.close
Here's the error:
Traceback (most recent call last):
File "C:\Python27\tutorial.py", line 14, in connection.request("POST", "/pdb-xml-reports/scb_xmlclient.cgi", params, headers)
File "C:\Python27\lib\httplib.py", line 958, in request self._send_request(method, url, body, headers)
File "C:\Python27\lib\httplib.py", line 992, in _send_request self.endheaders(body)
File "C:\Python27\lib\httplib.py", line 954, in endheaders self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 818, in _send_output self.send(message_body)
File "C:\Python27\lib\httplib.py", line 790, in send self.sock.sendall(data)
File "C:\Python27\lib\ssl.py", line 229, in sendall v = self.send(data[count:])
TypeError: unhashable type
My log file says that the xmldata parameter is empty.
Any ideas?
I guess params has to be a string when passing to .request, this would explain the error, due to the fact, that a hash is not hashable
Try to encode your params first with
params = urllib.urlencode(params)
You can find another code example too at the bottom of:
http://docs.python.org/release/3.1.5/library/http.client.html
Thanks for the feedback.
I guess I was making this too hard. I went a different route and it seems to work.
import urllib2
URL = "https://pdb-services-beta.nipr.com/pdb-xml-reports/scb_xmlclient.cgi"
DATA = 'xmldata=<?xml version="1.0"?><!DOCTYPE SCB_Request SYSTEM "http://www.nipr.com/html/SCB_XML_Request.dtd"><SCB_Request Request_Type="Create_Report"><SCB_Login_Data CustomerID="someuser" Passwd="somepass" /><SCB_Create_Report_Request Title=""><Producer_List><NIPR_Num_List_XML><NIPR_Num NIPR_Num="8980608" /></NIPR_Num_List_XML></Producer_List></SCB_Create_Report_Request></SCB_Request>'
req = urllib2.Request(url=URL, data=DATA)
f = urllib2.urlopen(req)
print f.read()

In Python 3.2, I can open and read an HTTPS web page with http.client, but urllib.request is failing to open the same page

I want to open and read https://yande.re/ with urllib.request, but I'm getting an SSL error. I can open and read the page just fine using http.client with this code:
import http.client
conn = http.client.HTTPSConnection('www.yande.re')
conn.request('GET', 'https://yande.re/')
resp = conn.getresponse()
data = resp.read()
However, the following code using urllib.request fails:
import urllib.request
opener = urllib.request.build_opener()
resp = opener.open('https://yande.re/')
data = resp.read()
It gives me the following error: ssl.SSLError: [Errno 1] _ssl.c:392: error:1411809D:SSL routines:SSL_CHECK_SERVERHELLO_TLSEXT:tls invalid ecpointformat list. Why can I open the page with HTTPSConnection but not opener.open?
Edit: Here's my OpenSSL version and the traceback from trying to open https://yande.re/
>>> import ssl; ssl.OPENSSL_VERSION
'OpenSSL 1.0.0a 1 Jun 2010'
>>> import urllib.request
>>> urllib.request.urlopen('https://yande.re/')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
urllib.request.urlopen('https://yande.re/')
File "C:\Python32\lib\urllib\request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "C:\Python32\lib\urllib\request.py", line 369, in open
response = self._open(req, data)
File "C:\Python32\lib\urllib\request.py", line 387, in _open
'_open', req)
File "C:\Python32\lib\urllib\request.py", line 347, in _call_chain
result = func(*args)
File "C:\Python32\lib\urllib\request.py", line 1171, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Python32\lib\urllib\request.py", line 1138, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 1] _ssl.c:392: error:1411809D:SSL routines:SSL_CHECK_SERVERHELLO_TLSEXT:tls invalid ecpointformat list>
>>>
What a coincidence! I'm having the same problem as you are, with an added complication: I'm behind a proxy. I found this bug report regarding https-not-working-with-urllib. Luckily, they posted a workaround.
import urllib.request
import ssl
##uncomment this code if you're behind a proxy
##https port is 443 but it doesn't work for me, used port 80 instead
##proxy_auth = '{0}://{1}:{2}#{3}'.format('https', 'username', 'password',
## 'proxy:80')
##proxies = { 'https' : proxy_auth }
##proxy = urllib.request.ProxyHandler(proxies)
##proxy_auth_handler = urllib.request.HTTPBasicAuthHandler()
##opener = urllib.request.build_opener(proxy, proxy_auth_handler,
## https_sslv3_handler)
https_sslv3_handler =
urllib.request.HTTPSHandler(context=ssl.SSLContext(ssl.PROTOCOL_SSLv3))
opener = urllib.request.build_opener(https_sslv3_handler)
urllib.request.install_opener(opener)
resp = opener.open('https://yande.re/')
data = resp.read().decode('utf-8')
print(data)
Btw, thanks for showing how to use http.client. I didn't know that there's another library that can be used to connect to the internet. ;)
This is due to a bug in the early 1.x OpenSSL implementation of elliptic curve cryptography. Take a closer look at the relevant part of the exception:
_ssl.c:392: error:1411809D:SSL routines:SSL_CHECK_SERVERHELLO_TLSEXT:tls invalid ecpointformat list
This is an error from the underlying OpenSSL library code which is a result of mishandling the EC point format TLS extension. One workaround is to use the SSLv3 instead of SSLv23 method, the other workaround is to use a cipher suite specification which disables all ECC cipher suites (I had good results with ALL:-ECDH, use openssl ciphers for testing). The fix is to update OpenSSL.
The problem is due to the hostnames that your giving in the two examples:
import http.client
conn = http.client.HTTPSConnection('www.yande.re')
conn.request('GET', 'https://yande.re/')
and...
import urllib.request
urllib.request.urlopen('https://yande.re/')
Note that in the first example, you're asking the client to make a connection to the host: www.yande.re and in the second example, urllib will first parse the url 'https://yande.re' and then try a request at the host yande.re
Although www.yande.re and yande.re may resolve to the same IP address, from the perspective of the web server these are different virtual hosts. My guess is that you had an SNI configuration problem on your web server's side. Seeing as that the original question was posted on May 21, and the current cert at yande.re starts May 28, I'm thinking that you already fixed this problem?
Try this:
import connection #imports connection
import url
url = 'http://www.google.com/'
webpage = url.open(url)
try:
connection.receive(webpage)
except:
webpage = url.text('This webpage is not available!')
connection.receive(webpage)

Categories

Resources