open url with urllib2 that already have get parameter - python

I want to get data from pages like http://www.site.com/list?a=data&b=data...
I retrieve all those url from a page of site.com. When trying to open a link, I get error: TypeError: expected BaseHandler instance, got .
My guess is that url need to be "encoded" but how ?
Thanks for your help guys!
Edit:
Ok, here is the code, So all connection pass by my proxy server and try to open the url found earlier like described above.
Code:
tileurl = 'http://www.site.com/list?a=data&b=data'
proxy = SocksiPyHandler(socks.PROXY_TYPE_SOCKS4, '192.168.0.190', 12500)
opener = urllib2.build_opener(proxy)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open(tileurl)
tile_bin = infile.read()
Traceback (most recent call last):
File "C:\Users\Jean-michel\Dropbox\Projects\Python Code\Maps Saver\map.py", line 89, in <module> opener = urllib2.build_opener(tileurl)
File "C:\Python27\lib\urllib2.py", line 490, in build_opener opener.add_handler(h)
File "C:\Python27\lib\urllib2.py", line 326, in add_handler type(handler))
TypeError: expected BaseHandler instance, got type 'str'

tileurl = tile.replace(t1, "") ## Removing the parameters from the url
p = urlparse.parse_qs(t1) ## decoding the parameter
tileparam = urllib.urlencode(p) ## encoding the parameter...
Problem solved!! :)

Related

How do I use Python and lxml to parse a local html file?

I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
You can also try using Beautiful Soup
from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")
soup = BeautifulSoup(f)
f.close()

BeautifulSoup: An issue with find_all() and unicode?

So I'm using BeautifulSoup to build a webscraper to grab every ad on a Craigslist page. Here's what I've got so far:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4
page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text
roomSoup = BeautifulSoup(search_html, "html.parser")
ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls
url_str = [str(unicode) for unicode in ad_urls]
# What's in url_str?
for url in url_str:
print url
When I run this, I get:
miami.craigslist.org/mdc/roo/4870912192.html
miami.craigslist.org/mdc/roo/4858122981.html
miami.craigslist.org/mdc/roo/4870665175.html
miami.craigslist.org/mdc/roo/4857247075.html
miami.craigslist.org/mdc/roo/4870540048.html ...
This is exactly what I want: a list containing the URLs to each ad on the page.
My next step was to extract something from each of those pages; hence building another BeautifulSoup object. But I get stopped short:
for url in url_str:
ad_html = requests.get(str(url)).text
Here we finally get to my question: What exactly is this error? The only thing I can make sense of is the last 2 lines:
Traceback (most recent call last): File "webscraping.py", line 24,
in <module>
ad_html = requests.get(str(url)).text File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 65, in get
return request('get', url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 49, in request
response = session.request(method=method, url=url, **kwargs) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 447, in request
prep = self.prepare_request(req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks), File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 303, in prepare
self.prepare_url(url, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
It looks like the issue is that all my links are preceded by u', so requests.get() isn't working. This is why you see me pretty much trying to force all the URLs into a regular string with str(). No matter what I do, though, I get this error. Is there something else I'm missing? Am I completely misunderstanding my problem?
Thanks much in advance!
Looks like you misundersood the problem
The message:
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
means it lacks of http:// (the schema) before the url
so replacing
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
by
ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]
should do the job

Python urllib2 - cannot read a page

I am using urllib2 in Python to scrape a webpage. However, the read() method does not return.
Here is the code I am using:
import urllib2
url = 'http://edmonton.en.craigslist.ca/kid/'
headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib2.Request(url, headers=headers)
f_webpage = urllib2.urlopen(request)
html = f_webpage.read() # <- does not return
I last ran the script a month ago and it was working fine then.
Note that the same script runs well for webpages of other categories on Edmonton Craigslist like http://edmonton.en.craigslist.ca/act/ or http://edmonton.en.craigslist.ca/eve/.
As requested in comments :)
Install requests by $ pip install requests
Use requests as the following:
>>> import requests
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = requests.get(url, headers=headers)
>>> request.ok
True
>>> request.text # content in string, similar to .read() in question
...
...
Disclaimer: this is not technically the answer to OP's question, but solves OP's problem as urllib2 is known to be problematic and requests library is born to solve such problems.
It returns (or more specifically, errors out) fine for me:
>>> import urllib2
>>> url = 'http://edmonton.en.craigslist.ca/kid/'
>>> headers = {'User-Agent': 'Mozilla/5.0'}
>>> request = urllib2.Request(url, headers=headers)
>>> f_webpage = urllib2.urlopen(request)
>>> html = f_webpage.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 647, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
Chances are that Craigslist is detecting that you are a scraper and refusing to give you the actual page.
I met the similar problem with you. Part of my error information:
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
File "C:\Python27\lib\httplib.py", line 573, in read
s = self.fp.read(amt)
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
error: [Errno 10054]
I solve it by reading the buffer in small batches instead of reading directly.
def readBuf(fsrc, length=16*1024):
result=''
while 1:
buf = fsrc.read(length)
if not buf:
break
else:
result+=buf
return result
Instead of using html=f_webpage.read(), you can use html=readBuf(f_webpage) to scrape the webpage.

Rest API to output in .json

I am creating a script that will call a rest API in python and spits out the results in JSON format. I am getting some few trace back errors in my code. How can I go about fixing this issue.
'import sitecustomize' failed; use -v for traceback
Traceback (most recent call last):
File "/home/Desktop/Sync.py", line 12, in <module>
url = urllib2.Request(request)
File "/usr/lib/python2.7/urllib2.py", line 202, in __init__
self.__original = unwrap(url)
File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap
url = url.strip()
File "/usr/lib/python2.7/urllib2.py", line 229, in __getattr__
raise AttributeError, attr
AttributeError: strip
Here's the code:
import urllib2
import json
url = "http://google.com"
request = urllib2.Request(url)
request.add_header("Authorization","Basic xxxxxxxxxxxxxxxxxx")
socket = urllib2.urlopen(request)
data = json.dumps(socket)
hdrs = socket.headers
source = socket.read()
socket.close()
print "---- Headers -----"
print data
print "---- Source HTML -----"
print source
print "---- END -----"
value = 0
for line in source.splitlines():
if not line.strip(): continue
if line.startswith("value="):
try:
value = line.split("=")
except IndexError:
pass
if value > 0:
break
open("some.json", "w").write("value is: %d" % value)
You seem to have an issue here:
request=urllib2.Request( "http.google.com")
request.add_header("Authorization","Basic xxxxxxxxxxxxxxxxxxxxxxxx=")
url = urllib2.Request(request)
socket = urllib2.urlopen(url)
You are trying to create a Request object named "url" by passing a Request object into the constructor.
See http://docs.python.org/2/library/urllib2.html#urllib2.Request
Try this:
request=urllib2.Request( "http.google.com")
request.add_header("Authorization","Basic xxxxxxxxxxxxxxxxxxxxxxxx=")
socket = urllib2.urlopen(request)
From documentation of Request class:
url should be a string containing a valid URL.
You are curretnly passing another Request object to its constructor, so that's the reason for the error you're seeing. The correct way to do this:
request=urllib2.Request( "http.google.com")
request.add_header("Authorization","Basic xxxxxxxxxxxxxxxxxxxxxxxx=")
socket = urllib2.urlopen(request)

Python Proxy Error With Requests Library

I am trying to access the web via a proxy server in Python. I am using the requests library and I am having an issue with authenticating my proxy as the proxy I am using requires a password.
proxyDict = {
'http' : 'username:mypassword#77.75.105.165',
'https' : 'username:mypassword#77.75.105.165'
}
r = requests.get("http://www.google.com", proxies=proxyDict)
I am getting the following error:
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
r = requests.get("http://www.google.com", proxies=proxyDict)
File "C:\Python27\lib\site-packages\requests\api.py", line 78, in get
:param url: URL for the new :class:`Request` object.
File "C:\Python27\lib\site-packages\requests\api.py", line 65, in request
"""Sends a POST request. Returns :class:`Response` object.
File "C:\Python27\lib\site-packages\requests\sessions.py", line 187, in request
def head(self, url, **kwargs):
File "C:\Python27\lib\site-packages\requests\models.py", line 407, in send
"""
File "C:\Python27\lib\site-packages\requests\packages\urllib3\poolmanager.py", line 127, in proxy_from_url
File "C:\Python27\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 521, in connection_from_url
File "C:\Python27\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 497, in get_host
ValueError: invalid literal for int() with base 10: 'h6f2v6jh5dsxa#77.75.105.165'
How do I solve this?
Thanks in advance for your help.
You should remove the embedded username and password from proxyDict, and use the auth parameter instead.
import requests
from requests.auth import HTTPProxyAuth
proxyDict = {
'http' : '77.75.105.165',
'https' : '77.75.105.165'
}
auth = HTTPProxyAuth('username', 'mypassword')
r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
I've been having a similar problem on Windows and found the only way to get requests to work was to set the proxies as environment variables before I started Python. For you this would be something like this:
set HTTP_PROXY=http://77.75.105.165
set HTTPS_PROXY=https://77.75.105.165
You might also want to check is there's a specific port required, and if so set it after the url. For example, if the port is 8443 then do:
set HTTP_PROXY=http://77.75.105.165:8443
set HTTPS_PROXY=https://77.75.105.165:8443
You can use urllib library for this.
from urllib import request
request.urlopen("your URL", proxies=request.getproxies())

Categories

Resources