I have run into a strange problem.
Simple piece of python code:
import urllib2
request = urllib2.Request('http://google.com')
request.add_header('foo', 'bar')
response = urllib2.urlopen(request)
data = response.read()
print data
Raises AttributeError on add_header. Here is the traceback:
Traceback (most recent call last):
File "C:/path/to/bizarro.py", line 4, in <module>
request.add_header('foo', 'bar')
File "C:\Python27\lib\urllib2.py", line 229, in __getattr__
raise AttributeError, attr
AttributeError: add_header
This exact code works fine when I run it on a remote linux server.
Also, adding headers works using build_opener:
opener = urllib2.build_opener()
opener.addheaders = [('foo', 'bar')]
response = opener.open('http://google.com')
print response.read()
It feels like it has something to do with python or windows(I'm running windows 7).
I have consulted google but no hint thus far where to look. Has anyone encoutered anything like this? Any ideas where to look for a solution?
Thanks to Padraic Cunningham for pointing me in the right direction!
A single line was missing from urrlib2.py. WTF? It was the declaration of add_header method.
Should be like this:
def add_header(self, key, val):
# useful for something like authentication
self.headers[key.capitalize()] = val
Was like
# useful for something like authentication
self.headers[key.capitalize()] = val
That was the only line missing in the library. I added it and the code works.
EDIT: Thinking about it, I think it's possible I deleted it accidentally myself. In the IDE (PyCharm) I could have gone into the lib with Ctrl+click on add_header accidentally and a quick ctrl+x to delete the row which I use often and there you have it.
Related
I am unable to download a xls file from a url. I have tried with both urlopen and urlretrive. But I recieve an really long error message starting with:
Traceback (most recent call last):
File "C:/Users/Henrik/Documents/Development/Python/Projects/ImportFromWeb.py", line 6, in
f = ur.urlopen(dls)
File "C:\Users\Henrik\AppData\Local\Programs\Python\Python35\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
and ending with:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Unfortionally I can't provide the url I am using since the data is sensitive. However I will give you the url with some parts removed.
https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504
As you can see the url dosn't end with a "/file.xls" for example. I don't know if that matters but most of the threds regarding this issue has had those types of links.
If I enter the url in my address bar the download file window appears:
Image of download window
The code I have written look like this:
import urllib.request as ur
import openpyxl as pyxl
dls = 'https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504'
f = ur.urlopen(dls)
I am grateful for any help you can provide!
I want to change the port in given url.
OLD=http://test:7000/vcc3
NEW=http://test:7777/vcc3
I tried below code code, I am able to change the URL but not able to change the port.
>>> from urlparse import urlparse
>>> aaa = urlparse('http://test:7000/vcc3')
>>> aaa.hostname
test
>>> aaa.port
7000
>>>aaa._replace(netloc=aaa.netloc.replace(aaa.hostname,"newurl")).geturl()
'http://newurl:7000/vcc3'
>>>aaa._replace(netloc=aaa.netloc.replace(aaa.port,"7777")).geturl()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected a character buffer object
It's not a particularly good error message. It's complaining because you're passing ParseResult.port, an int, to the string's replace method which expects a str. Just stringify port before you pass it in:
aaa._replace(netloc=aaa.netloc.replace(str(aaa.port), "7777"))
I'm astonished that there isn't a simple way to set the port using the urlparse library. It feels like an oversight. Ideally you'd be able to say something like parseresult._replace(port=7777), but alas, that doesn't work.
The details of the port are stored in netloc, so you can simply do:
>>> a = urlparse('http://test:7000/vcc3')
>>> a._replace(netloc='newurl:7777').geturl()
'http://newurl:7777/vcc3'
>>> a._replace(netloc=a.hostname+':7777').geturl() # Keep the same host
'http://test:7777/vcc3'
The problem is that the ParseResult 's 'port' member is protected and you can't change the attribute -don't event try to use private _replace() method. Solution is here:
from urllib.parse import urlparse, ParseResult
old = urlparse('http://test:7000/vcc3')
new = ParseResult(scheme=a.scheme, netloc="{}:{}".format(old.hostname, 7777),
path=old.path, params=old.params, query=old.query, fragment=old.fragment)
new_url = new.geturl()
The second idea is to convert ParseResult to list->change it later on like here:
Changing hostname in a url
BTW 'urlparse' library is not flexible in that area!
I want to query pubmed through python. I found a nice biology related library to do this:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
I found some example code here:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.egquery(term="orchid")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
if row["DbName"]=="pubmed":
print row["Count"]
When I change the email and run this code I get the following error:
Traceback (most recent call last):
File "pubmed.py", line 15, in <module>
handle = Entrez.egquery(term=my_query)
File "/usr/lib/pymodules/python2.7/Bio/Entrez/__init__.py", line 299, in egquery
return _open(cgi, variables)
File "/usr/lib/pymodules/python2.7/Bio/Entrez/__init__.py", line 442, in _open
raise exception
urllib2.HTTPError: HTTP Error 404: Not Found
There is not much of a lead to the source of the problem. I don't know what url it is trying to access. When I search "pubmed entrez urllib2.HTTPError: HTTP Error 404: Not Found", I get 8 results, none of which are related (aside from this thread).
The example works for me. It looks like it was a temporary NCBI issue, although the "Error 404" is quite unusual and not typical of the network problems I have seen with Entrez. In general with any network resource, give it a few hours or a day before worrying that something has broken.
There is also an Entrez Utilities announcement mailing list you may wish to subscribe to, although if there was a planned service outage recently it was not mentioned here:
http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce
I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.
I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this
Anyway, thing['href'] has the old link in it because it scrapes the web page for it; one would think that doing browser.open() on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?
Thanks for the help and let me know if you need any more information.
Update: Alright, I enabled logging; now my code reads:
req = mechanize.Request(pageUrl)
print logging.INFO
When I run it I get this:
url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20
Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+") and it works perfectly.
Both urllib2 and mechanize openers include a handler for redirect responses by default (you can check looking at the handlers attribute), so I don't think the problem is that a redirect response isn't being correctly followed.
To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing urllib2.BaseHandler to create your own handler to log all the information you need for every request and add the handler to your opener object using the add_handler method).