Retrieving full URL from cgi.FieldStorage - python

I'm passing a URL to a python script using cgi.FieldStorage():
http://localhost/cgi-bin/test.py?file=http://localhost/test.xml
test.py just contains
#!/usr/bin/env python
import cgi
print "Access-Control-Allow-Origin: *"
print "Content-Type: text/plain; charset=x-user-defined"
print "Accept-Ranges: bytes"
print
print cgi.FieldStorage()
and the result is
FieldStorage(None, None, [MiniFieldStorage('file', 'http:/localhost/test.xml')])
Note that the URL only contains http:/localhost - how do I pass the full encoded URI so that file is the whole URI? I've tried encoding the file parameter (http%3A%2F%2Flocalhost%2ftext.xml) but this also doesn't work
The screenshot shows that the output to the webpage isn't what is expected, but that the encoded url is correct

Your CGI script works fine for me using Apache 2.4.10 and Firefox (curl also). What web server and browser are you using?
My guess is that you are using Python's CGIHTTPServer, or something based on it. This exhibits the problem that you identify. CGIHTTPServer assumes that it is being provided with a path to a CGI script so it collapses the path without regard to any query string that might be present. Collapsing the path removes duplicate forward slashes as well as relative path elements such as ...
If you are using this web server I don't see any obvious way around by changing the URL. You won't be using it in production, so perhaps look at another web server such as Apache, nginx, lighttpd etc.

The problem is with your query parameters, you should be encoding them:
>>> from urllib import urlencode
>>> urlencode({'file': 'http://localhost/test.xml', 'other': 'this/has/forward/slashes'})
'other=this%2Fhas%2Fforward%2Fslashes&file=http%3A%2F%2Flocalhost%2Ftest.xml'

Related

Python ValueError: unknown url type: space (?)

I am using the urllib2 module in Python 2.7 using Spyder 3.0 to batch download text files by reading a text file that contains a list of them:
reload(sys)
sys.setdefaultencoding('utf-8')
with open('ocean_not_templated_url.txt', 'r') as text:
lines = text.readlines()
for line in lines:
url = urllib2.urlopen(line.strip('ï \xa0\t\n\r\v'))
with open(line.strip('\n\r\t ').replace('/', '!').replace(':', '~'), 'wb') as out:
for d in url:
out.write(d)
I've already discovered a bunch of weird characters in the urls that I've since stripped, however, the script fails when nearly 90% complete, giving the following error:
I thought it to be a non-breaking space (denoted by \xa0 in the code), but it still fails. Any ideas?
That's an odd URL!
Specify the communication protocol over the network. Try prefixing the URL with http:// and the domain names if the file exists on the WWW.
Files always reside somewhere, in some server's directory, or locally on your system. So there must be a network path to such files, for example:
http://127.0.0.1/folder1/samuel/file1.txt
Same example, with localhost being an alias for 127.0.0.1 (generally)
http://localhost/folder1/samuel/file1.txt
That might solve the problem. Just think about where your file exists and how it should be addressed...
Update:
I experimented quite a bit on this. I think I know why that error is raised! :D
I speculate that your file which stores the URL's actually has a sneaky empty line near the end. I can say it's near the end as you said that it executes about 90% of it and then fails. So, the python urllib2 function get_type is unable to process that empty url and throws unknown url type:
I think that's the problem! Remove that empty line in the file ocean_not_templated_url.txt and try it out!
Just check and let me know! :P

How to read all HTTP headers in Python CGI script?

I have a python CGI script that receives a POST request containing a specific HTTP header. How do you read and parse the headers received? I am not using BaseHTTPRequestHandler or HTTPServer. I receive the body of the post with sys.stdin.read(). Thanks.
It is possible to get a custom request header's value in an apache CGI script with python. The solution is similar to this.
Apache's mod_cgi will set environment variables for each HTTP request header received, the variables set in this manner will all have an HTTP_ prefix, so for example x-client-version: 1.2.3 will be available as variable HTTP_X_CLIENT_VERSION.
So, to read the above custom header just call os.environ["HTTP_X_CLIENT_VERSION"].
The below script will print all HTTP_* headers and values:
#!/usr/bin/env python
import os
print "Content-Type: text/html"
print "Cache-Control: no-cache"
print
print "<html><body>"
for headername, headervalue in os.environ.iteritems():
if headername.startswith("HTTP_"):
print "<p>{0} = {1}</p>".format(headername, headervalue)
print "</html></body>"
You might want to look at the cgi module included with Python's standard library. It appears to have a cgi.parse_header(string) function that you might find to be helpful in trying to get the headers.

Internal Server Error with very simple python script

I'm new to python, and i'm trying to run a simple script (On a Mac if that's important).
Now, this code, gives me Internal Server Error:
#!/usr/bin/python
print 'hi'
But this one works like a charm (Only extra 'print' command):
#!/usr/bin/python
print
print 'hi'
Any explanation? Thanks!
Update:
When I run this script from the Terminal everything is fine. But when I run it from the browser:
http://localhost/cgi-bin/test.py
I get this error (And again, only if i'm not adding the extra print command).
I use Apache server of course.
Looks like you're running your script as a CGI-script (your edit confirms that you're using CGI)
...and the initial (empty) print is required to signify the end of the headers.
Check your Apache's error log (/var/log/apache2/error.log probably) to see if it says 'Premature end of script headers' (more info here).
EDIT: a bit more explanation:
A CGI script in Apache is responsible for generating it's own HTTP response.
An HTTP response consists of a header block, an empty line, and the so-called body contents. Even though you should generate some headers, it's not mandatory to do so. However, you do need to output the empty line; Apache expects it to be there, and if it's not (or if you only output a body which can't be parsed as headers), Apache will generate an error.
That's why your first version didn't work, but your second did: adding the empty print added the required empty line that Apache was expecting.
This will also work:
#!/usr/bin/env python
print "Content-Type: text/html" # header block
print "Vary: *" # also part of the header block
print "X-Test: hello world" # this too is a header, but a made-up one
print # empty line
print "hi" # body

get source html in local system python

Dears I want get source page but not in internet rather in local system
example : url=urllib.request.urlopen ('c://1.html')
>>> import urllib.request
>>> url=urllib.request.urlopen ('http://google.com')
>>> page =url.read()
>>> page=page.decode()
>>> page
what's my problem ?
from os.path import abspath
with open(abspath('c:/1.html') as fh:
print(fh.read())
Since url.read() just gives you the data as-is, and .decode() doesn't really do anything except convert the byte data from the socket to a traditional string, just print the filecontents?
urllib is mainly (if not only) a transporter to recieve HTML data, not actually parse the content. So all it does is connect to the source, separate the headers and give you the content. If you've already stored it locally, in a file.. Well then urllib has no more use to you. Consider looking at a HTML Parsing library such as BeautifulSoup for instance.

python cgitb is not functioning through a browser

I can't seem to get the python module cgitb to output the stack trace in a browser. I have no problems in a shell environment. I'm running Centos 6 with python 2.6.
Here is an example simple code that I am using:
import cgitb; cgitb.enable()
print "Content-type: text/html"
print
print 1/0
I get an Internal Server error instead of the printed detailed report. I have tried different error types, different browsers, etc.
When I don't have an error, of course python works fine. It will print the error in a shell fine. The point of cgitb is to print the error instead of returning an "Internal Server Error" in the browser for most error exceptions. Basically I'm just trying to get cgitb to work in a browser environment.
Any Suggestions?
Okay, I got my problem fixed and the OP brought me to it: Even tho cgitb will output HTML by default, it will not output a header! And Apache does not like that and might give you some stupid error like:
<...blablabla>: Response header name '<!--' contains invalid characters, aborting request
It indicates, that Apache was still working its way through the headers when it already encountered some HTML. Look at what the OP prints before the error is triggered. That is a header and you need that. Including the empty line.
I will just quote the docs:
Make sure that your script is readable and executable by "others"; the Unix file mode should be 0755 octal (use chmod 0755 filename).
Make sure that the first line of the script contains #! starting in column 1 followed by the pathname of the Python interpreter, for instance:
#!/usr/local/bin/python

Categories

Resources