How to get current URL in python web page? - python

I am a noob in Python. Just installed it, and spent 2 hours googleing how to get to a simple parameter sent in the URL to a Python script
Found this
Very helpful, except I cannot for anything in the world to figure out how to replace
import urlparse
url = 'http://foo.appspot.com/abc?def=ghi'
parsed = urlparse.urlparse(url)
print urlparse.parse_qs(parsed.query)['def']
With what do I replace url = 'string' to make it work?
I just want to access http://site.com/test/test.py?param=abc and see abc printed.
Final code after Alex's answer:
url = os.environ["REQUEST_URI"]
parsed = urlparse.urlparse(url)
print urlparse.parse_qs(parsed.query)['param']

If you don't have any libraries to do this for you, you can construct your current URL from the HTTP request that gets sent to your script via the browser.
The headers that interest you are Host and whatever's after the HTTP method (probably GET, in your case). Here are some more explanations (first link that seemed ok, you're free to Google some more :).
This answer shows you how to get the headers in your CGI script:
If you are running as a CGI, you can't read the HTTP header directly,
but the web server put much of that information into environment
variables for you. You can just pick it out of os.environ[].
If you're doing this as an exercise, then it's fine because you'll get to understand what's behind the scenes. If you're building anything reusable, I recommend you use libraries or a framework so you don't reinvent the wheel every time you need something.

This is how I capture in Python 3 from CGI (A) URL, (B) GET parameters and (C) POST data:
=======================================================
import sys, os, io
CAPTURE URL
myDomainSelf = os.environ.get('SERVER_NAME')
myPathSelf = os.environ.get('PATH_INFO')
myURLSelf = myDomainSelf + myPathSelf
CAPTURE GET DATA
myQuerySelf = os.environ.get('QUERY_STRING')
CAPTURE POST DATA
myTotalBytesStr=(os.environ.get('HTTP_CONTENT_LENGTH'))
if (myTotalBytesStr == None):
myJSONStr = '{"error": {"value": true, "message": "No (post) data received"}}'
else:
myTotalBytes=int(os.environ.get('HTTP_CONTENT_LENGTH'))
myPostDataRaw = io.open(sys.stdin.fileno(),"rb").read(myTotalBytes)
myPostData = myPostDataRaw.decode("utf-8")
Write RAW to FILE
mySpy = "myURLSelf: [" + str(myURLSelf) + "]\n"
mySpy = mySpy + "myQuerySelf: [" + str(myQuerySelf) + "]\n"
mySpy = mySpy + "myPostData: [" + str(myPostData) + "]\n"
You need to define your own myPath here
myFilename = "spy.txt"
myFilePath = myPath + "\" + myFilename
myFile = open(myFilePath, "w")
myFile.write(mySpy)
myFile.close()
=======================================================
Here are some other useful CGI environment vars:
AUTH_TYPE
CONTENT_LENGTH
CONTENT_TYPE
GATEWAY_INTERFACE
PATH_INFO
PATH_TRANSLATED
QUERY_STRING
REMOTE_ADDR
REMOTE_HOST
REMOTE_IDENT
REMOTE_USER
REQUEST_METHOD
SCRIPT_NAME
SERVER_NAME
SERVER_PORT
SERVER_PROTOCOL
SERVER_SOFTWARE

Related

How do I get the args from a post or get with Python without using cgi.FieldStorage

I just read that cgi is deprecated and so cgi.FieldStorage will stop working.
I'm struggling to find the replacement for this functionality. All the searches I've tried refer to urllib or requests, both of which (AFAIK) are designed to create requests, not to respond to them.
Thanks in advance
The reference to urllib is actually a bit misleading. The following might give some insight to the cgi interface from a python programmers point of view:
#!/usr/bin/python3
'''
preflight_cgi.py
check the preflight option call
'''
import sys
import os
if __name__ == "__main__":
print("Content-Type: text/html") # HTML is following
print()
i = 0
for arg in sys.argv:
print("argv{}: {}\n".format(i, arg))
i = 0
for line in sys.stdin:
print("line {}: {}\n".format(i, line))
i += 1
print("<TITLE>CGI script output</TITLE>")
print("<H1>This is the environmet</H1>")
for it in os.environ.items():
print("<p>{} = {}</p>".format(it[0], it[1]))
Put that where your current cgi.FieldStorage based app is and call it via the address line of the browser.
You will see something like
[...]
CONTENT_LENGTH = 0
QUERY_STRING = par=meter&var=able
REQUEST_URI = /cgi-bin/preflight_cgi.py?par=meter&var=able
REDIRECT_STATUS = 200
SCRIPT_NAME = /cgi-bin/preflight_cgi.py
REQUEST_METHOD = GET
SERVER_PROTOCOL = HTTP/1.1
SERVER_SOFTWARE = lighttpd/1.4.53
GATEWAY_INTERFACE = CGI/1.1
REQUEST_SCHEME = http
SERVER_PORT = 80
[...]
The environment variables have already most of done.
As an alternative you can also use one of the http.server classes to build the server completely in python.

MITMProxy: smart URL replacement

We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de.
But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!
You need to replace the content of the response and craft the header with just a few fields.
Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)

How to download several files with GAE Python

I'd like to download several files with GAE Python code.
My current code is like below
import webapp2, urllib
url1 = 'http://dummy/sample1.jpg'
url2 = 'http://dummy/sample2.jpg'
class DownloadHandler(webapp2.RequestHandler):
def get(self):
#image1
self.response.headers['Content-Type'] = 'application/octet-stream'
self.response.headers['Content-Disposition'] = 'attachment; filename="' + 'sample1.jpg' + '"'
f = urllib.urlopen(url1)
data = f.read()
self.response.out.write(data)
#image2
self.response.headers['Content-Type'] = 'application/octet-stream'
self.response.headers['Content-Disposition'] = 'attachment; filename="' + 'sample2.jpg' + '"'
f = urllib.urlopen(url2)
data = f.read()
self.response.out.write(data)
app = webapp2.WSGIApplication([('/.*', DownloadHandler)],
debug=True)
I expected to occur download dialogue twice with this code, but actually occurred once, and only sample2.jpg was downloaded.
How can you handle download dialogue several times?
I'd actually like to realize some other functions adding above as well.
To display progressing message on the browser such as
sample1.jpg was downloaded
sample2.jpg was downloaded
sample3.jpg was downloaded ...
And redirect to the other page after downloading files.
When I wrote a code such as
self.redirect('/otherpage')
after
self.response.out.write(data)
Only redirect had happened and didn't occur download procedure.
Would you give me any ideas to solve it please.
I'm using python2.7
Two things.
You cannot write two files in one response that has a Content-Type of application/octet-stream. To stuff multiple files in in the response, you would have to encode your response with multipart/form-data or multipart/mixed and hope that the client would understand that and parse it and show two download dialogues
Once you've already called self.response.out.write(…), you shouldn't be setting any more headers.
To me it seems that the most foolproof option would be to serve an HTML file that contains something like:
<script>
window.open('/path/to/file/1.jpg');
window.open('/path/to/file/1.jpg');
</script>
… and then handle those paths using different handlers.
Another option would be to zip the two files and serve the zipfile to the client, though it may or may not be preferable in your case.
I reached the goal what I wanted to do.
As user interaction, generating html sources include below
<script type="text/javascript">
window.open("/download?url=http://dummy/sample1.jpg")
window.open("/download?url=http://dummy/sample2.jpg")
</script>
then created new windows are handled with this code.
class DownloadHandler(webapp2.RequestHandler):
def get(self):
url = self.request.get('url')
filename = str(os.path.basename(url))
self.response.headers['Content-Type'] ='application/octet-stream'
self.response.headers['Content-Disposition'] = 'attachment; filename="%s"' % (filename)
data = urllib.urlopen(url).read()
self.response.out.write(data)
app = webapp2.WSGIApplication([('/download', DownloadHandler)], debug=True)
Thank you, Attila.

How can I un-shorten a URL using python?

I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)

Get current URL in Python

How would i get the current URL with Python,
I need to grab the current URL so i can check it for query strings e.g
requested_url = "URL_HERE"
url = urlparse(requested_url)
if url[4]:
params = dict([part.split('=') for part in url[4].split('&')])
also this is running in Google App Engine
Try this:
self.request.url
Also, if you just need the querystring, this will work:
self.request.query_string
And, lastly, if you know the querystring variable that you're looking for, you can do this:
self.request.get("name-of-querystring-variable")
For anybody finding this via google,
i figured it out,
you can get the query strings on your current request using:
url_get = self.request.GET
which is a UnicodeMultiDict of your query strings!
I couldn't get the other answers to work, but here is what worked for me:
url = os.environ['HTTP_HOST']
uri = os.environ['REQUEST_URI']
return url + uri
Try this
import os
url = os.environ['HTTP_HOST']
This is how I capture in Python 3 from CGI (A) URL, (B) GET parameters and (C) POST data:
=======================================================
import sys, os, io
CAPTURE URL
myDomainSelf = os.environ.get('SERVER_NAME')
myPathSelf = os.environ.get('PATH_INFO')
myURLSelf = myDomainSelf + myPathSelf
CAPTURE GET DATA
myQuerySelf = os.environ.get('QUERY_STRING')
CAPTURE POST DATA
myTotalBytesStr=(os.environ.get('HTTP_CONTENT_LENGTH'))
if (myTotalBytesStr == None):
myJSONStr = '{"error": {"value": true, "message": "No (post) data received"}}'
else:
myTotalBytes=int(os.environ.get('HTTP_CONTENT_LENGTH'))
myPostDataRaw = io.open(sys.stdin.fileno(),"rb").read(myTotalBytes)
myPostData = myPostDataRaw.decode("utf-8")
Write RAW to FILE
mySpy = "myURLSelf: [" + str(myURLSelf) + "]\n"
mySpy = mySpy + "myQuerySelf: [" + str(myQuerySelf) + "]\n"
mySpy = mySpy + "myPostData: [" + str(myPostData) + "]\n"
You need to define your own myPath here
myFilename = "spy.txt"
myFilePath = myPath + "\" + myFilename
myFile = open(myFilePath, "w")
myFile.write(mySpy)
myFile.close()
=======================================================
Here are some other useful CGI environment vars:
AUTH_TYPE
CONTENT_LENGTH
CONTENT_TYPE
GATEWAY_INTERFACE
PATH_INFO
PATH_TRANSLATED
QUERY_STRING
REMOTE_ADDR
REMOTE_HOST
REMOTE_IDENT
REMOTE_USER
REQUEST_METHOD
SCRIPT_NAME
SERVER_NAME
SERVER_PORT
SERVER_PROTOCOL
SERVER_SOFTWARE
============================================
I am using these methods running Python 3 on Windows Server with CGI via MIIS.
Hope this can help you.
requests module has 'url' attribute, that is changed url.
just try this:
import requests
current_url=requests.get("some url").url
print(current_url)
If your python script is server side:
You can use os
import os
url = os.environ
print(url)
with that, you will see all the data os.environ gives you. It looks like your need the 'QUERY_STRING'. Like any JSON object, you can obtain the data like this.
import os
url = os.environ['QUERY_STRING']
print(url)
And if you want a really elegant scalable solution you can use anywhere and always, you can save the variables into a dictionary (named vars here) like so:
vars={}
splits=os.environ['QUERY_STRING'].split('&')
for x in splits:
name,value=x.split('=')
vars[name]=value
print(vars)
If you are client side, then any of the other responses involving the get request will work

Categories

Resources