How can I get the base URI in a Google AppEngine app written in Python? I'm using the webapp framework.
e.g.
http://example.appspot.com/
The proper way to parse self.request.url is not with a regular expression, but with Python standard library's urlparse module:
import urlparse
...
o = urlparse.urlparse(self.request.url)
Object o will be an instance of the ParseResult class with string-valued fields such as o.scheme (probably http;-) and o.netloc ('example.appspot.com' in your case). You can put some of the strings back together again with the urlparse.urlunparse function from the same module, e.g.
s = urlparse.urlunparse((o.scheme, o.netloc, '', '', '', ''))
which would give you in s the string 'http://example.appspot.com' in this case.
If you just want to find your app ID, you can get that from the environment without having to parse the current URL. The environment variable is APPLICATION_ID
You can also use this to find the current version (CURRENT_VERSION_ID), auth domain (which will let you know whether you're running on appspot.com, AUTH_DOMAIN) and whether you're running on the local development server or in production (SERVER_SOFTWARE).
So to get the full base URL, try something like this:
import os
def get_base_url():
if os.environ[AUTH_DOMAIN] == "gmail.com":
app_id = os.environ[APPLICATION_ID]
return "http://" + app_id + ".appspot.com"
else:
return "http://" + os.environ[AUTH_DOMAIN]
edit: AUTH_DOMAIN contains the custom domain, no need to include the app ID.
This will return the current version's base URL even if you're not accessing the current version, or if you visit the current version using a URL like http://current-version.latest.app_id.appspot.com (unlike URL-parsing methods)
Related
This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.
So, for example,
Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu
Input: mail.google.com
Output: google.com
Input: google.co.uk
Output: google.co.uk
For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls attribute.
I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.
My current code uses urlparse but this also gets the subdomain which I don't want...
from urllib.parse import urlparse
uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'
Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?
I am using tldextract When I doing the domain parse.
In your case you only need combine the domain + suffix
import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')
I am working with an application that returns urls, written with Flask. I want the URL displayed to the user to be as clean as possible so I want to remove the http:// from it. I looked and found the urlparse library, but couldn't find any examples of how to do this.
What would be the best way to go about it, and if urlparse is overkill is there a simpler way? Would simply removing the "http://" substring from the URL just using the regular string parsing tools be bad practice or cause problems?
I don't think urlparse offers a single method or function for this. This is how I'd do it:
from urlparse import urlparse
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
def strip_scheme(url):
parsed = urlparse(url)
scheme = "%s://" % parsed.scheme
return parsed.geturl().replace(scheme, '', 1)
print strip_scheme(url)
Output:
stackoverflow.com/questions/tagged/python?page=2
If you'd use (only) simple string parsing, you'd have to deal with http[s], and possibly other schemes yourself. Also, this handles weird casing of the scheme.
If you are using these programmatically rather than using a replace, I suggest having urlparse recreate the url without a scheme.
The ParseResult object is a tuple. So you can create another removing the fields you don't want.
# py2/3 compatibility
try:
from urllib.parse import urlparse, ParseResult
except ImportError:
from urlparse import urlparse, ParseResult
def strip_scheme(url):
parsed_result = urlparse(url)
return ParseResult('', *parsed_result[1:]).geturl()
You can remove any component of the parsedresult by simply replacing the input with an empty string.
It's important to note there is a functional difference between this answer and #Lukas Graf's answer. The most likely functional difference is that the '//' component of a url isn't technically the scheme, so this answer will preserve it, whereas it will remain here.
>>> Lukas_strip_scheme('https://yoman/hi?whatup')
'yoman/hi?whatup'
>>> strip_scheme('https://yoman/hi?whatup')
'//yoman/hi?whatup'
A simple regex search and replace works.
import re
def strip_scheme(url: str):
return re.sub(r'^https?:\/\/', '', url)
I've seen this done in Flask libraries and extensions. Worth noting you can do it although it does make use of a protected member (._replace) of the ParseResult/SplitResult.
url = 'HtTp://stackoverflow.com/questions/tagged/python?page=2'
split_url = urlsplit(url)
# >>> SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
split_url_without_scheme = split_url._replace(scheme="")
# >>> SplitResult(scheme='', netloc='stackoverflow.com', path='/questions/tagged/python', query='page=2', fragment='')
new_url = urlunsplit(split_url_without_scheme)
Almost 9 years since the question was asked and still not much have changed :D.
This is the cleanest way I came up with to solve that issue:
def strip_scheme(url: str) -> str:
schemaless = urlparse(url)._replace(scheme='').geturl()
return schemaless[2:] if schemaless.startswith("//") else schemaless
And some unit tests:
import pytest
#pytest.mark.parametrize(
['url', 'expected_url'],
[
# Test url not changed when no scheme
('www.test-url.com', 'www.test-url.com'),
# Test https scheme stripped
('https://www.test-url.com', 'www.test-url.com'),
# Test http scheme stripped
('http://www.test-url.com', 'www.test-url.com'),
# Test only scheme stripped when url with path
('https://www.test-url.com/de/fr', 'www.test-url.com/de/fr'),
# Test only scheme stripped when url with path and params
('https://test.com/de/fr?param1=foo', 'test.com/de/fr?param1=foo'),
]
)
def test_strip_scheme(url: str, expected_url: str) -> None:
assert strip_scheme(url) == expected_url
According to documentation (https://docs.python.org/3/library/urllib.parse.html#url-parsing) the return value is a named tuple, its items can be accessed by index or as named attributes. So we can get access to certain parts of parsed url by using named attributes:
from urllib.parse import urlparse
def delete_http(link):
url = urlparse(link)
return url.netloc + url.path
user_link = input()
print(delete_http(user_link))
Input: https://stackoverflow.com/
Output: stackoverflow.com/
I have a bottle site that loads content via getJSON() requests.
To handle normal requests made when navigating the site, getJSON() requests are made to a Python script which dumps the results in JSON format.
Push/pop state is used to change the URL when the dynamic content is loaded.
To handle this same dynamic content when directly accessing the URL, eg reloading the page, I create a conditional which loads the bottle template and passes the path to the getJSON() request, which then loads the dynamic content.
#route('/')
#route('/<identifier>')
def my_function(identifier='index'):
# do things with certain paths using if and elif conditionals
#
# handle the getJSON() path
elif value_passed_to_getJSON.startswith("part_one/"):
# get the path after / , in this example it should be 'part_two'
my_variable = value_passed_to_getJSON.split("/")[1]
# perform database query with my_variable
response.content_type = 'application/json'
return dumps(cursor)
# and later for direct URL access, eg reloading the page, this essentially routes the query
# back to the getJSON() path handler above.
elif identifier == "part_one/part_two":
return template('site_template',value_passed_to_getJSON="/" + identifier)
This setup is working when the identifier is something like part_one but not in the format above ie part_one/part_two, in that case a 404 is raised.
As another test, if I simply put:
elif identifier == "part_one/part_two":
return "hello"
I also get a 404 with a Firebug error on GET part_two.
I am wondering if this is because the initial route #route('/<identifier>') only contains a single value and forward slash?
Does it need an extra wildcard to handle the two parts of the path?
Demonstrated Solution (per comment below)
#route('/')
#route('/<identifier:path>')
#view('my_template.tpl')
def index(identifier='index'):
if identifier == 'part_one/part_two':
return "hello"
else:
return "something standard"
I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.
Current url is
http://myapp.appspot.com/something/<user-id>
or
http://127.0.0.1:8080/something/<user-id>
How in my python code I can get http://myapp.appspot.com/ or http://127.0.0.1:8080/?
This is need for dynamic links generation, for ex., to http://myapp.appspot.com/somethingelse.
self.request.path returns the whole path.
self.request.host_url
I think you want app_identity.get_default_version_hostname().
If an app is served from a custom domain, it may be necessary to
retrieve the entire hostname component. You can do this using the
app_identity.get_default_version_hostname() method.
This code:
logging.info(app_identity.get_default_version_hostname())
prints localhost:8080 on the development server.
If self.request.path returns the whole path, can't you just do:
import urlparse
def get_domain(url):
return urlparse.urlparse(url).netloc
>>> get_domain("http://myapp.appspot.com/something/")
'myapp.appspot.com'