How to get current domain name with Python/GAE? - python

Current url is
http://myapp.appspot.com/something/<user-id>
or
http://127.0.0.1:8080/something/<user-id>
How in my python code I can get http://myapp.appspot.com/ or http://127.0.0.1:8080/?
This is need for dynamic links generation, for ex., to http://myapp.appspot.com/somethingelse.
self.request.path returns the whole path.

self.request.host_url

I think you want app_identity.get_default_version_hostname().
If an app is served from a custom domain, it may be necessary to
retrieve the entire hostname component. You can do this using the
app_identity.get_default_version_hostname() method.
This code:
logging.info(app_identity.get_default_version_hostname())
prints localhost:8080 on the development server.

If self.request.path returns the whole path, can't you just do:
import urlparse
def get_domain(url):
return urlparse.urlparse(url).netloc
>>> get_domain("http://myapp.appspot.com/something/")
'myapp.appspot.com'

Related

Redirect hostname/endpoint to api.hostname/endpoint in django

I have my api built with this pattern: api.hostname/endpoint.
However there is a plugin to my app that uses hostname/endpoint pattern.
I would like to solve it on the backend side by adding redirection to api.hostname/endpoint.
I tried to experiment with adding urls or paths to urlpatterns, but it didn't help me.
How can I achieve it? Any ideas?
Regards,
Maciej.
You can use urllib
import urllib.parse
url = "https://hostname/endpoint"
split_url = urllib.parse.urlsplit(url)
result = f"{split_url.scheme}://api.{split_url.hostname}/{split_url.endpoint}"
print(result)
>> "https://api.hostname/endpoint"

Get protocol and domain (WITHOUT subdomain) from a URL

This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.
So, for example,
Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu
Input: mail.google.com
Output: google.com
Input: google.co.uk
Output: google.co.uk
For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls attribute.
I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.
My current code uses urlparse but this also gets the subdomain which I don't want...
from urllib.parse import urlparse
uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'
Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?
I am using tldextract When I doing the domain parse.
In your case you only need combine the domain + suffix
import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

Locally execute python file that is located on a web server

I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.

How can I get the base URI in AppEngine?

How can I get the base URI in a Google AppEngine app written in Python? I'm using the webapp framework.
e.g.
http://example.appspot.com/
The proper way to parse self.request.url is not with a regular expression, but with Python standard library's urlparse module:
import urlparse
...
o = urlparse.urlparse(self.request.url)
Object o will be an instance of the ParseResult class with string-valued fields such as o.scheme (probably http;-) and o.netloc ('example.appspot.com' in your case). You can put some of the strings back together again with the urlparse.urlunparse function from the same module, e.g.
s = urlparse.urlunparse((o.scheme, o.netloc, '', '', '', ''))
which would give you in s the string 'http://example.appspot.com' in this case.
If you just want to find your app ID, you can get that from the environment without having to parse the current URL. The environment variable is APPLICATION_ID
You can also use this to find the current version (CURRENT_VERSION_ID), auth domain (which will let you know whether you're running on appspot.com, AUTH_DOMAIN) and whether you're running on the local development server or in production (SERVER_SOFTWARE).
So to get the full base URL, try something like this:
import os
def get_base_url():
if os.environ[AUTH_DOMAIN] == "gmail.com":
app_id = os.environ[APPLICATION_ID]
return "http://" + app_id + ".appspot.com"
else:
return "http://" + os.environ[AUTH_DOMAIN]
edit: AUTH_DOMAIN contains the custom domain, no need to include the app ID.
This will return the current version's base URL even if you're not accessing the current version, or if you visit the current version using a URL like http://current-version.latest.app_id.appspot.com (unlike URL-parsing methods)

Categories

Resources