Get protocol and domain (WITHOUT subdomain) from a URL - python

This is an extension of Get protocol + host name from URL, with the added requirement that I want only the domain name, not the subdomain.
So, for example,
Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu
Input: mail.google.com
Output: google.com
Input: google.co.uk
Output: google.co.uk
For more context, I accept one or more seed URLs from a user and then run a scrapy crawler on the links. I need the domain name (without the subdomain) to set the allowed_urls attribute.
I've also taken a look at Python urlparse -- extract domain name without subdomain but the answers there seem outdated.
My current code uses urlparse but this also gets the subdomain which I don't want...
from urllib.parse import urlparse
uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'
Is there a (hopefully stdlib) way of getting (only) the domain in python-3.x?

I am using tldextract When I doing the domain parse.
In your case you only need combine the domain + suffix
import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

Related

How can I remove 'www.' from original URL through [urllib] parse in python?

Original URL ▶ https://www.exeam.org/index.html
I want to extract exeam.org/ or exeam.org from original URL.
To do this, I used urllib the most powerful parser in Python that I know,
but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.
to extract the domain name from a url using `urllib):
from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)
this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:
from tldextract import extract
ext = extract(surl)
print(ext.registered_domain)
this will produce:
exam.org

How to get domain name for any site?

I am getting output "com" instead of "google.com"
I have tried changing sites name but every site is showing "com"
from tld import get_tld
def get_domain_name(url):
domain_name= get_tld(url)
return domain_name
print(get_domain_name("https://www.google.com"))
I expect output to be "google.com" I have tried this code on Piaza workspace terminal
What you're asking for is the first level domain, so you'd use the function get_fld
from tld import get_fld
print(get_fld("https://www.google.com")
>>> google.com
See the documentation at readthedocs

How to get the part of a URL without protocol nor domain

I have URLs of the form
http://example.com/example/a/b/c.html
https//www.example.com/
How do I get the path from the server root, without protocol or domain name? With the examples above, the function should return:
/example/a/b/c.html
/
(I am using Django: answers relying on this framework are accepted!)
urlparse module can solve this:
from urlparse import urlparse # for python 2
from urllib.parse import urlparse # for python 3
parsed_url = urlparse('http://example.com/abc/cde')
assert parsed_url.path == '/abc/cde'
You could use the path attribute of django HttpRequest object, in other words:
request.path
see the docs for more

How to get current domain name with Python/GAE?

Current url is
http://myapp.appspot.com/something/<user-id>
or
http://127.0.0.1:8080/something/<user-id>
How in my python code I can get http://myapp.appspot.com/ or http://127.0.0.1:8080/?
This is need for dynamic links generation, for ex., to http://myapp.appspot.com/somethingelse.
self.request.path returns the whole path.
self.request.host_url
I think you want app_identity.get_default_version_hostname().
If an app is served from a custom domain, it may be necessary to
retrieve the entire hostname component. You can do this using the
app_identity.get_default_version_hostname() method.
This code:
logging.info(app_identity.get_default_version_hostname())
prints localhost:8080 on the development server.
If self.request.path returns the whole path, can't you just do:
import urlparse
def get_domain(url):
return urlparse.urlparse(url).netloc
>>> get_domain("http://myapp.appspot.com/something/")
'myapp.appspot.com'

How can I get the base URI in AppEngine?

How can I get the base URI in a Google AppEngine app written in Python? I'm using the webapp framework.
e.g.
http://example.appspot.com/
The proper way to parse self.request.url is not with a regular expression, but with Python standard library's urlparse module:
import urlparse
...
o = urlparse.urlparse(self.request.url)
Object o will be an instance of the ParseResult class with string-valued fields such as o.scheme (probably http;-) and o.netloc ('example.appspot.com' in your case). You can put some of the strings back together again with the urlparse.urlunparse function from the same module, e.g.
s = urlparse.urlunparse((o.scheme, o.netloc, '', '', '', ''))
which would give you in s the string 'http://example.appspot.com' in this case.
If you just want to find your app ID, you can get that from the environment without having to parse the current URL. The environment variable is APPLICATION_ID
You can also use this to find the current version (CURRENT_VERSION_ID), auth domain (which will let you know whether you're running on appspot.com, AUTH_DOMAIN) and whether you're running on the local development server or in production (SERVER_SOFTWARE).
So to get the full base URL, try something like this:
import os
def get_base_url():
if os.environ[AUTH_DOMAIN] == "gmail.com":
app_id = os.environ[APPLICATION_ID]
return "http://" + app_id + ".appspot.com"
else:
return "http://" + os.environ[AUTH_DOMAIN]
edit: AUTH_DOMAIN contains the custom domain, no need to include the app ID.
This will return the current version's base URL even if you're not accessing the current version, or if you visit the current version using a URL like http://current-version.latest.app_id.appspot.com (unlike URL-parsing methods)

Categories

Resources