Get Root Domain of Link

Get Root Domain of Link - python

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

Getting the hostname is easy enough using urlparse:
hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.
One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. ".com", ".net", ".org") as well as private domains which are used like TLDs (e.g. ".co.uk" or even ".github.io"). You can access the PSL from Python using the publicsuffix2 library:
import publicsuffix
import urlparse
def get_base_domain(url):
# This causes an HTTP request; if your script is running more than,
# say, once a day, you'd want to cache it yourself. Make sure you
# update frequently, though!
psl = publicsuffix.fetch()
hostname = urlparse.urlparse(url).hostname
return publicsuffix.get_public_suffix(hostname, psl)

General structure of URL:
scheme://netloc/path;parameters?query#fragment
As TIMTOWTDI motto:
Using urlparse,
>>> from urllib.parse import urlparse # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever') # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '') # as per your case
>>> print(result)
'stackoverflow.com/'
Using tldextract,
>>> import tldextract # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
in your case:
>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.
Cheerio! :)

Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies - especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:
from urlparse import urlparse
def extract_domain(url):
parsed_domain = urlparse(url)
domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
domain_parts = domain.split('.')
if len(domain_parts) > 2:
return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
return domain
extract_domain('google.com') # google.com
extract_domain('www.google.com') # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk') # google.co.uk
extract_domain('sub.google.co.uk') # google.co.uk
extract_domain('www.google.com') # google.com
extract_domain('sub.sub2.voila.fr') # sub2.voila.fr

______Using Python 3.3 and not 2.x________
I would like to add a small thing to Ben Blank's answer.
from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc
By now, I just got the domain name from urlparse.
To remove the subdomains you first of all need to know which are Top Level Domains and which are not. E.g. in the above http://twitter.co.uk - co.uk is a TLD while in http://sub.twitter.com we have only .com as TLD and sub is a subdomain.
So, we need to get a file/list which has all the tlds.
tlds = load_file("tlds.txt") #tlds holds the list of tlds
hostname = u.split(".")
if len(hostname)>2:
if hostname[-2].upper() in tlds:
hostname=".".join(hostname[-3:])
else:
hostname=".".join(hostname[-2:])
else:
hostname=".".join(hostname[-2:])

def get_domain(url):
u = urlsplit(url)
return u.netloc
def get_top_domain(url):
u"""
>>> get_top_domain('http://www.google.com')
'google.com'
>>> get_top_domain('http://www.sina.com.cn')
'sina.com.cn'
>>> get_top_domain('http://bbc.co.uk')
'bbc.co.uk'
>>> get_top_domain('http://mail.cs.buaa.edu.cn')
'buaa.edu.cn'
"""
domain = get_domain(url)
domain_parts = domain.split('.')
if len(domain_parts) < 2:
return domain
top_domain_parts = 2
# if a domain's last part is 2 letter long, it must be country name
if len(domain_parts[-1]) == 2:
if domain_parts[-1] in ['uk', 'jp']:
if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
top_domain_parts = 3
else:
if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
top_domain_parts = 3
return '.'.join(domain_parts[-top_domain_parts:])

You dont need a package, or any of the complexities people are suggesting to do this, it's as simple as below and tweaking to your liking.
def is_root(url):
head, sep, tail = url.partition('//')
is_root_domain = tail.split('/', 1)[0] if '/' in tail else url
# printing or returning is_root_domain will give you what you seek
print(is_root_domain)
is_root('http://www.techcrunch.com/')

This worked for me:
def get_sub_domains(url):
urlp = parseurl(url)
urlsplit = urlp.netloc.split(".")
l = []
if len(urlsplit) < 3: return l
for item in urlsplit:
urlsplit = urlsplit[1:]
l.append(".".join(urlsplit))
if len(urlsplit) < 3:
return l

this simple code will get the root domain name from all valid URLs.
from urllib.parse import urlparse
url = 'https://www.google.com/search?q=python'
root_url = urlparse(url).scheme + '://' + urlparse(url).hostname
print(root_url) # https://www.google.com

This worked for my purposes. I figured I'd share it.
".".join("www.sun.google.com".split(".")[-2:])

Related

Python add to url

I have a URL as follows:
http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
I need to insert a node 'us' in this case, as follows:
http://www.example.com/boards/results/us/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01
Using Python's urlparse library, I can get to the path as follows:
path = urlparse(url).path
... and then using a complicated and ugly routine involving splitting the path based on slashes and inserting the new node and then reconstructing the URL
>>> path = urlparse(url).path
>>> path.split('/')
['', 'boards', 'results', 'current:entry1,current:entry2', 'modular', 'table', 'alltables', 'alltables', 'alltables', '2011-01-01']
>>> ps = path.split('/')
>>> ps.insert(4, 'us')
>>> '/'.join(ps)
'/boards/results/current:entry1,current:entry2/us/modular/table/alltables/alltables/alltables/2011-01-01'
>>>
Is there a more elegant/pythonic way to accomplish this using default libraries?
EDIT:
The 'results' in the URL is not fixed - it can be 'results' or 'products' or 'prices' and so on. However, it will always be right after 'boards'.

path = "http://www.example.com/boards/results/current:entry1,current:entry2/modular/table/alltables/alltables/alltables/2011-01-01"
replace_start_word = 'results'
replace_word_length = len(replace_start_word)
replace_index = path.find(replace_start_word)
new_url = '%s/us%s' % (path[:replace_index + replace_word_length], path[replace_index + replace_word_length:])

crlDistributionPoints dirName

i am a new user of pyOpenSSL,i want make a certicate with following code
from OpenSSL import crypto as c
cert = c.X509()
cert.add_extensions([
c.X509Extension('crlDistributionPoints', False, 'dirName:/C=US/O=TEST'),
])
this code can't work, can anyone help me?pyOpenSSL seems not support dirName
cert.add_extensions([
c.X509Extension('crlDistributionPoints', False, 'URI:http://somesite') can work
])

I had exactly the same problem, and, however I also couldn't find a real solution, I managed to have a sort of workaround to get it done via Python.
In this page the formatting is explained http://openssl.org/docs/apps/x509v3_config.html#CRL-distribution-points and also a option to use raw DER bytes. (Section: ARBITRARY EXTENSIONS)
First 'collect' the DER bytes from a certificate which already have the correct URI and dirName. Alternative make a certificate with openssl with correct crlDistributionPoint, tmpcert in this example is this certificate. Also figure out which extension index is used. get_short_name will give the 'key' of the extension, so search for crlDistributionPoint.
Collect it using:
from binascii import hexlify
print tmpcert.get_extension(5).get_short_name()
print hexlify(tmpcert.get_extension(5).get_data())
And afterwards format this output and use it in the initialiser of X509Extension()
crypto.X509Extension('crlDistributionPoints', False,
"DER:30:6a:xx:xx:xx:..........:xx:xx")
As one understands, this is quitte a 'hardcoded' solution, there is no straightforward way of altering the content of this field this way.

Here is a way in which you can generated the DER ... it does not include the code for dirName, but I hope it gives an idea of how you can construct the DER
from pyasn1.codec.der import encoder as der_encoder
from pyasn1.type import tag
from pyasn1_modules import rfc2459
class GeneralNames(rfc2459.GeneralNames):
"""
rfc2459 has wrong tagset.
"""
tagSet = tag.TagSet(
(),
tag.Tag(tag.tagClassContext, tag.tagFormatConstructed, 0),
)
class DistributionPointName(rfc2459.DistributionPointName):
"""
rfc2459 has wrong tagset.
"""
tagSet = tag.TagSet(
(),
tag.Tag(tag.tagClassContext, tag.tagFormatConstructed, 0),
)
cdps = [('uri', 'http://something'), ('dns', 'some.domain.com')]
cdp = rfc2459.CRLDistPointsSyntax()
values = []
position = 0
for cdp_type, cdp_value in cdps:
cdp_entry = rfc2459.DistributionPoint()
general_name = rfc2459.GeneralName()
if cdp_type == 'uri':
general_name.setComponentByName(
'uniformResourceIdentifier',
cdp_value,
)
elif cdp_type == 'dns':
general_name.setComponentByName(
'dNSName',
cdp_value,
)
general_names = GeneralNames()
general_names.setComponentByPosition(0, general_name)
name = DistributionPointName()
name.setComponentByName('fullName', general_names)
cdp_entry.setComponentByName('distributionPoint', name)
cdp.setComponentByPosition(position, cdp_entry)
position += 1
cdp_der = der_encoder.encode(cdp)
extensions.append(
crypto.X509Extension(
b'crlDistributionPoints',
False,
'DER:' + cdp_der.encode('hex'),
),
)

How to manipulate a URL string in order to extract a single piece?

I'm new to programming and Python.
Background
My program accepts a url. I want to extract the username from the url.
The username is the subdomain.
If the subdomain is 'www', the username should be the main part of the domain. The rest of the domain should be discard (eg. '.com/', '.org/')
I've tried the following:
def get_username_from_url(url):
if url.startswith(r'http://www.'):
user = url.replace(r'http://www.', '', 1)
user = user.split('.')[0]
return user
elif url.startswith(r'http://'):
user = url.replace(r'http://', '', 1)
user = user.split('.')[0]
return user
easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"
print get_username_from_url(easy_url)
# output = httpwwwweirdusername (good! expected.)
print get_username_from_url(hard_url)
# output = weirdusername (bad! username should = httpwwwweirdusername)
I've tried many other combinations using strip(), split(), and replace().
Could you advise me on how to solve this relatively simple problem?

There is a module called urlparse that is specifically for the task:
>>> from urlparse import urlparse
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> urlparse(url).hostname.split('.')[0]
'httpwwwweirdusername'
In case of http://www.httpwwwweirdusername.com/ it would output www which is not desired. There are workarounds to ignore www part, like, for example, get the first item from the splitted hostname that is not equal to www:
>>> from urlparse import urlparse
>>> url = "http://www.httpwwwweirdusername.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'
>>> url = "http://httpwwwweirdusername.blogger.com/"
>>> next(item for item in urlparse(url).hostname.split('.') if item != 'www')
'httpwwwweirdusername'

Possible to do this with regular expressions (could probably modify the regex to be more accurate/efficient).
import re
url_pattern = re.compile(r'.*/(?:www.)?(\w+)')
def get_username_from_url(url):
match = re.match(url_pattern, url)
if match:
return match.group(1)
easy_url = "http://www.httpwwwweirdusername.com/"
hard_url = "http://httpwwwweirdusername.blogger.com/"
print get_username_from_url(easy_url)
print get_username_from_url(hard_url)
Which yields us:
httpwwwweirdusername
httpwwwweirdusername

script to serve from url, for requests matching regular expression

I am a complete n00b in Python and am trying to figure out a stub for mitmproxy.
I have tried the documentation but they assume we know Python so i am at a stalemate.
I've been working with a script:
original_url = 'http://production.domain.com/1/2/3'
new_content_path = '/home/andrepadez/proj/main.js'
body = open(new_content_path, 'r').read()
def response(context, flow):
url = flow.request.get_url()
if url == original_url:
flow.response.content = body
As you can predict, the proxy takes every request to 'http://production.domain.com/1/2/3' and serves the content of my file.
I need this to be more dynamic:
for every request to 'http://production.domain.com/*', i need to serve a correspondent URL, for example:
http://production.domain.com/1/4/3 -> http://develop.domain.com/1/4/3
I know i have to use a regular expression, so i can capture and map it correctly, but i don't know how to serve the contents of the develop url as "flow.response.content".
Any help will be welcome

You would have to do something like this:
import re
# In order not to re-read the original file every time, we maintain
# a cache of already-read bodies.
bodies = { }
def response(context, flow):
# Intercept all URLs
url = flow.request.get_url()
# Check if this URL is one of "ours" (check out Python regexps)
m = re.search('REGEXP_FOR_ORIGINAL_URL/(\d+)/(\d+)/(\d+)', url)
if None != m:
# It is, and m will contain this information
# The three numbers are in m.group(1), (2), (3)
key = "%d.%d.%d" % ( m.group(1), m.group(2), m.group(3) )
try:
body = bodies[key]
except KeyError:
# We do not yet have this body
body = // whatever is necessary to retrieve this body
= open("%s.txt" % ( key ), 'r').read()
bodies[key] = body
flow.response.content = body

How can I talk to UniProt over HTTP in Python?

I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python.
In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Perl one and it seems to work, so the problem is my Python attempts. The (working) script is:
## tool_example.pl ##
use strict;
use warnings;
use LWP::UserAgent;
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
from => 'ACC', to => 'P_REFSEQ_AC', format => 'tab',
query => 'P13368 P20806 Q9UM73 P97793 Q17192'
};
my $agent = LWP::UserAgent->new;
push #{$agent->requests_redirectable}, 'POST';
print STDERR "Submitting...\n";
my $response = $agent->post("$base/$tool/", $params);
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
print STDERR "Checking...\n";
$response = $agent->get($response->base);
}
$response->is_success ?
print $response->content :
die 'Failed, got ' . $response->status_line .
' for ' . $response->request->uri . "\n";
My questions are:
1) How would you do that in Python?
2) Will I be able to massively "scale" that (i.e., use a lot of entries in the query field)?

question #1:
This can be done using python's urllibs:
import urllib, urllib2
import time
import sys
query = ' '.join(sys.argv)
# encode params as a list of 2-tuples
params = ( ('from','ACC'), ('to', 'P_REFSEQ_AC'), ('format','tab'), ('query', query))
# url encode them
data = urllib.urlencode(params)
url = 'http://www.uniprot.org/mapping/'
# fetch the data
try:
foo = urllib2.urlopen(url, data)
except urllib2.HttpError, e:
if e.code == 503:
# blah blah get the value of the header...
wait_time = int(e.hdrs.get('Retry-after', 0))
print 'Sleeping %i seconds...' % (wait_time,)
time.sleep(wait_time)
foo = urllib2.urlopen(url, data)
# foo is a file-like object, do with it what you will.
foo.read()

You're probably better off using the Protein Identifier Cross Reference service from the EBI to convert one set of IDs to another. It has a very good REST interface.
http://www.ebi.ac.uk/Tools/picr/
I should also mention that UniProt has very good webservices available. Though if you are tied to using simple http requests for some reason then its probably not useful.

Let's assume that you are using Python 2.5.
We can use httplib to directly call the web site:
import httplib, urllib
querystring = {}
#Build the query string here from the following keys (query, format, columns, compress, limit, offset)
querystring["query"] = ""
querystring["format"] = "" # one of html | tab | fasta | gff | txt | xml | rdf | rss | list
querystring["columns"] = "" # the columns you want comma seperated
querystring["compress"] = "" # yes or no
## These may be optional
querystring["limit"] = "" # I guess if you only want a few rows
querystring["offset"] = "" # bring on paging
##From the examples - query=organism:9606+AND+antigen&format=xml&compress=no
##Delete the following and replace with your query
querystring = {}
querystring["query"] = "organism:9606 AND antigen"
querystring["format"] = "xml" #make it human readable
querystring["compress"] = "no" #I don't want to have to unzip
conn = httplib.HTTPConnection("www.uniprot.org")
conn.request("GET", "/uniprot/?"+ urllib.urlencode(querystring))
r1 = conn.getresponse()
if r1.status == 200:
data1 = r1.read()
print data1 #or do something with it
You could then make a function around creating the query string and you should be away.

check this out bioservices. they interface a lot of databases through Python.
https://pythonhosted.org/bioservices/_modules/bioservices/uniprot.html
conda install bioservices --yes

in complement to O.rka answer:
Question 1:
from bioservices import UniProt
u = UniProt()
res = u.get_df("P13368 P20806 Q9UM73 P97793 Q17192".split())
This returns a dataframe with all information about each entry.
Question 2: same answer. This should scale up.
Disclaimer: I'm the author of bioservices

There is a python package in pip which does exactly what you want
pip install uniprot-mapper

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Root Domain of Link - python

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

This worked for me: def get_sub_domains(url): urlp = parseurl(url) urlsplit = urlp.netloc.split(".") l = [] if len(urlsplit) < 3: return l for item in urlsplit: urlsplit = urlsplit[1:] l.append(".".join(urlsplit)) if len(urlsplit) < 3: return l

this simple code will get the root domain name from all valid URLs. from urllib.parse import urlparse url = 'https://www.google.com/search?q=python' root_url = urlparse(url).scheme + '://' + urlparse(url).hostname print(root_url) # https://www.google.com

This worked for my purposes. I figured I'd share it. ".".join("www.sun.google.com".split(".")[-2:])

Related

Python add to url

crlDistributionPoints dirName

How to manipulate a URL string in order to extract a single piece?

script to serve from url, for requests matching regular expression

How can I talk to UniProt over HTTP in Python?

Categories

Resources