How to distinguish a domain and a hostname in python

How to distinguish a domain and a hostname in python - python

A domain is something like this: google.com, yahoo.com. It also have a whois record
A hostname is something like this: m.google.com, www.google.com, images.google.com
A domain can have very interesting TLDs and ccTLDs: google.co.uk, google.academy, google.xxx
A hostname can be also like this: mail.services.1.google.com, xxx.google.com
Here is the question: I have a string variable and i want to decide that if the value is a hostname or a domain. Is there a clever way to distinguish them in python?

You already seem to know how to differentiate between them.
use urllib.parse to break down the string and then write your own logic to decide.
Docs: https://docs.python.org/3/library/urllib.parse.html

I found the answer. We can do this with tldextract package.
from tldextract import tldextract
test_str = 'mail.google.co.uk'
te_result = tldextract.extract(test_str)
domain = '{}.{}'.format(te_result.domain, te_result.suffix)
print('domain: {}'.format(domain))
print('is_hostname: {}'.format(test_str != domain))
print('is_domain: {}'.format(test_str == domain))
Answer:
domain: google.co.uk
is_hostname: True
is_domain: False

Related

How to check if 2 urls in same domain? [duplicate]

how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks

Here's a great python module someone wrote to solve this problem after seeing this question:
https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote:
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).
You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Using this file of effective tlds which someone else found on Mozilla's website:
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]
def get_domain(url, tlds):
url_elements = urlparse(url)[1].split('.')
# url_elements = ["abcde","co","uk"]
for i in range(-len(url_elements), 0):
last_i_elements = url_elements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
exception_candidate = "!" + candidate
# match tlds:
if (exception_candidate in tlds):
return ".".join(url_elements[i:])
if (candidate in tlds or wildcard_candidate in tlds):
return ".".join(url_elements[i-1:])
# returns "abcde.co.uk"
raise ValueError("Domain not in global list of TLDs")
print get_domain("http://abcde.co.uk", tlds)
results in:
abcde.co.uk
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Using python tld
https://pypi.python.org/pypi/tld
Install
pip install tld
Get the TLD name as string from the URL given
from tld import get_tld
print get_tld("http://www.google.co.uk")
co.uk
or without protocol
from tld import get_tld
get_tld("www.google.co.uk", fix_protocol=True)
co.uk
Get the TLD as an object
from tld import get_tld
res = get_tld("http://some.subdomain.google.co.uk", as_object=True)
res
# 'co.uk'
res.subdomain
# 'some.subdomain'
res.domain
# 'google'
res.tld
# 'co.uk'
res.fld
# 'google.co.uk'
res.parsed_url
# SplitResult(
# scheme='http',
# netloc='some.subdomain.google.co.uk',
# path='',
# query='',
# fragment=''
# )
Get the first level domain name as string from the URL given
from tld import get_fld
get_fld("http://www.google.co.uk")
# 'google.co.uk'

There are many, many TLD's. Here's the list:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Here's another list
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
Here's another list
http://www.iana.org/domains/root/db/

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
def get_tld():
try:
return get_tld(self.content_url)
except Exception, e:
re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
matchObj = re_domain.findall(str(e))
if matchObj:
for m in matchObj:
return m
raise e

Here's how I handle it:
if not url.startswith('http'):
url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
sys.exit(2)
elif not match.group(0):
sys.exit(2)

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!
So finally, I decided to write this method
IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

Python print certain lines from terraform output from certain resource

I'm trying to print certain line, from a string that is inside a variable in python3, the variable comes from os.popen execution like the following example.
some_url = os.popen(f"terraform state show 'module.dns.aws_route53_record.main'").read()
In order to print the output I do something like this
print(f"{color.DARKCYAN}[SOME_URL]{color.END}, {some_url}")
But the output look's like this...
[SOME_URL], # module.dns.aws_route53_record.main:
resource "aws_route53_record" "main" {
allow_overwrite = true
fqdn = "xxxxxxxxxx-xxxxxxxxxxxx-xxxxxxx-xxxxxx.xxxxxx.xxxxxxxx.xxxxxxx.com"
id = "xxxxxxxxxxxxxxxxxxxxxxxxx_xxxxx-xxxxxxxxxxxxx-xxxxxx-xxxxx.xxxxx.xxxx.xxxxxxx.com_CNAME"
name = "xxxxxxxxx-xxxxxxxxxxxxx-xxxxxxx-xxxxxxx.xxxxxxxxx.xx.xxxxxx.xxxx"
records = [
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxx.xxxxxxxxxx.xxxx.xxxxxxxxxxxx.xxx",
]
ttl = xxxx
type = "xxxx"
zone_id = "xxxxxxxxxxxxxxxxxxxxxxxxxx"
}
Is it a simple way to parse and print just the line with the fqdn after [SOME_URL] ???

Is this what you were looking for?
import re
string = '''
resource "aws_route53_record" "main" {
allow_overwrite = true
fqdn = "xxxxxxxxxx-xxxxxxxxxxxx-xxxxxxx-xxxxxx.xxxxxx.xxxxxxxx.xxxxxxx.com"
id = "xxxxxxxxxxxxxxxxxxxxxxxxx_xxxxx-xxxxxxxxxxxxx-xxxxxx-xxxxx.xxxxx.xxxx.xxxxxxx.com_CNAME"
name = "xxxxxxxxx-xxxxxxxxxxxxx-xxxxxxx-xxxxxxx.xxxxxxxxx.xx.xxxxxx.xxxx"
records = [
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxx.xxxxxxxxxx.xxxx.xxxxxxxxxxxx.xxx",
]
ttl = xxxx
type = "xxxx"
zone_id = "xxxxxxxxxxxxxxxxxxxxxxxxxx"
}
'''
if m := re.search("(?<=\n)\s*fqdn\s.*", string):
fqdn = re.sub("\s+=", " =", m.group().strip())
print(f"\x1b[1;36m[SOME_URL]\x1b[0m, {fqdn}")
[SOME_URL], fqdn = "xxxxxxxxxx-xxxxxxxxxxxx-xxxxxxx-xxxxxx.xxxxxx.xxxxxxxx.xxxxxxx.com"
The expression should be precise enough.
I can recommend learning regular expressions especially for these cases!
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.'
Now they have two problems."
"Regular expressions tend to be easier to write than they are to read. This is less of a problem if you are the only one who ever needs to maintain the program (or sed routine, or shell script, or what have you), but if several people need to watch over it, the syntax can turn into more of a hindrance than an aid.")
(Stephen Ramsay, University of Virginia)

Terraform has a very nice way to grant you access to information you need from the state in the form of outputs values. In your case, you can create an output variable for the fqdn and read just this one piece of information:
output "main_dns" {
value = aws_route53_record.main.fqdn
}
And in the python call you use the terraform output main_dns command
some_url = os.popen(f"terraform output main_dns").read()
Now you can use the fqdn as you see fit.
terraform output documentation

Match referer by regex

I'd like to setup a simple notification if a view has a specific base referer.
Let's say I land on http://myapp.com/page/ and I came from http://myapp.com/other/page/1. Here's an example of my pseudo code, basically if I'm coming from any page/X I want to setup a notification.
I'm thinking it might be something like ^r^myapp.com/other/page/$ but I'm not so familiar with how to use regex with python.
from django.http import HttpRequest
def someview(request):
notify = False
... # other stuff not important to question
req = HttpRequest()
test = req.META['HTTP_REFERER'] like "http://myapp.com/other/page*"
# where * denotes matching anything past that point and the test returns T/F
if test:
notify = True
return # doesn't matter here
This may be more of a "how do I use regex in this context" rather than a django question specifically.

You could go with something like this:
import re
referrer = "http://myapp.com/other/page/aaa"
m = re.match("^http://myapp.com/other/page/(.*)", referrer)
if m:
print m.group(1)

IMAPClient and BODY[HEADER.FIELDS (FROM)] field

I'm really starting to get the hang of IMAPClient. The code: 'BODY[HEADER.FIELDS (FROM)]' returns
From: First Last <first.last#domain.com>
I'd really just like it to return the email address like this:
first.last#lbox.com
Do I need to pass it to a variable first and trim it down or is there another fetch modifier I can use?
response = server.fetch(messages, ['FLAGS', 'RFC822.SIZE', 'BODY[HEADER.FIELDS (FROM)]'])
for msgid, data in response.iteritems():
print ' ID %d: %d bytes, From: %s flags=%s' % (msgid,
data['RFC822.SIZE'],
data['BODY[HEADER.FIELDS (FROM)]'],
data['FLAGS'])

No - you can't do that with an IMAP request, if you look at my other post you'll notice something using parseaddr, but here it is again with your example:
>>> from email.utils import parseaddr
>>> a = 'From: First Last <first.last#domain.com>'
>>> parseaddr(a)
('First Last', 'first.last#domain.com')

IMAPLIB doesn't parse much of the protocol for you. It's returning the line from the server as is.
You can and should use the parsers in the email library to help you out.

How to extract a word from text in Python

I have this string "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate." in a log file. What I need to do is look for this message and extract the IP address (1.2.3.4) from the log file.
import os
import shutil
import optparse
import sys
def main():
file = open("messages", "r")
log_data = file.read()
file.close()
search_str = "is currently trusted in the white list, but it is now using a new trusted certificate."
index = log_data.find(search_str)
print index
return
if __name__ == '__main__':
main()
How do I extract the IP address? Your response is appreciated.

Really simple answer:
msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
parts = msg.split(' ', 2)
print parts[1]
results in:
1.2.3.4
You could also do REs if you wanted, but for something this simple...

There will be dozens of possible approaches, pros and cons depend on the details of your log file. One example, using the re module:
import re
x = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
pattern = "IP ([0-9\.]+) is currently trusted in the white list"
m = re.match(pattern, x)
for ip in m.groups():
print ip
If you want to print out every instance of that string in your log file, you'd do something like this:
import re
pattern = "(IP [9-0\.]+ is currently trusted in the white list, but it is now using a new trusted certificate.)"
m = re.match(pattern, log_data)
for match in m.groups():
print match

Use regular expressions.
Code like this:
import re
compiled = re.compile(r"""
.*? # Leading junk
(?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
.*? # Trailing junk
""", re.VERBOSE)
str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
m = compiled.match(str)
print m.group("ipaddress")
And you get this:
>>> import re
>>>
>>> compiled = re.compile(r"""
... .*? # Leading junk
... (?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
... .*? # Trailing junk
... """, re.VERBOSE)
>>> str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
>>> m = compiled.match(str)
>>> print m.group("ipaddress")
1.2.3.4
Also, I learned there there is a dictionary of matches, groupdict():
>>>> str = "Peer 10.11.6.224 is currently trusted in the white list, but it is now using a new trusted certificate. Consider removing its likely outdated white list entry."
>>>> m = compiled.match(str)
>>>> print m.groupdict()
{'ipaddress': '10.11.6.224'}
Later: fixed that. The initial '.*' was eating your first character match. Changed it to be non-greedy. For consistency (but not necessity), I changed the trailing match, too.

Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format to a regular expression. In your case, you will do the following:
from stringparser import Parser
parser = Parser("IP {} is currently trusted in the white list, but it is now using a new trusted certificate.")
ip = parser(text)
If you have a file with multiple lines you can replace the last line by:
with open("log.txt", "r") as fp:
ips = [parser(line) for line in fp]
Good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to distinguish a domain and a hostname in python - python

You already seem to know how to differentiate between them. use urllib.parse to break down the string and then write your own logic to decide. Docs: https://docs.python.org/3/library/urllib.parse.html

Related

How to check if 2 urls in same domain? [duplicate]

Python print certain lines from terraform output from certain resource

Match referer by regex

IMAPClient and BODY[HEADER.FIELDS (FROM)] field

How to extract a word from text in Python

Categories

Resources