Get just domain name from URL in Python [duplicate] - python

This question already has answers here:
Extract domain name from URL in Python
(8 answers)
Closed 3 years ago.
I've seen questions similar to this but not really getting at what I'm looking for so I was wondering. I'm trying to extract the main domain of a server from its URL, but just that, without any subdomains. So if the URL was, for example, "http://forums.example.com/" I want to know how to extract just the "example.com" portion from it. I've tried splitting at the second-to-last dot but that brings trouble when dealing with URLs like "http://forums.example.co.uk/", as it extracts just the "co.uk" when I would want "example.co.uk". Is there a way I can parse URLs this way without having to find a list of TLDs to compare?
PS: In case it matters, I will be using this in the context of mail servers, so the URLs will likely look more like "mail.example.co.uk" or "message-ID#user.mail.example.co.uk"
Edit: Okay so I know that the answer to this question is the same as one of the answers in the "duplicate" question but I believe it is different because the questions are different. In the other question the asker was asking regardless of subdomains and so the selected answer used urlparse, which doesn't distinguish subdomain from domain. In addition this question asks about email addresses as well, and urlparse doesn't work on email addresses (throws invalid url exception). So I believe this question is distinct from the other and not a duplicate

You want to check out tldextract. With it you can do everything you want easily. For example:
>>> import tldextract
>>> extracted_domain = tldextract.extract('forums.example.com')
ExtractResult(subdomain='forums', domain='example', suffix='com')
Then you can just:
>>> domain = "{}.{}".format(extracted_domain.domain, extracted_domain.suffix)
>>> domain
'example.com'
It also works with emails:
>>> tldextract.extract('message-ID#user.mail.example.co.uk')
ExtractResult(subdomain='user.mail', domain='example', suffix='co.uk')
Just use pip to install it: pip install tldextract

Related

Is it possible in Python to capture individual parts of a URL with constant structure?

If my question is vague, I apologize, it's a difficult question to put to words. If, for example, I needed parts of this URL:
https://stackoverflow.com/questions/449775/how-can-i-split-a-url-string-up-into-separate-parts-in-python
I needed the question number, and the question title, and let's assume the title is followed by some other changing characters, but still separated by a "/". The base URL, and the word "questions" never change. The data I want changes, but is unique and specific to each question. However all this information is always in the same place in the URL.
Is there a way to parse this URL in python and separate what I need?
The code below will pick apart the URL using str.split() with '/' as a delimiter then assign the portion of interest to variables.
It's not particularly robust but given your specification that the base URL is always the same format this is an efficient way to do what you asked:
URL="https://stackoverflow.com/questions/449775/how-can-i-split-a-url-string-up-into-separate-parts-in-python"
protocol, _, server, question, question_number, question_title, *_ = URL.split("/")
print("protocol: ", protocol)
print("server: ", server)
print("question number:", question_number)
print("question title: ", question_title)
Results:
protocol: https:
server: stackoverflow.com
question number: 449775
question title: how-can-i-split-a-url-string-up-into-separate-parts-in-python
Let's take the link of this question which is
https://stackoverflow.com/questions/74119810/is-it-possible-in-python-to-capture-individual-parts-of-a-url-with-constant-stru
Now, if you see the pattern, after https://(just ignore it), we have 2 "/". So we can split it based on these.
In [1]: link = "https://stackoverflow.com/questions/74119810/is-it-possible-in-p
...: ython-to-capture-individual-parts-of-a-url-with-constant-stru"
Let's remove https first
In [3]: link[8:]
Out[3]: 'stackoverflow.com/questions/74119810/is-it-possible-in-python-to-capture-individual-parts-of-a-url-with-constant-stru'
Now split it
In [4]: link[8:].split('/')
Out[4]:
['stackoverflow.com',
'questions',
'74119810',
'is-it-possible-in-python-to-capture-individual-parts-of-a-url-with-constant-stru']
Now the question id is index number 2.
so
In [5]: link[8:].split('/')[2]
Out[5]: '74119810'
Let's wrap it into a function:
In [6]: def get_qid(link:str):
...: return link[8:].split('/')[2]
And test it on a separate link.
In [7]: get_qid("https://stackoverflow.com/questions/74119795/how-to-create-sess
...: ion-in-graphql-in-fastapi-to-store-token-safely-after-generati")
Out[7]: '74119795'
As far as Question Title is concerned, you need to do some web scraping or use some kind of API to do so. Even though you can extract it from the link, it wont be complete since link removes some of the part of the title.
As you can see in this example:
In [10]: ' '.join(link[8:].split('/')[-1].split('-'))
Out[10]: 'is it possible in python to capture individual parts of a url with constant stru'
The last element of the splited link is title, we split it based on '-' which represents the space, and join it via space using ' '.join.
The returned title is not complete since it was not encoded completely in the link.

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to access webpage with variables in python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a project and want to access a url by python. If i have variable1=1 and variable2=2, I want an output to be like this:
www.example.com/data.php?variable1=1&variable2=2
How do I achieve this? Thanks!
Check this out:
try:
from urllib2 import urlopen # python 2
except:
from urllib.request import urlopen # python 3
vars = ['variable1=1', 'variable2=2']
for i in vars:
url = 'http://www.example.com/data.php?' + i
response = urlopen(url)
html = response.read()
print(html)
The first four lines import some code we can use to make a HTTP request.
Then we create a list of variables named vars.
Then we pass each of those variables into a loop; that loop will run once for each item in vars.
Next we build the url given the current value in vars.
Finally we get the html at that address and print it to the terminal.
You can use formate operation in python for string.
for example
variable1=1
variable1=1
url = 'www.example.com/data.php?variable1={}&variable2={}'.format(variable1,variable1)
or if you want to use the url with request then you can make a dict and pass it in request like this way
import requests
url = 'http://www.example.com/data.php'
data = {'variable1':'1','variable2':'2'}
r = requests.get(url,data)
and it will making request on this url
http://www.example.com/data.php?variable1=1&variable2=2
Try string formatting...
url = 'www.example.com/data.php?variable1={}&variable2={}'.format(variable1, variable2)
This means the 2 {} will be replaced with whatever you pass in .format(), which in this case is just the variables' values

Django or python manipulate email addresses and reason about domains

I want to be able to parse email addresses to isolate the domain part, and test if an email address is part of a given domain.
The email module doesn't, as far as I can tell, do that. Is there anything worth using to do this other than the usual string handling and regex routines?
Note: I know how to deal with python strings. I don't need basic recipes, although awesome recipes are welcome.
The problem here is essentially that email addresses have the format (schematically) userpart#sub\.domain\.[sld]+\.tld.
Stripping the part before the # is easy; the hard part is parsing the domain to work out which parts are subdomains on a larger organisation's domain, rather than generic second-level (or, I guess even higher order) public domains.
Imagine parsing user#mail.organisation.co.uk to find that the organisation's domain name is organisation.co.uk and so be able to match both mail.organisation.co.uk and finance.organisation.co.uk as subdomains of organisation.co.uk.
There are basically two possible (non-dns-based) approaches: build a finite automaton that knows about all generic slds and their relation to the tld (including popular 'fake' slds like uk.com), or try to guess, based on the knowledge that there must be a tld, and assuming that if there are three (or more) elements, the second-level domain is generic if it has fewer than three/four characters. The relative drawbacks of each approach should be obvious.
The alternative is to look through DNS entries to work out what is a registered domain, which has its own drawbacks.
In any case, I would rather piggyback on the work of others.
As per #dm03514's comment, there is a python library that does exactly this: tldextract:
>>> import tldextract
>>> tldextract.extract('foo#bar.baz.org.uk')
ExtractResult(subdomain='bar', domain='baz', tld='org.uk')
With this simple script, we replace # with #. so that our domain is terminated and the endswith won't match a domain ending with the same text.
def address_in_domain(address, domain):
return address.replace('#', '#.').endswith('.' + domain)
if __name__ == '__main__':
addresses = [
'user1#domain.com',
'user1#anotherdomain.com',
'user2#org.domain.com',
]
print filter(lambda address: address_in_domain(address, 'domain.com'), addresses)
# Prints: ['user1#domain.com', 'user2#org.domain.com']

MX Record lookup and check

I need to create a tool that will check a domains live mx records against what should be expected (we have had issues with some of our staff fiddling with them and causing all incoming mail to redirected into the void)
Now I won't lie, I'm not a competent programmer in the slightest! I'm about 40 pages into "dive into python" and can read and understand the most basic code. But I'm willing to learn rather than just being given an answer.
So would anyone be able to suggest which language I should be using?
I was thinking of using python and starting with something along the lines of using 0s.system() to do a (dig +nocmd domain.com mx +noall +answer) to pull up the records, I then get a bit confused about how to compare this to a existing set of records.
Sorry if that all sounds like nonsense!
Thanks
Chris
With dnspython module (not built-in, you must pip install it):
>>> import dns.resolver
>>> domain = 'hotmail.com'
>>> for x in dns.resolver.resolve(domain, 'MX'):
... print(x.to_text())
...
5 mx3.hotmail.com.
5 mx4.hotmail.com.
5 mx1.hotmail.com.
5 mx2.hotmail.com.
Take a look at dnspython, a module that should do the lookups for you just fine without needing to resort to system calls.
the above solutions are correct. some things I would like to add and update.
the dnspython has been updated to be used with python3 and it has superseeded the dnspython3 library so use of dnspython is recommended
the domain will strictly take in the domain and nothing else.
for example: dnspython.org is valid domain, not www.dnspython.org
here's a function if you want to get the mail servers for a domain.
def get_mx_server(domain: str = "dnspython.org") -> str:
mail_servers = resolver.resolve(domain, 'MX')
mail_servers = list(set([data.exchange.to_text()
for data in mail_servers]))
return ",".join(mail_servers)

Categories

Resources