Get domain from string? - Python - python

i needed help. How do i get domain from a string?
For example: "Hi im Natsume, check out my site http://www.mysite.com/"
How do i get just mysite.com?
Output example:
http://www.mysite.com/ (if http entered)
www.mysite.com (if http not entered)
mysite.com (if both http and www not entered)

myString = "Hi im Natsume, check out my site http://www.mysite.com/"
>>> a = re.search("(?P<url>https?://[^\s]+)", myString) or re.search("(?P<url>www[^\s]+)", myString)
>>> a.group("url")
'http://www.mysite.com/'
>>> myString = "Hi im Natsume, check out my site www.mysite.com/"
>>> a = re.search("(?P<url>https?://[^\s]+)", myString) or re.search("(?P<url>www[^\s]+)", myString)
>>> a.group("url")
'www.mysite.com/'

Well ... You need some way to define what you consider to be something that has a "domain". One approach might be to look up a regular expression for URL-matching, and apply that to the string. If that succeeds, you at least know that the string holds a URL, and can continue to interpret the URL in order to look for a host name, from which you can then extract the domain (possibly).

s= "Hi im Natsume, check out my site http://www.mysite.com/"
start=s.find("http://") if s.find("http://")!=-1 else s.find("https://")+1
t = s[start+11:s.find(" ",start+11)]
print(t)
output:
mysite.com

If you want to use regular expression, one way could be -
>>> s = "Hi im Natsume, check out my site http://www.mysite.com/"
>>> re.findall(r'http\:\/\/www\.([a-zA-Z0-9\.-_]*)\/', s)
['mysite.com']
..considering url ends with '/'

If all the sites had the same format, you could use a regexp like this (which work in this specific case):
re.findall('http://www\.(\w+)\.com', url)
However you need a more complex regexp able to parse whichever url and extract the domain name.

Best way is to use regex to extract the URL. Then use tldextract to get valid domain name from the URL.
import re
import tldextract
text = "Hi im Natsume, check out my site http://www.example.com/"
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
found_url = urls[0]
info = tldextract.extract(found_url)
domain_name = info.domain
suffix_name = info.suffix
final_domain_name = domain_name+"."+suffix_name
print(final_domain_name)

How about this?
url='https://www.google.com/'
var=url.split('//www.')[1]
domain=var[0:var.index('/')]
print(domain)

Related

Matching partially string on url string in python3

I am trying to match or find coincidence a string in python with regex method re.search() without lucky
this is my code:
import re
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
for url in urls:
c_url = re.compile(url)
result = re.search(c_url, request_path)
if isinstance(result, re.Match):
allowed_url = url
break
print(allowed_url) # must be /colpos/papanicolau2
what I want to happen?, if url is in request_path (in this case partially) I expect that result been re.Match object instance not None.
how can I achive this?, is any better way to know if my request_path is in urls?
the code mentioned above only works if url and request_path contains exactly the same, I dont want that. How should I use re.search() in python to achive this?
thank you
I tried checking it with the "in" keyword instead of using re module. I think it is simpler and more readable.
request_path = '/colpos/papanicolaou2/124579/1254'
urls = ['/colpos/prescription', '/colpos/transfer', '/colpos/papanicolaou2', '/colpos/biopsia']
allowed_urls = []
for url in urls:
if url in request_path:
allowed_urls.append(url)
print(allowed_urls) # this contains '/colpos/papanicolaou2' like you wanted
In case you just got 2 fixed (real) parts for your request_path, you could the following (no loops, no regex - just Python):
/colpos/papanicolaou2/124579/1254
/part_1/part_2 /param1/param2/...
Code:
urls = ['colpos/prescription', 'colpos/transfer', 'colpos/papanicolaou2', 'colpos/biopsia']
request_path = "/colpos/papanicolaou2/124579/1254"
p1, p2, params = request_path[1:].split('/', 2)
if '/'.join([p1, p2]).lower() not in urls:
#raise Error(404)
print("url not found")
Note: You would need to make it more stable for production usage :)

How to remove query string from a url?

I have the following URL:
https://stackoverflow.com/questions/7990301?aaa=aaa
https://stackoverflow.com/questions/7990300?fr=aladdin
https://stackoverflow.com/questions/22375#6
https://stackoverflow.com/questions/22375?
https://stackoverflow.com/questions/22375#3_1
I need URLs for example:
https://stackoverflow.com/questions/7990301
https://stackoverflow.com/questions/7990300
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
https://stackoverflow.com/questions/22375
My attempt:
url='https://stackoverflow.com/questions/7990301?aaa=aaa'
if '?' in url:
url=url.split('?')[0]
if '#' in url:
url = url.split('#')[0]
I think this is a stupid way
The very helpful library furl makes it trivial to remove both query and fragment parts:
>>> furl.furl("https://hi.com/?abc=def#ghi").remove(args=True, fragment=True).url
https://hi.com/
You can split on something that doesn't exist in the string, you'll just get a list of one element, so depending on your goal, you could do something like this to simplify your existing code:
url = url.split('?')[0].split('#')[0]
Not saying this is the best way (furl is a great solution), but it is a way.
In your example you're also removing the fragment (the thing after a #), not just the query.
You can remove both by using urllib.parse.urlsplit, then calling ._replace on the namedtuple it returns and converting back to a string URL with urllib.parse.unsplit:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query="", fragment=""))
Output:
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990301?aaa=aaa")
'https://stackoverflow.com/questions/7990301'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/7990300?fr=aladdin")
'https://stackoverflow.com/questions/7990300'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#6")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375?")
'https://stackoverflow.com/questions/22375'
>>> remove_query_params_and_fragment("https://stackoverflow.com/questions/22375#3_1")
'https://stackoverflow.com/questions/22375'
You could try
urls = ["https://stackoverflow.com/questions/7990301?aaa=aaa",
"https://stackoverflow.com/questions/7990300?fr=aladdin",
"https://stackoverflow.com/questions/22375#6",
"https://stackoverflow.com/questions/22375"?,
"https://stackoverflow.com/questions/22375#3_1"]
urls_without_query = [url.split('?')[0] for url in urls]
for example, "https://stackoverflow.com/questions/7990301?aaa=aaa".split() returns a list that looks like ["https://stackoverflow.com/questions/7990301", "aaa=aaa"], and if that string is url, url.split('?')[0] would give you "https://stackoverflow.com/questions/7990301".
Edit: I didn't think about # arguments. The other answers might help you more :)
You can use w3lib
from w3lib import url as w3_url
url_without_query = w3_url.url_query_cleaner(url)
Here is an answer using standard libraries, and which parses the URL properly:
from urllib.parse import urlparse
url = 'http://www.example.com/this/category?one=two'
parsed = urlparse(url)
print("".join([parsed.scheme,"://",parsed.netloc,parsed.path]))
expected output:
http://www.example.com/this/category
Note: this also strips params and the fragment, but is easy to modify to include those if you want.

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the re module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:
href = re.compile("(/user/|/channel/)(.+)")
What it should return is something like /user/username or /channel/channelname. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like /user/username/videos?view=60 or something else that goes on after the username/ portion.
In an attempt to adress this issue, I rewrote the bit of code above as
href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")
along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include videos?view=60 anywhere in the URL?
Use the following approach with a specific regex pattern:
user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'
pattern = re.compile(r'(/user/|/channel/)([^/]+)')
m = re.match(pattern, user_url)
print(m.group()) # /user/username
m = re.match(pattern, channel_url)
print(m.group()) # /channel/channelname
I used This approach and it seems it does what you want.
import re
user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'
pattern = re.compile(r"(/user/|/channel/)[\w]+/")
user_match = re.search(pattern, user)
if user_match:
print user_match.group()
else:
print "Invalid Pattern"
pattern_match = re.search(pattern,channel)
if pattern_match:
print pattern_match.group()
else:
print "Invalid pattern"
Hope this helps!

Python strip Google Alerts URL

I've currently got a dataframe filled with Google Alert URLS that look like:
link = 'https://www.google.com/url?rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q'
and I just want the part following url= and before the junk.
http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/
I used urllib.parse.urlparse(link) to get a list of URL elements...
parsed = ParseResult(scheme='https', netloc='www.google.com', path='/url', params='', query='rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q', fragment='')
but even then parsed[4] only breaks it down to...
'rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q'
I found other queries on Stack with this same question but they were in other programming languages than Python.
Any ideas on a Python approach?
You may use a regex on parsed[4] to extract that URL:
(?:^|&)url=([^&]+)
See the regex demo
Details:
(?:^|&) - either start of string or &
url= - literal text url=
([^&]+) - Group 1 capturing one or more symbols other than &.
Python demo:
import re
p = re.compile(r'(?:^|&)url=([^&]+)')
s = "rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q"
mObj = p.search(s)
if mObj:
print(mObj.group(1))

How to match 0 or 1 time character at the end of line?

I am trying to normalize a URL, to extract the content after :// and before the last / at the end of line if it exists.
Here is my script:
url = "https://example.com/25194425/"
matchUrl = re.findall(r'://(.*)/?$', url)
print matchUrl
What I want is example.com/25194425, but I get example.com/25194425/. How to deal with the last /?
Why doesn't /? work?
An alternative way to do it without using regex is using urlparse
>>> from urlparse import urlparse
>>> url = 'https://example.com/25194425/'
>>> '{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'example.com/25194425'
Later on, if you want to include the protocol, port, params, ... parts into the normalized url. It can be done easier (than updating the regex)
>>> '{url.scheme}://{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'https://example.com/25194425'
As one of the commenters said, you just need to make the quantifier non-greedy:
://(.*?)/?$
However, the result of findall() is a list, not a string. In this case it's list with only one entry, but it's still a list. To get the actual string, you need to provide the index:
url = "https://example.com/25194425/"
match = re.findall(r'://(.*?)/?$', url)
print match[0]
But that seems like an inappropriate use of findall() to me. I would have gone with search():
url = "https://example.com/25194425/"
match = re.search(r'://(.*?)/?$', url)
if match:
print match.group(1)
The default is possible because the regular match more characters. So '(.*) /' will match to the last slash.
You can use it:
matchUrl = re.findall(r'://(.*)/[^/]?$', url)
EDIT Please try the following pattern (python 2.7x):
import re
url1 = 'https://example.com/25194425?/'
url2 = 'https://example.com/25194425?'
print re.findall('https?://([\S]+)(?<!/)[/]?', url1)
print re.findall('https?://([\S]+)(?<!/)[/]?', url2)
Output:
['example.com/25194425?']
['example.com/25194425?']
Thanks #Alan Moore for pointing out the word boundary issue. Now it should work for both scenarios.

Categories

Resources