Python Regex to Extract Domain from Text - python

I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say,
"this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...

Try this one (thanks #SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s matches any whitespace character
(?:www.)? non-capturing group, matches www. 0 or more times
(\w+.com) matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.
This answer has a lot of helpful info about matching domains:
What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.

Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com" also it will o/p like :
'website1.com'

Related

See which component in regex alternation was captured

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

regular expression to extract part of email address

I am trying to use a regular expression to extract the part of an email address between the "#" sign and the "." character. This is how I am currently doing it, but can't get the right results.
company = re.findall('^From:.+#(.*).',line)
Gives me:
['#iupui.edu']
I want to get rid of the .edu
To match a literal . in your regex, you need to use \., so your code should look like this:
company = re.findall('^From:.+#(.*)\.',line)
# ^ this position was wrong
See it live here.
Note that this will always match the last occurrence of . in your string, because (.*) is greedy. If you want to match the first occurence, you need to exclude any . from your capturing group:
company = re.findall('^From:.+#([^\.]*)\.',line)
See a demo.
You can try this:
(?<=\#)(.*?)(?=\.)
See a demo.
A simple example would be:
>>> import re
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo.com")
['moo']
>>> re.findall(".*(?<=\#)(.*?)(?=\.)", "From: atc#moo-hihihi.com")
['moo-hihihi']
This matches the hostname regardless of the beginning of the line, i.e. it's greedy.
You could just split and find:
s = " abc.def#ghi.mn I"
s = s.split("#", 1)[-1]
print(s[:s.find(".")])
Or just split if it is not always going to match your string:
s = s.split("#", 1)[-1].split(".", 1)[0]
If it is then find will be the fastest:
i = s.find("#")
s = s[i+1:s.find(".", i)]

How i can do this regex simple?

I have this regex:
"\w{4}[A-D]{1}[a-d]*\s*"
How can I repeat the part of [A-D]{1}[a-d]*\s* several time with something like *?
So if I have the expression:
"Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab"
the regex will give me:
"Bed0Dabc Babc Cabb"
"99rrAbaaaa Daa"
Your regex is invalid and lacks "\" at start, your desired output is also invalid and second string should be "99rrAbaaaa Daa".
I believe what you mean is groups, this is a pretty basic concept though, you should probably read more about regular expressions before using them.
The desired regex:
\w{4}([A-D][a-d]*\s*)+
You should add the \s to the set of characters.
import re
data = 'Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab'
pattern = r'\w{4}(?:[A-D][a-d\s]*)+'
matches = re.findall(pattern, data)
The result:
['Bed0Dabc Babc Cabb', '99rrAbaaaa Daa']
The ?: at the start of the group defines a non-capturing group. If you omit it your result will look like this.
['Cabb', 'Daa']

Regex help to match groups

I am trying to write a regex for matching a text file that has multiple lines such as :
* 964 0050.56aa.3480 dynamic 200 F F Veth1379
* 930 0025.b52a.dd7e static 0 F F Veth1469
My intention is to match the "0050.56aa.3480 " and "Veth1379" and put them in group(1) & group(2) for using later on.
The regex I wrote is :
\*\s*\d{1,}\s*(\d{1,}\.(?:[a-z][a-z]*[0-9]+[a-z0-9]*)\.\d{1,})\s*(?:[a-z][a-z]+)\s*\d{1,}\s*.\s*.\s*((?:[a-z][a-z]*[0-9]+[a-z0-9]*))
But it does not seem to be working when I test at:
http://www.pythonregex.com/
Could someone point to any obvious error I am doing here.
Thanks,
~Newbie
Try this:
^\* [0-9]{3} +([0-9]{4}.[0-9a-z]{4}.[0-9a-z]{4}).*(Veth[0-9]{4})$
Debuggex Demo
The first part is in capture group one, the "Veth" code in capture group two.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a list of online testers in the bottom section.
I don't think you need a regex for this:
for line in open('myfile','r').readlines():
fields = line.split( )
print "\n" + fields[1] + "\n" +fields[6]
A very strict version would look something like this:
^\*\s+\d{3}\s+(\d{4}(?:\.[0-9a-f]{4}){2})\s+\w+\s+\d+\s+\w\s+\w\s+([0-9A-Za-z]+)$
Debuggex Demo
Here I assume that:
the columns will be pretty much the same,
your first match group contains a group of decimal digits and two groups of lower-case hex digits,
and the last word can be anything.
A few notes:
\d+ is equivalent to \d{1,} or [0-9]{1,}, but reads better (imo)
use \. to match a literal ., as . would simply match anything
[a-z]{2} is equivalent to [a-z][a-z], but reads better (my opinion, again)
however, you might want to use \w instead to match a word character
This will do it:
reobj = re.compile(r"^.*?([\w]{4}\.[\w]{4}\.[\w]{4}).*?([\w]+)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
group1 = match.group(1)
group2 = match.group(2)
else:
result = ""

Regex to match space and a string until a forward slash

I have two django urls,
(r'^groups/(?P<group>[\w|\W\-\.]{1,60})$')
(r'^groups/(?P<group>[\w|\W\-\.]{1,60})/users$'
The regex ([\w|\W\-\.])$ in the urls matches soccer players and soccer players/users. Can someone help get a regex that matches anything between groups and /. I want the regex to match anything after the groups until it encounters a /
You simply need to do the following, which will match anything up to a slash:-
regexp = re.compile(r'^group/(?P<group>[^/]+)$')
For the case where you need to match urls like your example with a trailing /user, you simply add this to the expression:-
regexp = re.compile(r'^group/(?P<group>[^/]+)/users$')
If you needed to get a user id, for example, you could also use the same matching:-
regexp = re.compile(r'^group/(?P<group>[^/]+)/users/(?P<user>[^/]+)$')
Then you can get the result:-
match = regexp.match(url) # "group/soccer players/users/123"
if match:
group = match.group("group") # "soccer players"
user = match.group("user") # "123"

Categories

Resources