Regular expression to match one from two groups - python

I want to perform operation like select any one suitable group from two
for e.g:
gmail.com|gmail.co.in
this gives me direct result but how can I write regular expression for python to map email id for above case.
Note. I just want to map 3 char after dot or max two group of dot and 2 character
I tried writing regex as :
[\w]+\.?[\w]{3}|[[\w]{2}\.?]{2}
but wont give me expected results
If tried to use () returns group for e.g:abcd#gmail.com will return gmail.com but need to retrieve whole email address.

You can use the following regex :
\w+#\w+\.((\w{3})|(\w{2}\.\w{2}))
Demo
All you need here is put the first part as \w+#\w+\. then you just need to play with grouping and pipe.so the following pattern:
((\w{3})|(\w{2}\.\w{2}))
will match a string contain 3 word character or (\w{2}\.\w{2}) that means a string with 2 word character then dot then string with 2 word character.

Hope this helps.
>>> import re
>>> y = re.search('(?P<user>\w+)#(?P<domain>[\w.]+)', 'abc#gmail.com')
>>> print y.group("user"), y.group("domain")
abc gmail.com
>>> y = re.search('(?P<user>\w+)#(?P<domain>[\w.]+)', 'abc#gmail.co.in')
>>> print y.group("user"), y.group("domain")
abc gmail.co.in
>>>

I hope this can help you.
email = 'abc#gmail.com'
user = email.split('#')[0]
domain = email.split('#')[1]

Related

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Find the next word after a word in a string

I am trying to record the word after a specific word. For example, let's say I have a string:
First Name: John
Last Name: Doe
Email: John.Doe#email.com
I want to search the string for a key word such as "First Name:". Then I want to only capture the next word after it, in this case John.
I started using string.find("First Name:"), but I do not think that is the correct approach.
I could use some help with this. Most examples either split the string or keep everything else after "John". My goal is to be able to search strings for specific keywords no mater their location.
SOLUTION:
I used a similar set of code as below:
search = r"(First Name:)(.)(.+)"
x = re.compile(search)
This gave me the "John" with no spaces
a regular expression is the way to go
import re
pattern = r"(?:First Name\: ).+\b"
first_names = re.findall(pattern, mystring)
It will find the prefix (First name: ) without extracting r"(?:First Name: )
then extracts .+\b which denotes a word. Or you can split the string and itterate over resulting list
my_words = [ x.split()[0] for x in my_string.split("First Name: ")]
The .find approach is a good start.
You can use split on the remaining string to limit results to the single word.
Without using regex
s = "abc def opx"
q = 'abc'
res = s[s.find(q)+len(q):].split()[0]
res == 'def'

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Remove -#### in zipcodes

How do I remove the +4 from zipcodes, in python?
I've got data like
85001
52804-3233
Winston-Salem
And I want that to become
85001
52804
Winston-Salem
>>> zip = '52804-3233'
>>> zip[:5]
'52804'
...and of course when you parse your lines from the original data you should insert some kind of rule to distinguish between zipcode to fix and other strings, but I don't know how your data looks like, so I can't help much (you could check if they are only digits and the '-' symbol, maybe?).
>>> import re
>>> s = "52804-3233"
>>> # regex to remove a dash and 4 digits after the dash after 5 digits:
>>> re.sub('(\d{5})-\d{4}', '\\1', s)
'52804'
The \\1 is a so called back reference and gets replaced by the first group, which would be the 5 digit zipcode in this case.
You could try something like this:
for input in inputs:
if input[:5].isnumeric():
input = input[:5]
# Takes the first 5 characters from the string
Just take away the first 5 characters of anything that is numbers in the first 5 positions.
re.sub('-\d{4}$', '', zipcode)
This grabs all items of the format 00000-0000 with a space or other word boundary before and after the number and replaces it with the first five digits. The other regex's posted will match some other number formats that you might not want.
re.sub('\b(\d{5})-\d{4}\b', '\\1', zipcode)
Or without regex:
output = [line[:5] if line[:5].isnumeric() and line[6:].isnumeric() else line for line in text if line]

Categories

Resources