Python regex manipulation extract email id

Python regex manipulation extract email id - python

First, I want to grab this kind of string from a text file
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
And then convert it to separate strings such as
kevin.knerr#google.com.au
sam.mcgettrick#google.com.au
mike.grahs#google.com.au
For example text file can be as:
Some gibberish words
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
Some Gibberish words

As said in the comments, better grab the part in {} and use some programming logic afterwards. You can grab the different parts with:
\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)
# looks for {
# captures everything not } into the group individual
# looks for # afterwards
# saves everything not a whitespace into the group domain
See a demo on regex101.com.
In Python this would be:
import re
rx = r'\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)'
string = 'gibberish {kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au gibberish'
for match in re.finditer(rx, string):
print match.group('individual')
print match.group('domain')

Python Code
ip = "{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au"
arr = re.match(r"\{([^\}]+)\}(\#\S+$)", ip)
#Using split for solution
for x in arr.group(1).split(","):
print (x.strip() + arr.group(2))
#Regex Based solution
arr1 = re.findall(r"([^, ]+)", arr.group(1))
for x in arr1:
print (x + arr.group(2))
IDEONE DEMO

Related

replace before and after a string using re in python

i have string like this 'approved:rakeshc#IAD.GOOGLE.COM'
i would like extract text after ':' and before '#'
in this case the test to be extracted is rakeshc
it can be done using split method - 'approved:rakeshc#IAD.GOOGLE.COM'.split(':')[1].split('#')[0]
but i would want this be done using regular expression.
this is what i have tried so far.
import re
iptext = 'approved:rakeshc#IAD.GOOGLE.COM'
re.sub('^(.*approved:)',"", iptext) --> give everything after ':'
re.sub('(#IAD.GOOGLE.COM)$',"", iptext) --> give everything before'#'
would want to have the result in single expression. expression would be used to replace a string with only the middle string

Here is a regex one-liner:
inp = "approved:rakeshc#IAD.GOOGLE.COM"
output = re.sub(r'^.*:|#.*$', '', inp)
print(output) # rakeshc
The above approach is to strip all text from the start up, and including, the :, as well as to strip all text from # until the end. This leaves behind the email ID.

Use a capture group to copy the part between the matches to the result.
result = re.sub(r'.*approved:(.*)#IAD\.GOOGLE\.COM$', r'\1', iptext)

Hope this works for you:
import re
input_text = "approved:rakeshc#IAD.GOOGLE.COM"
out = re.search(':(.+?)#', input_text)
if out:
found = out.group(1)
print(found)

You can use this one-liner:
re.sub(r'^.*:(\w+)#.*$', r'\1', iptext)
Output:
rakeshc

Python Regex to extract multiple complex groups

I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.

I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]

You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']

You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]

How would I match a string that may or may not span multiple lines?

I have a document that when converted to text splits the phone number onto multiple lines like this:
(xxx)-xxx-
xxxx
For a variety of reasons related to my project I can't simply join the lines.
If I know the phonenumber="(555)-555-5555" how can I compile a regex so that if I run it over
(555)-555-
5555
it will match?
**EDIT
To help clarify my question here it is in a more abstract form.
test_string = "xxxx xx x xxxx"
text = """xxxx xx
x
xxxx"""
I need the test string to be found in the text. Newlines can be anywhere in the text and characters that need to be escaped should be taken into consideration.

A simple workaround would be to replace all the \n characters in the document text before you search it:
pat = re.compile(r'\(\d{3}\)-\d{3}\d{4}')
numbers = pat.findall(text.replace('\n',''))
# ['(555)-555-5555']
If this cannot be done for any reasons, the obvious answer, though unsightly, would be to handle a newline character between each search character:
pat = re.compile(r'\(\n*5\n*5\n*5\n*\)\n*-\n*5\n*5\n*5\n*-\n*5\n*5\n*5\n*5')
If you needed to handle any format, you can pad the format like so:
phonenumber = '(555)-555-5555'
pat = re.compile('\n*'.join(['\\'+i if not i.isalnum() else i for i in phonenumber]))
# pat
# re.compile(r'\(\n*5\n*5\n*5\n*\)\n*\-\n*5\n*5\n*5\n*\-\n*5\n*5\n*5\n*5', re.UNICODE)
Test case:
import random
def rndinsert(s):
i = random.randrange(len(s)-1)
return s[:i] + '\n' + s[i:]
for i in range(10):
print(pat.findall(rndinsert('abc (555)-555-5555 def')))
# ['(555)-555-5555']
# ['(555)-5\n55-5555']
# ['(555)-5\n55-5555']
# ['(555)-555-5555']
# ['(555\n)-555-5555']
# ['(5\n55)-555-5555']
# ['(555)\n-555-5555']
# ['(555)-\n555-5555']
# ['(\n555)-555-5555']
# ['(555)-555-555\n5']

You can search for a possible \n existing in the string:
import re
nums = ["(555)-555-\n5555", "(555)-555-5555"]
new_nums = [i for i in nums if re.findall('\([\d\n]+\)[\n-][\d\n]+-[\d\n]+', i)]
Output:
['(555)-555-\n5555', '(555)-555-5555']

data = ["(555)-555-\n5555", "(55\n5)-555-\n55\n55", "(555\n)-555-\n5555", "(555)-555-5555"]
input = '(555)-555-5555'
#add new lines to input string
input = re.sub(r'(?!^|$)', r'\\n*', input)
#escape brackets ()
input = re.sub(r'(?=[()])', r'\\',input)
r = re.compile(input)
match = filter(r.match, data)
Code demo

Regex in Python 2.7 for extraction of information from Snort log files

I'm trying to extract information from a Snort file using regular expressions. I've sucessfully got the IP's and SID, but I seem to be having trouble with extracting a specific part of the text.
How can I extract part of a Snort log file? The part I'm trying to extract can look like [Classification: example-of-attack] or [Classification: Example of Attack]. However, the first example may have any number of hyphens and whilst the second instance doesn't have any hyphens but contains some capital letters.
How could I extract just example-of-attack or Example-of-Attack?
I unfortunately only know how to search for static words such as:
test = re.search("exact-name", line)
t = test.group()
print t
I've tried many different commands on the web, but I just don't seem to get it.

You can use the following regex:
>>> m = re.search(r'\[Classification:\s*([^]]+)\]', line).group(1)
( Explanation | Working Demo )

You could use look-behinds,
>>> s = "[Classification: example-of-attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s)
>>> m
<_sre.SRE_Match object at 0x7ff54a954370>
>>> m.group()
'example-of-attack'
>>> s = "[Classification: Example of Attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s).group()
>>> m
'Example of Attack'
Use regex module if there are more than one spaces after the string Classification:,
>>> import regex
>>> s = "[Classification: Example of Attack]"
>>> regex.search(r'(?<=Classification:\s+\b)[^\]]*', s).group()
'Example of Attack
'

If you want to match any substring with the pattern [Word: Value], you could use the following regex,
ptrn = r"\[\s*(\w+):\s*([\w\s-]+)\s*\]"
Here I've used two groups, one for the first word ("Classification" in your question) and one for the second (either "example-of-attack" or "Example of Attack"). It also requires opening and closing square brackets. For example,
txt1 = '[Classification: example-of-attack]'
m = re.search( ptrn, txt1 )
>>> m.group(2)
'example-of-attack'

Extract email sub-strings from large document

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
...<name#domain.com>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain #domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

This code extracts the email addresses in a string. Use it while reading line by line
>>> import re
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol"
>>> match = re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk#bob.com.lol'
If you have several email addresses use findall:
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol or popop#coco.com"
>>> match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match
['jdsk#bob.com.lol', 'popop#coco.com']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.
Edit: as suggested in a comment by #kostek:
In the string Contact us at support#example.com. my regex returns support#example.com. (with dot at the end). To avoid this, use [\w\.,]+#[\w\.,]+\.\w+)
Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+#[\w\.-]+\.\w+which will capture example#do-main.com as well.
Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad#ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."
Update 2023
Seems stackabuse has compiled a post based on the popular SO answer mentioned above.
import re
regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")#([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")
def isValid(email):
if re.fullmatch(regex, email):
print("Valid email")
else:
print("Invalid email")
isValid("name.surname#gmail.com")
isValid("anonymous123#yahoo.co.uk")
isValid("anonymous123#...uk")
isValid("...#domain.us")

You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2#als.com, Users1#gmail.de " \
"Dariush#dasd-asasdsa.com.lo,Dariush.lastName#someDomain.com"
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)

import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])

If you're looking for a specific domain:
>>> import re
>>> text = "this is an email la#test.com, it will be matched, x#y.com will not, and test#test.com will"
>>> match = re.findall(r'[\w-\._\+%]+#test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la#test.com', 'test#test.com']

import re
reg_pat = r'\S+#\S+\.\S+'
test_text = 'xyz.byc#cfg-jj.com ir_er#cu.co.kl uiufubvcbuw bvkw ko#com m#urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc#cfg-jj.com', 'ir_er#cu.co.kl']

import re
mess = '''Jawadahmed#gmail.com Ahmed#gmail.com
abc#gmail'''
email = re.compile(r'([\w\.-]+#gmail.com)')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.

You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+#[\w\-\.]+\b

Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:
>>> str1 = "abcdef_12345#gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345#gmail.com'

content = ' abcdabcd jcopelan#nyx.cs.du.edu afgh 65882#mimsy.umd.edu qwertyuiop mangoe#cs.umd'
match_objects = re.findall(r'\w+#\w+[\.\w+]+', content)

# \b[\w|\.]+ ---> means begins with any english and number character or dot.
import re
marks = '''
!()[]{};?#$%:'"\,/^&é*
'''
text = 'Hello from priyankv#gmail.com to python#gmail.com, datascience##gmail.com and machinelearning##yahoo..com wrong email address: farzad#google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*#[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
for x in marks:
p = p.replace(x, "")
if len(re.findall(pattern, p)) > 0:
print(re.findall(pattern, p))

One other way is to divide it into 3 different groups and capture the group(0). See below:
emails=[]
for line in email: # email is the text file where some emails exist.
e=re.search(r'([.\w\d-]+)(#)([.\w\d-]+)',line) # 3 different groups are composed.
if e:
emails.append(e.group(0))
print(emails)

Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello#world.com>><123#123.at> <huhu#fake> bla bla <myname#some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello#world.com>', '<123#123.at>', '<huhu#fake>', '<myname#somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello#world.com', '123#123.at', 'myname#some-domain.pt']

import re
txt = 'hello from absc#gmail.com to par1#yahoo.com about the meeting #2PM'
email =re.findall('\S+#\S+',s)
print(email)
Printed output:
['absc#gmail.com', 'par1#yahoo.com']

import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+#\S+',s)
for r in result:
print(r)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex manipulation extract email id - python

Related

replace before and after a string using re in python

Python Regex to extract multiple complex groups

How would I match a string that may or may not span multiple lines?

Regex in Python 2.7 for extraction of information from Snort log files

Extract email sub-strings from large document

Categories

Resources