Extract email sub-strings from large document - python

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
...<name#domain.com>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain #domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

This code extracts the email addresses in a string. Use it while reading line by line
>>> import re
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol"
>>> match = re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk#bob.com.lol'
If you have several email addresses use findall:
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol or popop#coco.com"
>>> match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match
['jdsk#bob.com.lol', 'popop#coco.com']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.
Edit: as suggested in a comment by #kostek:
In the string Contact us at support#example.com. my regex returns support#example.com. (with dot at the end). To avoid this, use [\w\.,]+#[\w\.,]+\.\w+)
Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+#[\w\.-]+\.\w+which will capture example#do-main.com as well.
Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad#ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."
Update 2023
Seems stackabuse has compiled a post based on the popular SO answer mentioned above.
import re
regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")#([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")
def isValid(email):
if re.fullmatch(regex, email):
print("Valid email")
else:
print("Invalid email")
isValid("name.surname#gmail.com")
isValid("anonymous123#yahoo.co.uk")
isValid("anonymous123#...uk")
isValid("...#domain.us")

You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2#als.com, Users1#gmail.de " \
"Dariush#dasd-asasdsa.com.lo,Dariush.lastName#someDomain.com"
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)

import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])

If you're looking for a specific domain:
>>> import re
>>> text = "this is an email la#test.com, it will be matched, x#y.com will not, and test#test.com will"
>>> match = re.findall(r'[\w-\._\+%]+#test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la#test.com', 'test#test.com']

import re
reg_pat = r'\S+#\S+\.\S+'
test_text = 'xyz.byc#cfg-jj.com ir_er#cu.co.kl uiufubvcbuw bvkw ko#com m#urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc#cfg-jj.com', 'ir_er#cu.co.kl']

import re
mess = '''Jawadahmed#gmail.com Ahmed#gmail.com
abc#gmail'''
email = re.compile(r'([\w\.-]+#gmail.com)')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.

You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+#[\w\-\.]+\b

Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:
>>> str1 = "abcdef_12345#gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345#gmail.com'

content = ' abcdabcd jcopelan#nyx.cs.du.edu afgh 65882#mimsy.umd.edu qwertyuiop mangoe#cs.umd'
match_objects = re.findall(r'\w+#\w+[\.\w+]+', content)

# \b[\w|\.]+ ---> means begins with any english and number character or dot.
import re
marks = '''
!()[]{};?#$%:'"\,/^&é*
'''
text = 'Hello from priyankv#gmail.com to python#gmail.com, datascience##gmail.com and machinelearning##yahoo..com wrong email address: farzad#google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*#[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
for x in marks:
p = p.replace(x, "")
if len(re.findall(pattern, p)) > 0:
print(re.findall(pattern, p))

One other way is to divide it into 3 different groups and capture the group(0). See below:
emails=[]
for line in email: # email is the text file where some emails exist.
e=re.search(r'([.\w\d-]+)(#)([.\w\d-]+)',line) # 3 different groups are composed.
if e:
emails.append(e.group(0))
print(emails)

Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello#world.com>><123#123.at> <huhu#fake> bla bla <myname#some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello#world.com>', '<123#123.at>', '<huhu#fake>', '<myname#somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello#world.com', '123#123.at', 'myname#some-domain.pt']

import re
txt = 'hello from absc#gmail.com to par1#yahoo.com about the meeting #2PM'
email =re.findall('\S+#\S+',s)
print(email)
Printed output:
['absc#gmail.com', 'par1#yahoo.com']

import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+#\S+',s)
for r in result:
print(r)

Related

Python extract email address from a HUGE string [duplicate]

This question already has answers here:
Extract email sub-strings from large document
(14 answers)
Closed 21 days ago.
I have been using this:
(I know, there are probably more efficient ways...)
Given this in an email message:
Submitted data:
First Name: MyName
Your Email Address: email#domain.com
TAG:
I coded this:
intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]
intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]
... and got what I needed. This worked because I had the 'TAG' label.
Now I am given this:
Submitted data:
First name: MyName
Last name:
Email: email#domain.com
I'm having a brain block on getting the email address without a next word. There is whitespace. Can someone nudge me in the right direction? I suspect I can dig out the email address after the occurrence of 'Email:' using regex...
You can, in fact, make use of RegEx to extract e-mails.
To find single e-mails in a text, you can make use of
re.search().group()
In case you want to find multiple emails, you can make use of
re.findall()
An example
import re
text = "First name: MyName Last name: Email: email#domain.com "
email = re.search(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", text)
print(email.group())
emails = re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", text)
print (emails)
This would give the output as
email#domain.com
['email#domain.com']
If the email should come after the word Email followed by a :, you could match the Name part, and capture the email in a group with an email like pattern.
\bEmail[^:]*:\s*([^\s#]+#[^\s#]+)
\bEmail A word boundary to prevent a partial match, match Email
[^:]*:\s* Match optional chars other than :, then match : and optional whitespace chars
( Capture group 1
[^\s#]+#[^\s#]+ Match a single # between 1+ more non whitespace chars ecluding the # itself
) Close group 1
Regex demo
Example with re.findall that returns the values of the capture groups:
import re
regex = r"\bEmail[^:]*:\s*([^\s#]+#[^\s#]+)"
s = ("Submitted data:\n"
"First Name: MyName\n"
"Your Email Address: email#domain.com\n"
"TAG:\n\n"
"Submitted data:\n"
"First name: MyName\n"
"Last name:\n"
"Email: email#domain.com")
print(re.findall(regex, s))
Output
['email#domain.com', 'email#domain.com']
Searching for strings is often better done with splitting, and occasionally regular expressions. So first split the lines:
bodylines = bodystring.splitlines()
Split the resulting lines on the : delimiter (make a generator):
chunks = (line.split(':') for line in bodylines)
Now grab the first one that has "email" on the left and # on the right:
address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '#' in val)
If you want all the emails across multiple lines, replace next with a list comprehension:
addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '#' in val]
This can be done in one line with no imports (if you replace chunks with its definition, not that I recommend it). Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result. If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!

How can I extract the email address string

My python script currently pulls an email address as a list, but I need to get the text portion only. In this example, it should have been golfshop#3lakesgolf.com. I have tried using the text attribute (gc_email.text), but that didn't work.
gc_email=web.select('a[href^=mailto]')
print(gc_email)
output:
[golfshop#3lakesgolf.com]
Help! How can I extract just the mailto address?
You can use a regex capture to pull this string
import re
str = 'golfshop#3lakesgolf.com'
regex = '<a href="mailto:(.*?)".*'
try:
match = re.match(regex, str).group(1)
except:
match = None
x=1
if match is not None:
print(match)
Output
golfshop#3lakesgolf.com
Assuming every line follows the format you provided, you could use the '.split()' function on a series of characters and then select the appropriate items from the returned lists.
line = 'golfshop#3lakesgolf.com]'
sections1 = line.split(':')
email = sections1[1].split('.com')[0]+'.com'
Output
golfshop#3lakesgolf.com
If the formatting varies and is not like this every single time, then I'd go with regular expressions.

Text Replacement Using RE - Allow The First Occurance, Replace The Rest

I am looking for some thoughts on how I would be able to accomplish these tasks:
Allow the first occurrence of a problem_word, but ban any following uses of it and the rest of the problem words.
No modifications to the original document (.txt file). Only modify for print().
Keep the same structure of the email. If there are line breaks, or tabs, or weird spacings, let them keep their integrity.
Here is the code sample:
import re
# Sample email is "Hello, banned1. This is banned2. What is going on with
# banned 3? Hopefully banned1 is alright."
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
# TODO: Filter negative_words into overused_negative_words
banned_problem_words = []
for w in problem_words:
if sample_email.count(f'\\b{w}s?\\b') > 1:
banned_problem_words.append(w)
pattern = '|'.join(f'\\b{w}s?\\b' for w in banned_problem_words)
def list_check(email, pattern):
return re.sub(pattern, 'REDACTED', email, flags=re.IGNORECASE)
print(list_check(sample_email, pattern))
# Result should be: "Hello, banned1. This is REDACTED. What is going on with
# REDACTED? Hopefully REDACTED is alright."
The repl argument of re.sub can take a function that takes a match object and returns the replacement string. Here is my solution:
import re
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
pattern = '|'.join(f'\\b{w}\\b' for w in problem_words)
occurrences = 0
def redact(match):
global occurrences
occurrences += 1
if occurrences > 1:
return "REDACTED"
return match.group(0)
replaced = re.sub(pattern, redact, sample_email, flags=re.IGNORECASE)
print(replaced)
(As a further note, string.count doesn't support regex, but there is no need to count)

Python regex manipulation extract email id

First, I want to grab this kind of string from a text file
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
And then convert it to separate strings such as
kevin.knerr#google.com.au
sam.mcgettrick#google.com.au
mike.grahs#google.com.au
For example text file can be as:
Some gibberish words
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
Some Gibberish words
As said in the comments, better grab the part in {} and use some programming logic afterwards. You can grab the different parts with:
\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)
# looks for {
# captures everything not } into the group individual
# looks for # afterwards
# saves everything not a whitespace into the group domain
See a demo on regex101.com.
In Python this would be:
import re
rx = r'\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)'
string = 'gibberish {kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au gibberish'
for match in re.finditer(rx, string):
print match.group('individual')
print match.group('domain')
Python Code
ip = "{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au"
arr = re.match(r"\{([^\}]+)\}(\#\S+$)", ip)
#Using split for solution
for x in arr.group(1).split(","):
print (x.strip() + arr.group(2))
#Regex Based solution
arr1 = re.findall(r"([^, ]+)", arr.group(1))
for x in arr1:
print (x + arr.group(2))
IDEONE DEMO

Regex in Python 2.7 for extraction of information from Snort log files

I'm trying to extract information from a Snort file using regular expressions. I've sucessfully got the IP's and SID, but I seem to be having trouble with extracting a specific part of the text.
How can I extract part of a Snort log file? The part I'm trying to extract can look like [Classification: example-of-attack] or [Classification: Example of Attack]. However, the first example may have any number of hyphens and whilst the second instance doesn't have any hyphens but contains some capital letters.
How could I extract just example-of-attack or Example-of-Attack?
I unfortunately only know how to search for static words such as:
test = re.search("exact-name", line)
t = test.group()
print t
I've tried many different commands on the web, but I just don't seem to get it.
You can use the following regex:
>>> m = re.search(r'\[Classification:\s*([^]]+)\]', line).group(1)
( Explanation | Working Demo )
You could use look-behinds,
>>> s = "[Classification: example-of-attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s)
>>> m
<_sre.SRE_Match object at 0x7ff54a954370>
>>> m.group()
'example-of-attack'
>>> s = "[Classification: Example of Attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s).group()
>>> m
'Example of Attack'
Use regex module if there are more than one spaces after the string Classification:,
>>> import regex
>>> s = "[Classification: Example of Attack]"
>>> regex.search(r'(?<=Classification:\s+\b)[^\]]*', s).group()
'Example of Attack
'
If you want to match any substring with the pattern [Word: Value], you could use the following regex,
ptrn = r"\[\s*(\w+):\s*([\w\s-]+)\s*\]"
Here I've used two groups, one for the first word ("Classification" in your question) and one for the second (either "example-of-attack" or "Example of Attack"). It also requires opening and closing square brackets. For example,
txt1 = '[Classification: example-of-attack]'
m = re.search( ptrn, txt1 )
>>> m.group(2)
'example-of-attack'

Categories

Resources