How can I extract the email address string - python

My python script currently pulls an email address as a list, but I need to get the text portion only. In this example, it should have been golfshop#3lakesgolf.com. I have tried using the text attribute (gc_email.text), but that didn't work.
gc_email=web.select('a[href^=mailto]')
print(gc_email)
output:
[golfshop#3lakesgolf.com]
Help! How can I extract just the mailto address?

You can use a regex capture to pull this string
import re
str = 'golfshop#3lakesgolf.com'
regex = '<a href="mailto:(.*?)".*'
try:
match = re.match(regex, str).group(1)
except:
match = None
x=1
if match is not None:
print(match)
Output
golfshop#3lakesgolf.com

Assuming every line follows the format you provided, you could use the '.split()' function on a series of characters and then select the appropriate items from the returned lists.
line = 'golfshop#3lakesgolf.com]'
sections1 = line.split(':')
email = sections1[1].split('.com')[0]+'.com'
Output
golfshop#3lakesgolf.com
If the formatting varies and is not like this every single time, then I'd go with regular expressions.

Related

Python regex manipulation extract email id

First, I want to grab this kind of string from a text file
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
And then convert it to separate strings such as
kevin.knerr#google.com.au
sam.mcgettrick#google.com.au
mike.grahs#google.com.au
For example text file can be as:
Some gibberish words
{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au
Some Gibberish words
As said in the comments, better grab the part in {} and use some programming logic afterwards. You can grab the different parts with:
\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)
# looks for {
# captures everything not } into the group individual
# looks for # afterwards
# saves everything not a whitespace into the group domain
See a demo on regex101.com.
In Python this would be:
import re
rx = r'\{(?P<individual>[^{}]+)\}#(?P<domain>\S+)'
string = 'gibberish {kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au gibberish'
for match in re.finditer(rx, string):
print match.group('individual')
print match.group('domain')
Python Code
ip = "{kevin.knerr, sam.mcgettrick, mike.grahs}#google.com.au"
arr = re.match(r"\{([^\}]+)\}(\#\S+$)", ip)
#Using split for solution
for x in arr.group(1).split(","):
print (x.strip() + arr.group(2))
#Regex Based solution
arr1 = re.findall(r"([^, ]+)", arr.group(1))
for x in arr1:
print (x + arr.group(2))
IDEONE DEMO

Python re.findall prints list instead of string

address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
regex = re.findall(r"([a-f\d]{12})", html)
if you run the script the output will be something similiar to this:
['aaaaaaaaaaaa', 'bbbbbbbbbbbb', 'cccccccccccc']
how do i make the script print this output (note the line break):
aaaaaaaaaaaa
bbbbbbbbbbbb
cccccccccccc
any help?
Just print regex like this:
print "\n".join(regex)
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
regex = re.findall(r"([a-f\d]{12})", html)
print "\n".join(regex)
re.findall() returns a list. So you can either iterate over the list and print out each element separately like so:
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
for match in re.findall(r"([a-f\d]{12})", html)
print match
Or you can do as #bigOTHER suggests and join the list together into one long string and print the string. It's essentially doing the same thing.
Source: https://docs.python.org/2/library/re.html#re.findall
re.findall(pattern, string, flags=0) Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
Use join on the result:
"".join("{0}\n".format(x) for x in re.findall(r"([a-f\d]{12})", html)

Capture a match with Regex on Python and Assign the captured string value to variable

Just started learning python/regex.
I have error log file in which I want to capture the strings that match specific patterns, and create a list from it. There is one error per line. I have the datetime portion down. I need to extract 'company' and 'errorline', assign them to variables, append to my nested list.
The error lines look something like this:
2013-02-02 12:20:15 blahblahblah=123214, moreblah=1021, blah.blah.blah, company=201944, errorline=#2043
f = open("/path/error.log","r")
errorlist = [["datetime","company","errorline"]] #I want to append to nested list
for line in f:
datetime = line[:19]
company = re.search(r"=[0-9]{6},",line)
company = company.group[1:-1] #to remove the '=' and ','
errorline = re.search(r"#[0-9]{1,}",line)
errorline = errorline.group()[1:]
errorlist.append([datetime,company,errorline])
I know that this code does not work because I can't assign the .group() to a variable.
Please help!
it should be:
company = re.search(r'=([0-9]{6}),',line).group(1)
errorline = re.search(r'#([0-9]{1,})',line).group(1)
note the parentheses, and call to .group. Also, you may do it all together:
company, errorline = re.search(r'=([0-9]{6}),.*?#([0-9]{1,})',line).groups()
re.search returns a Match Object
Classically, your code for the match should be:
match= re.search(r'(\d+)', 'abc 123 def')
if match:
digits = match.group(1)
else:
# react to no match
You can also compact both matches in your example into one (Demo) and can be seen here:
>>> s='2013-02-02 12:20:15 blahblahblah=123214, moreblah=1021, blah.blah.blah, company=201944, errorline=#2043'
>>> match=re.search(r'^.*company=(\d+)\D+(\d+)', s)
>>> match.group(1)
'201944'
>>> match.group(2)
'2043'
so then the match part of your code becomes something like:
match=re.search(r'^.*company=(\d+)\D+(\d+)', line)
if match:
company=match.group(1)
errorline=match.group(2)
# do whatever with company and errorline
else:
# react to an unexpected line format...

Extract email sub-strings from large document

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
...<name#domain.com>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain #domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.
This code extracts the email addresses in a string. Use it while reading line by line
>>> import re
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol"
>>> match = re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk#bob.com.lol'
If you have several email addresses use findall:
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol or popop#coco.com"
>>> match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match
['jdsk#bob.com.lol', 'popop#coco.com']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.
Edit: as suggested in a comment by #kostek:
In the string Contact us at support#example.com. my regex returns support#example.com. (with dot at the end). To avoid this, use [\w\.,]+#[\w\.,]+\.\w+)
Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+#[\w\.-]+\.\w+which will capture example#do-main.com as well.
Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad#ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."
Update 2023
Seems stackabuse has compiled a post based on the popular SO answer mentioned above.
import re
regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")#([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")
def isValid(email):
if re.fullmatch(regex, email):
print("Valid email")
else:
print("Invalid email")
isValid("name.surname#gmail.com")
isValid("anonymous123#yahoo.co.uk")
isValid("anonymous123#...uk")
isValid("...#domain.us")
You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2#als.com, Users1#gmail.de " \
"Dariush#dasd-asasdsa.com.lo,Dariush.lastName#someDomain.com"
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)
import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
If you're looking for a specific domain:
>>> import re
>>> text = "this is an email la#test.com, it will be matched, x#y.com will not, and test#test.com will"
>>> match = re.findall(r'[\w-\._\+%]+#test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la#test.com', 'test#test.com']
import re
reg_pat = r'\S+#\S+\.\S+'
test_text = 'xyz.byc#cfg-jj.com ir_er#cu.co.kl uiufubvcbuw bvkw ko#com m#urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc#cfg-jj.com', 'ir_er#cu.co.kl']
import re
mess = '''Jawadahmed#gmail.com Ahmed#gmail.com
abc#gmail'''
email = re.compile(r'([\w\.-]+#gmail.com)')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.
You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+#[\w\-\.]+\b
Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:
>>> str1 = "abcdef_12345#gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345#gmail.com'
content = ' abcdabcd jcopelan#nyx.cs.du.edu afgh 65882#mimsy.umd.edu qwertyuiop mangoe#cs.umd'
match_objects = re.findall(r'\w+#\w+[\.\w+]+', content)
# \b[\w|\.]+ ---> means begins with any english and number character or dot.
import re
marks = '''
!()[]{};?#$%:'"\,/^&é*
'''
text = 'Hello from priyankv#gmail.com to python#gmail.com, datascience##gmail.com and machinelearning##yahoo..com wrong email address: farzad#google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*#[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
for x in marks:
p = p.replace(x, "")
if len(re.findall(pattern, p)) > 0:
print(re.findall(pattern, p))
One other way is to divide it into 3 different groups and capture the group(0). See below:
emails=[]
for line in email: # email is the text file where some emails exist.
e=re.search(r'([.\w\d-]+)(#)([.\w\d-]+)',line) # 3 different groups are composed.
if e:
emails.append(e.group(0))
print(emails)
Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello#world.com>><123#123.at> <huhu#fake> bla bla <myname#some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello#world.com>', '<123#123.at>', '<huhu#fake>', '<myname#somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello#world.com', '123#123.at', 'myname#some-domain.pt']
import re
txt = 'hello from absc#gmail.com to par1#yahoo.com about the meeting #2PM'
email =re.findall('\S+#\S+',s)
print(email)
Printed output:
['absc#gmail.com', 'par1#yahoo.com']
import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+#\S+',s)
for r in result:
print(r)

python regex search addition to parse a tag in a text file

I got some help with this earlier today but I cannot figure out the last part of the problem I am having. This regex search returns all of the matches in the open file from the input. What I need to do is also find which part of the file that the match comes from.
Each section is opened and closed with a tag. For example one of the tags opens with <opera> and ends with </opera>. What I want to be able to do is when I find a match I want to either go backwards to the open tag or forwards to the close tag and include the contents of the tag, in this case "opera" in the output. My question is can I do this with an addition to the regular expression or is there a better way? Here is the code I have that works great already:
text = open_file.read()
#the test string for this code is "NNP^CC^NNP"
grammarList = raw_input("Enter your grammar string: ");
tags = grammarList.split("^")
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"
from re import findall
print(findall(tags_pattern, text))
One way to do it would be to find all occurrences of your start and end section tags (say they're <opera> and </opera>), get the indices, and compare them to each match of tags_pattern. This uses finditer which is like findall but returns indices too. Something like:
startTags = re.finditer("<opera>",text)
endTags = re.finditer("</opera>",text)
matches = re.finditer(tags_pattern,text)
# Now, [m.start() for m in matches] gives the starting index into `text`.
# if <opera> starts at subindices 0, 1000, 2345
# and you get a match starting at subindex 1100,
# then it's in the 1000-2345 block.
for m in matches:
# find first
sec = [i for i in xrange(len(startTags)) if i>startTags[i].start()]
if len(sec)=0:
print "err couldn't find it"
else:
sec = sec[0]
print "found in\n" + text[startTags[sec].start():endTags[sec].end()]
(Note: you can get the matched text with m.group() Default () has group 0 (ie entire string), and you can use m.group(i) for the ith capturing group).
from BeautifulSoup import BeautifulSoup
tags = """stuff outside<opera>asdfljlaksdjf lkasjdfl kajsdlf kajsdf stuff
<asdf>asdf</asdf></opera>stuff outside"""
soup = BeautifulSoup(tags)
soup.opera.text
Out[22]: u'asdfljlaksdjf lkasjdfl kajsdlf kajsdf stuffasdf'
str(soup.opera)
Out[23]: '<opera>asdfljlaksdjf lkasjdfl kajsdlf kajsdf stuff
<asdf>asdf</asdf></opera>'

Categories

Resources