I am quite new to python and regex and I was wondering how to extract the first part of an email address upto the domain name. So for example if:
s='xjhgjg876896#domain.com'
I would like the regex result to be (taking into account all "sorts" of email ids i.e including numbers etc..):
xjhgjg876896
I get the idea of regex - as in I know I need to scan till "#" and then store the result - but I am unsure how to implement this in python.
Thanks for your time.
You should just use the split method of strings:
s.split("#")[0]
As others have pointed out, the better solution is to use split.
If you're really keen on using regex then this should work:
import re
regexStr = r'^([^#]+)#[^#]+$'
emailStr = 'foo#bar.baz'
matchobj = re.search(regexStr, emailStr)
if not matchobj is None:
print matchobj.group(1)
else:
print "Did not match"
and it prints out
foo
NOTE: This is going to work only with email strings of SOMEONE#SOMETHING.TLD. If you want to match emails of type NAME<SOMEONE#SOMETHING.TLD>, you need to adjust the regex.
You shouldn't use a regex or split.
local, at, domain = 'john.smith#example.org'.rpartition('#')
You have to use right RFC5322 parser.
"#####"#example.com is a valid email address, and semantically localpart("#####") is different from its username(#####)
As of python3.6, you can use email.headerregistry:
from email.headerregistry import Address
s='xjhgjg876896#domain.com'
Address(addr_spec=s).username # => 'xjhgjg876896'
#!/usr/bin/python3.6
def email_splitter(email):
username = email.split('#')[0]
domain = email.split('#')[1]
domain_name = domain.split('.')[0]
domain_type = domain.split('.')[1]
print('Username : ', username)
print('Domain : ', domain_name)
print('Type : ', domain_type)
email_splitter('foo.goo#bar.com')
Output :
Username : foo.goo
Domain : bar
Type : com
Here is another way, using the index method.
s='xjhgjg876896#domain.com'
# Now lets find the location of the "#" sign
index = s.index("#")
# Next lets get the string starting from the begining up to the location of the "#" sign.
s_id = s[:index]
print(s_id)
And the output is
xjhgjg876896
need to install package
pip install email_split
from email_split import email_split
email = email_split("ssss#ggh.com")
print(email.domain)
print(email.local)
Below should help you do it :
fromAddr = message.get('From').split('#')[1].rstrip('>')
fromAddr = fromAddr.split(' ')[0]
Good answers have already been answered but i want to put mine anyways.
If i have an email john#gmail.com i want to get just "john".
i want to get only "john"
If i have an email john.joe#gmail.com i want to get just "john"
i want to get only "john"
so this is what i did:
name = recipient.split("#")[0]
name = name.split(".")[0]
print name
cheers
You can also try to use email_split.
from email_split import email_split
email = email_split('xjhgjg876896#domain.com')
email.local # xjhgjg876896
email.domain # domain.com
You can find more here https://pypi.org/project/email_split/ . Good luck :)
The following will return the continuous text before #
re.findall(r'(\S+)#', s)
You can find all the words in the email and then return the first word.
import re
def returnUserName(email):
return re.findall("\w*",email)[0]
print(returnUserName("johns123.ss#google.com")) #Output is - johns123
print(returnUserName('xjhgjg876896#domain.com')) #Output is - xjhgjg876896
Related
I am using
import re
def transform_record(record):
new_record = re.sub(r'(,[^a-zA-z])', r'\1+1-',record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
#Excpected Output:::" Sabrina Green,+1-802-867-5309,System Administrator"
But I am getting output::
Sabrina Green,8+1-02-867-5309,S+-ystem Administrator
Below one is working.
re.sub(r",([\d-]+)",r",+1-\1" ,record)
import re
def transform_record(record):
new_record = re.sub(r',(?=\d)', r',+1-',record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator
print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist
print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer
print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
def transform_record(record, number_field=1):
fields = record.split(",") # See note.
if not fields[number_field].startswith("+1-"):
fields[number_field] = "+1-" + fields[number_field]
return ",".join(fields)
I have a note in the above implementation. You are probably working with CSV data. You should use a proper CSV parser instead of just splitting on commas if so. Just splitting on commas goes wrong if a field contains escaped commas.
If your data is not well ordered, and you want to add +1- before any , that is followed with a digit, yo may use
re.sub(r',(?=\d)', r',+1-', record)
See the regex demo.
The ,(?=\d) pattern matches a comma first, and then (?=\d) positive lookahead makes sure there is a digit right after, without consuming the digit (and it remains in the replacement result).
See the Python demo online.
First of all, detect the pattern from the record text by r",(?=[0-9])". That means if there are some digits after , comma, then add +1- after the comma and then the previous phone number.
For example : 345-345-34567 convert to +1-345-345-34567
import re
def transform_record(record):
new_record = re.sub(r",(?=[0-9])",",+1-",record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator
print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist
print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer
print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer
import re
def transform_record(record):
new_record = re.sub(r"([\d-]+)",r"+1-\1",record)
return new_record
new_record = re.sub(r",([\d])",r",+1-",record)
This works for me.
In this code we want to search one or more digit so you need to use \d in a class with the "+" sign and for re.sub you need to add the previous phone number with "+1"
new_record = re.sub(r',([\d]+)',r',+1\1', record)
In the 'details' column, every entry has 'Mobile' and 'Email" text inside them. I want to separate out Mobile Number and Email-ID of corresponding entries in different individual columns using a Python code.
Please help.
Thanks in advance!
You could try something like this -
import pandas as pd
data = pd.read_csv('AIOS_data.csv')
data['Mobile'] = data['Mobile'].str.extract(r'(Mobile[\d|\D]+Email)')
data['Mobile'] = data['Mobile'].str.replace('[Mobile:|Email:]', '').str.strip()
data['Email'] = data['Email'].str.extract(r'(Email:[\d|\D]+)')
data['Email'] = data['Email'].str.replace('Email:','').str.strip()
Use Series.str.extract with regex for filter values between values Mobile and Email, \s* means zero or some spaces and (.*) means extract any value between:
df[['Mobile','Email']] = df['Details'].str.extract('Mobile:\s*(.*)\s+Email:\s*(.*)')
If want also address:
cols = ['Address','Mobile','Email']
df[cols] = df['Details'].str.extract('Address:\s*(.*)\s*Mobile:\s*(.*)\s+Email:\s*(.*)')
Without providing the full code, I guess you have to take three steps:
Read the csv-file into memory. Python has a handy module for that called csv (documentation)
Once you have done this, you can iterate over each row, and search in detail for the mobile number and email address. If detail is always written in the same way, you can just use the str.find() method (documentation) for that.
E.g.
detail = "Address: 108/81-B, METTU STREET, SE...KKAL TAMIL NADU 637409 Mobile: 9789617285 Email: Leens1794#gmail.com"
mobile_start = detail[detail.find("Mobile:")+8:] # => '9789617285 Email: Leens1794#gmail.com'
mobile = mobile_start[:mobile_start.find(' ')] # => '9789617285'
(You do the same for email)
You store the results (mobile and email) in a new column and export it to csv, again using the ``csv'' module.
I'm very new to python and have been getting help from peers with developing this program. I essentially have a very unrefined, dynamic scraper, that pulls emails from a given URL.
I've been considering how I would go about matching up a first/last name to the email address, and come up with the idea of matching any 4+ consecutive characters before the '#' in an email to another element on the web page, under the assumption that most business's use at least some portion of first/last name in the creation of the email. I also decided to go with 4 characters to avoid any mix ups that might occur at 3+ characters, as I don't feel this is specific enough.
I hope this isn't too general of a question, I'm just unsure where to start.
Most of what I have found while pondering this question has been based on splitting the email and using regex to match, but I'm unsure if this will work on the page itself/how to implement.
import urllib.request,re
f = urllib.request.urlopen("http://www.sampleurl.com")
s = f.read().decode('utf-8')
print(re.findall(r"\+\d{2}\s?0?\d{10}",s))
print(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
This a very basic version of a much larger program, but it is most of the inner workings. It returns email and phone number properly based on the given URL.
from email_split import email_split
import urllib.request,re
#Find Emails
f = urllib.request.urlopen("https://www.sampleurl.com/")
s = f.read().decode('utf-8')
e = (re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
emails = []
for x in e:
emails.append(str(x))
#Split Email
email = email_split(x)
str = email.local
match = re.search(r'([\w.-]+)', str)
if match:
print match.group()
It is sort of a general question, but I thought of this right away. I'm sort of thinking this would be a 3 step process:
1) extract the names on the website. I haven't used it, but sounds like you could use spaCy to pull out name entities. Then store those in some type of list of names.
2) extract all the email addresses on the site and store those in a list of email addresses.
3) Then use fuzzywuzzy to iterate through the names and find matches to an email address sample of fuzzywuzzy
Without having a specific website to try this on, it's just in theory, and I created a sample list to sort of show what I'm thinking:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
names = ['John A. Smith', 'Michael Jordan', 'James Thomas', 'Bradley Cooper']
emails = ["jasmith#foo.com", "john.smith#bar.com", "jthomas#foobar.com", "bradc#company.com"]
email_users = [ x.split('#')[0] for x in emails ]
email_dict = dict(zip(email_users, emails))
for name in names:
matches = process.extract(name, email_users, limit=3)
print ('%s: ' %name)
for each in matches:
if each[1] >= 70:
print ('\t%-20s - Score: %s' %(email_dict[each[0]], each[1]))
else:
continue
Output:
John A. Smith:
john.smith#bar.com - Score: 95
jasmith#foo.com - Score: 70
Michael Jordan:
James Thomas:
jthomas#foobar.com - Score: 77
Bradley Cooper:
bradc#company.com - Score: 72
I am using following python regex code to analyze values from the To field of an email:
import re
PATTERN = re.compile(r'''((?:[^(;|,)"']|"[^"]*"|'[^']*')+)''')
list = PATTERN.split(raw)[1::2]
The list should output the name and address of each recipient, based on either "," or ";" as seperator. If these values are within quotes, they are to be ignorded, this is part of the name, often: "Last Name, First Name"
Most of the times this works well, however in the following case I am getting unexpected behaviour:
"Some Name | Company Name" <name#example.com>
In this case it is splitting on the "|" character. Even though when I check the pattern on regex tester websites, it selects the name and address as a whole. What am I doing wrong?
Example input would be:
"Some Name | Company Name" <name1#example.com>, "Some Other Name | Company Name" <name2#example.com>, "Last Name, First Name" <name3#example.com>
This is not a direct answer to your question but to the problem you seem to be solving and therefore maybe still helpful:
To parse emails I always make extensive use of Python's email library.
In your case you could use something like this:
from email.utils import getaddresses
from email import message_from_string
msg = message_from_string(str_with_msg_source)
tos = msg.get_all('to', [])
ccs = msg.get_all('cc', [])
resent_tos = msg.get_all('resent-to', [])
resent_ccs = msg.get_all('resent-cc', [])
all_recipients = getaddresses(tos + ccs + resent_tos + resent_ccs)
for (name, address) in all_recipients:
# do some postprocessing on name or address if necessary
This always took reliable care of splitting names and addresses in mail headers in my cases.
You can use a much simpler regex using look arounds to split the text.
r'(?<=>)\s*,\s*(?=")'
Regex Explanation
\s*,\s* matches , which is surrounded by zero or more spaces (\s*)
(?<=>) Look behind assertion. Checks if the , is preceded by a >
(?=") Look ahead assertion. Checks if the , is followed by a "
Test
>>> re.split(r'(?<=>)\s*,\s*(?=")', string)
['"Some Name | Company Name" <name1#example.com>', '"Some Other Name | Company Name" <name2#example.com>', '"Last Name, First Name" <name3#example.com>']
Corrections
Case 1 In the above example, we used a single delimiter ,. If yo wish to split on basis of more than one delimiters you can use a character class
r'(?<=>)\s*[,;]\s*(?=")'
[,;] Character class, matches , or ;
Case 2 As mentioned in comments, if the address part is missing, all we need to do is to add " to the look behind
Example
>>> string = '"Some Other Name | Company Name" <name2#example.com>, "Some Name, Nothing", "Last Name, First Name" <name3#example.com>'
>>> re.split(r'(?<=(?:>|"))\s*[,;]\s*(?=")', string)
['"Some Other Name | Company Name" <name2#example.com>', '"Some Name, Nothing"', '"Last Name, First Name" <name3#example.com>']
I am practicing sending emails with Google App Engine with Python. This code checks to see if message.sender is in the database:
class ReceiveEmail(InboundMailHandler):
def receive(self, message):
querySender = User.all()
querySender.filter("userEmail =", message.sender)
senderInDatabase = None
for match in querySender:
senderInDatabase = match.userEmail
This works in the development server because I send the email as "az#example.com" and message.sender="az#example.com"
But I realized that in the production server emails come formatted as "az <az#example.com> and my code fails because now message.sender="az <az#example.com>" but the email in the database is simple "az#example.com".
I searched for how to do this with regex and it is possible but I was wondering if I can do this with Python lists? Or, what do you think is the best way to achieve this result? I need to take just the email address from the message.sender.
App Engine documentation acknowledges the formatting but I could not find a specific way to select the email address only.
Thanks!
EDIT2 (re: Forest answer)
#Forest:
parseaddr() appears to be simple enough:
>>> e = "az <az#example.com>"
>>> parsed = parseaddr(e)
>>> parsed
('az', 'az#example.com')
>>> parsed[1]
'az#example.com'
>>>
But this still does not cover the other type of formatting that you mention: user#example.com (Full Name)
>>> e2 = "<az#example.com> az"
>>> parsed2 = parseaddr(e2)
>>> parsed2
('', 'az#example.com')
>>>
Is there really a formatting where full name comes after the email?
EDIT (re: Adam Bernier answer)
My try about how the regex works (probably not correct):
r # raw string
< # first limit character
( # what is inside () is matched
[ # indicates a set of characters
^ # start of string
> # start with this and go backward?
] # end set of characters
+ # repeat the match
) # end group
> # end limit character
Rather than storing the entire contents of a To: or From: header field as an opaque string, why don't you parse incoming email and store email address separately from full name? See email.utils.parseaddr(). This way you don't have to use complicated, slow pattern matching when you want to look up an address. You can always reassemble the fields using formataddr().
If you want to use regex try something like this:
>>> import re
>>> email_string = "az <az#example.com>"
>>> re.findall(r'<([^>]+)>', email_string)
['az#example.com']
Note that the above regex handles multiple addresses...
>>> email_string2 = "az <az#example.com>, bz <bz#example.com>"
>>> re.findall(r'<([^>]+)>', email_string2)
['az#example.com', 'bz#example.com']
but this simpler regex doesn't:
>>> re.findall(r'<(.*)>', email_string2)
['az#example.com>, bz <bz#example.com'] # matches too much
Using slices—which I think you meant to say instead of "lists"—seems more convoluted, e.g.:
>>> email_string[email_string.find('<')+1:-1]
'az#example.com'
and if multiple:
>>> email_strings = email_string2.split(',')
>>> for s in email_strings:
... s[s.find('<')+1:-1]
...
'az#example.com'
'bz#example.com'