pattern between two strings in python - python

I have a string in following format:
In-product feedback from Vince (aaa#bbb.com)...In-product feedback from Corey Zimmerman Anderson (ccc#ddd.com)...In-product feedback from Andrea Ibarra (eee#fff.com)
I need to extract the email ID from above string. The "In-product feedback from " will be static and email IDs will always be inside parenthesis, but the name in between will vary.

Since the text you have is pretty much static and names will likely not contain () you can use a non regex approach:
s = "In-product feedback from Vince (aaa#bbb.com)"
s_clean = s.rsplit('(')[1].strip(')')
print(s_clean)
# 'aaa#bbb.com'
Or use regex anyway:
import re
s = "In-product feedback from Vince (aaa#bbb.com)"
s_clean = re.findall(r'\((.*?)\)', s)[0]
print(s_clean)
# 'aaa#bbb.com'
And with multiple occurrences, you'll get a list of all the emails:
s = "In-product feedback from Vince (aaa#bbb.com)...In-product feedback from Corey Zimmerman Anderson (ccc#ddd.com)...In-product feedback from Andrea Ibarra (eee#fff.com)"
s_clean = re.findall(r'\((.*?)\)', s)
print(s_clean)
# ['aaa#bbb.com', 'ccc#ddd.com', 'eee#fff.com']

Use the following code:
import re
r = re.findall(r"\(([^)]+)\)", s)
print(r)
where s in your strings.

Try this
import re
str = 'In-product feedback from Vince (aaa#bbb.com)'
regex = '(In-product feedback from) ([a-zA-Z ]+) \(([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)\)'
phrase= re.match(regex, str)
print phrase.group(1) # In-product feedback from
print phrase.group(2) # Vince
print phrase.group(3) # aaa#bbb.com

Related

The phone number field contains U.S. phone numbers, and needs to be modified to the international format, with "+1-" in front of the phone number

I am using
import re
def transform_record(record):
new_record = re.sub(r'(,[^a-zA-z])', r'\1+1-',record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
#Excpected Output:::" Sabrina Green,+1-802-867-5309,System Administrator"
But I am getting output::
Sabrina Green,8+1-02-867-5309,S+-ystem Administrator
Below one is working.
re.sub(r",([\d-]+)",r",+1-\1" ,record)
import re
def transform_record(record):
new_record = re.sub(r',(?=\d)', r',+1-',record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator
print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist
print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer
print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
def transform_record(record, number_field=1):
fields = record.split(",") # See note.
if not fields[number_field].startswith("+1-"):
fields[number_field] = "+1-" + fields[number_field]
return ",".join(fields)
I have a note in the above implementation. You are probably working with CSV data. You should use a proper CSV parser instead of just splitting on commas if so. Just splitting on commas goes wrong if a field contains escaped commas.
If your data is not well ordered, and you want to add +1- before any , that is followed with a digit, yo may use
re.sub(r',(?=\d)', r',+1-', record)
See the regex demo.
The ,(?=\d) pattern matches a comma first, and then (?=\d) positive lookahead makes sure there is a digit right after, without consuming the digit (and it remains in the replacement result).
See the Python demo online.
First of all, detect the pattern from the record text by r",(?=[0-9])". That means if there are some digits after , comma, then add +1- after the comma and then the previous phone number.
For example : 345-345-34567 convert to +1-345-345-34567
import re
def transform_record(record):
new_record = re.sub(r",(?=[0-9])",",+1-",record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator
print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist
print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer
print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer
import re
def transform_record(record):
new_record = re.sub(r"([\d-]+)",r"+1-\1",record)
return new_record
new_record = re.sub(r",([\d])",r",+1-",record)
This works for me.
In this code we want to search one or more digit so you need to use \d in a class with the "+" sign and for re.sub you need to add the previous phone number with "+1"
new_record = re.sub(r',([\d]+)',r',+1\1', record)

Unable to remove accented special characters in a string despite using regex

I have the following code
import re
oldstr="HR Director, LearningÂ"
newstr = re.sub(r"[-()\"#/#;:<>{}`+=&~|.!?,^]", " ", oldstr)
print(newstr)
The above code does not work.
Current result
"HR Director, LearningÂ"
Expected result
"HR Director, Learning"
How to achieve this ?
Converting my comment to answer so that solution is easy to find for future visitors.
You may use:
import re
oldstr="HR Director, LearningÂ"
newstr = re.sub(r'[^\x00-\x7f]+|[-()"#/#;:<>{}`+=&~|.!?,^]+', "", oldstr)
print(newstr)
Output:
HR Director Learning
[^\x00-\x7f] will match all non-ASCII characters.
You can use this method too:
def _removeNonAscii(s):
return "".join(i for i in s if ord(i)<128)
Here's how my piece of code outputs:
s = "HR Director, LearningÂ"
def _removeNonAscii(s):
return "".join(i for i in s if ord(i)<128)
print(_removeNonAscii(s))
Output:
HR Director, Learning

How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson#sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA#anet.com
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Your problem is that \s will also match newlines. Instead of \s just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.

Replacing the dots for a list of abbreviations?

I'm trying to remove the dots of a list of abbreviations so that they will not confuse the sentence tokenizer. This is should be very straightforward. Don't know why my code is not working.
Below please find my code:
abbrevs = [
"No.", "U.S.", "Mses.", "B.S.", "B.A.", "D.C.", "B.Tech.", "Pte.", "Mr.", "O.E.M.",
"I.R.S", "sq.", "Reg.", "S-K."
]
def replace_abbrev(abbrs, text):
re_abbrs = [r"\b" + re.escape(a) + r"\b" for a in abbrs]
abbr_no_dot = [a.replace(".", "") for a in abbrs]
pattern_zip = zip(re_abbrs, abbr_no_dot)
for p in pattern_zip:
text = re.sub(p[0], p[1], text)
return text
text = "Test No. U.S. Mses. B.S. Test"
text = replace_abbrev(abbrevs, text)
print(text)
Here is the result. Nothing happened. What was wrong? Thanks.
Test No. U.S. Mses. B.S. Test
re_abbrs = [r"\b" + re.escape(a) for a in abbrs]
You need to use this.There is no \b after . .This gives the correct output.
Test No US Mses BS Test
You could use map and operator.methodcaller no need for re even though it's a great library.
from operator import methodcaller
' '.join(map(methodcaller('replace', '.', ''), abbrevs))
#No US Mses BS BA DC BTech Pte Mr OEM IRS sq Reg S-K

Regular expressions in a Python find-and-replace script? Update

I'm new to Python scripting, so please forgive me in advance if the answer to this question seems inherently obvious.
I'm trying to put together a large-scale find-and-replace script using Python. I'm using code similar to the following:
infile = sys.argv[1]
charenc = sys.argv[2]
outFile=infile+'.output'
findreplace = [
('term1', 'term2'),
]
inF = open(infile,'rb')
s=unicode(inF.read(),charenc)
inF.close()
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
outF = open(outFile,'wb')
outF.write(outtext.encode('utf-8'))
outF.close()
How would I go about having the script do a find and replace for regular expressions?
Specifically, I want it to find some information (metadata) specified at the top of a text file. Eg:
Title: This is the title
Author: This is the author
Date: This is the date
and convert it into LaTeX format. Eg:
\title{This is the title}
\author{This is the author}
\date{This is the date}
Maybe I'm tackling this the wrong way. If there's a better way than regular expressions please let me know!
Thanks!
Update: Thanks for posting some example code in your answers! I can get it to work so long as I replace the findreplace action, but I can't get both to work. The problem now is I can't integrate it properly into the code I've got. How would I go about having the script do multiple actions on 'outtext' in the below snippet?
for couple in findreplace:
outtext=s.replace(couple[0],couple[1])
s=outtext
>>> import re
>>> s = """Title: This is the title
... Author: This is the author
... Date: This is the date"""
>>> p = re.compile(r'^(\w+):\s*(.+)$', re.M)
>>> print p.sub(r'\\\1{\2}', s)
\Title{This is the title}
\Author{This is the author}
\Date{This is the date}
To change the case, use a function as replace parameter:
def repl_cb(m):
return "\\%s{%s}" %(m.group(1).lower(), m.group(2))
p = re.compile(r'^(\w+):\s*(.+)$', re.M)
print p.sub(repl_cb, s)
\title{This is the title}
\author{This is the author}
\date{This is the date}
See re.sub()
The regular expression you want would probably be along the lines of this one:
^([^:]+): (.*)
and the replacement expression would be
\\\1{\2}
>>> import re
>>> m = 'title', 'author', 'date'
>>> s = """Title: This is the title
Author: This is the author
Date: This is the date"""
>>> for i in m:
s = re.compile(i+': (.*)', re.I).sub(r'\\' + i + r'{\1}', s)
>>> print(s)
\title{This is the title}
\author{This is the author}
\date{This is the date}

Categories

Resources