Extract phone numbers from email using python 2.7 regex - python

I'm trying to extract the phone numbers from many files of emails. I wrote regex code to extract them but I got the results for just one format.
PHONERX = re.compile("(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})")
phonenumber = re.findall(PHONERX,content)
when I reviewed the data, I found there were many formats for phone numbers.
How can I extract all the phone numbers that have these format together:
800-569-0123
1-866-523-4176
(324)442-9843
(212) 332-1200
713/853-5620
713 853-0357
713 837 1749
This link is a sample for the dataset. the problem is sometime the phone numbers regex extract from the messageId and other numbers in the email
https://www.dropbox.com/sh/pw2yfesim4ejncf/AADwdWpJJTuxaJTPfha38OdRa?dl=0

You may want to use:
\(?(?:1-)?\b[2-9][0-9]{2}\)?[-. \/]?[2-9][0-9]{2}[-. ]?[0-9]{4}\b
Which will match all your examples + ignore false positives, like:
113 837 1749
222 2222 22222
Regex Demo and Explanation
Python Demo

You don't need to include all the possibilities using a logical OR. You can use following regex:
(?:\(\d+\)\s?\d*|\d+)([-\/ ]\d+){1,3}
see the Demo
For using with re.findall() use non-captured group:
(?:\(\d+\)\s?\d*|\d+)(?:[-\/ ]\d+){1,3}

Related

Python, pandas replace entire column with regex match of string

I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue+', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters.
For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d+ )+\w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex.
Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d+ )+\w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.
you could use the extract method to overwrite the column with its own content
pat = r'(\d+\s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d+ )+\w{0,3} matches the following patterns and returns some funky stuff as well
1131 1313 street
121 avenue
1 1 1 1 1 1 avenue
42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.

Regex expression to find strings between two strings in Python

I am trying to write a regular expression which returns a string which is between two other strings. For example: I want to get the string along with spaces which resides between the strings "15/08/2017" and "$610,000"
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
should return
"TRANSFER OF LAND"
Here is the expression I have pieced together so far:
re.search(r'15/08/2017(.*?)$610,000', a).group(1)
It doesn't return any matches. I think it is because we also need to consider spaces in the expression. Is there a way to find strings between two strings ignoring the spaces?
Use Regex Lookbehind & Lookahead
Ex:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
print(re.search(r"(?<=15/08/2017).*?(?=\$610,000)", a).group())
Output:
TRANSFER OF LAND
>>> re.search(r'15/08/2017(.*)\$610,000',a).group(1)
' TRANSFER OF LAND '
Since $ is a regex metacharacter (standing for the end of a logical line), you need to escape it to use as a literal '$'.
Might be easier to use find:
a = '172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
b = '15/08/2017'
c = '$610,000'
a[a.find(b) + len(b):a.find(c)].strip()
'TRANSFER OF LAND'

Regex to extract digits before word while ignoring certain lines

Using Python and pdf2text I'm trying to extract a postcode from a 4000 odd single page PDF files I have received to print and mail - unfortunately I do not have access to the original files so can't adjust when creating files.
My end goal here is to rename all the PDF files with the Postalcode_ExistingFilename.pdf so I can sort them for the postal network. I'll also need to combine PDF"s for the same customer into one file but that's another problem.
In the PDF we have the word "Dear" and the postal code is before that (albeit a few lines up):
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample Suburb
Sample City 1234
Dear Sam
I've managed to get it work with
(\d+)\s*Dear
until the number of address lines changes which causes the conversion to text to add a block of text between the Dear and postcode.
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample City 1234
PO Box 1234
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam
I tried to get this working from the top and look for the first 4 digit excluding 2018, however any 4 digit street numbers were being matched which isn't what I'm after.
Any advice you can give would be awesome.
You can use regular expression:
\b\d{4}$\b(?<!2018)
\b Open word boundary.
\d{4}$ Match exactly four digits at the end of line.
\b Close word boundary.
(?<!2018) Negative lookbehind to check that the group of four digits is not 2018.
You can try it live here. The regular expression is based on the assumptions, as per the comments, that the postcode occurs at the end of the line. If you are expecting different years, you can simply adjust the negative lookbehind to deal with additional years. For example:
(?<!2018|2017) will exclude 2017 or 2018.
(?<!201[0-9]) will exclude years from 2010 to 2019.
According to your Python version you might need to specify the re.MULTILINE flag for start and end of line assertions.
>>> str = """04 Jul 2018
Mr Sam Sample
1235 Sample Street
Sample City 1234
PO Box 1237
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam"""
>>>re.findall(r"\b\d{4}$\b(?<!2018)",str,re.MULTILINE)
['1234', '1237']
How about trying to match 4 digit numbers at the end of line, on lines that doesn't contain date (that is line beginning with number)?
import re
re.findall(r'^[^\d].*?\s+(\d{4})\s*$', data, re.MULTILINE)
# ['1234']

Python RegEx for Australian Phone Numbers - False negative - 2 matches in the same substring

I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}

How to split a multi-line string using regular expressions?

I have been banging my beginner head for most of the day trying various things.
Here is the string
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 Production active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Test active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 Backup active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
What I need is to split like below. I have tried to use regex101.com to simulate various regex but I did not have much luck. I managed to isolate the delimiters with (\n\d+) and then I wanted to use lookbehind but I got an error saying that I need fixed string length.
Here is a link to the regex101 section:
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 VLAN047 active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Rogers-Refresh-MGT active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 ManagementSegtNorthW active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
Update: I update the regex101 example but it is not selecting what I want. The python code works. I wonder what is the problem with regex101
That's pretty simple - use lookahead instead of lookbehind:
parsed = re.split(r'\n(?=\d)', data)
In python there is always more than one way to skin a cat. Multiline regexes are usually very hard. The following is a lot simpler, and more importantly readable
for line in data.split("\n"):
if line[0].isdigit():
if section:
sections.append("\n".join(section))
section=[]
section.append(line)
sections.append("\n".join(section)) # grab the last one
print(sections)
Performance wise, I think this would probably be better, because we are not looking for a pattern in the entire string. we are only looking at the first character in a line.

Categories

Resources