How to Python split by a character yet maintain that character? - python

Google Maps results are often displayed thus:
'\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
Another variation:
'Clayton Village Shopping Center, 14856 Clayton Rd\nChesterfield, MO, United States\n(636) 227-2844'
And another:
'Wildwood, MO\nUnited States\n(636) 458-7707'
Notice the variation in the placement of the \n characters.
I'm looking to extract the first X lines as address, and the last line as phone number. A regex such as (.*\n.*)\n(.*) would suffice for the first example, but falls short for the other two. The only thing I can rely on is that the phone number will be in the form (ddd) ddd-dddd.
I think a regex that will allow for each and every possible variation will be hard to come by. Is it possible to use split(), but maintain the character we have split by? So in this example, split by "(", to split out the address and phone number, but retain this character in the phone number? I could concatenate the "(" back into split("(")[1], but is there a neater way?

Don't use regex. Just split the string on the '\n'. The last index is a phone number, the other indexes are the address.
lines = inputString.split('\n')
phone = lines[-1] if lines[-1].match(REGEX_PHONE_US) else None
address = '\n'.join(lines[:-1]) if phone else inputString
Python has a lot of great built in tools for manipulating strings in a more... human way... than regex allows.

If I understand you correctly, you want to "extract the first X lines as address". Assuming that all the addresses you need are in the US this regex code should work for you. In any case, it works on the 3 examples you provided:
import re
x = 'Wildwood, MO\nUnited States\n(636) 458-7707'
print re.findall(r'.*\n+.*\States', x)
The output is:
['Wildwood, MO\nUnited States']
If you want to print it later without the \n you can do it this way:
x = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
y = re.findall(r'.*\n+.*\States', x)
y = y[0].rstrip()
When you print y the output:
113 W 5th St
Eureka, MO, United States
And, if you want to extract the phone number separately you can do this:
tel = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
num = re.findall(r'.*\d+\-\d+', tel)
num = num[0].rstrip()
When you print num the output:
(636) 938-9310

Related

How to Remove Extra characters from the column Value using python

I am trying to Map the values from the dictionary, where if the Field values matches with the dictionary it must remove all the extra values from the same. However i can match the things but how i can remove the extra charaters from the column.
Input Data
col_data
Indi8
United states / 08
UNITED Kindom (55)
ITALY 22
israel
Expected Output:
col_data
India
United States
United Kindom
Italy
Israel
Script i am using :
match_val=['India','United Kingdom','Israel','United States','Italy']
lower = [x.lower() for x in match_val]
def nearest(s):
idx = np.argmax([SequenceMatcher(None, s.lower(), i).ratio() for i in lower])
return np.array(match_val)[idx]
df['col_data'] = df['col_data'].apply(nearest)
The above script matches the vales with the List, But not able to remove the extra characters from the same. How i can modify the script so that it can remove the extra characters as well after mapping.
I like this str.extract approach:
df['col_data'] = df['col_data'].str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)').str.title()
The regex ([A-Za-z]+(?: [A-Za-z]+)*) will match all all-letter words from the start of the column, omitting all content at the end which you want to remove.

Regex for matching alphabet, numbers and special charters while looping in python

I am trying to find words and print using below code. Everything is working perfect but only issue is i am unable to print the last word(which is number).
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid ']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(i), textfile)
print(y)
Text file i working with:
textfile = """1, REBECCA M. ROTH , COLLECTOR OF TAXES of the taxing district of the
township of MORRIS for Six Hundred Sixty Seven dollars andFifty Two cents, the land
in said taxing district described as Block No. 10303 Lot No. 10 :
and known as 239 E HANOVER AVE , on the tax Taxes For: 2012
Sewer
Assessments For Improvements
Total Cost of Sale 35.00
Total
Premium (if any) Paid 1,400.00 """
Would like to know where am i making mistake.
Any suggestion is appreciated.
A couple of issues:
As others have mentioned, you need to escape special characters like parentheses ( ) and dots .. Very simply, you can use re.escape
Another issue is the trailing space in Premium \(if any\) Paid (it's trying to match two spaces instead of one as you're also checking for a space in your regex {} ([^ ]*))
You should instead change your code to the following:
See working code here
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(re.escape(i)), textfile)
print(y)
Two problems:
Your current 'Premium (if any) Paid ' string ends on a space, and '{} ([^ ]*)' also has a space after {}, which adds them together. Delete the trailing space in 'Premium (if any) Paid '.
You need to escape parenthesis, so if you want to keep your regular expression unchanged, the string in the list should be ['Premium \(if any\) Paid']. You can also use re.escape instead.
For your particular cases, this seems to be an optimal solution:
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{}\s+([\S]*)'.format(re.escape(i)), text, re.I)
print(y)

input value same as predefined list but getting a different output

Ive written a program which takes in the name and age of multiple entries seperated by a comma and then sepearates the aplhabets from the numerics and then compares the name with a pre defined set/list.
If the entry doesnt match with the pre defined data, the program sends a message"incorrect entry" along with the element which didnt match.
heres the code:
from string import digits
print("enter name and age")
order=input("Seperate entries using a comma ',':")
order1=order.strip()
order2=order1.replace(" ","")
order_sep=order2.split()
removed_digits=str.maketrans('','',digits)
names=order.translate(removed_digits)
print(names)
names1=names.split(',')
names_list=['abby','chris','john','cena']
names_list=set(names_list)
for name in names1:
if name not in names_list:
print(f"{name}:doesnt match with predefined data")
the problem im having is even when i enter chris or john, the program treats them as they dont belong to the pre defined list
sample input : ravi 19,chris 20
output:ravi ,chris
ravi :doesnt match with predefined data
chris :doesnt match with predefined data
also i have another issue , ive written a part to eliminate whitespace but i dont know why, it doesnt elimintae them
sample input:ravi , chris
ravi :doesnt match with predefined data
()chris :doesnt match with predefined data
theres a space where ive put parenthesis.
any suggestion to tackle this problem and/or improve this code is appreciated!
I think some of the parts can be simplified, especially when removing the digits. As long as the input is entered with a space between the name and age, you can use split() twice. First to separate the entries with split(',') and next to separate out the ages with split(). It makes comparisons easier later if you store the names by themselves with no punctuation or whitespace around them. To print the names out from an iterable, you can use the str.join() function. Here is an example:
print("enter name and age")
order = input("Seperate entries using a comma ',': ")
names1 = [x.split()[0] for x in order.split(',')]
print(', '.join(names1))
names_list=['abby', 'chris', 'john', 'cena']
for name in names1:
if name not in names_list:
print(f"{name}:doesnt match with predefined data")
This will give the desired output:
enter name and age
Seperate entries using a comma ',': ravi 19, chris 20
ravi, chris
ravi:doesnt match with predefined data

regex groups: How to get the desired output with a more specific match pattern?

The following input list of entries
l = ["555-8396 Neu, Allison",
"Burns, C. Montgomery",
"555-5299 Putz, Lionel",
"555-7334 Simpson, Homer Jay"]
is expected to be transformed to:
Allison Neu 555-8396
C. Montgomery Burns
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
I tried the following:
for i in l:
mo = re.search(r"([0-9]{3}-[0-9]{4})?\s*(\w*),\s*(\S.*$)", i)
if mo:
print("{} {} {}".format(mo.group(3), mo.group(2), mo.group(1)))
and it results in the following incorrect output (note the "None" in the second line of output)
Allison Neu 555-8396
C. Montgomery Burns None
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
However the following solution mentioned in the e-book does indeed give the desired output:
for i in l:
mo = re.search(r"([0-9-]*)\s*([A-Za-z]+),\s+(.*)", i)
print(mo.group(3) + " " + mo.group(2) + " " + mo.group(1))
In short, it boils down to the difference in the groups() output of the 2 reg exp searches:
>>> mo = re.search(r"([0-9]{3}-[0-9]{4})?\s*(\w*),\s*(\S.*$)", "Burns, C. Montgomery")
>>> mo.groups()
(None, 'Burns', 'C. Montgomery')
versus
>>> mo = re.search(r"([0-9-]*)\s*(\w*),\s*(\S.*$)", "Burns, C. Montgomery")
>>> mo.groups()
('', 'Burns', 'C. Montgomery')
None vs ''
I wanted to do a more accurate match of the phone number format with [0-9]{3}-[0-9]{4} instead of using [0-9-]* which can match arbitrary number and - combinations (ex: "0-1-2" or "1-23").
Why does "*" result in a different grouping than "?".
Yes, it is trivial for me to take care of the "None" while printing out the result, but I am interested to know the reason for the difference in grouping results.
((?:[0-9]{3}-[0-9]{4})?)\s*(\w*),\s*(\S.*$)
Try this.See demo.
https://regex101.com/r/Qx6ylw/1
In the book example group was not optional...its contents were....in your regex group was optional.
Let me say in plain English what RegEx demos are hinting at and actually answer your actual question:
([0-9-]*) Matches 0 or more characters of digits or the - character. When there is no telephone present, that would be the case of matching 0 characters. But note the operative word matching, i.e. it is still a match. Thus, mo.group(1) returns ''.
([0-9]{3}-[0-9]{4})? Attempts to match a phone number in a specific format, but this match is optional. When the phone number is not present in the input, the match does not exist and thus mo.group(1) returns None.
Using judicious whitespace trimming, a simple find and replace example is this :
Find: ^((?:\d+(?:-\d+)+)?)\s*([^,]*?)\s*,\s*(.*)
Replace \3 \2 \1
https://regex101.com/r/oo0NWy/1
This code solves your problem:
for i in l:
mo = re.search(r"([0-9]{3}-[0-9]{4})?\s*(\w*),\s*(\S.*$)", i)
if mo:
if mo.group(1):
print("{} {} {}".format(mo.group(3), mo.group(2), mo.group(1)))
else:
print("{} {}".format(mo.group(3), mo.group(2)))
Output:
Allison Neu 555-8396
C. Montgomery Burns
Lionel Putz 555-5299
Homer Jay Simpson 555-7334

Canadian postal code validation - python - regex

Below is the code I have written for a Canadian postal code validation script. It's supposed to read in a file:
123 4th Street, Toronto, Ontario, M1A 1A1
12456 Pine Way, Montreal, Quebec H9Z 9Z9
56 Winding Way, Thunder Bay, Ontario, D56 4A3
34 Cliff Drive, Bishop's Falls, Newfoundland B7E 4T
and output whether the phone number is valid or not. All of my postal codes are returning as invalid when postal codes 1, and 2 are valid and 3 and 4 are invalid.
import re
filename = input("Please enter the name of the file containing the input Canadian postal code: ")
fo = open(filename, "r")
for line in open(filename):
regex = '^(?!.*[DFIOQU])[A-VXY][0-9][A-Z]●?[0-9][A-Z][0-9]$'
m = re.match(regex, line)
if m is not None:
print("Valid: ", line)
else: print("Invalid: ", line)
fo.close
I do not guarantee that I fully understand the format, but this seems to work:
\b(?!.{0,7}[DFIOQU])[A-VXY]\d[A-Z][^-\w\d]\d[A-Z]\d\b
Demo
You can also fix yours (at least for the example) with this change:
(?!.*[DFIOQU])[A-VXY][0-9][A-Z].?[0-9][A-Z][0-9]
(except that it accepts a hyphen, which is forbidden)
Demo
But in this case, an explicit pattern may be best:
\b[ABCEGHJ-NPRSTVXY]\d[ABCEGHJ-NPRSTV-Z]\s\d[ABCEGHJ-NPRSTV-Z]\d\b
Which completes is 1/4 the steps of the others.
Demo
This generic code can help you
import re
PIN = input("Enter your Address")
PIN1= PIN.upper()
if (len(re.findall(r'[A-Z]{1}[0-9]{1}[A-Z]{1}\s*[0-9]{1}[A-Z]{1}[0-9]{1}',PIN1)))==1:
print("valid")
else:
print("invalid")
As we are taking input from user. So there is many chances that user can type postal code without space, in lower case letters. so this code can help you out with
1) Improper spacing
2)Lower case letter

Categories

Resources