Python regex, negate a set of characters in between a string - python

I have several set of strings with numbers followed words and jumbled numbers and words etc.
For example,
"Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"
I am trying to separate the street names from the apartment numbers.
Here is my current code:
import re
pattern_street = re.compile(r'[A-Za-z]+\s?\w+\s?[A-Za-z]+\s?[A-Za-z]+',re.X)
pattern_apartmentnumber = re.compile(r'(^\d+\s? | [A-Za-z]+[\s?]+[0-9]+$)',re.X)
for i in ["Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"]:
match_street = pattern_street.search(i)
match_apartmentnumber = pattern_apartmentnumber.search(i)
fin_street = match_street[0]
fin_apartmentnumber = match_apartmentnumber[0]
print("street--",fin_street)
print("apartmentnumber--",fin_apartmentnumber)
which prints:
street-- Street 50 No
apartmentnumber-- No 40
street-- saint bakers holy street
apartmentnumber-- 5
street-- Syndicate street
apartmentnumber-- 32
I want to remove the "No" from the first street name. i.e. if there is any street with No followed by a number at the end, that needs to be taken as the apartment number,
and not as the street.
How can I do this for my above example strings?

First try the case where there is a No 123 at the end, use a positive lookahead.
If not found try a street without this.
pattern_street = re.compile(r'[A-Za-z]+[\s\w]+(?=\s[Nn]o\s\d+$)|[A-Za-z]+[\s\w]+',re.X)

You can find the street name by the following regex pattern to eliminate No [0-9] from the statement.
pattern_street = re.compile(r'[A-Za-z]+((?!No).)+',re.X)

Related

regular expression to exclude 2 consecutive capital letters

I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals
If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071
With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'

How can I use a regex function to extract european streetnames, housing numbers and adjectives with pandas?

I have a dataframe with 6000 records and need to extract/split the column with streetname into: "Streetname", "Housingnumber" and "Adjectives". Unfortunately, the problem is not solved yet using regex functions because there is no structure in the notation of df["streetname"]:
**Input from df["Streetname"]**
St. edward's Lane 26
Vineyardlane3a
High Street 0-9
ParkRoad near #33
Queens Road ??
s-Georgelane9abc
Kings Road 9b
1st Park Avenue 67 near cyclelane
**Output that I would like:
df["Street"] df["housingnumber"] df["adjective"]**
St. Edward's lane 26
Vineyardlane 3 a
High Street 0-9
ParkRoad 33
Queens Road
s-Georgelane 9 abc
Kings Road 9 b
1st Park Avenue 67
I tried this:
Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'
df["Streetname"] = df["Streetname"].str.extract(Filter)
I lose a lot of data and the result is only written into one column... Hope that someone can help!
Not 100% perfect (I doubt that this will be possible without a database or machine learning algorithms) but a starting point:
^ # start of line/string
(?P<street>\w+?\D+) # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*) # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*) # a-z
.* # consume the rest of the string
See a demo on regex101.com.
You might want to strip of #, whitespaces or ? from the end of street afterwards.

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

Remove numbers conditionally?

I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK

FInd a US street address in text (preferably using Python regex)

Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?
I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'
\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

Categories

Resources