regular expression to exclude 2 consecutive capital letters

regular expression to exclude 2 consecutive capital letters - python

I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals

If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071

With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'

Related

Python regex, negate a set of characters in between a string

I have several set of strings with numbers followed words and jumbled numbers and words etc.
For example,
"Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"
I am trying to separate the street names from the apartment numbers.
Here is my current code:
import re
pattern_street = re.compile(r'[A-Za-z]+\s?\w+\s?[A-Za-z]+\s?[A-Za-z]+',re.X)
pattern_apartmentnumber = re.compile(r'(^\d+\s? | [A-Za-z]+[\s?]+[0-9]+$)',re.X)
for i in ["Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"]:
match_street = pattern_street.search(i)
match_apartmentnumber = pattern_apartmentnumber.search(i)
fin_street = match_street[0]
fin_apartmentnumber = match_apartmentnumber[0]
print("street--",fin_street)
print("apartmentnumber--",fin_apartmentnumber)
which prints:
street-- Street 50 No
apartmentnumber-- No 40
street-- saint bakers holy street
apartmentnumber-- 5
street-- Syndicate street
apartmentnumber-- 32
I want to remove the "No" from the first street name. i.e. if there is any street with No followed by a number at the end, that needs to be taken as the apartment number,
and not as the street.
How can I do this for my above example strings?

First try the case where there is a No 123 at the end, use a positive lookahead.
If not found try a street without this.
pattern_street = re.compile(r'[A-Za-z]+[\s\w]+(?=\s[Nn]o\s\d+$)|[A-Za-z]+[\s\w]+',re.X)

You can find the street name by the following regex pattern to eliminate No [0-9] from the statement.
pattern_street = re.compile(r'[A-Za-z]+((?!No).)+',re.X)

How can I use a regex function to extract european streetnames, housing numbers and adjectives with pandas?

I have a dataframe with 6000 records and need to extract/split the column with streetname into: "Streetname", "Housingnumber" and "Adjectives". Unfortunately, the problem is not solved yet using regex functions because there is no structure in the notation of df["streetname"]:
**Input from df["Streetname"]**
St. edward's Lane 26
Vineyardlane3a
High Street 0-9
ParkRoad near #33
Queens Road ??
s-Georgelane9abc
Kings Road 9b
1st Park Avenue 67 near cyclelane
**Output that I would like:
df["Street"] df["housingnumber"] df["adjective"]**
St. Edward's lane 26
Vineyardlane 3 a
High Street 0-9
ParkRoad 33
Queens Road
s-Georgelane 9 abc
Kings Road 9 b
1st Park Avenue 67
I tried this:
Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'
df["Streetname"] = df["Streetname"].str.extract(Filter)
I lose a lot of data and the result is only written into one column... Hope that someone can help!

Not 100% perfect (I doubt that this will be possible without a database or machine learning algorithms) but a starting point:
^ # start of line/string
(?P<street>\w+?\D+) # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*) # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*) # a-z
.* # consume the rest of the string
See a demo on regex101.com.
You might want to strip of #, whitespaces or ? from the end of street afterwards.

Remove numbers conditionally?

I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case

If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK

Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')

Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks

For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

ArcGIS Field Calculator strip words after spaces

I have a feature class which contains 40,000 mailing addresses. Each address contains the street address, city, state and zipcode separated by spaces.
Example 1: 123 Northwest Johnson St Cleveland Ohio 12345
Example 2: PO Box 3 Pine Springs Ohio 12345
I want to pull out just the street addresses. How do I say: trim off the string starting at the 3rd or 4th to last space?
Thanks. Any help would be appreciated. I'm trying combinations of split, trim, etc. but can't get it right.

This is how you can do it in pure Python, I am not sure about differences when using ArcGIS:
ad1 = "123 Northwest Johnson St Cleveland Ohio 12345"
ad2 = "PO Box 3 Pine Springs Ohio 12345"
ad1split = ad1.split(" ")
ad2split = ad2.split(" ")
print ' '.join( ad1split[: len(ad1split)-3 ] ) # 123 Northwest Johnson
print ' '.join( ad2split[: len(ad1split)-3 ] ) # PO Box 3
This however only works if all addresses have the same format.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regular expression to exclude 2 consecutive capital letters - python

With re.sub() and group capturing you can use: s="686 Middle Street Pollocksville OG 56432" re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s) Out: 'OG 56432: Middle Street Pollocksville | 686'

Related

Python regex, negate a set of characters in between a string

How can I use a regex function to extract european streetnames, housing numbers and adjectives with pandas?

Remove numbers conditionally?

How can I extract address from raw text using NLTK in python?

ArcGIS Field Calculator strip words after spaces

Categories

Resources