I have a feature class which contains 40,000 mailing addresses. Each address contains the street address, city, state and zipcode separated by spaces.
Example 1: 123 Northwest Johnson St Cleveland Ohio 12345
Example 2: PO Box 3 Pine Springs Ohio 12345
I want to pull out just the street addresses. How do I say: trim off the string starting at the 3rd or 4th to last space?
Thanks. Any help would be appreciated. I'm trying combinations of split, trim, etc. but can't get it right.
This is how you can do it in pure Python, I am not sure about differences when using ArcGIS:
ad1 = "123 Northwest Johnson St Cleveland Ohio 12345"
ad2 = "PO Box 3 Pine Springs Ohio 12345"
ad1split = ad1.split(" ")
ad2split = ad2.split(" ")
print ' '.join( ad1split[: len(ad1split)-3 ] ) # 123 Northwest Johnson
print ' '.join( ad2split[: len(ad1split)-3 ] ) # PO Box 3
This however only works if all addresses have the same format.
Related
I have several set of strings with numbers followed words and jumbled numbers and words etc.
For example,
"Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"
I am trying to separate the street names from the apartment numbers.
Here is my current code:
import re
pattern_street = re.compile(r'[A-Za-z]+\s?\w+\s?[A-Za-z]+\s?[A-Za-z]+',re.X)
pattern_apartmentnumber = re.compile(r'(^\d+\s? | [A-Za-z]+[\s?]+[0-9]+$)',re.X)
for i in ["Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"]:
match_street = pattern_street.search(i)
match_apartmentnumber = pattern_apartmentnumber.search(i)
fin_street = match_street[0]
fin_apartmentnumber = match_apartmentnumber[0]
print("street--",fin_street)
print("apartmentnumber--",fin_apartmentnumber)
which prints:
street-- Street 50 No
apartmentnumber-- No 40
street-- saint bakers holy street
apartmentnumber-- 5
street-- Syndicate street
apartmentnumber-- 32
I want to remove the "No" from the first street name. i.e. if there is any street with No followed by a number at the end, that needs to be taken as the apartment number,
and not as the street.
How can I do this for my above example strings?
First try the case where there is a No 123 at the end, use a positive lookahead.
If not found try a street without this.
pattern_street = re.compile(r'[A-Za-z]+[\s\w]+(?=\s[Nn]o\s\d+$)|[A-Za-z]+[\s\w]+',re.X)
You can find the street name by the following regex pattern to eliminate No [0-9] from the statement.
pattern_street = re.compile(r'[A-Za-z]+((?!No).)+',re.X)
I have a pandas dataframe column street_address with strings that looks like this:
id | street_address
----------------------
1 | 3510 WILSHIRE BLVD #1500
2 | PO BOX 29043
3 | RE HIAM S ABU QARTUMI 4676 ADMIRALTY WAY STE 632
4 | RE: SOON, LEE YEE 3510 WILSHIRE BLVD #150
5 | LAW OFFICES OF JOE M DOE 133 SANDSTONE ST STE 901
6 | SUITE 940, 1500 N CENTRAL AVE
I want to remove the text before the numeric values (actual address), but need to exclude PO Boxes and the address that begins with Suite number.
I want the output to be something like this:
street_address
----------------------
3510 WILSHIRE BLVD #1500
PO BOX 29043
4676 ADMIRALTY WAY STE 632
3510 WILSHIRE BLVD #150
133 SANDSTONE ST STE 901
SUITE 940, 1500 N CENTRAL AVE
Thanks for your help!
EDIT
Thanks everyone for the help!
However, for my example I made it work by using replace
# When an address starts with a string,
# remove that string though the first number
# unless that string is similar to 'PO BOX' or 'SUITE'.
# This catches variants like
# PO BOX, P.O BOX, PMB, STE, Suite, ste, etc.
pattern = r"^(?![PO.\sBX]{2,}|[PMB]{2,}|[\d]|[SUITE])(\D+)(.+)"
df['str_addr'] = df['street_address'].str.replace(pattern,'\\2')
Use the following regex:
r'^(?:(?!(PO BOX|SUITE|\d+)))([a-zA-Z :,]+)'
The first part uses a non-capturing group to identify rows that don't start with a number, "PO BOX", or "SUITE". The second part ([a-zA-Z :,]+) captures the start of those addresses that were flagged by the first half of the regex. You can extract this capture group from the offending lines and strip them down to the address. Obviously, if there are more characters besides [a-zA-Z :,], add them to this bracketed list to grab them too.
I am trying to parse the text of this format:
1ST Circuit U.S. District Court for NEW YORK SOUTHERN District Judge SMITH, JOHN T., JR
In the text, I want to capture:
Circuit name: In the example above, 1ST CIRCUIT. Circuit number can be between 1ST and 99TH. This information is not always there.
State name: In the text above, NEW YORK SOUTHERN. It can be at most three words. This information is not always there.
Title: It can be either District or Magistrate.
Last Name: Here, it is SMITH
Name: The name is JOHN T.,JR
To make my problem more clear, let me give two more examples of the text I want to parse.
15TH Circuit U.S. District Court for ALABAMA Magistrate Judge NEELY, CATHERINE
Magistrate Judge COOKE, THOMAS M
I have tried the following expression. It was able to capture the name of the judge but failed to capture the circuit and the state.
((?P<circuit>\d{1,2}\w{2} Circuit)?\s?(U\.S\. District Court for )?\s?(?
P<state>\b[A-Z]*(\s[A-Z]*)\b)*)?.* (?<=Judge )(?P<lname>[A-Z]*), (?P<name>
[A-Z,. ]*)( {1,2}\(.*\))?
Many thanks.
This should help.
import re
s = ["1ST Circuit U.S. District Court for NEW YORK SOUTHERN District Judge SMITH, JOHN T., JR", "15TH Circuit U.S. District Court for ALABAMA Magistrate Judge NEELY, CATHERINE"]
for sVal in s:
m = re.search(r"((?P<circuit>\d*(ST|TH) Circuit)) U.S. District Court for (?P<state>\b[A-Z\s]*\b)(?P<title>(District|Magistrate)) Judge (?P<lname>[A-Z]*), (?P<name>.*$)", sVal)
if m:
for i in ["circuit", "state", "title", "lname", "name"]:
print(m.group(i))
print("-----")
Output:
1ST Circuit
NEW YORK SOUTHERN
District
SMITH
JOHN T., JR
-----
15TH Circuit
ALABAMA
Magistrate
NEELY
CATHERINE
-----
I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals
If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071
With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'
I have been looking and reading for a few days about how to collect the last word from a varied length string. I have found lots of postings about how to collect/split the last word, but none of the content I have read addresses varied length stings.
I would like to use this function for column population, automated labeling and content filtering from inside either the field calculator or label expression interfaces.
String examples: Morgan County, Johnson Parrish, John Brown County, Rick de la Rosa City, Big Wild life Area.
Output example : County, Parish, City, Area
Nothing I have tried has worked 100%. The following code just about works, and would work if all my strings were two words long: s.split( " " )[1:][0]
I am using arcmap 10.2 / python
How about this:
# example comma separated string ending with a period
s = "Morgan County, Johnson Parrish, John Brown County, Rick de la Rosa City, Big Wild life Area."
# output list
out = []
for pair in s.replace('.', '').split(', '):
out.append(pair.split(' ')[-1])
print out
Which results in: ['County', 'Parrish', 'County', 'City', 'Area']