Split Australian addresses into street_address, suburb, state and postcode - python

I have scraped addresses from a webiste but their format is not consistent, for instance:
address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
I have tried to split them by space ' ' but couldn't get the desired result.
I have tried:
if "," in address:
raw_address = address.split(",")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = splitted_address[0].strip()
item["Suburb"] = splitted_address[1].strip()
item["State"] = splitted_suburb[0].strip()
item["Postcode"] = splitted_address[2].strip()
else:
raw_address = address.split(" ")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = " ".join(splitted_address[:-1])
item["Suburb"] = splitted_suburb[0]
item["State"] = splitted_suburb[1]
item["Postcode"] = splitted_address[-1]
And my desired output should be like this:
Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356
How can I split the full address into these specific fields?
Update:
I have parsed out the desired fields using regex pattern:
regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(\d{,4})$"
matches = re.search(regex_str, full_address)
street, suburb, state, postcode = matches.groups()
item["Street_Address"] = street
item["Suburb"] = suburb
item["State"] = state
item["Postcode"] = postcode
It is working for some addresses like with address_3 but with address_1, address_2 this pattern is not working I am getting None Type error:
File "colliers_sale.py", line 164, in parse_details
street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
How can I fix this?

you can use regular expression but probably need multiple pattern, some thing like this:
import re
match = None
if (match := re.search( r'(.*?\d+-\d+),? (.+?) ([A-Z ]+) ([A-Z]+) (\d+)$', address)):
pass # this match address, address_3, address_4
elif (match := re.search(r'(\d+-\d+) (.+?), (.+?), ([A-Z]+), (\d+)$', address)):
pass # this match address_2
# elif (...another pattern...)
if match:
print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
print( 'nothing match')

try 're' package. You can do t using regular expressions like this
import re
address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'
addresses = [address, address_2, address_3, address_4]
for add in addresses:
print(', '.join(re.findall(r"(.*\d+-\d+)[, ]+(\w*\s*\w+\s+\w+)[, ]+(\w*\s*\w+)[, ]+(\w+)[, ]+(\d+)", add)[0]))
parentheses in pattern part of re.findall will help you capture wanted parts.

Related

Why does geolocate not give me the right addresses?

So I was analyzing a data set with addresses in Philadelphia, PA. Now, in order to make use of these, I wanted to get the exact longitude and latitude to later show them on a map.
I have gotten the unique entries of the column as a list and have implemented a loop to get me the longitude and latitude, though it's giving me the same coordinates for every city and sometimes even ones that are outside of Philadelphia.
Here's what I did so far:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="my_user_agent")
geocode = lambda query: geolocator.geocode("%s, Philadelphia PA" % query)
cities = list(philly["station_name"].unique())
for city in cities:
address = city
location = geolocator.geocode(address)
if(location != None):
philly["longitude"] = location.longitude
philly["latitude"] = location.latitude
philly["coordinates"] = list(zip(philly["latitude"], philly["longitude"]))
If "philly" is a list of dictionary objects then you can iterate over the list and add the location properties to each record.
from geopy.geocoders import Nominatim
philly = [{'station_name': '30th Street Station'}]
geolocator = Nominatim(user_agent="my_user_agent")
for row in philly:
address = row["station_name"]
location = geolocator.geocode(f"{address}, Philadelphia, PA", country_codes="us")
if location:
print(address)
print(">>", location.longitude, location.latitude)
row["longitude"] = location.longitude
row["latitude"] = location.latitude
row["coordinates"] = (location.longitude, location.latitude)
print(philly)
Output:
30th Street Station
>> -75.1821442 39.9552836
[{'station_name': '30th Street Station', 'longitude': -75.1821442, 'latitude': 39.9552836, 'coordinates': (-75.1821442, 39.9552836)}]
If working with a Pandas dataframe then you can iterate over each record in the dataframe then set the latitude, longitude and coordinates fields in it.
You can do something like this:
from geopy.geocoders import Nominatim
import pandas as pd
geolocator = Nominatim(user_agent="my_user_agent")
philly = [{'station_name': '30th Street Station'}]
df = pd.DataFrame(philly)
# add empty location columns to data frame
df["latitude"] = ""
df["longitude"] = ""
df["coordinates"] = ""
for _, row in df.iterrows():
address = row.station_name
location = geolocator.geocode(f"{address}, Philadelphia, PA", country_codes="us")
if location:
row["latitude"] = location.latitude
row["longitude"] = location.longitude
row["coordinates"] = (location.longitude, location.latitude)
print(df)
Output:
station_name latitude longitude coordinates
0 30th Street Station 39.955284 -75.182144 (-75.1821442, 39.9552836)
If you have a list with duplicate station names then you should cache the results so you don't make duplicate geolocation requests.

Separating text/text processing using regex

I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.
You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps

Append a text word into a text file after searching for specific word in python

i want to read a text file and want to a specific word and then want to append some other word next to it.
For example:
I want to find first name in a file like John and then want to append last name with "John" like John Smith.
Here is the code i have written up till now.
usrinp = input("Enter name: ")
lines = []
with open('names.txt','rt') as in_file:
for line in in_file:
lines.append(line.rstrip('\n'))
for element in lines:
if usrinp in element is not -1:
print(lines[0]+" Smith")
print(element)
Thats what text file looks like:
My name is FirstName
My name is FirstName
My name is FirstName
FirstName is a asp developer
Java developer is FirstName
FirstName is a python developer
Using replace is one way to do it.
Input file (names.txt):
My name is John
My name is John
My name is John
John is a asp developer
Java developer is John
John is a python developer
Script:
name = 'John'
last_name = 'Smith'
with open('names.txt','r') as names_file:
content = names_file.read()
new = content.replace(name, ' '.join([name, last_name]))
with open('new_names.txt','w') as new_names_file:
new_names_file.write(new)
Output file (new_names.txt):
My name is John Smith
My name is John Smith
My name is John Smith
John Smith is a asp developer
Java developer is John Smith
John Smith is a python developer
search_string = 'john'
file_content = open(file_path,'r+')
lines = []
flag = 0
for line in file_content:
line = line.lower()
stripped_line = line
if search_string in line:
flag = 1
stripped_line = line.strip('\n')+' '+'smith \n'
lines.append(stripped_line)
file_content.close()
if(flag == 1):
file_content = open(file_path,'w')
file_content.writelines(lines)
file_content.close()
**OUTPUT**
My name is FirstName
My name is FirstName
My name is FirstName
FirstName is a asp developer
Java developer is john smith
FirstName is a developer

Searching for a title thats 'starts with' in python?

So I have a list of names
name_list = ["John Smith", "John Wrinkle", "John Wayne", "David John", "David Wrinkle", "David Wayne"]
I want to be able to search, for example, John and
John Smith
John Wrinkle
John Wayne
will display. At the moment my code will display
John Smith
John Wrinkle
John Wayne
David John
What am I doing wrong?
Here is my code
search = input(str("Search: "))
search = search.lower()
matches = [name for name in name_list if search in name]
for i in matches:
if(search == ""):
print("Empty search field")
break
else:
i = i.title()
print(i)
Change your matches to:
matches = [name for name in name_list if name.startswith(search)]
You can also make some changes to your code:
# You can do this in one go
search = input(str("Search: ")).lower()
# Why bother looping if search string wasn't provided.
if not search:
print("Empty search field")
else:
# This can be a generator
for i in (name for name in name_list if name.startswith(search)):
print(i.title())

Parsing unstructured text file with Python

I have a text file, a few snippets of which look like this:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets#yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia#yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation#yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama#cylon.net
Here's my code:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
Here is the output:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
My issues are as follows:
First, there should be 2 records as output, and I'm only outputting one.
In the top block of text, my script has trouble knowing where the previous value ends, and the new one starts: 'Status: Fell Thru' should be one value, 'Primary closing party:', 'Buyer
Acceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' should be caught.
Once this is parsed correctly, I will need to parse again to separate keys from values (ie. 'Closing date':01/10/2010), ideally in a list of dicts.
I can't think of a better way other than using regex to pick out keys, and then grabbing the snippets of text that follow.
When complete, I'd like a csv w/a header row filled with keys, that I can import into pandas w/read_csv. I've spent quite a few hours on this one..
(This isn't a complete answer, but it's too long for a comment).
Field names can have spaces (e.g. MLS number)
Several fields can appear on each line (e.g. Home: Pager:)
The Notes field has the time in it, with a : in it
These mean you can't take your approach to identifying the fieldnames by regex. It's impossible for it to know whether "MLS" is part of the previous data value or the subsequent fieldname.
Some of the Home: Pager: lines refer to the Seller, some to the Buyer or the Lessee Agent or the Leasing Agent. This means the naive line-by-line approach I take below doesn't work either.
This is the code I was working on, it runs against your test data but gives incorrect output due to the above. It's here for a reference of the approach I was taking:
replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]
def fix_linemash(data):
# splits many fields on one line into several lines
results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]
return [line for line in results if line]
def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}
for old, new in replaces:
data = data.replace(old, new)
for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()
return record
records = []
collection = []
blank_flag = False
for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset
line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue
if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []
elif not line: # maybe end of record?
blank_flag = True
else: # false alarm, record continues
blank_flag = False
collection.append(line)
for record in records:
print record
I'm now thinking it would be a much better idea to do some pre-processing tidyup steps over the data:
Strip out "Page n of n" and "Printed on ..." lines, and similar
Identify all valid field names, then break up the combined lines, meaning every line has one field only, fields start at the start of a line.
Run through and just process the Seller/Buyer/Agents blocks, replacing fieldnames with an identifying prefix, e.g. Email: -> Seller Email:.
Then write a record parser, which should be easy - check for two blank lines, split the lines at the first colon, use the left bit as the field name and the right bit as the value. Store however you want (nb. that dictionary keys are unordered).
I suppose it is easier to start a new record by hitting the word "Page".
Just share a little bit of my own experience - it just too difficult to write a generalized parser.
The situation isn't that bad given the data here. Instead of using a simple list to store an entry, use an object. Add all other fields as attributes/values to the object.

Categories

Resources