I have spent the last 2 months working on a script that cleans, formats, and geocodes addresses. It is quite successful right now, however, there are some addresses that are giving me problems.
For example:
Addresses such as 7TH AVENUE 530 xxxxxxxxxxx are causing my geolocation module to fail. You can assume that the x's are other text. The other text isn't causing errors, it's purely due to the street number coming after avenue. I currently have filters in my program to truncate the address after street suffixes, such as avenue, street, etc. Due to this the program will ultimately only send 7th avenue to the cleaning module, which isn't accurate.
How can I account for instances where there are a group of numbers immediately preceding the street suffix and then move them to the front of the address. Then I can continue as I already am and truncate the string after the suffix.
You can assume that I have a list of all of the street suffixes named patterns.
Thank you. Any help is greatly appreciated.
FURTHER CLARIFICATION: I would only need to perform this rearrangement of the string if the group of numbers was 3 digits or less, because the zip code will frequently come after the address suffix, and in cases like that I wouldn't want to rearrange the string.
I am not sure if this helps, but you can start with this:
import re
address = '7TH AVENUE 530 xxxxxxxxxxx'
m = re.search('(?<=AVENUE )\d{1,3}', address)
print (m.group(0))
>>> 530
Edit based on your comment:
import re
original = '7TH AVENUE 530 xxxxxxxxxxx'
patterns = ['street', 'avenue', 'road', 'place']
regex = r'(.*)(' + '|'.join(patterns) + r')(.*)'
address = re.sub(regex, r'\1\2', original.lower()).lstrip()
new_addr = re.search(r'(?<=%s )\d{1,3} ' % address, original.lower())
resulting_address = ' '.join([new_addr.group(0).strip(),address]) if new_addr else address
address = ' '.join(map(str.capitalize, resulting_address.split()))
Related
I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue+', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters.
For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d+ )+\w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex.
Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d+ )+\w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.
you could use the extract method to overwrite the column with its own content
pat = r'(\d+\s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d+ )+\w{0,3} matches the following patterns and returns some funky stuff as well
1131 1313 street
121 avenue
1 1 1 1 1 1 avenue
42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.
I'm very new to python and have been getting help from peers with developing this program. I essentially have a very unrefined, dynamic scraper, that pulls emails from a given URL.
I've been considering how I would go about matching up a first/last name to the email address, and come up with the idea of matching any 4+ consecutive characters before the '#' in an email to another element on the web page, under the assumption that most business's use at least some portion of first/last name in the creation of the email. I also decided to go with 4 characters to avoid any mix ups that might occur at 3+ characters, as I don't feel this is specific enough.
I hope this isn't too general of a question, I'm just unsure where to start.
Most of what I have found while pondering this question has been based on splitting the email and using regex to match, but I'm unsure if this will work on the page itself/how to implement.
import urllib.request,re
f = urllib.request.urlopen("http://www.sampleurl.com")
s = f.read().decode('utf-8')
print(re.findall(r"\+\d{2}\s?0?\d{10}",s))
print(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
This a very basic version of a much larger program, but it is most of the inner workings. It returns email and phone number properly based on the given URL.
from email_split import email_split
import urllib.request,re
#Find Emails
f = urllib.request.urlopen("https://www.sampleurl.com/")
s = f.read().decode('utf-8')
e = (re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
emails = []
for x in e:
emails.append(str(x))
#Split Email
email = email_split(x)
str = email.local
match = re.search(r'([\w.-]+)', str)
if match:
print match.group()
It is sort of a general question, but I thought of this right away. I'm sort of thinking this would be a 3 step process:
1) extract the names on the website. I haven't used it, but sounds like you could use spaCy to pull out name entities. Then store those in some type of list of names.
2) extract all the email addresses on the site and store those in a list of email addresses.
3) Then use fuzzywuzzy to iterate through the names and find matches to an email address sample of fuzzywuzzy
Without having a specific website to try this on, it's just in theory, and I created a sample list to sort of show what I'm thinking:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
names = ['John A. Smith', 'Michael Jordan', 'James Thomas', 'Bradley Cooper']
emails = ["jasmith#foo.com", "john.smith#bar.com", "jthomas#foobar.com", "bradc#company.com"]
email_users = [ x.split('#')[0] for x in emails ]
email_dict = dict(zip(email_users, emails))
for name in names:
matches = process.extract(name, email_users, limit=3)
print ('%s: ' %name)
for each in matches:
if each[1] >= 70:
print ('\t%-20s - Score: %s' %(email_dict[each[0]], each[1]))
else:
continue
Output:
John A. Smith:
john.smith#bar.com - Score: 95
jasmith#foo.com - Score: 70
Michael Jordan:
James Thomas:
jthomas#foobar.com - Score: 77
Bradley Cooper:
bradc#company.com - Score: 72
I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress
I have 1 Billion addresses which are kinda in a bad format like:
'12-as FS street, 456 DLGG Area, Rand. District, Sydney, Australia 32 1020203'
I need the output like
Column1:12AS
Column2: FS 456 DLGG Area
Column3: Rand
Column4: Sydney
Column5: Australia
Column6: 32
Column7: 1020203
So basically i need them to be separated as house number, address line, state, country, statecode, pincode and remove words like street, district, countryside, road etc.
Also I need to search for the most frequent words above a particular threshold.
You just need to write a parser. Its code would depend on data. Unless somebody has written parser for your specific data format.
List of immediate questions (incomplete):
1) Is comma the separator for all lines?
2) Is comma used inside values (e.g. inside street name)?
3) List of all words to be removed (road, rd., blvd. etc.)
4) Can address be in the form of "house name" instead of street with number?
This is a random example of address parser with some learning functionality:
https://github.com/datamade/usaddress
If your format and requirements are not exactly matching some existing parser, then you have to write on your own.
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.