extract `state` from address using geopy

extract `state` from address using geopy - python

I have a table that has 10,000 address entries. I would like to obtain the full address from it and the state.
How could I do it with geopy module? Other modules such as geopandas are not available.
Example
address
------------------------------------------------------------------------------
Prosperity Drive Northwest, Huntsville, Alabama, United States
------------------------------------------------------------------------------
Eureka Drive, Fremont, Newark, Alameda County, 94538, United States
------------------------------------------------------------------------------
Messenger Loop NW, Facebook Data Center, Los Lunas, Valencia County, New Mexico
Desired
address with format | state
----------------------------------------------------------------------------------------------
Prosperity Drive Northwest, Huntsville, Madison County, Alabama, 35773, United States | Alabama
----------------------------------------------------------------------------------------------
Eureka Drive, Fremont, Newark, Alameda County, California, 94538, United States | California
----------------------------------------------------------------------------------------------
Messenger Loop NW, Facebook Data Center, Los Lunas, Valencia County, New Mexico, 87031, United States | New Mexico
Thank you for your time.
Note
I am aware of how to use the Nominatim function and the extra module that can cope with pandas dataframe. But extra module is not available in this case.

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
location = geolocator.geocode("Eureka Drive, Fremont, Newark, Alameda County, 94538, United States")
print(location.address)
Output:
Eureka Drive, Fremont, Newark, Alameda County, California, 94538, United States
Exactly as it's used in the documentation...

Related

Getting a word from a set in a dataframe?

I have a dataframe column 'address' with values like this in each row:
3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)
Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)
I need only to keep the value Bronx / Queens / Manhattan / Staten Island from each row.
Is there any way to do this?
Thanks in advance.

One option is this, assuming the values are always in the same place. Using .split(', ')[2]
"3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)".split(', ')[2]
If the source file is a CSV (Comma-separated values), I would have a look at pandas and pandas.read_csv('filename.csv') and leverage all the nice features that are in pandas.
If the values are not at the same position and you need only a is in set of values or not:
import pandas as pd
df = pd.DataFrame(["The Bronx", "Queens", "Man"])
df.isin(["Queens", "The Bronx"])

You could add a column, let's call it 'district' and then populate it like this.
import pandas as pd
df = pd.DataFrame({'address':["3466B, Jerome Avenue, The Bronx, Bronx County, New York, 10467, United States, (40.881836199999995, -73.88176324294639)",
"Jackson Heights 74th Street - Roosevelt Avenue (7), 75th Street, Queens, Queens County, New York, 11372, United States, (40.74691655, -73.8914737373454)"]})
districts = ['Bronx','Queens','Manhattan', 'Staten Island']
df['district'] = ''
for district in districts:
df.loc[df['address'].str.contains(district) , 'district'] = district
print(df)

obtain confidence/distance on country extracted by geograpy3

I tried a simple demo to check if geograpy can do what i'm looking for: trying to find the country name and iso code in denormalized addresses (which is basically what geograpy is written for!).
The problem is that, in the test i made, geograpy is able to found several country for each address used, including the right in most of cases, but i can't find any type of parameters to decide which country is the most "correct".
The list of fake addresses that i used, which may reflect reality that could be analyzed, is this:
John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
John Doe 30 Huntington Terrace Newark, New York 07112 USA
John Doe 22 Huntington Terrace Newark, New York 07112 US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
This is the simple code written:
import geograpy
ind = ["John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti",
"John Doe 160 Huntington Terrace Newark, New York 07112 United States of America",
"John Doe 30 Huntington Terrace Newark, New York 07112 USA",
"John Doe 22 Huntington Terrace Newark, New York 07112 US",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy"]
locator = geograpy.locator.Locator()
for address in ind:
places = geograpy.get_place_context(text=address)
print(address)
#print(places)
for country in places.countries:
print("Country:"+country+", IsoCode:"+locator.getCountry(name=country).iso)
print()
and this is the output:
John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
Country:United States, IsoCode:US
Country:United Kingdom, IsoCode:GB
Country:Netherlands, IsoCode:NL
Country:Jamaica, IsoCode:JM
Country:Argentina, IsoCode:AR
John Doe 30 Huntington Terrace Newark, New York 07112 USA
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 22 Huntington Terrace Newark, New York 07112 US
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
Country:Italy, IsoCode:IT
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
First of all, the biggest problem is that in italian address (number 4) is unable to find at all the right country (Italia/Italy), and i don't know from where the three country found comes from.
The seconds it that in most cases it find wrong country, in addiction to the right, and i don't have any type of indicator about confidence percentage, distance, or something that could me understand if a country located could be considered acceptable as answer and, in multiple results, what could be the "best".
I want to apologize in advance, but I didn't have time to study geograpy3 in depth and i don't know if this is a stupid question, but i haven't found anything about confidence/probability/distance in documentation.

I am answering as a committer of geograpy3.
It looks like you are trying to use the legacy interface of geograpy Version1 times for your first step and only then use the locator. For your usecase the improved locator interface might be much more reasonable. This interface can use extra information like population or gdp per capita to find the "most likely" country for disambiguation.
The Stati Uniti/United States Italia/Italy issue is a language problem - see the long standing open issue https://github.com/ushahidi/geograpy/issues/23 of geograpy version1. As of today there seems to be no new issue in geograpy3 yet - feel free to file one if you need this improvement.
I added your example to test_locator.py in the geograpy3 project to show the difference in the concepts:
def testStackOverflow64379688(self):
'''
compare old and new geograpy interface
'''
examples=['John Doe 160 Huntington Terrace Newark, New York 07112 United States of America',
'John Doe 30 Huntington Terrace Newark, New York 07112 USA',
'John Doe 22 Huntington Terrace Newark, New York 07112 US',
'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia',
'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy',
'Newark','Rome']
for example in examples:
city=geograpy.locateCity(example,debug=False)
print(city)
Result:
None
None
None
None
None
Newark (US-NJ(New Jersey) - US(United States))
Rome (IT-62(Latium) - IT(Italy))

Converting from HTML to CSV using Python

I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:

The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland

And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Filling out empty cells with lists of values

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.

Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.

Robot FrameWork Collections - List comparison Issue

I am trying to compare two identical lists in Robot Framework . The code I am using is :
List Test
Lists Should Be Equal #{List_Of_States_USA} #{List_Of_States_USA-Temp}
and the lists are identical with the following values :
#{List_Of_States_USA} Alabama Alaska American Samoa Arizona Arkansas California Colorado
... Connecticut Delaware District of Columbia Florida Georgia Guam Hawaii
... Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana
... Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri
... Montana National Nebraska Nevada New Hampshire New Jersey New Mexico
... New York North Carolina North Dakota Northern Mariana Islands Ohio Oklahoma Oregon
... Pennsylvania Puerto Rico Rhode Island South Carolina South Dakota Tennessee Texas
... Utah Vermont Virgin Islands Virginia Washington West Virginia Wisconsin
... Wyoming
This test fails with the following error:
FAIL Keyword 'Collections.Lists Should Be Equal' expected 2 to 5 arguments, got 114.
I have searched SO and other sites for a solution, but could not figure out why this happened. Thanks in advance for support

You need to use a $ not #. When you use #, robot expands the lists into multiple arguments.
From the robot framework user's guide:
When a variable is used as a scalar like ${EXAMPLE}, its value will be used as-is. If a variable value is a list or list-like, it is also possible to use as a list variable like #{EXAMPLE}. In this case individual list items are passed in as arguments separately.
Consider the case of #{foo} being a list with the values "one", "two" and "three". In such as case the following two are identical:
some keyword #{foo}
some keyword one two three
You need to change your statement to this:
Lists Should Be Equal ${List_Of_States_USA} ${List_Of_States_USA-Temp}

So, As suggested by Bryan-Oakley above, I modified the test as follows:
${L1} Create List #{List_Of_States_USA}
${L2} Create List #{List_Of_States_USA-Temp}
Lists Should Be Equal ${L1} ${L2}
Now the test passed. Thanks Again # Brian

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract `state` from address using geopy - python

Related

Getting a word from a set in a dataframe?

obtain confidence/distance on country extracted by geograpy3

Converting from HTML to CSV using Python

Filling out empty cells with lists of values

Robot FrameWork Collections - List comparison Issue

Categories

Resources