obtain confidence/distance on country extracted by geograpy3 - python

I tried a simple demo to check if geograpy can do what i'm looking for: trying to find the country name and iso code in denormalized addresses (which is basically what geograpy is written for!).
The problem is that, in the test i made, geograpy is able to found several country for each address used, including the right in most of cases, but i can't find any type of parameters to decide which country is the most "correct".
The list of fake addresses that i used, which may reflect reality that could be analyzed, is this:
John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
John Doe 30 Huntington Terrace Newark, New York 07112 USA
John Doe 22 Huntington Terrace Newark, New York 07112 US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
This is the simple code written:
import geograpy
ind = ["John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti",
"John Doe 160 Huntington Terrace Newark, New York 07112 United States of America",
"John Doe 30 Huntington Terrace Newark, New York 07112 USA",
"John Doe 22 Huntington Terrace Newark, New York 07112 US",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy"]
locator = geograpy.locator.Locator()
for address in ind:
places = geograpy.get_place_context(text=address)
print(address)
#print(places)
for country in places.countries:
print("Country:"+country+", IsoCode:"+locator.getCountry(name=country).iso)
print()
and this is the output:
John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
Country:United States, IsoCode:US
Country:United Kingdom, IsoCode:GB
Country:Netherlands, IsoCode:NL
Country:Jamaica, IsoCode:JM
Country:Argentina, IsoCode:AR
John Doe 30 Huntington Terrace Newark, New York 07112 USA
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 22 Huntington Terrace Newark, New York 07112 US
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
Country:Italy, IsoCode:IT
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
First of all, the biggest problem is that in italian address (number 4) is unable to find at all the right country (Italia/Italy), and i don't know from where the three country found comes from.
The seconds it that in most cases it find wrong country, in addiction to the right, and i don't have any type of indicator about confidence percentage, distance, or something that could me understand if a country located could be considered acceptable as answer and, in multiple results, what could be the "best".
I want to apologize in advance, but I didn't have time to study geograpy3 in depth and i don't know if this is a stupid question, but i haven't found anything about confidence/probability/distance in documentation.

I am answering as a committer of geograpy3.
It looks like you are trying to use the legacy interface of geograpy Version1 times for your first step and only then use the locator. For your usecase the improved locator interface might be much more reasonable. This interface can use extra information like population or gdp per capita to find the "most likely" country for disambiguation.
The Stati Uniti/United States Italia/Italy issue is a language problem - see the long standing open issue https://github.com/ushahidi/geograpy/issues/23 of geograpy version1. As of today there seems to be no new issue in geograpy3 yet - feel free to file one if you need this improvement.
I added your example to test_locator.py in the geograpy3 project to show the difference in the concepts:
def testStackOverflow64379688(self):
'''
compare old and new geograpy interface
'''
examples=['John Doe 160 Huntington Terrace Newark, New York 07112 United States of America',
'John Doe 30 Huntington Terrace Newark, New York 07112 USA',
'John Doe 22 Huntington Terrace Newark, New York 07112 US',
'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia',
'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy',
'Newark','Rome']
for example in examples:
city=geograpy.locateCity(example,debug=False)
print(city)
Result:
None
None
None
None
None
Newark (US-NJ(New Jersey) - US(United States))
Rome (IT-62(Latium) - IT(Italy))

Related

How to use Faker's unique method with generator objects

I want to use the Faker multi-locale mode (https://maleficefakertest.readthedocs.io/en/docs-housekeeping/fakerclass.html#multiple-locale-mode) and pass a list of locales to my Faker object, then call the respective locale generator and generate unique values when needed in my code.
The "unique" attribute works fine for a Faker object, but does not when it is operating on a Faker Generator. I can see what is happening, but was expecting/hoping that I could just use the same "unique" method with the multi-locales mode. For example:
from faker import Faker
fake1 = Faker("en_US")
fake2 = Faker(["en_CA", "en_US"])
print(type(fake1))
print(fake1.state())
print(fake1.unique.state())
print(type(fake2["en_US"]))
print(fake2["en_US"].state())
print(fake2["en_US"].unique.state())
Gives:
<class 'faker.proxy.Faker'>
Arizona
Illinois
<class 'faker.generator.Generator'>
Oregon
Traceback (most recent call last):
File "test.py", line 12, in <module>
print(fake2["en_US"].unique.state())
AttributeError: 'Generator' object has no attribute 'unique'
Does anyone know a way to use "unique" with the multi-locale mode?
The good news is, Canada has provinces, so you can call fake2.unique.province() 13 times (10 provinces, 3 territories) before getting a UniquenessException.
The bad news is, in playing around with your problem for another locale that also has states (Australia), I couldn't make much sense of the behaviour. It seems to depart from what you want.
Just from observation, it becomes likely that you start getting UniquenessException once one of the locales is exhausted (likely to be Australia in this case, having 8 states).
fake3 = Faker(['en_US', 'en_AU'])
for i in range(70):
try:
print(fake3.unique.state())
except:
print("UniquenessException")
Iowa
Georgia
Tasmania
Idaho
Alabama
Australian Capital Territory
Maine
Montana
Western Australia
Victoria
Queensland
Kentucky
Pennsylvania
New South Wales
Rhode Island
Arizona
South Australia
Louisiana
Northern Territory
Missouri
UniquenessException
Nevada
UniquenessException
Alaska
Connecticut
UniquenessException
Delaware
Kansas
UniquenessException
Indiana
Texas
UniquenessException
UniquenessException
UniquenessException
New York
UniquenessException
UniquenessException
New Mexico
UniquenessException
UniquenessException
UniquenessException
Illinois
Mississippi
UniquenessException
West Virginia
Ohio
Arkansas
Wyoming
UniquenessException
UniquenessException
New Hampshire
UniquenessException
UniquenessException
Hawaii
UniquenessException
Vermont
UniquenessException
North Dakota
UniquenessException
North Carolina
South Carolina
Washington
UniquenessException
UniquenessException
Minnesota
UniquenessException
Utah
UniquenessException
UniquenessException
Tennessee

BeautifulSoup Assistance

I am trying to scrape the following website (https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0) and ultimately am interested in storing some of the data inside each 'li class="search-result-item"' to perform further analytics.
Example of one "search-result-item"
I want to capture the <h3>,<span class="plaque-role"> and <span class="plaque-location"> in a python dictionary:
<li class="search-result-item"><img class="search-result-image max-width" src="/siteassets/home/visit/blue-plaques/find-a-plaque/blue-plaques-f-j/helen-gwynne-vaughan-plaque.jpg?w=732&h=465&mode=crop&scale=both&cache=always&quality=60&anchor=&WebsiteVersion=20220516171525" alt="" title=""><div class="search-result-info"><h3>GWYNNE-VAUGHAN, Dame Helen (1879-1967)</h3><span class="plaque-role">Botanist and Military Officer</span><span class="plaque-location">Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden</span></div></li>
So far I am trying to isolate all the "search-result-item" but my current code prints absolutely nothing. If someone can help me sort that problem out and point me in the right direction to storing each data element into a python dictionary I would be very grateful.
from bs4 import BeautifulSoup
import requests
url = 'https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup.prettify())
print(soup.find_all(class_='search-result-item')).get_text()
You're not getting anything because the search results are generated by JavaScript. Use the API endpoint they fetch the data from.
For example:
import requests
api_url = "https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0&_=1653043005731"
plaques = requests.get(api_url).json()["plaques"]
for plaque in plaques:
print(plaque["title"])
print(plaque["address"])
print(f"https://www.english-heritage.org.uk{plaque['path']}")
print("-" * 80)
Output:
GWYNNE-VAUGHAN, Dame Helen (1879-1967)
Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden
https://www.english-heritage.org.uk/visit/blue-plaques/helen-gwynne-vaughan/
--------------------------------------------------------------------------------
READING, Lady Stella (1894-1971)
41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/stella-lady-reading/
--------------------------------------------------------------------------------
32 SOHO SQUARE
32 Soho Square, Soho, London, W1D 3AP, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/soho-square/
--------------------------------------------------------------------------------
14 BUCKINGHAM STREET
14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/buckingham-street/
--------------------------------------------------------------------------------
ABRAHAMS, Harold (1899-1978)
Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet
https://www.english-heritage.org.uk/visit/blue-plaques/abrahams-harold/
--------------------------------------------------------------------------------
ADAM, ROBERT and HOOD, THOMAS and GALSWORTHY, JOHN and BARRIE, SIR JAMES
1-3 Robert Street, Adelphi, Charing Cross, London, WC2N 6BN, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adam-hood-galsworthy-barrie/
--------------------------------------------------------------------------------
ADAMS, Henry Brooks (1838-1918)
98 Portland Place, Marylebone, London, W1B 1ET, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/united-states-embassy/
--------------------------------------------------------------------------------
ADELPHI, The
The Adelphi Terrace, Charing Cross, London, WC2N 6BJ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adelphi/
--------------------------------------------------------------------------------
ALDRIDGE, Ira (1807-1867)
5 Hamlet Road, Upper Norwood, London, SE19 2AP, London Borough of Bromley
https://www.english-heritage.org.uk/visit/blue-plaques/aldridge-ira/
--------------------------------------------------------------------------------
ALEXANDER, Sir George (1858-1918)
57 Pont Street, Chelsea, London, SW1X 0BD, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/george-alexander/
--------------------------------------------------------------------------------
ALLENBY, Field Marshal Edmund Henry Hynman, Viscount Allenby (1861-1936)
24 Wetherby Gardens, South Kensington, London, SW5 0JR, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/field-marshal-viscount-allenby/
--------------------------------------------------------------------------------
ALMA-TADEMA, Sir Lawrence, O.M. (1836-1912)
44 Grove End Road, St John's Wood, London, NW8 9NE, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/lawrence-alma-tadema/
--------------------------------------------------------------------------------
Content is generated dynamically by JavaScript so you wont find the elements / info you are looking for with BeautifulSoup, instead use their API.
Example
import requests
url = 'https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url).json()
data = []
for e in page['plaques']:
data.append(dict((k,v) for k,v in e.items() if k in ['title','professions','address']))
data
Output
[{'title': 'GWYNNE-VAUGHAN, Dame Helen (1879-1967)', 'address': 'Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden', 'professions': 'Botanist and Military Officer'}, {'title': 'READING, Lady Stella (1894-1971)', 'address': '41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster', 'professions': "Founder of the Women's Voluntary Service"}, {'title': '32 SOHO SQUARE', 'address': '32 Soho Square, Soho, London, W1D 3AP, City Of Westminster', 'professions': 'Botanists'}, {'title': '14 BUCKINGHAM STREET', 'address': '14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster', 'professions': 'Statesman, Diarist, Naval Official, Painter'}, {'title': 'ABRAHAMS, Harold (1899-1978)', 'address': 'Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet', 'professions': 'Athlete'}, ...]

extract `state` from address using geopy

I have a table that has 10,000 address entries. I would like to obtain the full address from it and the state.
How could I do it with geopy module? Other modules such as geopandas are not available.
Example
address
------------------------------------------------------------------------------
Prosperity Drive Northwest, Huntsville, Alabama, United States
------------------------------------------------------------------------------
Eureka Drive, Fremont, Newark, Alameda County, 94538, United States
------------------------------------------------------------------------------
Messenger Loop NW, Facebook Data Center, Los Lunas, Valencia County, New Mexico
Desired
address with format | state
----------------------------------------------------------------------------------------------
Prosperity Drive Northwest, Huntsville, Madison County, Alabama, 35773, United States | Alabama
----------------------------------------------------------------------------------------------
Eureka Drive, Fremont, Newark, Alameda County, California, 94538, United States | California
----------------------------------------------------------------------------------------------
Messenger Loop NW, Facebook Data Center, Los Lunas, Valencia County, New Mexico, 87031, United States | New Mexico
Thank you for your time.
Note
I am aware of how to use the Nominatim function and the extra module that can cope with pandas dataframe. But extra module is not available in this case.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
location = geolocator.geocode("Eureka Drive, Fremont, Newark, Alameda County, 94538, United States")
print(location.address)
Output:
Eureka Drive, Fremont, Newark, Alameda County, California, 94538, United States
Exactly as it's used in the documentation...

Cannot turn string into an integer python3

I'm attempting to convert the following into integers. I have literally tried everything and keep getting errors.
For instance:
pop2007 = pop2007.astype('int32')
ValueError: invalid literal for int() with base 10: '4,779,736'
Below is the DF I'm trying to convert. I've even attempted the .values method with no success.
pop2007
Alabama 4,779,736
Alaska 710,231
Arizona 6,392,017
Arkansas 2,915,918
California 37,253,956
Colorado 5,029,196
Connecticut 3,574,097
Delaware 897,934
Florida 18,801,310
Georgia 9,687,653
Idaho 1,567,582
Illinois 12,830,632
Indiana 6,483,802
Iowa 3,046,355
Kansas 2,853,118
Kentucky 4,339,367
Louisiana 4,533,372
Maine 1,328,361
Maryland 5,773,552
Massachusetts 6,547,629
Michigan 9,883,640
Minnesota 5,303,925
Mississippi 2,967,297
Missouri 5,988,927
Montana 989,415
Nebraska 1,826,341
Nevada 2,700,551
New Hampshire 1,316,470
New Jersey 8,791,894
New Mexico 2059179
New York 19378102
North Carolina 9535483
North Dakota 672591
Ohio 11536504
Oklahoma 3751351
Oregon 3831074
Pennsylvania 12702379
Rhode Island 1052567
South Carolina 4625364
South Dakota 814180
Tennessee 6346105
Texas 25,145,561
Utah 2,763,885
Vermont 625,741
Virginia 8,001,024
Washington 6,724,540
West Virginia 1,852,994
Wisconsin 5,686,986
Wyoming 563,626
Name: 3, dtype: object
You can't turn a string with commas into an integer. Try this.
my_int = '1,000,000'
my_int = int(my_int.replace(',', ''))
print(my_int)
Have you tried pop2007.replace(',','') to remove the commas from your string values before converting to integers?

Converting from HTML to CSV using Python

I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:
The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland
And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Categories

Resources