Finding the Metropolitan Area or County for City, State - python

I have following data in a dataframe:
... location amount
... Jacksonville, FL 40
... Provo, UT 20
... Montgomery, AL 22
... Los Angeles, CA 34
My dataset only contains U.S. cities in the form of [city name, state code] and I have no ZIP codes.
I want to determine either county of a city, in order to visualize my data with ggcounty (like here).
I looked on the website of the U.S. Census Bureau but couldn't really find a table of city,state,county, or similar.
Assuming that I would prefer solving the problem in R only, who has an idea how to solve this?

You can get a ZIP code and more detailed info doing this:
library(ggmap)
revgeocode(as.numeric(geocode('Jacksonville, FL ')))
Hope it helps

Related

Cenpy Library cannot fetch Pittsburgh, PA data

I am working on a project where I am trying to analyze a lot of ACS census data from Pittsburgh, PA, USA. I can easily go to data.census.gov to grab the data I need for the 138 census tracts I am looking for, but that is simply not efficient. So I downloaded the cenpy library which has been working great for New York City ACS data. Here is an example for New York City:
NYC_income = products.ACS(2019).from_place('New York City, NY', level = 'tract',
variables = ['B19013_001E'])
This works fine and will give me a geodataframe with the ACS variables I pass in. I have tried this for Pittsburgh and it does not work and the errors are not very helpful:
pgh_Test = products.ACS(2019).from_place('Pittsburgh, PA', level='tract',
variables = ['B01001A_001E'])
This will return an error:
KeyError: 'Response from API is malformed. You may have submitted too many queries,
formatted the request incorrectly, or experienced significant network connectivity
issues. Check to make sure that your inputs, like placenames, are spelled correctly,
and that your geographies match the level at which you intend to query. The original
error from the Census is:\\n(API ERROR 400:Failed to execute query.([]))'
I have also tried other varients of how to spell Pittsburgh such as Pittsburgh City, Pittsburgh city, Pittsburg, and also tried spelling out the state instead of using acronym.
Ultimately, I am curious if anyone has run into this issue and how to fix it so that I can access Pittsburgh ACS data via cenpy instead of selecting every individual census tract through data.census.gov.
Thank you in advance!
Use 'County Subdivision' as place_type. It seems that it helps to resolve the place correctly:
products.ACS(2019).from_place('Pittsburgh, PA',
place_type='County Subdivision',
level='tract',
variables = ['B01001A_001E'])
Output:
Matched: Pittsburgh, PA to Pittsburgh city within layer County Subdivisions
GEOID geometry B01001A_001E state county tract
0 42003270300 POLYGON ((-8910344.550 4935795.800, -8910341.7... 1154.0 42 003 270300
1 42003980600 POLYGON ((-8909715.600 4933176.800, -8909606.6... 13.0 42 003 980600
2 42003051100 POLYGON ((-8903296.360 4930484.040, -8903251.9... 0.0 42 003 051100
3 42003050900 POLYGON ((-8903766.910 4931335.660, -8903642.5... 40.0 42 003 050900
4 42003562000 POLYGON ((-8901104.700 4930705.200, -8901104.1... 1826.0 42 003 562000
... ... ... ... ... ... ...
84 42003980500 POLYGON ((-8899981.160 4929217.570, -8899977.7... 16.0 42 003 980500
85 42003140200 POLYGON ((-8898569.740 4931230.040, -8898532.8... 1932.0 42 003 140200
86 42003111300 POLYGON ((-8898077.150 4934571.530, -8898053.1... 1499.0 42 003 111300
87 42003111500 POLYGON ((-8898240.670 4932660.630, -8898229.9... 942.0 42 003 111500
88 42003120700 POLYGON ((-8895502.550 4932516.230, -8895493.0... 17.0 42 003 120700
89 rows × 6 columns
Other values for this argument are 'Incorporated Place' and 'Census Designated Place'. From the documentation:
place_type : str
type of place to focus on, Incorporated Place, County Subdivision, or Census Designated Place.
See a demo in this colab.

Parsing addresses from a blob of text in dataframe column

I am trying to use a library called pyap to parse addresses from text in a dataframe column.
My dataframe df has data in the following format:
MID TEXT_BODY
1 I live at 4998 Stairstep Lane Toronto ON
2 Let us catch up at the Ruby Restaurant. Here is the address 1234 Food Court Dr, Atlanta, GA 30030
The package website gives the following sample:
import pyap
test_address = """
I live at 4998 Stairstep Lane Toronto ON
"""
addresses = pyap.parse(test_address, country='CA')
for address in addresses:
# shows found address
print(address)
THe sample return it as a list but I would like to keep it in the dataframe as a new column
The output I am expecting is a data frame like this:
MID ADDRESS TEXT_BODY
1 4998 Stairstep Lane Toronto ON I live at 4998 Stairstep Lane Toronto ON
2 1234 Food Court Dr, Atlanta, GA 30030 Let us catch up at the Ruby Restaurant. Here is the address 1234 Food Court Dr, Atlanta, GA 30030
I tried this:
df["ADDRESS"] = df['TEXT_BODY'].apply(lambda row: pyap.parse(row, country='US'))
But this does not work. I get an error:
TypeError: expected string or bytes-like object
How do I do this?
Apply is indeed the right direction.
def parse_address(addr):
address = pyap.parse(addr, country = "US")
if not address:
address = pyap.parse(addr, country = "CA")
return address[0]
df["addr"] = df.TEXT_BODY.apply(parse_address)
The result is:
MID TEXT_BODY addr
0 1 I live at 4998 Stairstep Lane Toronto ON 4998 Stairstep Lane Toronto ON
1 2 Let us catch up at the Ruby Restaurant. Here i... 1234 Food Court Dr, Atlanta, GA 30030

Pandas : Create a new dataframe from 2 different dataframes using fuzzy matching [duplicate]

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Python AppEngine Sort Query Object Through Two Attributes

I have a model
class Neighborhood(model) {
name = db.StringProperty(required=True)
city = EncodedProperty(encoder=_get_city_name)
}
and I would like to retrieve all of the Neighborhood objects and sort them through both "name" and "city" attributes. Initially, I tried
def retrieveAndSort()
query=Neighborhood.all()
query.order("name")
query.order("city")
return query
but after some further research, it seems GAE doesn't support sorting for EncodedProperty objects. Then, I tried to sort the data in Python after retrieving the query object with the sort() method but the Query object doesn't have this method. Lastly, I tried to sort the Query object with the sorted() method with the code:
neighborhoods = sorted(neighborhoods, key=attrgetter('city', 'name'))
and it almost worked. However, data seemed jumbled and I received some output like the following below.
Abu Dhabi - Al Maryah Island
Minneapolis - Montrose
Atlanta - Buckhead
Atlanta - Home Park
Atlantic City - Boardwalk
Atlantic City - Marina
...
California - Saint Helena
California - Yountville
New York City - Central Park
Charlotte - South End
Charlotte - Third End
...
I have absolutely no idea why this occurs other and would appreciate any help possible.
Edit:
Shorter sample output:
New York City - Meatpacking District
New York City - Brooklyn
New York City - Midtown West
I think you have to fetch the entities and then sort it.
neighborhoods = Neighborhood.all().fetch(1000)
neighborhoods = sorted(neighborhoods, key=attrgetter('city', 'name'))
The question is tagged as gae-datastore but "EncodedProperty" isn't a datastore property type.
https://cloud.google.com/appengine/docs/python/datastore/typesandpropertyclasses
The model should be:
class Neighborhood(db.model):
name = db.StringProperty(required=True)
city = db.StringProperty(required=True)
Then your query function could simply be:
def retrieveAndSort():
"""
Returns everything in Neighborhood model alphabetically
"""
query = db.GqlQuery('SELECT * FROM Neighborhood ORDER BY city DESC')
for place in query:
print "{0} - {1}".format(place.city, place.name)

writing and saving CSV file from scraping data using python and Beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to export the data into a CSV file displaying the parameters I need.
Attached below is my script. I need help on creating code that will transfer my extracted code into a CSV file and how to save it into my desktop.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
for item in g_data1:
try:
print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
except:
pass
for item in g_data2:
try:
print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
pass
This is what I currently get when I run the script. I want to take this data and make into a CSV table for geocoding later.
1801 Merrimac Trl
Williamsburg, Virginia 23185-5905
12551 Glades Rd
Boca Raton, Florida 33498-6830
Preserve Golf Club
13601 SW 115th Ave
Dunnellon, Florida 34432-5621
1000 Acres Ranch Resort
465 Warrensburg Rd
Stony Creek, New York 12878-1613
1757 Golf Club
45120 Waxpool Rd
Dulles, Virginia 20166-6923
27 Pines Golf Course
5611 Silverdale Rd
Sturgeon Bay, Wisconsin 54235-8308
3 Creek Ranch Golf Club
2625 S Park Loop Rd
Jackson, Wyoming 83001-9473
3 Lakes Golf Course
6700 Saltsburg Rd
Pittsburgh, Pennsylvania 15235-2130
3 Par At Four Points
8110 Aero Dr
San Diego, California 92123-1715
3 Parks Fairways
3841 N Florence Blvd
Florence, Arizona 85132
3-30 Golf & Country Club
101 Country Club Lane
Lowden, Iowa 52255
401 Par Golf
5715 Fayetteville Rd
Raleigh, North Carolina 27603-4525
93 Golf Ranch
406 E 200 S
Jerome, Idaho 83338-6731
A 1 Golf Center
1805 East Highway 30
Rockwall, Texas 75087
A H Blank Municipal Course
808 County Line Rd
Des Moines, Iowa 50320-6706
A-Bar-A Ranch Golf Course
Highway 230
Encampment, Wyoming 82325
A-Ga-Ming Golf Resort, Sundance
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A-Ga-Ming Golf Resort, Torch
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A. C. Read Golf Club, Bayou
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
A. C. Read Golf Club, Bayview
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
All you really need to do here is put your output in a list and then use the CSV library to export it. I'm not entirely clear on what you are getting out views-field-nothing-1 but to just focus on view-fields-nothing, you could do something like:
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
course=[name,address1,address2]
courses_list.append(course)
This will put the courses in a list, next you can write them to a cvs like so:
import csv
with open ('filename.cv','wb') as file:
writer=csv.writer(file)
for row in course_list:
writer.writerow(row)
First of all you want to put all of your items in a list and then write to a file later in case there is an error while you are scrapping. Instead of printing just append to a list.
Then you can write to a csv file
f= open('filename', 'wb')
csv_writer = csv.writer(f)
for i in main_list:
csv_writer.writerow(i)
f.close()

Categories

Resources