Get address using Latitude and Longitude from two columns in DataFrame?

Get address using Latitude and Longitude from two columns in DataFrame? - python

I have a dataframe with the longitude column and latitude column. When I try to get the address using geolocator.reverse() I get the error ValueError: Must be a coordinate pair or Point
I can't for the life of me insert the lat and long into the reverse function without getting that error. I tried creating a tuple using list(zip(zips['Store_latitude'], zips['Store_longitude'])) but I get the same error.
Code:
import pandas as pd
from geopy.geocoders import Nominatim
from decimal import Decimal
from geopy.point import Point
zips = pd.read_excel("zips.xlsx")
geolocator = Nominatim(user_agent="geoapiExercises")
zips['Store_latitude']= zips['Store_latitude'].astype(str)
zips['Store_longitude'] = zips['Store_longitude'].astype(str)
zips['Location'] = list(zip(zips['Store_latitude'], zips['Store_longitude']))
zips['Address'] = geolocator.reverse(zips['Location'])
What my DataFrame looks like
Store_latitude
Store_longitude
34.2262225
-118.4508349
34.017667
-118.149135

I think you might try with a tuple or a geopy.point.Point before going to a list to see whether the package works all right.
I tested just now as follows (Python 3.9.13, command line style)
import geopy
p = geopy.point.Point(51.4,3.45)
gl = geopy.geocoders.Nominatim(user_agent="my_test") # Without the user_agent it raises a ConfigurationError.
gl.reverse(p)
output:
Location(Vlissingen, Zeeland, Nederland, (51.49433865, 3.415005767601362, 0.0))
This is as expected.
Maybe you should cast your dataframe['Store_latitude'] and dataframe['Store_longitude'] before/after you convert to list? They are not strings?
More information on your dataframe and content would be required to further assist, I think.
Good luck!
EDIT: added information after OP's comments below.
When you read your excel file as zips = pd.read("yourexcel.xlsx") you will get a pandas dataframe.
The content of the dataframe is two columns (which will be of type Series) and each element will be a numpy.float64 (if your excel has real values as input and not strings!). You can check this using the type() command:
>>> type(zips)
<class 'pandas.core.frame.DataFrame'>
>>> type(zips['Lat'])
<class 'pandas.core.series.Series'>
>>> type(zips['Lat'][0])
<class 'numpy.float64'>
What you then do is convert these floats (=decimal numbers) to a string (=text) by performing zips[...] = zips[...].astype(str). There is no reason to do that, because your geolocator requires numbers, not text.
As shown in the comment by #Derek, you need to iterate over each row and while doing so, you can put the resulting Locations you receive from the geolocator in a new column.
So in the next block, I first create a new (empty) list. Then i iterate over couples of lat,lon by combining your zips['Lat'] and zips['lon'] using the zip command (so the naming of zips is a bit unlucky if you don't know the zip command; it thus may be confusing you). But don't worry, what it does is just combining the entries of each row in the varables lat and lon. Within the for-each loop, I append the result of the geolocator lookup. Note that the argument of the reverse command is a tuple (lat,lon), so the complete syntax is reverse( (lat,lon) ). Instead of (lat,lon), you could also have created a Point as in my original example. But that is not necessary imo. (note: for brevity I just write 'Lat' and 'Lon' instead of your Store...).
Finally, assign the result list as a new column in your zip pandas dataframe.
import geopy as gp
# instiate a geolocator
gl = gp.geocoders.Nominatim(user_agent="my_test")
locations = [] # Create empty list
# For loop over each couple of lat, lon
for lat,lon in zip(zips['Lat'], zips['Lon']):
locations.append(gl.reverse((lat,lon))
# Add extra column to your pandas table (address will be the column name)
zips = zips.assign(address=locations)
One thing you still may want, is just have the text string instead of the complete geopy.Location() string in your table.
To get that you write the for loop with this small modification ([0] as the first element of the Location object). Note that this won't work if the result of the lookup of a given row is empty (None). Then the [0] will raise an error.
# For loop over each couple of lat, lon
for lat,lon in zip(zips['Lat'], zips['Lon']:
locations.append(gl.reverse((lat,lon)[0])
I hope this gets you going!

Related

Convert JSON column in dataframe to simple array of values

I am trying to convert the JSON in the bbox (bounding box) column into a simple array of values for a DL project in python in a Jupyter notebook.
The possible labels are the following categories: [glass, cardboard, trash, metal, paper].
[{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]
TO
([191 70 183 311], 0)
I'm looking for help to convert the bbox column from the JSON object for a single CSV that contains all the image names and the related bboxes.
UPDATE
The current column is a series so I keep getting a "TypeError: the JSON object must be str, bytes or bytearray, not 'Series'" any time I try to apply JSON operations on the column. So far I have tried to convert the column into JSON object and then pull out the values from the keys.
BB_CSV

You'll want to use a JSON decoder: https://docs.python.org/3/library/json.html
import json
li = json.loads('''[{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]''')
d = dictionary = li[0]
result = ([d[key] for key in "left top width height".split()], 0)
print(result)
Edit:
If you want map the operation of extracting the values from the dictionary to all element of the list, you can do:
extracted = []
for element in li:
result = ([element[key] for key in "left top width height".split()], 0)
extracted.append(result)
# print(extracted)
print(extracted[:10])
# `[:10]` is there to limit the number of item displayed to 10
Similarly, as per my comment, if you do not want commas between the extracted numbers in the list, you can use:
without_comma = []
for element, zero in extracted:
result_string = "([{}], 0)".format(" ".join([str(value) for value in element]))
without_comma.append(result_string)

It looks like each row of your bbox column contains a dictionary inside of a list. I've tried to replicate your problem as follows. Edit: Clarifying that the below solution assumes that what you're referring to as a "JSON object" is represented as a list containing a single dictionary, which is what it appears to be per your example and screenshot.
# Create empty sample DataFrame with one row
df = pd.DataFrame([None],columns=['bbox'])
# Assign your sample item to the first row
df['bbox'][0] = [{"left":191,"top":70,"width":183,"height":311,"label":"glass"}]
Now, to simple unpack the row you can do:
df['bbox_unpacked'] = df['bbox'].map(lambda x: x[0].values())
Which will get you a new column with a tuple of 5 items.
If you want to go further and apply your labels, you'll likely want to create a dictionary to contain your labeling logic. Per the example you're given in the comments, I've done:
labels = {
'cardboard': 1,
'trash': 2,
'glass': 3
}
This should get your your desired layout if you want a one-line solution without writing your own function.
df['bbox_unpacked'] = df['bbox'].map(lambda x: (list(x[0].values())[:4],labels.get(list(x[0].values())[-1])))
A more readable solution would be to define your own function using the .apply() method. Edit: Since it looks like your JSON object is being stored as a str inside your DataFrame rows, I added json.loads(row) to process the string first before retrieving the keys. You'll need to import json to run.
import json
def unpack_bbox(row, labels):
# load the string into a JSON object (in this
# case a list of length one containing the dictionary;
# index the list to its first item [0] and use the .values()
# dictionary method to access the values only
keys = list(json.loads(row)[0].values())
bbox_values = keys[:4]
bbox_label = keys[-1]
label_value = labels.get(bbox_label)
return bbox_values, label_value
df['bbox_unpacked'] = df['bbox'].apply(unpack_bbox,args=(labels,))

Extract a string from a CSV cell containing special characters in Python

I'm writing a Python program to extract specific values from each cell in a .CSV file column and then make all the extracted values new columns.
Sample column cell:(This is actually a small part, the real cell contains much more data)
AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-
One value I'm trying to extract is number 10 between the "JitterInterArrival": and ,"JitterInterArrivalMax" . But since each cell contains relatively long strings and special characters around it(such as ""), opener=re.escape(r"***")and closer=re.escape(r"***") wouldn't work.
Does anyone know a better solution? Thanks a lot!

IIUC, you have a json string and wish to get values from its attributes. So, given
s = '''
{"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,
"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,
"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,
"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,
"PossibleDataMissing":false}]}
'''
You can do
import json
>>> data = json.loads(s)
>>> ji = data['AudioStreams'][0]['JitterInterArrival']
10
In a data frame scenario, if you have a column col of strings such as the above, e.g.
df = pd.DataFrame({"col": [s]})
You can use transform passing json.loads as argument
df.col.transform(json.loads)
to get a Series of dictionaries. Then, you can manipulate these dicts or just access the data as done above.

Taking dictionary/list, and mapping to dataframe with x number of matching columns

I have a table that contains 2 columns.
Column 1 | Column 2
----------------------------
unique_number | '123 Main St. Suite 100 Chicago, IL'
I've been exploring address parsing using https://parserator.datamade.us/api-docs and ideally would like to parse the address, and put the results into new columns.
import usaddress
addr='123 Main St. Suite 100 Chicago, IL'
Two options for returning parsed results, and I plan on using whichever is easier to add to a dataframe:
usaddress.parse(addr) The parse method will split your address string into components, and
label each component. (returns list)
usaddress.tag(addr) The tag method will try to be a little smarter it will merge consecutive components, strip commas, & return an address type (returns ordered list)
There are 26 different tags available for an address using this parser.
However, Not all addresses will contain all of these tags.
I need to grab the full address for each row, parse it, map the parsed results to each matching column in that same row.
What the tag data looks like using from_records (index isn't exactly ideal)
What the parse data looks like using from_records
I can't quite figure out how to logic of row by row calculations and mapping results.

First, create a column of json responses from the parsing service
df['json_response'] = df['address'].apply(usaddress.pars)
Next, combine all the jsons into a single json string
json_combined = json.dumps(list(df['json_response']))
Finally parse the combined json to a dataframe (after parsing the json string)
df_parsed = pd.io.json.json_normalize(json.loads(json_combined))
Now you should have a structured dataframe with all required columns which you can df.join with your original dataframe to produce a single unified dataset.
Just a note, depending on the structure of the returned json, you may need to pass further arguments to the `pandas.io.json.json_normalize function. The example on the linked page is a good starting point.

Super late in posting this solution, but wanted to in case anyone else ran into the same problem
Address csv file headings:
name, address
Imports:
import pandas as pd
import numpy as np
import json
import itertools
import usaddress
def address_func(address):
try:
return usaddress.tag(address)
except:
return [{'AddressConverstion':'Error'}]
# import file
file = pd.read_excel('addresses.xlsx')
# apply function
file['tag_response'] = file['Full Address'].apply(address_func)
# copy values to new column
file['tags'] = file.apply(lambda row: row['tag_response'][0], axis=1)
# dump json
tags_combined = json.dumps(list(file['tags']))
# create dataframe of parsed info
df_parsed = pd.io.json.json_normalize(json.loads(tags_combined))
# merge dataframes on index
merged = file.join(df_parsed)

Why my code works (filtering a data-frame with a function)?

First of all, I have created a function with input of lat Lon in order to filter ships not entering a particular zone.
check_devaiation_notInZone(LAT, LON)
It takes two inputs and return True if ships not enter a particular zone.
Secondly, I got data on many ships with Lat in one header and Lon in another header in CSV format. So, I need to take data from two column into the function and create another column to store the output of the function.
After I looked at Pandas: How to use apply function to multiple columns. I found the solution df1['deviation'] = df1.apply(lambda row: check_devaiation_notInZone(row['Latitude'], row['Longitude']), axis = 1)
But I have no idea why it works. Can anyone explain the things inside the apply()?

A lambda function is just like a normal function but it has no name and can be used only in the place where it is defined.
lambda row: check_devaiation_notInZone(row['Latitude'], row['Longitude'])
is the same as:
def anyname(row):
return check_devaiation_notInZone(row['Latitude'], row['Longitude'])
So in the apply you just call another function check_devaiation_notInZone with the parameters row['Latitude'], row['Longitude'].

Adding names and assigning data types to ASCII data

My professor uses IDL and sent me a file of ASCII data that I need to eventually be able to read and manipulate.
He used the following command to read the data:
readcol, 'sn-full.txt', format='A,X,X,X,X,X,F,A,F,A,X,X,X,X,X,X,X,X,X,A,X,X,X,X,A,X,X,X,X,F,X,I,X,F,F,X,X,F,X,F,F,F,F,F,F', $
sn, off1, dir1, off2, dir2, type, gal, dist, htype, d1, d2, pa, ai, b, berr, b0, k, kerr
Here's a picture of what the first two rows look like: http://i.imgur.com/hT7YIE3.png
Since I'm not going to be an astronomer, I am using Python but since I am new to it, I am having a hard time reading the data.
I know that the his code assigns the data type A (string data) to column one, skips columns two -six by using an X, and then assigns the data type F (floating point) to column seven, etc. Then sn is assigned to the first column that isn't skipped, etc.
I have been trying to replicate this by using either numpy.loadtxt("sn-full.txt") or ascii.read("sn-full.txt") but am not sure how to enter the dtype parameter. I know I could assign everything to be a certain data type, but how do I assign data types to individual columns?

Using astropy.io.ascii you should be able to read your file relatively easily:
from astropy.io import ascii
# Give names for ALL of the columns, as there is no easy way to skip columns
# for a table with no column header.
colnames = ('sn', 'gal_name1', 'gal_name2', 'year', 'month', 'day', ...)
table = ascii.read('sn_full.txt', Reader=ascii.NoHeader, names=colnames)
This gives you a table with all of the data columns. The fact that you have some columns you don't need is not a problem unless the table is mega-rows long. For the table you showed you don't need to specify the dtypes explicitly since io.ascii.read will figure them out correctly.
One slight catch here is that the table you've shown is really a fixed width table, meaning that all the columns line up vertically. Notice that the first row begins with 1998S NGC 3877. As long as every row has the same pattern with three space-delimited columns indicating the supernova name and the galaxy name as two words, then you're fine. But if any of the galaxy names are a single word then the parsing will fail. I suspect that if the IDL readcol is working then the corresponding io.ascii version should work out of the box. If not then io.ascii has a way of reading fixed width tables where you supply the column names and positions explicitly.
[EDIT]
Looks like in this case a fixed width reader is needed to inform the parser how to split the columns instead of just using space as delimiter. So basically you need to add two rows at the top of the table file, where the first one gives the column names and the second has dashes that indicate the span of each column:
a b c
---- ------------ ------
1.2 hello there 2
2.4 worlds 3
It's also possible in astropy.io.ascii to just specify by code the start and stop position of each column if you don't have the option of modifying the input data file, e.g.:
>>> ascii.read(table, Reader=ascii.FixedWidthNoHeader,
names=('Name', 'Phone', 'TCP'),
col_starts=(0, 9, 18),
col_ends=(5, 17, 28),
)

http://casa.colorado.edu/~ginsbura/pyreadcol.htm looks like it does what you want. It emulates IDL's readcol function.
Another possibility is https://pypi.python.org/pypi/fortranformat. It looks like it might be more capable and the data you're looking at is in fixed format and the format specifiers (X, A, etc.) are fortran format specifiers.

I would use Pandas for that particular purpose. The easiest way to do it is, assuming your columns are single-tab-separated:
import pandas as pd
import scipy as sp # Provides all functionality from numpy, too
mydata = pd.read_table(
'filename.dat', sep='\t', header=None,
names=['sn', 'gal_name1', 'gal_name2', 'year', 'month',...],
dtype={'sn':sp.float64, 'gal_name1':object, 'year':sp.int64, ...},)
(Strings here fall into the general 'object' datatype).
Each column now has a name and can be accessed as mydata['colname'], and this can then be sliced like regular numpy 1D arrays like e.g. mydata['colname'][20:50] etc. etc.
Pandas has built-in plotting calls to matplotlib, so you can quickly get an overview of a numerical type column by mydata['column'].plot(), or two different columns against each other as mydata.plot('col1', 'col2'). All normal plotting keywords can be passed.
If you want to plot the data in a normal matplotlib routine, you can just pass the columns to matplotlib, where they will be treated as ordinary Numpy vectors.
Each column can be accessed as an ordinary Numpy vector as mydata['colname'].values.
EDIT
If your data are not uniformly separated, numpy's genfromtxt() function is better. You can then convert it to a Pandas DataFrame by
mydf = pd.DataFrame(myarray, columns=['col1', 'col2', ...],
dtype={'col1':sp.float64, 'col2':object, ...})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.