I have a dictionary with geohash as keys and a value associated with them. I am looking up values from the dict to create a new column in my pandas dataframe.
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
df['value'] = df.apply(lambda x : geo_dict[x.geohash], axis=1)
I need to be able to handle non-matches, i.e geohashes that do not exist in the dictionary. Expected handling below:
Find k-number of geohashes nearby and compute the mean value
Assign the mean of neighboring geohashes to pandas column
Questions -
Is there a library I can use to find nearby geohashes?
How do I code up this solution?
The module pygeodesy has several functions to calculate distance between geohashes. We can wrap this in a function that first checks is a match exists in the dict, else returns the mean value of the n closest geohashes:
import pygeodesy as pgd
import pandas as pd
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
geo_df = pd.DataFrame(zip(geo_dict.keys(), geo_dict.values()), columns=['geohash', 'value'])
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
def approximate_distance(geohash1, geohash2):
return pgd.geohash.distance_(geohash1, geohash2)
#return pgd.geohash.equirectangular_(geohash1, geohash2) #alternative ways to calculate distance
#return pgd.geohash.haversine_(geohash1, geohash2)
def get_value(x, n=2): #set number of closest geohashes to use for approximation with n
val = geo_df.loc[geo_df['geohash'] == x]
if not val.empty:
return val['value'].iloc[0]
else:
geo_df['tmp_dist'] = geo_df['geohash'].apply(lambda y: approximate_distance(y,x))
return geo_df.nlargest(n, 'tmp_dist')['value'].mean()
df['value'] = df['geohash'].apply(get_value)
result:
geohash
label
value
0
9q5dx
a
10
1
9qh0g
b
12.5
2
9q9hv
c
15
3
9q5dv
d
20
Related
In one column, I have 4 possible (non-sequential) values: A, 2, +, ? and I want order rows according to a custom sequence 2, ?, A, +, I followed some code I followed online:
order_by_custom = pd.CategoricalDtype(['2', '?', 'A', '+'], ordered=True)
df['column_name'].astype(order_by_custom)
df.sort_values('column_name', ignore_index=True)
But for some reason, although it does sort, it still does so according to alphabetical (or binary value) position rather than the order I've entered them in the order_by_custom object.
Any ideas?
.astype does return Series after conversion, but you did not anything with it. Try assigning it to your df. Consider following example:
import pandas as pd
df = pd.DataFrame({'orderno':[1,2,3],'custom':['X','Y','Z']})
order_by_custom = pd.CategoricalDtype(['Z', 'Y', 'X'], ordered=True)
df['custom'] = df['custom'].astype(order_by_custom)
print(df.sort_values('custom'))
output
orderno custom
2 3 Z
1 2 Y
0 1 X
You can use a customized dictionary to sort it. For example a dictionary will be as:
my_custom_dict = {'2': 0, '?': 1, 'A': 2, '+' : 3}
If your column name is "my_column_name" then,
df.sort_values(by=['my_column_name'], key=lambda x: x.map(my_custom_dict))
I have a DataFrame, for which I want to store values from the keyvalue column into a variable, based on the value of keyname.
Example DataFrame:
keyname keyvalue
A 100,200
B 300
C 400
Expected output:
v_A = [100, 200]
v_B = 300
v_C = 400
While this is a more verbose approach, it's posted to demonstrate the basic concept for assigning keyvalue values to a list variable, based on the value of keyname.
v_A = df.loc[df['keyname'] == 'A', 'keyvalue'].to_list()
v_B = df.loc[df['keyname'] == 'B', 'keyvalue'].to_list()
v_C = df.loc[df['keyname'] == 'C', 'keyvalue'].to_list()
Output:
['100,200']
['300']
['400']
Close, what you need is dictionary with keys and values, because concept of strings variables in python is not recommended:
d = df.set_index('keyname')['keyvalue'].to_dict()
I can suggest two options..
Convert to a dictionary.. that will give you them as key value pairs, if that is what you want.
df.set_index('keyname').to_dict()
output:
{'keyvalue': {'A': '100,200', 'B': '300', 'C': '400'}}
Take a transpose and you will get them in columns of dataframe and then you can convert as list
dft=df.set_index('keyname').T
v_A=list(map(int, dft['A'][0].split(",")))
v_B=list(map(int, dft['B'][0].split(",")))
v_C=list(map(int, dft['C'][0].split(",")))
print(v_A)
print(v_B)
print(v_C)
output :
[100, 200]
[300]
[400]
I'm working with some geospatial data, df_geo and am have a CSV of values I'd like to join to the location data frame, called df_data.
My issue, however, is that there are multiple ways to spell the values in the column I'd like to join the two data frames on (region names). Look at the Catalonia example below, in df_geo: there are 6 different ways to spell the region name, depending on the language.
My question is this: if the row is named "Catalonia" in df_data, how would I go about joining df_data to df_geo?
Since the rows are unique to a region, you can create a dictionary that maps any name in 'VARNAME_1' to the index from df_geo.
Then use this to map the the names in df_data to a dummy column and you can do a simple merge on the index in df_geo and the mapped column in df_data.
To get the dictionary do:
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
Sample Data:
import pandas as pd
df_geo = pd.DataFrame({'VARNAME_1': ['Catalogna\Catalogne\Catalonia', 'A\B\C\D\E\F\G']})
df_data = pd.DataFrame({'Name': ['Catalogna', 'Seven', 'E'],
'Vals': [1,2,3]})
Code
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
#{'A': 1,
# 'B': 1,
# 'C': 1,
# 'Catalogna': 0,
# 'Catalogne': 0,
# 'Catalonia': 0,
# 'D': 1,
# 'E': 1,
# 'F': 1,
# 'G': 1}
df_data['ID'] = df_data.Name.map(d)
df_data.merge(df_geo, left_on='ID', right_index=True, how='left').drop(columns='ID')
Output:
Name Vals VARNAME_1
0 Catalogna 1 Catalogna\Catalogne\Catalonia
1 Seven 2 NaN
2 E 3 A\B\C\D\E\F\G
How the dictionary works.
df_geo.VARNAME_1.str.split(r'\\').values splits the string in VARNAME_1 on the '\' character and places all the separated values in a Series of lists. Using .items on the Series gives you a tuple (which we unpacked into two separate values), with the first value being the index, which is the same as the index of the original DataFrame, and the second item being the
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
print(f'id:{ids} and val:{val}')
#id:0 and val:['Catalogna', 'Catalogne', 'Catalonia']
#id:1 and val:['A', 'B', 'C', 'D', 'E', 'F', 'G']
So now val is a list, which we again want to iterate over to create out dictionary.
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
for y in val:
print(f'id:{ids} and y:{y}')
#id:0 and y:Catalogna
#id:0 and y:Catalogne
#id:0 and y:Catalonia
#id:1 and y:A
#id:1 and y:B
#id:1 and y:C
#id:1 and y:D
#id:1 and y:E
#id:1 and y:F
#id:1 and y:G
And so the dictionary I created was with y as the key, and the original DataFrame index ids as the value.
Given a dataframe containing a numeric (float) series and a categorical ID (df). How can I create a dictionary in the form 'key': [] where the key is an ID from the dataframe and the list contains the difference between the numbers in the separate dataframes?
I have managed this using loops though I am looking for a more pandas way of doing this.
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'a': [0.75435, 0.74897, 0.60949,
0.87438, 0.90885, 0.28547,
0.27327, 0.31078, 0.15576,
0.58139],
'id': list('aaaxxbbyyy')})
rl = pd.DataFrame({'b': [0.51, 0.30], 'id': ['aaa', 'bbb']})
interval = 0.1
d = defaultdict(list)
for index, row in rl.iterrows():
before = df[df['a'].between(row['b'] - interval, row['b'], inclusive=False)]
after = df[df['a'].between(row['b'], row['b'] + interval, inclusive=True)]
for x, b_row in before.iterrows():
d[b_row['id']].append((b_row['a'] - row['b']))
for x, a_row in after.iterrows():
d[a_row['id']].append((a_row['a'] - row['b']))
for k, v in d.items():
print('{k}\t{v}'.format(k=k, v=len(v)))
a 1
y 2
b 2
d
defaultdict(list,
{'a': [0.09948],
'b': [-0.01452, -0.02672],
'y': [0.07138, 0.01078]})
I am new python, thank you for all your help in advance!
I am having a lot of trouble accomplishing something in Python that is very easy to do in Excel.
I have a pandas data frame that looks like this:
df = pd.DataFrame(
{'c1': [1,2,3,4,5],
'c2': [4,6,7,None,3],
'c3': [0,None,3,None,4]})
Notice I have NaN values in columns c2 and c3.
I want to remove all rows with NaN in c2.
So the result should look like this:
c1: [1,2,3,5]
c2: [4,6,7,3]
c3: [0,Nan,3,4]
I tried all sorts of list comprehensions but they either contain bugs or won't give me the correct result.
I think this is close:
[x for x in df["c2"] if x != None]
You don't need a list comprehension, for a pure pandas solution:
df.dropna(subset=['c2'])
subset allows you to select columns to inspect.
You're very close:
d = {'c1': [1,2,3,4,5],
'c2': [4,6,7,None,3],
'c3': [0,None,3,None,4]}
for k in d:
d[k] = [x for x in d[k] if x != None]
df= pd.DataFrame(d)
Since all your columns are stored as lists, you can use c2.index(None) to get the index of None in c2. Then remove that index from each list using pop(). More documentation here: https://docs.python.org/2/tutorial/datastructures.html
Given this data:
data = {
'c1': [4,6,7,None,3],
'c2': [4,6,7,None,3],
'c3': [0,None,3,None,4]
}
Removal of the first instance:
The values of equal to None can most efficiently be removed as follows:
ind = data['c2'].index(None)
data['c2'].pop(ind)
You may wish to implement a function to automate this:
def remove(data_set, item, value):
ind = data_set[item].index(value)
return data_set.pop[ind]
Removal of all instances:
Notice that this will remove only the first occurrence of None, or any other values. To remove them all occurrences efficiently and without iteration, you may wish to do as follows:
tmp = set(data['c2']) - set([None]*len(data['c2']))
data['c2'] = list(tmp)
or define a function:
def remove(data_set, item, value):
response = set(data_set[item]) - set([value] * len(data_set[item]))
return list(response)
whereby:
data['c2'] = remove(data, 'c2', None)
Comparison of results:
All the above return this for c2:
[4, 6, 7, 3]
The first 2 solutions, applied to c3, return:
[0, 3, None, 4]
whereas the last 2 solutions, however, return as follows if applied to c3:
[0, 3, 4]
Hope you find this helpful.