I have a Df as followed:
position_latitude position_longitude geohash
0 53.398940 10.069293 u1
1 53.408875 10.052669 u1
2 48.856350 9.171759 u0
3 48.856068 9.170798 u0
4 48.856350 9.171759 u0
What I want to do know, is receiving the nearest node to this positions, using different Shapefiles based on the Geohash.
So what I want to do, is load for ever group in Geohash (ex u1) the graph from a file and then use this graph in a function for getting the nearest node.
I could do it in a for loop, however I think there are more efficient ways of doing so.
I though of something like this:
df['nearestNode'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x. position_longitude,x. geohash), axis=1)
However, I can't figure out how to load the graph only once per group, since it will take some time to get it from the file.
what I came up with so far:
groupHashed = geoSub.groupby('geohash')
geoSub['distance'] = np.nan
for name, group in groupHashed:
G = osmnx.graph.graph_from_xml('geohash/'+name+'.osm', simplify=True, retain_all=False)
geoSub['distance'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G) if x.geohash == name, axis=1)
definitely seems to work, however I feel like the if condition slows it down drastically
update:
just updated:
geoSub['distance'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G) if x.geohash == name, axis=1)
to:
geoSub['distance'] = geoSub[geoSub['geohash'] == name].apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G), axis=1)
its a lot faster now. is there an even better method?
You can use transform
I am stubbing G and getDistanceToEdge (as x+y+geohash[-1]) so show a working example
import pandas as pd
from io import StringIO
data = StringIO("""
,position_latitude,position_longitude,geohash
0,53.398940,10.069293,u1
1,53.408875,10.052669,u1
2,48.856350,9.171759,u0
3,48.856068,9.170798,u0
4,48.856350,9.171759,u0
""" )
df = pd.read_csv(data, index_col=0).fillna('')
def getDistanceToEdge(x, y, G):
return x+y+G
def fun(pos):
G = int(pos.values[0][-1][-1])
return pos.apply(lambda x: getDistanceToEdge(x[0], x[1], G))
df['pos'] = list(zip(df['position_latitude'], df['position_longitude'], df['geohash']))
df['distance'] = df.groupby(['geohash'])['pos'].transform(fun)
df = df.drop(['pos'], axis=1)
print (df)
Output:
position_latitude position_longitude geohash distance
0 53.398940 10.069293 u1 64.468233
1 53.408875 10.052669 u1 64.461544
2 48.856350 9.171759 u0 58.028109
3 48.856068 9.170798 u0 58.026866
4 48.856350 9.171759 u0 58.028109
As you can see you can get the name of the group using pos.values[0][-1] inside the function fun. This is because we care framing the pos column as a tuple of (lat, log, geohash), and each geohash within a group after groupby is same. So with a group we can grab the geohash by taking the last value of the tuple (pos) of any row. pos.values[0][-1] give the last value of the tuple of the first row.
Related
I'm trying to solve a system of equations:
and I would like to apply fsolve over a pandas dataframe.
How can I do that?
this is my code:
import numpy as np
import pandas as pd
import scipy.optimize as opt
a = np.linspace(300,400,30)
b = np.random.randint(700,18000,30)
c = np.random.uniform(1.4,4.0,30)
df = pd.DataFrame({'A':a, 'B':b, 'C':c})
def func(zGuess,*Params):
x,y,z = zGuess
a,b,c = Params
eq_1 = ((3.47-np.log10(y))**2+(np.log10(c)+1.22)**2)**0.5
eq_2 = (a/101.32) * (101.32/b)** z
eq_3 = 0.381 * x + 0.05 * (b/101.32) -0.15
return eq_1,eq_2,eq_3
zGuess = np.array([2.6,20.2,0.92])
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])))
But still not working, and I can't see the problem
The error: KeyError: 'A'
basically means he can't find the reference to 'A'
Thats happening because apply doesn't default to apply on rows.
By setting the parameter 1 at the end, it will iterate on each row, looking for the column reference 'A','B',...
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1)
That however, might not give the desired result, as it will save all the output (an array) into a single column.
For that, make reference to the three columns you want to create, make an interator with zip(*...)
df['output_a'],df['output_b'],df['output_c'] = zip(*df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1) )
Cannot find out the source of SettingWithCopyWarning? I tried to fix the assignment operations as suggested in the documentation but still, it gives me the SettingWithCopyWarning. Any help would be greatly appreciated.
def fuzzy_merge(df_left, df_right, key_left, key_right, threshold=84, limit=1):
"""
df_1: the left table to join
df_2: the right table to join
key_left: the key column of the left table
key_right: the key column of the right table
threshold: how close the matches should be to return a match, based on Levenshtein distance
limit: the amount of matches that will get returned, these are sorted high to low
"""
s = df_right.loc[:, key_right].tolist()
m = df_left.loc[:, key_left].apply(lambda x: process.extract(x, s, limit=limit))
df_left.loc[:, "matches"] = m
m2 = df_left.loc[:, "matches"].apply(
lambda x: ", ".join([i[0] for i in x if i[1] >= threshold])
)
df_left.loc[:, "matches"] = m2
return df_left
I'd like to know what is the best solution to get distances from the google maps distance API for my dataframe composed of coordinates (origin & destination) which is around 75k rows.
#Origin #Destination
1 (40.7127837, -74.0059413) (34.0522342, -118.2436849)
2 (41.8781136, -87.6297982) (29.7604267, -95.3698028)
3 (39.9525839, -75.1652215) (40.7127837, -74.0059413)
4 (41.8781136, -87.6297982) (34.0522342, -118.2436849)
5 (29.7604267, -95.3698028) (39.9525839, -75.1652215)
So far my code iterates through the dataframe and calls the API copying the distance value into the new "distance" column.
df['distance'] = ""
for index, row in df.iterrows():
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK": # Handle "no result" exception
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
df['distance'].iloc[index] = KM
else:
df['distance'].iloc[index] = 0
df.to_csv('distance.csv')
I get the desired result but from what I've read iterating through dataframe is rather inefficient and should be avoided. It took 20 secondes for 240 rows, so it would take 1h30 to do all dataframe. Note that once done, no need to re-run anymore, only new few new rows a month (~500).
What would we the best solution here ?
Edit: if anybody has experience with the google distance API and its limitations any tips/best practices is welcomed.
I tried to understand about any limitations about concurrent calls here but I couldn't find anything. Few suggestions
Avoid loops
About your code I'd rather skip for loops and use apply first
def get_gmaps_distance(row):
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK":
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
else:
KM = 0
return KM
df["distance"] = df.apply(get_gmaps_distance, axis=1)
Split your dataframe and use multiprocessing
import multiprocessing as mp
def parallelize(fun, vec, cores=mp.cpu_count()):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
# split your dataframe in many chunks as the number of cores
df = np.array_split(df, mp.cpu_count())
# this use your functions for every chunck
def parallel_distance(x):
x["distance"] = x.apply(get_gmaps_distance, axis=1)
return x
df = parallelize(parallel_distance, df)
df = pd.concat(df, ignore_index=True, sort=False)
Do not calculate twice the same distance (save $$$)
In case you have duplicates row you should drop some of them
grp = df.drop_duplicates(["origin", "destination"]).reset_index(drop=True)
Here I didn't overwrite df as it possibly contain more information you need and you can merge the results to it.
grp["distance"] = grp.apply(get_gmaps_distance, axis=1)
df = pd.merge(df, grp, how="left")
Reduce decimals
You should ask you this question: do I really need to be accurate to the 7th decimal? As 1 degree of latitude is ~111km the 7th decimal place gives you a precision up to ~1cm. You get the idea from this when-less-is-more where reducing decimals they improved the model.
Conclusion
If you can eventually use all the suggested methods you could get some interesting improvements. I'd like you to comment them here as I don't have a personal API key to try by myself.
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).
Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
t, p = zip(*out)
but it t is
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
maybe this is most strightforward (most pythonic i guess):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
out.index.name=None
maybe:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
I believe you want this:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of #CT ZHU and that of #Siraj S.
Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of #CT ZHU performs much faster when the size of input grows.
Example:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,
>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)
Thus, you could do this,
>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
t r
0 -1.776982 0.088249
1 0.105437 0.916905
2 -1.720683 0.098178
3 1.006216 0.324349