I'd like to know what is the best solution to get distances from the google maps distance API for my dataframe composed of coordinates (origin & destination) which is around 75k rows.
#Origin #Destination
1 (40.7127837, -74.0059413) (34.0522342, -118.2436849)
2 (41.8781136, -87.6297982) (29.7604267, -95.3698028)
3 (39.9525839, -75.1652215) (40.7127837, -74.0059413)
4 (41.8781136, -87.6297982) (34.0522342, -118.2436849)
5 (29.7604267, -95.3698028) (39.9525839, -75.1652215)
So far my code iterates through the dataframe and calls the API copying the distance value into the new "distance" column.
df['distance'] = ""
for index, row in df.iterrows():
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK": # Handle "no result" exception
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
df['distance'].iloc[index] = KM
else:
df['distance'].iloc[index] = 0
df.to_csv('distance.csv')
I get the desired result but from what I've read iterating through dataframe is rather inefficient and should be avoided. It took 20 secondes for 240 rows, so it would take 1h30 to do all dataframe. Note that once done, no need to re-run anymore, only new few new rows a month (~500).
What would we the best solution here ?
Edit: if anybody has experience with the google distance API and its limitations any tips/best practices is welcomed.
I tried to understand about any limitations about concurrent calls here but I couldn't find anything. Few suggestions
Avoid loops
About your code I'd rather skip for loops and use apply first
def get_gmaps_distance(row):
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK":
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
else:
KM = 0
return KM
df["distance"] = df.apply(get_gmaps_distance, axis=1)
Split your dataframe and use multiprocessing
import multiprocessing as mp
def parallelize(fun, vec, cores=mp.cpu_count()):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
# split your dataframe in many chunks as the number of cores
df = np.array_split(df, mp.cpu_count())
# this use your functions for every chunck
def parallel_distance(x):
x["distance"] = x.apply(get_gmaps_distance, axis=1)
return x
df = parallelize(parallel_distance, df)
df = pd.concat(df, ignore_index=True, sort=False)
Do not calculate twice the same distance (save $$$)
In case you have duplicates row you should drop some of them
grp = df.drop_duplicates(["origin", "destination"]).reset_index(drop=True)
Here I didn't overwrite df as it possibly contain more information you need and you can merge the results to it.
grp["distance"] = grp.apply(get_gmaps_distance, axis=1)
df = pd.merge(df, grp, how="left")
Reduce decimals
You should ask you this question: do I really need to be accurate to the 7th decimal? As 1 degree of latitude is ~111km the 7th decimal place gives you a precision up to ~1cm. You get the idea from this when-less-is-more where reducing decimals they improved the model.
Conclusion
If you can eventually use all the suggested methods you could get some interesting improvements. I'd like you to comment them here as I don't have a personal API key to try by myself.
Related
My situation:
The CSV file has been converted to a data frame df5 and all the columns being used in the for loop below are of float type, this code is working but taking many many hours to just do 30,000 rows.
What I want from my situation:
I need to do the same operation on millions of rows and I am looking for fixes/alternate solutions that make it considerably faster.
Below is the code I am using currently:
for row in np.arange(0,len(df5)):
underlyingPrice = df5.iloc[row]['CLOSE_y']
strikePrice = df5.iloc[row]['STRIKE_PR']
interestRate = 10
dayss = df5.iloc[row]['Days']
optPrice = df5.iloc[row]['CLOSE_x']
result = BS([underlyingPrice,strikePrice,interestRate,dayss], callPrice= optPrice)
df5.iloc[row,df5.columns.get_loc('IV')]= result.impliedVolatility
Your loop seems to take values from each row to build another column IV.
This can be done much faster by using the apply method, which allows to use a function on each row/column to calculate a result.
Something like this:
def useBS(row):
underlyingPrice = row['CLOSE_y']
strikePrice = row['STRIKE_PR']
interestRate = 10
dayss = row['Days']
optPrice = row['CLOSE_x']
result = BS([underlyingPrice,strikePrice,interestRate,dayss], callPrice= optPrice)
return result.impliedVolatility
df5['IV'] = df5.apply(useBS, axis=1)
I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).
I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()
On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.