There is a huge dataframe with hundreds of thousands of rows for every single client. I want to summarise this dataframe into another dataframe where a single row contains the summarised data of all the rows of that client.
The problem is it is not the only code, it contains similar 1000 more lines.
and it takes a lot of time to execute. but when I run this in R. It is 10 times faster. Attaching the R code for reference as well.
Is there a way I can make it fast like R code?
Python Code:
for i in range(len(client)):
print(i)
sub = data.loc[data['Client Name']==client['Client Name'][i],:]
client['requests'][i] = len(sub)
client['ppt_req'][i] = len(sub)/sub['CID'].nunique()
client['approval'][i] = (((sub['verify']=='Yes').sum())/client['requests'][i])*100
client['denial'][i] = (((sub['verify']=='No').sum())/client['requests'][i])*100
client['male'][i] = (((sub['gender']=='Male').sum())/client['requests'][i])*100
client['female'][i] = (((sub['gender ']=='female').sum())/client['requests'][i])*100
R Code:
for(i in 1: nrow(client))
{print(i)
#i=1
sub<-subset(data,data$Client.Name==client$Client.Name[i])
client$requests[i]<-nrow(sub)
client$ppt_req[i]<-nrow(sub)/(length(unique(sub$CID)))
client$approval[i]<-((as.numeric(table(sub$verify=="Yes")["TRUE"]))/client$requests[i])*100
client$denial[i]<-((as.numeric(table(sub$verify=="No")["TRUE"]))/client$requests[i])*100
client$male[i]<-((as.numeric(table(sub$gender)["Male"]))/client$requests[i])*100
client$female[i]<-((as.numeric(table(sub$gender)["Female"]))/client$requests[i])*100
Iterating over Pandas dataframe using Python loops is very slow.
But the main issue comes from the line data.loc[data['Client Name']==client['Client Name'][i],:] which walk through the whole dataframe data for each client. This means this line will finally iterate over >100,000 of strings >100,000 times, and thus tens of billions of costly string comparisons are made. Not to mention that per-group computation are replicated for each client.
You can solve this by using a groupby on client names followed by a merge.
Here is a sketch of code (untested):
# If the number of client name in `data` is much more important than in `client`,
# one can filter `data` before applying the next `groupby` using:
# client['Client Name'].unique()
# Generate a compact dataframe containing the information for each
# possible client name that appear in `data`.
clientDataInfos = pd.DataFrame(
{
'requests': len(group),
'ppt_req': len(group) / group['CID'].nunique(),
'approval': (((group['verify']=='Yes').sum()) / len(group)) * 100,
'denial': (((group['verify']=='No').sum()) / len(group)) * 100,
'male': (((group['gender']=='Male').sum()) / len(group)) * 100,
'female': (((group['gender ']=='female').sum()) / len(group)) * 100
} for name,group in data.groupby('Client Name')
)
# Extend `client` with the precomputed information in `clientDataInfos`.
# The extended columns should not already appear in `client`.
client = client.merge(clientDataInfos, on='Client Name')
Related
Setup
I have a multiple datasets each with their own DataFrame. I'm running calculations within them before comparing my results to a separate DataFrame which we can think of as constraints.
For example lets say 2 sets of data in a dictionary:
df_data_1 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
df_data_2 = pd.DataFrame(np.random.randint(0,50,size=(10, 4)), columns=list('ABCD'))
data_sets = {'data_1': df_data, 'data_2': df_data_2}
and one set of constraints:
df_constraints = pd.DataFrame([['a', 10, 20, 10000000],
['b', 100, 200, 20000000],
['c', 1000, 2000, 30000000]])
df_constraints.columns = ['index', 'sumMin', 'sumMax', 'productMax']
df_constraints.set_index('index', inplace=True)
Visually:
data_set_1
data_set_2
constraints
Function
I'm making calculations within each set of data and then comparing them to a set of constraints. For the sake of simplifying my question I am only comparing the data to the first row of constraints here, but in reality I have to compare the results of my calculations within each data-set to up to 20 sets of constraints.
Here is a simplified version of the function that I am trying to have run in parallel:
def test_func(df_data, df_constraints):
# Run some calculations
df = df_data.copy()
df['sum'] = df.sum(axis=1)
df['product'] = df.product(axis=1)
# Compare results to constraints
df['sumFit'] = ((df['sum'] > df_constraints.loc['a', 'sumMin']) &
(df['sum'] < df_constraints.loc['a', 'sumMax']))
df['productFit'] = df['product'] < df_constraints.loc['a', 'productMax']
# Analyze results
count_sumFits = df['sumFit'].sum()
count_productFits = df['productFit'].sum()
df_results = pd.DataFrame([['data_set_1', count_sumFits, count_productFits]],
columns=['DataSet', 'FittingSums', 'FittingProducts'])
df_results.set_index('DataSet', inplace=True)
return df_results
Sequential Version
I can run this function sequentially through each set of data; iterating through the dictionary with a while loop and then append the results as shown here, but with increased complexity this is taking way longer than I would like. (It's ugly but it works)
n=0
while n < len(data_sets):
data_set_names = list(data_sets.keys())
df_temp = test_func(data_sets[data_set_names[n]], df_constraints)
df_all_results.loc[n, 'FittingSums'] = df_temp.loc[0, 'FittingSums']
df_all_results.loc[n, 'FittingProducts'] = df_temp.loc[0, 'FittingProducts']
n+=1
The Problem
When I have 25 data-sets and I'm running more complex analysis with more calculations, the run time ends up being minutes long. Leading me to pursue concurrency/multiprocessing. I'm hoping to make this significantly faster as it is one step of many that I'm trying to optimize and then run them all a few thousand times.
So, Multiprocessing...
Due to the need to pass two arguments to the function I've been looking at mp.Pool.starmap, and pool.map(partial(test_func, b=df_constraints), data_sets, but I haven't been able to get either method to work.
ex.1) mp.Pool.starmap
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets, itertools.repeat(df_contraints)
This is as far as I've been able to get. Is it possible to process data concurrently like this and then append results to a dataframe? I don't need them to be in any particular order I just want to get the data into the right format.
I don't fully understand your code and your logic but replace data_sets by data_sets.values():
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
output = pool.starmap(test_file.test_func, zip(data_sets.values(),itertools.repeat(df_contraints)))
I'd like to know what is the best solution to get distances from the google maps distance API for my dataframe composed of coordinates (origin & destination) which is around 75k rows.
#Origin #Destination
1 (40.7127837, -74.0059413) (34.0522342, -118.2436849)
2 (41.8781136, -87.6297982) (29.7604267, -95.3698028)
3 (39.9525839, -75.1652215) (40.7127837, -74.0059413)
4 (41.8781136, -87.6297982) (34.0522342, -118.2436849)
5 (29.7604267, -95.3698028) (39.9525839, -75.1652215)
So far my code iterates through the dataframe and calls the API copying the distance value into the new "distance" column.
df['distance'] = ""
for index, row in df.iterrows():
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK": # Handle "no result" exception
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
df['distance'].iloc[index] = KM
else:
df['distance'].iloc[index] = 0
df.to_csv('distance.csv')
I get the desired result but from what I've read iterating through dataframe is rather inefficient and should be avoided. It took 20 secondes for 240 rows, so it would take 1h30 to do all dataframe. Note that once done, no need to re-run anymore, only new few new rows a month (~500).
What would we the best solution here ?
Edit: if anybody has experience with the google distance API and its limitations any tips/best practices is welcomed.
I tried to understand about any limitations about concurrent calls here but I couldn't find anything. Few suggestions
Avoid loops
About your code I'd rather skip for loops and use apply first
def get_gmaps_distance(row):
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK":
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
else:
KM = 0
return KM
df["distance"] = df.apply(get_gmaps_distance, axis=1)
Split your dataframe and use multiprocessing
import multiprocessing as mp
def parallelize(fun, vec, cores=mp.cpu_count()):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
# split your dataframe in many chunks as the number of cores
df = np.array_split(df, mp.cpu_count())
# this use your functions for every chunck
def parallel_distance(x):
x["distance"] = x.apply(get_gmaps_distance, axis=1)
return x
df = parallelize(parallel_distance, df)
df = pd.concat(df, ignore_index=True, sort=False)
Do not calculate twice the same distance (save $$$)
In case you have duplicates row you should drop some of them
grp = df.drop_duplicates(["origin", "destination"]).reset_index(drop=True)
Here I didn't overwrite df as it possibly contain more information you need and you can merge the results to it.
grp["distance"] = grp.apply(get_gmaps_distance, axis=1)
df = pd.merge(df, grp, how="left")
Reduce decimals
You should ask you this question: do I really need to be accurate to the 7th decimal? As 1 degree of latitude is ~111km the 7th decimal place gives you a precision up to ~1cm. You get the idea from this when-less-is-more where reducing decimals they improved the model.
Conclusion
If you can eventually use all the suggested methods you could get some interesting improvements. I'd like you to comment them here as I don't have a personal API key to try by myself.
My situation:
The CSV file has been converted to a data frame df5 and all the columns being used in the for loop below are of float type, this code is working but taking many many hours to just do 30,000 rows.
What I want from my situation:
I need to do the same operation on millions of rows and I am looking for fixes/alternate solutions that make it considerably faster.
Below is the code I am using currently:
for row in np.arange(0,len(df5)):
underlyingPrice = df5.iloc[row]['CLOSE_y']
strikePrice = df5.iloc[row]['STRIKE_PR']
interestRate = 10
dayss = df5.iloc[row]['Days']
optPrice = df5.iloc[row]['CLOSE_x']
result = BS([underlyingPrice,strikePrice,interestRate,dayss], callPrice= optPrice)
df5.iloc[row,df5.columns.get_loc('IV')]= result.impliedVolatility
Your loop seems to take values from each row to build another column IV.
This can be done much faster by using the apply method, which allows to use a function on each row/column to calculate a result.
Something like this:
def useBS(row):
underlyingPrice = row['CLOSE_y']
strikePrice = row['STRIKE_PR']
interestRate = 10
dayss = row['Days']
optPrice = row['CLOSE_x']
result = BS([underlyingPrice,strikePrice,interestRate,dayss], callPrice= optPrice)
return result.impliedVolatility
df5['IV'] = df5.apply(useBS, axis=1)
I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.