(Python) Pandas - GroupBy() using a similarity function

(Python) Pandas - GroupBy() using a similarity function - python

I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()

On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.

Related

How to count the duration of a field in a given value while having the field change history data?

I'm working with field change history data which has timestamps for when the field value was changed. In this example, I need to calculate the overall case duration in 'Termination in Progress' status.
The given case was changed from and to this status three times in total:
see screenshot
I need to add up all three durations in this case and in other cases it can be more or less than three.
Does anyone know how to calculate that in Python?

Welcome to Stack Overflow!
Based on the limited data you provided, here is a solution that should work although the code makes some assumptions that could cause errors so you will want to modify it to suit your needs. I avoided using list comprehension or array math to make it more clear since you said you're new to Python.
Assumptions:
You're pulling this data into a pandas dataframe
All Old values of "Termination in Progress" have a matching new value for all Case Numbers
import datetime
import pandas as pd
import numpy as np
fp = r'<PATH TO FILE>\\'
f = '<FILENAME>.csv'
data = pd.read_csv(fp+f)
#convert ts to datetime for later use doing time delta calculations
data['Edit Date'] = pd.to_datetime(data['Edit Date'])
# sort by the same case number and date in opposing order to make sure values for old and new align properly
data.sort_values(by = ['CaseNumber','Edit Date'], ascending = [True,False],inplace = True)
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
# this loop could also be accomplished with list comprehension like this:
#ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
print('Deltas between groups')
print(ts_deltas)
print()
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
print('Total Time Delta')
print(total_ts_delta)
Deltas between groups
[Timedelta('0 days 00:08:00'), Timedelta('0 days 00:06:00'), Timedelta('0 days 02:08:00')]
Total Time Delta
0 days 02:22:00
I've also attached a picture of the solution minus my file path for obvious reasons. Hope this helps. Please remember to mark as correct if this solution works for you. Otherwise let me know what issues you run into.
EDIT:
If you have multiple case numbers you want to look at, you could do it in various ways, but the simplest would be to just get a list of unique case numbers with data['CaseNumber'].unique() then iterate over that array filtering for each case number and appending the total time delta to a new list or a dictionary (not necessarily the most efficient solution, but it will work).
cases_total_td = {}
unique_cases = data['CaseNumber'].unique()
for case in unique_cases:
temp_data = data[data['CaseNumber'] == case]
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
cases_total_td[case] = total_ts_delta
print(cases_total_td)
{1005222: Timedelta('0 days 02:22:00')}

Best way to iterate through dataframe to call google distance API

I'd like to know what is the best solution to get distances from the google maps distance API for my dataframe composed of coordinates (origin & destination) which is around 75k rows.
#Origin #Destination
1 (40.7127837, -74.0059413) (34.0522342, -118.2436849)
2 (41.8781136, -87.6297982) (29.7604267, -95.3698028)
3 (39.9525839, -75.1652215) (40.7127837, -74.0059413)
4 (41.8781136, -87.6297982) (34.0522342, -118.2436849)
5 (29.7604267, -95.3698028) (39.9525839, -75.1652215)
So far my code iterates through the dataframe and calls the API copying the distance value into the new "distance" column.
df['distance'] = ""
for index, row in df.iterrows():
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK": # Handle "no result" exception
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
df['distance'].iloc[index] = KM
else:
df['distance'].iloc[index] = 0
df.to_csv('distance.csv')
I get the desired result but from what I've read iterating through dataframe is rather inefficient and should be avoided. It took 20 secondes for 240 rows, so it would take 1h30 to do all dataframe. Note that once done, no need to re-run anymore, only new few new rows a month (~500).
What would we the best solution here ?
Edit: if anybody has experience with the google distance API and its limitations any tips/best practices is welcomed.

I tried to understand about any limitations about concurrent calls here but I couldn't find anything. Few suggestions
Avoid loops
About your code I'd rather skip for loops and use apply first
def get_gmaps_distance(row):
result = gmaps.distance_matrix(row['origin'], row['destination'], mode='driving')
status = result['rows'][0]['elements'][0]['status']
if status == "OK":
KM = int(result['rows'][0]['elements'][0]['distance']['value'] / 1000)
else:
KM = 0
return KM
df["distance"] = df.apply(get_gmaps_distance, axis=1)
Split your dataframe and use multiprocessing
import multiprocessing as mp
def parallelize(fun, vec, cores=mp.cpu_count()):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
# split your dataframe in many chunks as the number of cores
df = np.array_split(df, mp.cpu_count())
# this use your functions for every chunck
def parallel_distance(x):
x["distance"] = x.apply(get_gmaps_distance, axis=1)
return x
df = parallelize(parallel_distance, df)
df = pd.concat(df, ignore_index=True, sort=False)
Do not calculate twice the same distance (save $$$)
In case you have duplicates row you should drop some of them
grp = df.drop_duplicates(["origin", "destination"]).reset_index(drop=True)
Here I didn't overwrite df as it possibly contain more information you need and you can merge the results to it.
grp["distance"] = grp.apply(get_gmaps_distance, axis=1)
df = pd.merge(df, grp, how="left")
Reduce decimals
You should ask you this question: do I really need to be accurate to the 7th decimal? As 1 degree of latitude is ~111km the 7th decimal place gives you a precision up to ~1cm. You get the idea from this when-less-is-more where reducing decimals they improved the model.
Conclusion
If you can eventually use all the suggested methods you could get some interesting improvements. I'd like you to comment them here as I don't have a personal API key to try by myself.

i am trying to write a function which divide a column 3 parts

i am trying to write a function which takes a column as input and divide it 3 parts as short, medium , long then return them as list.
i tried to do it with loc function, but, however, it return a dataframe rather than a list.
def DivideColumns(df,col):
mean = df[col].mean()
maxi = df[col].max()
mini = df[col].min()
less = mean - (maxi-mini)/3
more = mean + (maxi-mini)/3
short = df.loc[df[col] < less]
average = df.loc[df[col].between(df[col], less, more)]
long = df.loc[df[col] > more]
return short, average, long;
what i am expected was getting 3 different list, but unfortunately i got 3 different dataframe

Since you are using pandas you can use the concept of binning. By using the pandas cut function you can divide in the ranges you like and it makes your code easier to read. More info here
def DivideColumns(df,col):
mean = df[col].mean()
maxi = df[col].max()
mini = df[col].min()
less = mean - (maxi-mini)/3
more = mean + (maxi-mini)/3
# binning
bins_values = [mini, less, more, maxi]
group_names = ['short', 'avarage', 'long']
bins = pd.cut(df[col], bins_values, labels=group_names, include_lowest=True )
short = (df[col][bins == 'short']).tolist()
average = (df[col][bins == 'avarage']).tolist()
long = (df[col][bins == 'long']).tolist()
return short, average, long;

Use tolist() function to transform a pandas dataframe into a list.
short = df.loc[df[col] < less].values.tolist()
average = df.loc[df[col].between(df[col], less, more)].values.tolist()
long = df.loc[df[col] > more].values.tolist()

How to vectorize this peak finding for loop in Python?

Basically I'm writing a peak finding function that needs to be able to beat scipy.argrelextrema in benchmarking. Here is a link to the data I'm using, and the code:
https://drive.google.com/open?id=1U-_xQRWPoyUXhQUhFgnM3ByGw-1VImKB
If this link expires, the data can be found at dukascopy bank's online historical data downloader.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('EUR_USD.csv')
data.columns = ['Date', 'open', 'high', 'low', 'close','volume']
data.Date = pd.to_datetime(data.Date, format='%d.%m.%Y %H:%M:%S.%f')
data = data.set_index(data.Date)
data = data[['open', 'high', 'low', 'close']]
data = data.drop_duplicates(keep=False)
price = data.close.values
def fft_detect(price, p=0.4):
trans = np.fft.rfft(price)
trans[round(p*len(trans)):] = 0
inv = np.fft.irfft(trans)
dy = np.gradient(inv)
peaks_idx = np.where(np.diff(np.sign(dy)) == -2)[0] + 1
valleys_idx = np.where(np.diff(np.sign(dy)) == 2)[0] + 1
patt_idx = list(peaks_idx) + list(valleys_idx)
patt_idx.sort()
label = [x for x in np.diff(np.sign(dy)) if x != 0]
# Look for Better Peaks
l = 2
new_inds = []
for i in range(0,len(patt_idx[:-1])):
search = np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1))
if label[i] == -2:
idx = price[search].argmax()
elif label[i] == 2:
idx = price[search].argmin()
new_max = search[idx]
new_inds.append(new_max)
plt.plot(price)
plt.plot(inv)
plt.scatter(patt_idx,price[patt_idx])
plt.scatter(new_inds,price[new_inds],c='g')
plt.show()
return peaks_idx, price[peaks_idx]
It basically smoothes data using a fast fourier transform (FFT) then takes the derivative to find the minimum and maximum indices of the smoothed data, then finds the corresponding peaks on the unsmoothed data. Sometimes the peaks it finds are not idea due to some smoothing effects, so I run this for loop to search for higher or lower points for each index between the bounds specified by l. I need help vectorizing this for loop! I have no idea how to do it. Without the for loop, my code is about 50% faster than scipy.argrelextrema, but the for loop slows it down. So if I can find a way to vectorize it, it'd be a very quick, and very effective alternative to scipy.argrelextrema. These two images represent the data without and with the for loop respectively.

This may do it. It's not perfect but hopefully it obtains what you want and shows you a bit how to vectorize. Happy to hear any improvements you think up
label = np.array(label[:-1]) # not sure why this is 1 unit longer than search.shape[0]?
# the idea is to make the index matrix you're for looping over row by row all in one go.
# This part is sloppy and you can improve this generation.
search = np.vstack((np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1)) for i in range(0,len(patt_idx[:-1])))) # you can refine this.
# then you can make the price matrix
price = price[search]
# and you can swap the sign of elements so you only need to do argmin instead of both argmin and argmax
price[label==-2] = - price[label==-2]
# now find the indices of the minimum price on each row
idx = np.argmin(price,axis=1)
# and then extract the refined indices from the search matrix
new_inds = search[np.arange(idx.shape[0]),idx] # this too can be cleaner.
# not sure what's going on here so that search[:,idx] doesn't work for me
# probably just a misunderstanding
I find that this reproduces your result but I did not time it. I suspect the search generation is quite slow but probably still faster than your for loop.
Edit:
Here's a better way to produce search:
patt_idx = np.array(patt_idx)
starts = patt_idx[:-1]-(l+1)
stops = patt_idx[:-1]+(l+1)
ds = stops-starts
s0 = stops.shape[0]
s1 = ds[0]
search = np.reshape(np.repeat(stops - ds.cumsum(), ds) + np.arange(ds.sum()),(s0,s1))

Here is an alternative... it uses list comprehension which is generally faster than for-loops
l = 2
# Define the bounds beforehand, its marginally faster than doing it in the loop
upper = np.array(patt_idx) + l + 1
lower = np.array(patt_idx) - l - 1
# List comprehension...
new_inds = [price[low:hi].argmax() + low if lab == -2 else
price[low:hi].argmin() + low
for low, hi, lab in zip(lower, upper, label)]
# Find maximum within each interval
new_max = price[new_inds]
new_global_max = np.max(new_max)

Tracking Error on a number of benchmarks

I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)

There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)

Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

(Python) Pandas - GroupBy() using a similarity function - python

Related

How to count the duration of a field in a given value while having the field change history data?

Best way to iterate through dataframe to call google distance API

i am trying to write a function which divide a column 3 parts

How to vectorize this peak finding for loop in Python?

Tracking Error on a number of benchmarks

Categories

Resources