Question: What is the most efficient way to implement the equivalent of the following, using Pandas dataframes: temp = df[df.feature] == value] at scale (see below for context re: scale)?
Background: I have daily time series data for ~500 entities for 30 years, and for each entity and each day, need to create 90 features based on various look-backs, up to 240 days in the past. Currently, I'm looping through each day, manipulating all of the data from that day, then inserting it into a pre-allocated numpy matrix—but it's proving very slow for the size of my data set.
Naive approach:
df = pd.DataFrame()
for day in range(241, t_max):
temp_a = df_timeseries[df_timeseries.t] == day].copy()
temp_b = df_timeseries[df_timeseries.t] == day - 1].copy()
new_val = temp_a.feature_1/temp_b.feature_1
new_val['t'] = day
new_val['entity'] = temp_a.entity
df.concat([df, new_val])
Current approach (simplified):
df = np.matrix(np.zeros([num_days*num_entities, 3]))
col_dict = dict(zip(df_timeseries.columns, list(range(0,len(df_timeseries.columns)))))
mtrx_timeseries = np.matrix(df_timeseries.to_numpy())
for i, day in enumerate(range(241, t_max)):
interm = np.matrix(np.zeros([num_entities, 3]))
interm[:, 0] = day
temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :]
temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 1)[0], :]
temp_cr = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1
temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 5)[0], :]
temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == t - 10)[0], :]
temp_or = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1
interm[:, 1:] = np.concatenate([temp_cr, temp_or], axis=1)
df[i*num_entities : (i + 1)*num_entities, :] = interm
Line profiling the full version of the code I have shows that each statement of the form mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :] takes up ~23% of the time of the run in total, hence my looking for a more streamlined solution. Since indexing takes the most time, and since the loop means that this operation is performed every iteration, perhaps one solution might be to index just once, storing each day's data in a separate array element, and then looping through array elements?
This isn't a complete solution to your problem, but I think it will get you where you need to be.
Consider the following code:
entity_dict = {}
entity_idx = 0
arr = np.zeros((num_entities, t_max-240))
for entity, day, feature in df_timeseries[['entity', 'day', 'feature_1']].values:
if entity not in entity_dict:
entity_dict[entity] = entity_idx
entity_idx += 1
arr[entity_dict[entity], day-240] = feature
This will convert df_timeseries into an num_entities*num_days shaped array organized by entities, very efficiently. You won't need to do any fancy indexing at all. The most efficient way to index a numpy array or matrix is to know what indices you need ahead of time and not search the array
for them. You can then perform array operations (it looks to me like your operation is simple elementwise division, which you can do in a couple lines with no extra loop).
Then convert back to the original format.
I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules:
def disambiguate(arg):
arg['length'] = (arg.end - arg.begin).abs()
df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()
data = []
out = pd.DataFrame()
for row in df.itertuples():
test = df[df['note_id']==row.note_id].copy()
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
# filter out overlapping rows via hierarchy
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == minScore]
data.append(fx)
out = pd.concat(data, axis=0)
# randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
out.reset_index(inplace=True)
out = out.reindex(np.random.permutation(out.index))
return out.drop_duplicates(subset=['begin', 'end', 'note_id'])
This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there).
Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form.
Here is a sample of my data:
begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222
I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe.
It should been:
cases = set(df['note_id'].tolist())
for case in cases:
test = df[df['note_id']==case].copy()
for row in df.itertuples():
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == maxLength]
data.append(fx)
out = pd.concat(data, axis=0)
For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!
i am trying to write a function which takes a column as input and divide it 3 parts as short, medium , long then return them as list.
i tried to do it with loc function, but, however, it return a dataframe rather than a list.
def DivideColumns(df,col):
mean = df[col].mean()
maxi = df[col].max()
mini = df[col].min()
less = mean - (maxi-mini)/3
more = mean + (maxi-mini)/3
short = df.loc[df[col] < less]
average = df.loc[df[col].between(df[col], less, more)]
long = df.loc[df[col] > more]
return short, average, long;
what i am expected was getting 3 different list, but unfortunately i got 3 different dataframe
Since you are using pandas you can use the concept of binning. By using the pandas cut function you can divide in the ranges you like and it makes your code easier to read. More info here
def DivideColumns(df,col):
mean = df[col].mean()
maxi = df[col].max()
mini = df[col].min()
less = mean - (maxi-mini)/3
more = mean + (maxi-mini)/3
# binning
bins_values = [mini, less, more, maxi]
group_names = ['short', 'avarage', 'long']
bins = pd.cut(df[col], bins_values, labels=group_names, include_lowest=True )
short = (df[col][bins == 'short']).tolist()
average = (df[col][bins == 'avarage']).tolist()
long = (df[col][bins == 'long']).tolist()
return short, average, long;
Use tolist() function to transform a pandas dataframe into a list.
short = df.loc[df[col] < less].values.tolist()
average = df.loc[df[col].between(df[col], less, more)].values.tolist()
long = df.loc[df[col] > more].values.tolist()
I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()
On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.
I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})