I have created this function in python for generating different price combinations for a product dataset. So if a price of a product is 10$ the different possible prices would be [10,11,12,13,14,15].
For eg:
df = pd.DataFrame({'Product_id': [1, 2], 'price_per_tire': [10, 110]})
My function:
def price_comb(df):
K= [0,1,2,3,4,5]
final_df = pd.DataFrame()
c=0
for j in K:
c+=1
print('K count=' + str(c))
for index,i in df.iterrows():
if (i['price_per_tire']<=100):
i['price_per_tire'] = i['price_per_tire'] + 1*j
elif ((i['price_per_tire']>100) & (i['price_per_tire']<200)):
i['price_per_tire'] = i['price_per_tire'] + 2*j
elif ((i['price_per_tire']>200) & (i['price_per_tire']<300)):
i['price_per_tire'] = i['price_per_tire'] + 3*j
elif i['price_per_tire']>=300:
i['price_per_tire'] = i['price_per_tire'] + 5*j
final_df = final_df.append(i)
return final_df
when I run this function the output is
df = pd.DataFrame({'Product_id': [1,1,1,1,1,1, 2,2,2,2,2], 'price_per_tire': [10,11,12,13,14,15, 110,112,114,116,118,120]})
How ever its taking a lot of time (upto 2days) for 545k rows dataset. Im trying to find ways to execute this faster. Any help would be appreiciated
Please, provide a working version of the code, here is not clear where price_per_tire comes from.
This algo is a O(N2) so the is a lot of improvement to do.
First suggestion is to avoid for loop using numpy or pandas, try to solve your problem using vectorial approach.
This means that internal loop can be refactored using mask technique
for x in df.iterrows():
if x[fld] < limit:
x[fld] = f(x[fld])
can be refactored:
mask = df[fld] < limit
df[fld] = f(df[fld]) # if f(unction) can work in vectorial
df[fld] = df[fld].map(f) # Rolling version but slower
With this approach, you can speed up your code to a surprisingly fast version
Another point is that df.append is not a good practice, doing inline changes will be more efficient. You must create all needed columns before the main loop in order to allocate all required space .
Related
I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones. To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop. But this is unfortunately quite computationally intensive. Is there a better way (apart from parallelization)?
Minimal example with my function:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
The real data have a size of over 150k and are calculated for several minutes... :/
I'm not sure I understand completely what you want but I think this could work:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list.
If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc can be used.
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(You don't need that for your example.)
Your code is a bit unclear regarding the method to store these two different lists. Specifically, I'm not sure what is the correct structure of sample_list that you have in mind.
Regardless, using Series.pct_change and np.unique() you should achieve approximately what you're looking for.
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
Now indices points you to the start and end of that wrong gap.
If your data will have more than one gap then you'd want to only use diff()[1:].pct_change() and look for all values that are different than 0 using where().
same as above question mention
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
use time diff to compare with the normal_distance.seconds
create an auxiliary column tag to separate the gap group
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")
I need to find a more efficient solution for the following problem:
Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.
A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.
import numpy as np
import pandas as pd
from itertools import combinations
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
comparelist = row['List'].split(',')
pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
row['Count'] = len(pointercounter)
I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.
Thanks!
Your code can be rewritten as:
# working with integers are much better than strings
enums, codes = df.stack().factorize()
# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]
# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])
# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
ret[i] += a.issubset(b)
# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}
# and in values:
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')
All that took about 2 seconds as oppose to your code that took around 1.5 mins.
I have coded a for loop, with conditional statement and updates made to a list variable at every iteration, which probably is making the process really slow. Is there a way to speed up this process and accomplish the same results as this code snippet performs?
fault_array =[]
for i in x_range_original:
for j in range(0,16):
lower_threshold = min(df_records[:,j+1])
upper_threshold = max(df_records[:,j+1])
if((df_log[i,j] < lower_threshold) or (df_log[i,j] > upper_threshold)):
print("Fault detected at timestep: ",df_records['Time'][i])
fault_array.append(1)
else:
print("Normal operation at timestep: ",df_records['Time'][i])
fault_array.append(0)
Mini code review:
fault_array =[]
for i in x_range_original:
for j in range(0,16):
# recomputed on every i; perhaps you wanted j to be an outer loop
# use vectorized versions of min and max
lower_threshold = min(df_log[:,j])
upper_threshold = max(df_log[:,j])
# this condition is never true:
# df_log[i,j] cannot be less than min(df_log[:,j])
# same about upper threshold
if((df_log[i,j] < lower_threshold) or (df_log[i,j] > upper_threshold)):
print("Fault detected at timestep: ",df_records['Time'][i])
fault_array.append(1)
else:
# perhaps you need to use a vectorized operation here instead of for loop:
# fault_array = df.apply(lambda row: ...)
print("Normal operation at timestep: ",df_records['Time'][i])
fault_array.append(0)
Besides the always negative condition, I imagine you were looking for something like:
columns = list(range(16))
# I guess the thresholds logic should be different
upper_thresholds = df[columns].max(axis=0)
lower_thresholds = df[columns].min(axis=0)
# faults is a series of bools
faults = df[columns].apply(lambda row: any(row < lower_thresholds) or any(row > upper_thresholds), axis=1)
normal_timesteps = df_records.loc[faults, 'Time']
I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.
I am trying to construct hierarchies given a dataset, where each row represents a student, the course they've taken, and some other metadata. From this dataset, i'm trying to construct an adjacency matrix and determine the hierarchies based on what classes students have taken, and the path that different students take when choosing classes.
That being said, to construct this adjacency matrix, it is computationally expensive. Here is the code I have currently, which has been running for around 2 hours.
uniqueStudentIds = df.Id.unique()
uniqueClasses = df['Course_Title'].unique()
for studentID in uniqueStudentIds:
for course1 in uniqueClasses:
for course2 in uniqueClasses:
if (course1 != course2 and have_taken_both_courses(course1, course2, studentID)):
x = vertexDict[course1]
y = vertexDict[course2]
# Assuming symmetry
adjacency_matrix[x][y] += 1
adjacency_matrix[y][x] += 1
print(course1 + ', ' + course2)
def have_taken_both_courses(course1, course2, studentID):
hasTakenFirstCourse = len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0
if hasTakenFirstCourse:
return len(df.loc[(df['Course_Title'] == course2) & (df['Id'] == studentID)]) > 0
else:
return False
Given that I have a very large dataset size, I have tried to consult online resources in parallelizing/multithreading this computationally expensive for loop. However, i'm new to python and multiprocessing, so any guidance would be greatly appreciated!
It appears are looping way more than you have to. For every student you do NxN iterations, where N is the total number of classes. But your student has only taken a subset of those classes. So you can cut down on iterations significantly.
Your have_taken_both_courses() lookup is also more expensive than it needs to be.
Something like this will probably go a lot faster:
import numpy as np
import itertools
import pandas as pd
df = pd.read_table('/path/to/data.tsv')
students_df = pd.DataFrame(df['student'].unique())
students_lkp = {x[1][0]: x[0] for x in students_df.iterrows()}
classes_df = pd.DataFrame(df['class'].unique())
classes_lkp = {x[1][0]: x[0] for x in classes_df.iterrows()}
df['student_key'] = df['student'].apply(lambda x: students_lkp[x])
df['class_key'] = df['class'].apply(lambda x: classes_lkp[x])
df.set_index(['student_key', 'class_key'], inplace=True)
matr = np.zeros((len(classes_df), len(classes_df)))
for s in range(0, len(students_df)):
print s
# get all the classes for this student
classes = df.loc[s].index.unique().tolist()
for x, y in itertools.permutations(classes, 2):
matr[x][y] += 1