I have coded a for loop, with conditional statement and updates made to a list variable at every iteration, which probably is making the process really slow. Is there a way to speed up this process and accomplish the same results as this code snippet performs?
fault_array =[]
for i in x_range_original:
for j in range(0,16):
lower_threshold = min(df_records[:,j+1])
upper_threshold = max(df_records[:,j+1])
if((df_log[i,j] < lower_threshold) or (df_log[i,j] > upper_threshold)):
print("Fault detected at timestep: ",df_records['Time'][i])
fault_array.append(1)
else:
print("Normal operation at timestep: ",df_records['Time'][i])
fault_array.append(0)
Mini code review:
fault_array =[]
for i in x_range_original:
for j in range(0,16):
# recomputed on every i; perhaps you wanted j to be an outer loop
# use vectorized versions of min and max
lower_threshold = min(df_log[:,j])
upper_threshold = max(df_log[:,j])
# this condition is never true:
# df_log[i,j] cannot be less than min(df_log[:,j])
# same about upper threshold
if((df_log[i,j] < lower_threshold) or (df_log[i,j] > upper_threshold)):
print("Fault detected at timestep: ",df_records['Time'][i])
fault_array.append(1)
else:
# perhaps you need to use a vectorized operation here instead of for loop:
# fault_array = df.apply(lambda row: ...)
print("Normal operation at timestep: ",df_records['Time'][i])
fault_array.append(0)
Besides the always negative condition, I imagine you were looking for something like:
columns = list(range(16))
# I guess the thresholds logic should be different
upper_thresholds = df[columns].max(axis=0)
lower_thresholds = df[columns].min(axis=0)
# faults is a series of bools
faults = df[columns].apply(lambda row: any(row < lower_thresholds) or any(row > upper_thresholds), axis=1)
normal_timesteps = df_records.loc[faults, 'Time']
Related
How can I turn this code into a generator function? Or can I do it in a other way avoiding reading all data into memory?
The problem right now is that my memory gets full. I get KILLED after a long time when executing the code.
Code:
data = [3,4,3,1,2]
def convert(data):
for index in range(len(data)):
if data[index] == 0:
data[index] = 6
data.append(8)
elif data[index] == 1:
data[index] = 0
elif data[index] == 2:
data[index] = 1
elif data[index] == 3:
data[index] = 2
elif data[index] == 4:
data[index] = 3
elif data[index] == 5:
data[index] = 4
elif data[index] == 6:
data[index] = 5
elif data[index] == 7:
data[index] = 6
elif data[index] == 8:
data[index] = 7
return data
for i in range(256):
output = convert(data)
print(len(output))
Output:
266396864
290566743
316430103
346477329
376199930
412595447
447983143
490587171
534155549
582826967
637044072
692630033
759072776
824183073
903182618
982138692
1073414138
1171199621
1275457000
1396116848
1516813106
Killed
To answer the question: to turn a function into a generator function, all you have to do is yield something. You might do it like this:
def convert(data):
for index in range(len(data)):
...
yield data
Then, you can iterate over the output like this:
iter_converted_datas = convert(data)
for _, converted in zip(range(256), iter_converted_datas):
print(len(converted))
I also would suggest some improvements to this code. The first thing that jumps out at me, is to get rid of all those elif statements.
One helpful thing for this might be to supply a dictionary argument to your generator function that tells it how to convert the data values (the first one is a special case since it also appends).
Here is what that dict might look like:
replacement_dict = {
0: 6,
1: 0,
2: 1,
3: 2,
4: 3,
5: 4,
6: 5,
7: 6,
8: 7,
}
By the way: replacing a series of elif statements with a dictionary is a pretty typical thing to do in python. It isn't always appropriate, but it often works well.
Now you can write your generator like this:
def convert(data, replacement_dict):
for index in range(len(data)):
if index==0:
lst.append(8)
data[index] = replacement_dict[index]
yield data
And use it like this:
iter_converted_datas = convert(data, replacement_dict)
for _, converted in enumerate(iter_converted_datas):
print(len(converted))
But we haven't yet addressed the underlying memory problem.
For that, we need to step back a second: the reason your memory is filling up is you have created a routine that grows very large very fast. And if you were to keep going beyond 256 iterations, the list would get longer without end.
If you want to compute the Xth output for some member of the list without storing the entire list into memory, you have to change things around quite a bit.
My suggestion on how you might get started: create a function to get the Xth iteration for any starting input value.
Here is a generator that just produces outputs based on the replacement dict. Depending on the contents of the replacement dict, this could be infinite, or it might have an end (in which case it would raise a KeyError). In your case, it is infinite.
def process_replacements(value, replacement_dict):
while True:
yield (value := replacement_dict[value])
Next we can write our function to process the Xth iteration for a starting value:
def process_xth(value, xth, replacement_dict):
# emit the xth value from the original value
for _, value in zip(range(xth), process_replacements(value, replacement_dict)):
pass
return value
Now you can process the Xth iteration for any value in your starting data list:
index = 0
xth = 256
process_xth(data[index], xth, data, replacement_dict)
However, we have not appended 8 to the data list anytime we encounter the 0 value. We could do this, but as you have discovered, eventually the list of 8s will get too big. Instead, what we need to do is keep COUNT of how many 8s we have added to the end.
So I suggest adding a zero_tracker function to increment the count:
def zero_tracker():
global eights_count
eights_count += 1
Now you can call that function in the generator every time a zero is encountered, but resetting the global eights_count to zero at the start of the iteration:
def process_replacements(value, replacement_dict):
global eights_count
eights_count = 0
while True:
if value == 0:
zero_tracker()
yield (value := replacement_dict[value])
Now, for any Xth iteration you perform at some point in the list, you can know how many 8s were appended at the end, and when they were added.
But unfortunately simply counting the 8s isn't enough to get the final sequence; you also have to keep track of WHEN (ie, which iteration) they were added to the sequence, so you can know how deeply to iterate them. You could store this in memory pretty efficiently by keeping track of each iteration in a dictionary; that dictionary would look like this:
eights_dict = {
# iteration: count of 8s
}
And of course you can also calculate what each of these 8s will become at any arbitrary depth:
depth = 1
process_xth(8, depth, data, replacement_dict)
Once you know how many 8s there are added for every iteration given some finite number of Xth iterations, you can construct the final sequence by just yielding the correct value the right number of times over and over again, in a generator, without storing anything. I leave it to you to figure out how to construct your eights_dict and do this final part. :)
Here are a few things you can do to optimize it:
Instead of range(len(data)) you can use enumerate(data). This gives you access to both the element AND it's index. Example:
EDIT: According to this post, range is faster than enumerate. If you care about speed, you could ignore this change.
for index, element in enumerate(data):
if element == 0:
data[index] = 6
Secondly, most of the if statements have a predictable pattern. So you can rewrite them like this:
def convert(data):
for idx, elem in enumerate(data):
if elem == 0:
data[idx] = 6
data.append(8)
if elem <= 8:
data[index] = elem - 1
Since lists are mutable, you don't need to return data. It modifies it in-place.
I see that you ask about generator functions, but that ain't solve your memory issues. You run out of memory because, well, you keep everything in memory...
The memory complexity of your solution is O*((8/7)^n) where n is a number of calls to convert. This is because every time you call convert(), the data structure gets expanded with 1/7 of its elements (on average). This is the case because every number in your structure has (roughly) a 1/7 probability of being zero.
So memory complexity is O*((8/7)^n), hence exponential. But can we do better?
Yes we can (assuming that the conversion function remains this "nice and predictable"). We can keep in memory just the number of zeros that were present in a structure when we called a convert(). That way, we will have a linear memory complexity O*(n). Does that come with a cost?
Yes. Element access time no longer has a constant complexity O(1) but it has linear complexity O(n) where n is a number of calls to convert() (At least that's what I came up with).
But it resolves out-of-memory issue.
I also assumed that there would be need to iterate over the computed list. If you are only interested in the length, it is sufficient to keep count of digits in a number and work over those. That way you would use just a few integers of memory.
Here is a code:
from copy import deepcopy # to keep original list untouched ;)
class Data:
def __init__(self, seed):
self.seed = deepcopy(seed)
self.iteration = 0
self.zero_counts = list()
self.len = len(seed)
def __len__(self):
return self.len
def __iter__(self):
return SeededDataIterator(self)
def __repr__(self):
"""not necessary for a solution, but helps with debugging"""
return "[" + (", ".join(f"{n}" for n in self)) + "]"
def __getitem__(self, index: int):
if index >= self.len:
raise IndexError
if index < len(self.seed):
ret = self.seed[index] - self.iteration
else:
inner_it_idx = index - len(self.seed)
for i, cnt in enumerate(self.zero_counts):
if inner_it_idx < cnt:
ret = 9 + i - self.iteration
break
else:
inner_it_idx -= cnt
ret = ret if ret > 6 else ret % 7
return ret
def convert(self):
zero_count = sum((self[i] == 0) for i, _ in enumerate(self.seed))
for i, count in enumerate(self.zero_counts):
i = 9 + i - self.iteration
i = i if i > 6 else i % 7
if i == 0:
zero_count += count
self.zero_counts.append(zero_count)
self.len += self.zero_counts[self.iteration]
self.iteration += 1
class DataIterator:
"""Iterator class for the Data class"""
def __init__(self, seed_data):
self.seed_data = seed_data
self.index = 0
def __next__(self):
if self.index >= self.seed_data.len:
raise StopIteration
ret = self.seed_data[self.index]
self.index += 1
return ret
There is code that tests logical equality and prints required output:
original_data = [3,4,3,1,2]
data = deepcopy(original_data)
d = Data(data)
for _ in range(30):
output = convert(data)
d.convert()
print("---------------------------------------")
print(len(output))
assert len(output) == len(d)
for i, e in enumerate(output):
assert e == d[i]
data = deepcopy(original_data)
d = Data(data)
for _ in range(256):
d.convert()
print(len(d))
Results after your program crashed are:
1516813106
1662255394 <<< Killed here
1806321765
1976596756
2153338313
2348871138
2567316469
2792270106
3058372242
3323134871
3638852150
3959660078
4325467894
4720654782
5141141244
5625688711
6115404977
6697224392
7282794949
7964320044
8680314860
9466609138
10346343493
11256546221
12322913103
13398199926
14661544436
15963109809
17430929182
19026658353
20723155359
22669256596
24654746147
26984457539
I have created this function in python for generating different price combinations for a product dataset. So if a price of a product is 10$ the different possible prices would be [10,11,12,13,14,15].
For eg:
df = pd.DataFrame({'Product_id': [1, 2], 'price_per_tire': [10, 110]})
My function:
def price_comb(df):
K= [0,1,2,3,4,5]
final_df = pd.DataFrame()
c=0
for j in K:
c+=1
print('K count=' + str(c))
for index,i in df.iterrows():
if (i['price_per_tire']<=100):
i['price_per_tire'] = i['price_per_tire'] + 1*j
elif ((i['price_per_tire']>100) & (i['price_per_tire']<200)):
i['price_per_tire'] = i['price_per_tire'] + 2*j
elif ((i['price_per_tire']>200) & (i['price_per_tire']<300)):
i['price_per_tire'] = i['price_per_tire'] + 3*j
elif i['price_per_tire']>=300:
i['price_per_tire'] = i['price_per_tire'] + 5*j
final_df = final_df.append(i)
return final_df
when I run this function the output is
df = pd.DataFrame({'Product_id': [1,1,1,1,1,1, 2,2,2,2,2], 'price_per_tire': [10,11,12,13,14,15, 110,112,114,116,118,120]})
How ever its taking a lot of time (upto 2days) for 545k rows dataset. Im trying to find ways to execute this faster. Any help would be appreiciated
Please, provide a working version of the code, here is not clear where price_per_tire comes from.
This algo is a O(N2) so the is a lot of improvement to do.
First suggestion is to avoid for loop using numpy or pandas, try to solve your problem using vectorial approach.
This means that internal loop can be refactored using mask technique
for x in df.iterrows():
if x[fld] < limit:
x[fld] = f(x[fld])
can be refactored:
mask = df[fld] < limit
df[fld] = f(df[fld]) # if f(unction) can work in vectorial
df[fld] = df[fld].map(f) # Rolling version but slower
With this approach, you can speed up your code to a surprisingly fast version
Another point is that df.append is not a good practice, doing inline changes will be more efficient. You must create all needed columns before the main loop in order to allocate all required space .
I have a function which creates a set of results in a list. This is in a for-loop which changes one of the variables in each iteration. I need to be able to store these lists separately so that I can show the difference in results between each iteration as a graph.
Is there any way to store them separately like that? So far the only solution I've found is to copy out the function multiple times and manually change the variable and name of the list it stores to, but obviously this is a terrible way of doing it and I figure there must be a proper way.
Here is the code. The function is messy but works. Ideally I would be able to put this all in another for-loop which changes deceleration_p each iteration and then stores collected_averages as a different list so that I could compare collected_averages for each iteration.
import numpy as np
import random
import matplotlib.pyplot as plt
from statistics import mean
road_length = 500
deceleration_p = 0.1
max_speed = 5
buffer_road = np.zeros(road_length, dtype=int)
buffer_speed = 0
number_of_iterations = 1000
average_speed = 0
average_speed_list = []
collected_averages = []
total_speed = 0
for cars in range(1, road_length):
empty_road = np.ones(road_length - cars, dtype=int) * -1
cars_on_road = np.ones(cars, dtype=int)
road = np.append(empty_road, cars_on_road)
np.random.shuffle(road)
for i in range(0, number_of_iterations):
# acceleration
for speed in np.nditer(road, op_flags=['readwrite']):
if -1 < speed < max_speed:
speed[...] += 1
# randomisation
for speed in np.nditer(road, op_flags=['readwrite']):
if 0 < speed:
if deceleration_p > random.random():
speed += -1
# slowing down
for cell in range(0, road_length):
speed = road[cell]
for val in range(1, speed + 1):
new_speed = val
if (cell + val) > (road_length - 1):
val += -road_length
if road[cell + val] > -1:
speed = val - 1
road[cell] = new_speed - 1
break
buffer_road=np.ones(road_length, dtype=int)*-1
for cell in range(0, road_length):
speed = road[cell]
buffer_cell = cell + speed
if (buffer_cell) > (road_length - 1):
buffer_cell += -road_length
if speed > -1:
total_speed += speed
buffer_road[buffer_cell] = speed
road = buffer_road
average_speed = total_speed/cars
average_speed_list.append(average_speed)
average_speed = 0
total_speed = 0
steady_state_average=mean(average_speed_list[9:number_of_iterations])
average_speed_list=[]
collected_averages.append(steady_state_average)
Not to my knowledge. As stated in the comments, you could use a dictionary, but my suggestion is to use a list. For every iteration of the loop, you could append the value. (From what I understood) You stated that your results are in a list, so you could make a 2D array. My recommendation would be to use a numpy array as it is much faster. Hopefully this was helpful.
I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.
Is there a way to speed up this code?
In the following code L_mo, U_star, rah, H, etc. arrays are all of the same size i.e. 4000 x 5000.
All other variables in the loop are constant for the current loop.
When I run the following nested for loop, it takes almost 40-50 min.
for i in range(img_width):
for j in range(img_height):
error=99
L_mo_cell = L_mo[i,j]
U_star_cell = U_star[i,j]
rah_cell = rah[i,j]
H_cell = H[i,j]
Zom_cell=Zom[i,j]
d_cell=d[i,j]
Zoh_cell=Zoh[i,j]
To_cell=To[i,j]
if (L_mo_cell < 0) & (L_mo_cell > -10): # Unstable Conditions
while error > 0.05:
dumm=rah_cell # dummy variable
x1=(1-16*(Zm-d_cell)/L_mo_cell)**0.25
x2=(1-16*(Zoh_cell)/L_mo_cell)**0.25
x3=(1-16*(Zoh_cell)/L_mo_cell)**0.25
phi_h1=2*np.log((1+x1**2)/2)
phi_h2=2*np.log((1+x2**2)/2)
phi_m1=2*np.log((1+x1)/2)+np.log((1+x1**2)/2)-2*np.arctan(x1)+np.pi/2
phi_m2=2*np.log((1+x3)/2)+np.log((1+x3**2)/2)-2*np.arctan(x3)+np.pi/2
#Calculate Parameters (U*,rah,H,L_mo)
U_star_cell=(u*const_k)/(np.log((Zm-d_cell)/Zom_cell)-phi_m1+phi_m2)
rah_cell=(np.log((Zm-d_cell)/Zoh_cell)-phi_h1+phi_h2)/(U_star_cell*const_k)
H_cell=(rho_a*const_Cpa)*(To_cell-Ta)/rah_cell
L_mo_cell=(U_star_cell**3*(Ta+273.15)*rho_a*const_Cpa)/(const_g*const_k*H_cell)
error = abs((rah_cell-dumm)/rah_cell)
elif (L_mo_cell > 0) & (L_mo_cell < 10): #Stable Conditions
while error > 0.05:
dumm=rah_cell #dummy variable
phi_h1=6*((Zm-d_cell)/L_mo_cell)*np.log((1-Zm/L_mo_cell))
phi_h2=6*(Zom_cell/L_mo_cell)*np.log((1-Zm/L_mo_cell))
phi_m1=phi_h1
phi_m2=phi_h2
# Calculate Parameters (U*,rah,H,L_mo)
U_star_cell=(u*const_k)/(np.log((Zm-d_cell)/Zom_cell)-phi_m1+phi_m2)
rah_cell=(np.log((Zm-d_cell)/Zoh_cell)-phi_h1+phi_h2)/(U_star_cell*const_k)
H_cell=(rho_a*const_Cpa)*(To_cell-Ta)/rah_cell
L_mo_cell=-(U_star_cell**3*(Ta+273.15)*rho_a*const_Cpa)/(const_g*const_k*H_cell)
error = abs((rah_cell-dumm)/rah_cell)
# print (error)
else: #Neutral Conditions
state=0
U_star_cell=(u*const_k)/(np.log((Zm-d_cell)/Zom_cell))
rah_cell=np.log((Zm-d_cell)/Zom_cell)*np.log((Zm-d_cell)/Zoh_cell)/(u*const_k**2)
H_cell=(rho_a*const_Cpa)*(To_cell-Ta)/rah_cell
L_mo_cell=-(U_star_cell**3*Ta*rho_a*const_Cpa)/(const_g*const_k*H_cell)
U_star[i,j]=U_star_cell
rah[i,j]=rah_cell
H[i,j]=H_cell
L_mo[i,j]=L_mo_cell