Finding contiguous, non-unique slices in Pandas series without iterating - python

I'm trying to parse a logfile of our manufacturing process. Most of the time the process is run automatically but occasionally, the engineer needs to switch into manual mode to make some changes and then switches back to automatic control by the reactor software. When set to manual mode the logfile records the step as being "MAN.OP." instead of a number. Below is a representative example.
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
ser_orig = pd.Series(steps)
which results in
0 1
1 2
2 2
3 MAN.OP.
4 MAN.OP.
5 2
6 2
7 3
8 3
9 MAN.OP.
10 MAN.OP.
11 4
12 4
dtype: object
I need to detect the 'MAN.OP.' and make them distinct from each other. In this example, the two regions with values == 2 should be one region after detecting the manual mode section like this:
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
I have code that iterates over this series and produces the correct result when the series is passed to my object. The setter is:
#step_series.setter
def step_series(self, ss):
"""
On assignment, give the manual mode steps a unique name. Leave
the steps done on recipe the same.
"""
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
counter = 0
continuous = False
for i in ss.index:
if continuous and ss.at[i] != manual_mode:
continuous = False
counter += 1
elif not continuous and ss.at[i] == manual_mode:
continuous = True
ss.at[i] = new_manual_mode_text.format(str(counter))
elif continuous and ss.at[i] == manual_mode:
ss.at[i] = new_manual_mode_text.format(str(counter))
self._step_series = ss
but this iterates over the entire dataframe and is the slowest part of my code other than reading the logfile over the network.
How can I detect these non-unique sections and rename them uniquely without iterating over the entire series? The series is a column selection from a larger dataframe so adding extra columns is fine if needed.
For the completed answer I ended up with:
#step_series.setter
def step_series(self, ss):
pd.options.mode.chained_assignment = None
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
newManOp = (ss=='MAN.OP.') & (ss != ss.shift())
ss[ss == 'MAN.OP.'] = 'Manual_Mode_' + (newManOp.cumsum()-1).astype(str)
self._step_series = ss

Here's one way:
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
steps = pd.Series(steps)
newManOp = (steps=='MAN.OP.') & (steps != steps.shift())
steps[steps=='MAN.OP.'] += seq.cumsum().astype(str)
>>> steps
0 1
1 2
2 2
3 MAN.OP.1
4 MAN.OP.1
5 2
6 2
7 3
8 3
9 MAN.OP.2
10 MAN.OP.2
11 4
12 4
dtype: object
To get the exact format you listed (starting from zero instead of one, and changing from "MAN.OP." to "Manual_mode_"), just tweak the last line:
steps[steps=='MAN.OP.'] = 'Manual_Mode_' + (seq.cumsum()-1).astype(str)
>>> steps
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
There a pandas enhancement request for contiguous groupby, which would make this type of task simpler.

There is s function in matplotlib that takes a boolean array and returns a list of (start, end) pairs. Each pair represents a contiguous region where the input is True.
import matplotlib.mlab as mlab
regions = mlab.contiguous_regions(ser_orig == manual_mode)
for i, (start, end) in enumerate(regions):
ser_orig[start:end] = new_manual_mode_text.format(i)
ser_orig
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object

Related

list index out of range in calculation of nodal distance

I am working on a small task in which I have to find the distance between two nodes. Each node has X and Y coordinates which can be seen below.
node_number X_coordinate Y_coordinate
0 0 1 0
1 1 1 1
2 2 1 2
3 3 1 3
4 4 0 3
5 5 0 4
6 6 1 4
7 7 2 4
8 8 3 4
9 9 4 4
10 10 4 3
11 11 3 3
12 12 2 3
13 13 2 2
14 14 2 1
15 15 2 0
For the purpose I mentioned above, I wrote below code,
X1_coordinate = df['X_coordinate'].tolist()
Y1_coordinate = df['Y_coordinate'].tolist()
node_number1 = df['node_number'].tolist()
nodal_dist = []
i = 0
for i in range(len(node_number1)):
dist = math.sqrt((X1_coordinate[i+1] - X1_coordinate[i])**2 + (Y1_coordinate[i+1] - Y1_coordinate[i])**2)
nodal_dist.append(dist)
I got the error
list index out of range
Kindly let me know what I am doing wrong and what should I change to get the answer.
Indexing starts at zero, so the last element in the list has an index that is one less than the number of elements in that list. But the len() function gives you the number of elements in the list (in other words, it starts counting at 1), so you want the range of your loop to be len(node_number1) - 1 to avoid an -off-by-one error.
The problems should been in this line
dist = math.sqrt((X1_coordinate[i+1] - X1_coordinate[i])**2 + (Y1_coordinate[i+1] - Y1_coordinate[i])**2)
the X1_coordinate[i+1] and the ] Y1_coordinate[i+1]] go out of range on the last number call.

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Python - Create multiple lists and zip

I am looking to produce multiple lists based on the same function which randomises data based on a list. I want to be able to easily change how many of these new lists I want to have and then combine. The code which creates each list is the following:
"""
"""
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
This perturbs each value from the list based on a normal distribution.
To combine them is fine when I just want a handful of lists:
"""
"""
ensemble_form_1,ensemble_form_2,ensemble_form_3 = [],[],[]
ensemble_form_1 = normal_transform(R)
ensemble_form_2 = normal_transform(R)
ensemble_form_3 = normal_transform(R)
zipped_ensemble = list(zip(ensemble_form_1,ensemble_form_2,ensemble_form_3))
df_ensemble = pd.DataFrame(zipped_ensemble, columns = ['Ensemble_1', 'Ensemble_2','Ensemble_3'])
return ensemble_form_1, ensemble_form_2, ensemble_form_3
How could I repeat the same randomisation process to create a fixed number of lists (say 50 or 100), and then combine them into a table? Is there an easy way to do this with a for loop, or any other method? I'd need to be able to pick out each new list/column individually, as I would be combining the results in some way.
Any help would be greatly appreciated.
You can construct multiple lists and a table like this:
import pandas as pd
import numpy as np
# Your function for creating the individual lists
def normal_transform(R):
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
# Construction of multiple lists and the dataframe
NUM_LISTS = 50
R = list(range(100))
data = dict()
for i in range(NUM_LISTS):
data['Ensemble_' + str(i)] = normal_transform(R)
df_ensemble = pd.DataFrame(data)
You can access the individual lists/ columns like this:
df_ensemble['Ensemble_42']
df_ensemble[df_ensemble.columns[42]]
You can use zip() with * to create dataframe with variable number of columns. For example:
import pandas as pd
def generate_list(n):
#... generate your list here
return [*range(n)]
def get_dataframe(n_columns, n):
return pd.DataFrame(zip(*[generate_list(n) for _ in range(n_columns)]), columns=['Ensemble_{}'.format(i) for i in range(1, n_columns+1)])
print(get_dataframe(8, 10))
Prints (8 columns, 10 rows):
Ensemble_1 Ensemble_2 Ensemble_3 Ensemble_4 Ensemble_5 Ensemble_6 Ensemble_7 Ensemble_8
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9

sum multiple dataframes in model

I have build a model for the optimization of my energy consumption.
Some of the variables are given in datasheets.
For simplicity and the availability of different datasheets I build my model inside a class.
The different energy consumptions are defined in the init function of the class:
def __init__(self, PV, refrigerator, clotheswasher, dishwasher, freezer):
self.PV = PV
self.refrigerator = refrigerator
self.clotheswasher = clotheswasher
self.dishwasher = dishwasher
self.freezer = freezer
These values are used in my model for all timestamps of the data (i.e. one day, every 5 min data)
for t in self.model.T:
self.model.PV[t] = self.PV
self.model.refrigerator[t] = self.refrigerator
self.model.clotheswasher[t] = self.clotheswasher
self.model.dishwasher[t] = self.dishwasher
self.model.freezer[t] = self.freezer
I want to add them up to make a plot of my total energy consumption of the day
self.model.total[t] = self.model.PV[t] + self.model.refrigerator[t] + self.model.clotheswasher[t] + self.model.dishwasher[t] + self.model.freezer[t]
However, by doing so for every t in self.model.total[t] I get a dataframe with the summations, i.e., when adding A B and C:
index A B C
1 3 4 2
2 2 1 4
3 1 3 2
I would like to get a dataframe like:
index tot
1 9
2 7
3 6
but I get:
index tot
1 9
7
6
2 9
7
6
3 9
7
6
can someone help me out?
If we simplify what you are trying to do by working on your simplified data example, we could use the following code:
import pandas as pd
DF = pd.read_csv("Data.csv")
print DF
print
for i in range(len(DF)):
print i, sum(DF.iloc[i])
which would yield the following output:
A B C
0 3 4 2
1 2 1 4
2 1 3 2
0 9
1 7
2 6
You are probably just making some simple mistake with your class instantiation and data loading. Once you fix that, your results will likely come out right. Start from a small simple data set until you find your issue, and trouble shoot each step.

Categories

Resources