sum multiple dataframes in model - python

I have build a model for the optimization of my energy consumption.
Some of the variables are given in datasheets.
For simplicity and the availability of different datasheets I build my model inside a class.
The different energy consumptions are defined in the init function of the class:
def __init__(self, PV, refrigerator, clotheswasher, dishwasher, freezer):
self.PV = PV
self.refrigerator = refrigerator
self.clotheswasher = clotheswasher
self.dishwasher = dishwasher
self.freezer = freezer
These values are used in my model for all timestamps of the data (i.e. one day, every 5 min data)
for t in self.model.T:
self.model.PV[t] = self.PV
self.model.refrigerator[t] = self.refrigerator
self.model.clotheswasher[t] = self.clotheswasher
self.model.dishwasher[t] = self.dishwasher
self.model.freezer[t] = self.freezer
I want to add them up to make a plot of my total energy consumption of the day
self.model.total[t] = self.model.PV[t] + self.model.refrigerator[t] + self.model.clotheswasher[t] + self.model.dishwasher[t] + self.model.freezer[t]
However, by doing so for every t in self.model.total[t] I get a dataframe with the summations, i.e., when adding A B and C:
index A B C
1 3 4 2
2 2 1 4
3 1 3 2
I would like to get a dataframe like:
index tot
1 9
2 7
3 6
but I get:
index tot
1 9
7
6
2 9
7
6
3 9
7
6
can someone help me out?

If we simplify what you are trying to do by working on your simplified data example, we could use the following code:
import pandas as pd
DF = pd.read_csv("Data.csv")
print DF
print
for i in range(len(DF)):
print i, sum(DF.iloc[i])
which would yield the following output:
A B C
0 3 4 2
1 2 1 4
2 1 3 2
0 9
1 7
2 6
You are probably just making some simple mistake with your class instantiation and data loading. Once you fix that, your results will likely come out right. Start from a small simple data set until you find your issue, and trouble shoot each step.

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Change some values based on condition

Can you help on the following task? I have a dataframe column such as:
index df['Q0']
0 1
1 2
2 3
3 5
4 5
5 6
6 7
7 8
8 3
9 2
10 4
11 7
I want to substitute the values in df.loc[3:8,'Q0'] with the values in df.loc[0:2,'Q0'] if df.loc[0,'Q0']!=df.loc[3,'Q0']
The result should look like the one below:
index df['Q0']
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
9 2
10 4
11 7
I tried the following line:
df.loc[3:8,'Q0'].where(~df.loc[0,'Q0']!=df.loc[3,'Q0']),other=df.loc[0:2,'Q0'],inplace=True)
or
df['Q0'].replace(to_replace=df.loc[3:8,'Q0'], value=df.loc[0:2,'Q0'], inplace=True)
But it doesn't work. Most possible I am doing something wrong.
Any suggestions?
You can use the cycle function:
from itertools import cycle
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df["Q0"][3:8] = [next(c) for _ in range(5)]
Thanks for the replies. I tried the suggestions but I have some issues:
#adnanmuttaleb -
When I applied the function in a dataframe with more than 1 column (e.g. 12x2 or larger) I notice that the value in df.Q0[8] didn't change. Why?
#jezrael -
When I adjust to your suggestion I get the error:
ValueError: cannot copy sequence with size 5 to array axis with dimension 6
When I change the range to 6, I am getting wrong results
import pandas as pd
from itertools import cycle
data={'Q0':[1,2,3,5,5,6,7,8,3,2,4,7],
'Q0_New':[0,0,0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
##### version 1
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df['Q0_New'][3:8] = [next(c) for _ in range(5)]
##### version 2
d = cycle(df.loc[0:3,'Q0'])
if df.Q0[0] != df.Q0[3]:
df.loc[3:8,'Q0_New'] = [next(d) for _ in range(6)]
Why we have different behaviors and what corrections need to be made?
Thanks once more guys.

How can I reference particular cells in a dataframe?

I am a beginner and this is my first project.. I searched for the answer but it still isn't clear.
I have imported a worksheet from excel using Pandas..
**Rabbit Class:
Num Behavior Speaking Listening
0 1 3 1 1
1 2 1 1 1
2 3 3 1 1
3 4 1 1 1
4 5 3 2 2
5 6 3 2 3
6 7 3 3 1
7 8 3 3 3
8 9 2 3 2
What I want to do is create if functions.. ex. if a student's behavior is a "1" I want it to print one string, else print a different string. How can I reference a particular cell of the worksheet to set up such a function? I tried: val = df.at(1, "Behavior") but that clearly isn't working..
Here is the code I have so far..
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
path = r"C:\Users\USER\Desktop\Python\rabbit_class.xls"
print("Rabbit Class:")
print(df)
Also you can do
dff = df.loc[df['Behavior']==1]
if(not(dff.empty)):
# do Something
What you want is to find rows where df.Behavior is equal to 1. Use any of the following three methods.
# Method-1
df[df["Behavior"]==1]
# Method-2
df.loc[df["Behavior"]==1]
# Method-3
df.query("Behavior==1")
Output:
Num Behavior Speaking Listening LastColumn
0 0 1 3 1 1
Note: Dummy Data
Your sample data does not have a column header (the last one). So I named it LastColumn and read-in the data as a dataframe.
# Dummy Data
s = """
Num Behavior Speaking Listening LastColumn
0 1 3 1 1
1 2 1 1 1
2 3 3 1 1
3 4 1 1 1
4 5 3 2 2
5 6 3 2 3
6 7 3 3 1
7 8 3 3 3
8 9 2 3 2
"""
# Make Dataframe
ss = re.sub('\s+',',',s)
ss = ss[1:-1]
sa = np.array(ss.split(',')).reshape(-1,5)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df = df.astype(int)
df
Hope below example will help you
import pandas as pd
df = pd.read_excel(r"D:\test_stackoverflow.xlsx")
print(df.columns)
def _filter(col, filter_):
return df[df[col]==filter_]
print(_filter('Behavior', 1))
Thank you all for your answers. I finally figured out what I was trying to do using the following code:
i = 0
for i in df.index:
student_number = df["Student Number"][i]
print(student_number)
student_name = student_list[int(student_number) - 1]
behavior = df["Behavior"][i]
if behavior == 1:
print("%s's behavior is good" % student_name)
elif behavior == 2:
print ("%s's behavior is average." % student_name)
else:
print ("%s's behavior is poor" % student_name)
speaking = df["Speaking"][i]

Finding contiguous, non-unique slices in Pandas series without iterating

I'm trying to parse a logfile of our manufacturing process. Most of the time the process is run automatically but occasionally, the engineer needs to switch into manual mode to make some changes and then switches back to automatic control by the reactor software. When set to manual mode the logfile records the step as being "MAN.OP." instead of a number. Below is a representative example.
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
ser_orig = pd.Series(steps)
which results in
0 1
1 2
2 2
3 MAN.OP.
4 MAN.OP.
5 2
6 2
7 3
8 3
9 MAN.OP.
10 MAN.OP.
11 4
12 4
dtype: object
I need to detect the 'MAN.OP.' and make them distinct from each other. In this example, the two regions with values == 2 should be one region after detecting the manual mode section like this:
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
I have code that iterates over this series and produces the correct result when the series is passed to my object. The setter is:
#step_series.setter
def step_series(self, ss):
"""
On assignment, give the manual mode steps a unique name. Leave
the steps done on recipe the same.
"""
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
counter = 0
continuous = False
for i in ss.index:
if continuous and ss.at[i] != manual_mode:
continuous = False
counter += 1
elif not continuous and ss.at[i] == manual_mode:
continuous = True
ss.at[i] = new_manual_mode_text.format(str(counter))
elif continuous and ss.at[i] == manual_mode:
ss.at[i] = new_manual_mode_text.format(str(counter))
self._step_series = ss
but this iterates over the entire dataframe and is the slowest part of my code other than reading the logfile over the network.
How can I detect these non-unique sections and rename them uniquely without iterating over the entire series? The series is a column selection from a larger dataframe so adding extra columns is fine if needed.
For the completed answer I ended up with:
#step_series.setter
def step_series(self, ss):
pd.options.mode.chained_assignment = None
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
newManOp = (ss=='MAN.OP.') & (ss != ss.shift())
ss[ss == 'MAN.OP.'] = 'Manual_Mode_' + (newManOp.cumsum()-1).astype(str)
self._step_series = ss
Here's one way:
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
steps = pd.Series(steps)
newManOp = (steps=='MAN.OP.') & (steps != steps.shift())
steps[steps=='MAN.OP.'] += seq.cumsum().astype(str)
>>> steps
0 1
1 2
2 2
3 MAN.OP.1
4 MAN.OP.1
5 2
6 2
7 3
8 3
9 MAN.OP.2
10 MAN.OP.2
11 4
12 4
dtype: object
To get the exact format you listed (starting from zero instead of one, and changing from "MAN.OP." to "Manual_mode_"), just tweak the last line:
steps[steps=='MAN.OP.'] = 'Manual_Mode_' + (seq.cumsum()-1).astype(str)
>>> steps
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
There a pandas enhancement request for contiguous groupby, which would make this type of task simpler.
There is s function in matplotlib that takes a boolean array and returns a list of (start, end) pairs. Each pair represents a contiguous region where the input is True.
import matplotlib.mlab as mlab
regions = mlab.contiguous_regions(ser_orig == manual_mode)
for i, (start, end) in enumerate(regions):
ser_orig[start:end] = new_manual_mode_text.format(i)
ser_orig
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object

Categories

Resources