Related
I have the following code I made that gets data from a machine in CSV format:
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
#calculates rolling and rolling center of pressure values
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
#sets threshold for machine being on or off, if rolling center average is greater than 115 psi, machine is considered on
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
df
The following DF is created:
Throughout the rows in column "Machine On/Off" there will be values of 1 or 0 based on the threshold i set. I need to write a code that will go through these rows and indicate if a cycle has started. The problem is due to the data being slightly off, during a "on" cycle, there will be around 20 rows saying (1) with a couple of rows saying 0 due to poor data recieved.
I need to have a code that compares the values through the data in order to determine the amount of cycles the machine is on or off for. I was thinking that setting a threshold of around would work, so that if the value is (1) for more than 6 rows then it will indicate a cycle and ignore the incorrect 0's that are scattered throughout the column.
What would be the best way program this so I can get a total count of cycles the machine is on or off for throughout the 20,000 rows of data I have.
Edit: Here is a example Df that is similar, in this example we can see there are 3 cycles of the machine (1 values) and mixed into the on cycles is 0 values (bad data). I need a code that will count the total number of cycles and ignore the bad data that may be in the middle of a 'on cycle'.
import pandas as pd
Machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df2 = pd.DataFrame(Machine)
You can create groups of consecutive rows of on/off using cumsum:
machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df = pd.DataFrame(machine, columns=['Machine On/Off'])
df['group'] = df['Machine On/Off'].ne(df['Machine On/Off'].shift()).cumsum()
df['group_size'] = df.groupby('group')['group'].transform('size')
# Output
Machine On/Off group group_size
0 0 1 6
1 0 1 6
2 0 1 6
3 0 1 6
4 0 1 6
5 0 1 6
6 1 2 5
7 1 2 5
8 1 2 5
9 1 2 5
10 1 2 5
I'm not sure I got your intention on how you would like to filter/alter the values, but probably this can serve as a guide:
threshold = 6
# Replace 0 for 1 if group_size < threshold. This will make the groupings invalid.
df.loc[(df['Machine On/Off'].eq(0)) & (df.group_size.lt(threshold)), 'Machine On/Off'] = 1
# Output df['Machine On/Off'].values
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)
I am collecting heart rate values over the course of time. Each subject varies in the length of time that data was collected. I would like to make a table of the last 2 seconds of collected data.
import pandas as pd
import numpy as np
#example data
example_s = [["4/20/21 4:20", 302, 0, 0, 1, 2, 3],
["2/17/21 9:20",135, 1, 1.4, 8, 10, np.NaN, np.NaN],
["2/17/21 9:20", 111, 5, 5,1, np.NaN, np.NaN,np.NaN, np.NaN]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5, 6])
desired_outcome = [["4/20/21 4:20",302,1, 2, 3],
["2/17/21 9:20",135, 1.4, 8, 10 ],
["2/17/21 9:20",111, 5, 5,1 ]]
desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Second 1", "Second 2", "Second 3"])
I can see how to collect a single instance of the data from the example shown here, but would like to know how to quickly add multiple values to my table:
desired_outcome_table["Last Second"]=example_s_table.iloc[:,1:].ffill(axis=1).iloc[:, -1]
Python Dataframe Get Value of Last Non Null Column for Each Row
Try:
df = example_s_table.copy()
df = df.set_index(['Date_Time', 'CID'])
df_out = df.mask(df.eq(0))\
.apply(lambda x: pd.Series(x.dropna().tail(3).values), axis=1)\
.rename(columns = lambda x: f'Second {x+1}')
df_out['Last Second'] = df_out['Second 3']
print(df_out.reset_index())
Output:
Date_Time CID Second 1 Second 2 Second 3 Last Second
0 4/20/21 4:20 302 1.0 2.0 3.0 3.0
1 2/17/21 9:20 135 1.4 8.0 10.0 10.0
2 2/17/21 9:20 111 5.0 5.0 1.0 1.0
I have the below data table
A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})
I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:
Thanks for your help!
Keywords: groupby, shift, mean
Code:
df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]
df_result
A B
1 2.0 1.0
3 3.0 1.0
As you might noticed, you need first to determine the consecutive rows blocks having the same values.
One way to do so is by shifting B one row and then comparing it with itself.
df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's
df['A'] =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B'] =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted'] =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
[↑ ][↑ ][↑ ][↑ ]
Now we can use the groupby pandas method as follows:
df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())
Now if we looped in the DtaFrameGroupBy object df_grouped
we'll see the following tuples:
(0, A B B_shifted
0 2 0 0)
(1, A B B_shifted
1 3 1 0
2 1 1 1
3 2 1 1)
(2, A B B_shifted
4 4 0 1
5 1 0 0)
(3, A B B_shifted
6 5 1 0
7 3 1 1
8 1 1 1)
(4, A B B_shifted
9 7 0 1
10 5 0 0)
We can simply calculate the mean and filter the zero values now as follow
df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]
References:(link, link).
Try:
m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)
Prints:
A B
0 2 1
1 3 1
df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1
Output
A B
0 2 1
1 3 1
Explanation
We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.
Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.
A bit stumped here and hoping the collective can assist!
Given the following DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'machine': ['A','B','C','D','E'],
'test1': [1, 1, 0, np.nan, np.nan],
'test2': [0, 0, 1, 1, np.nan],
'test3': [1, 0, 1, np.nan, 0],
'test4': [1, 1, np.nan, 1, 1],
'test5': [1, 1, np.nan, 0, 0]
})
Imagine a 1 is a pass and a 0 is a fail, NaN means the machine was untested
I would like to append two new columns to the end:
First - Maximum consecutive "1" values found, ignoring NaNs (NaN != 0, they are just ignored and would allow consecutive "1" values to continue through them.
Expected result:
max-cons-pass
3
2
2
2 (note how this ignores the NaN in-between the 1's)
1
Second - I would like the current number of consecutive "1" values starting from the last column (test5 in this case) and going backwards, again ignoring NaNs.
Expected Results:
cur-cons-pass
3
2
2 (note how this ignores the NaNs in test4 and test5)
0
0
I'm trying to get the count (where all row values are 1) of each possible combination between the eight columns of a dataframe. Basically I need to understand how many times different overlaps exist.
I've tried to use itertools.product to get all the combinations, but it doesn't seem to work.
import pandas as pd
import numpy as np
import itertools
df = pd.read_excel('filename.xlsx')
df.head(15)
a b c d e f g h
0 1 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0 0
2 1 0 1 1 1 1 1 1
3 1 0 1 1 0 1 1 1
4 1 0 0 0 0 0 0 0
5 0 1 0 0 1 1 1 1
6 1 1 0 0 1 1 1 1
7 1 1 1 1 1 1 1 1
8 1 1 0 0 1 1 0 0
9 1 1 1 0 1 0 1 0
10 1 1 1 0 1 1 0 0
11 1 0 0 0 0 1 0 0
12 1 1 1 1 1 1 1 1
13 1 1 1 1 1 1 1 1
14 0 1 1 1 1 1 1 0
print(list(itertools.product(new_df.columns)))
The expected output would be a dataframe with the count (n) of rows for each of valid combinations (where values in the row are all 1).
For example:
a b
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 0
12 1 1
13 1 1
14 0 1
Would give
combination count
a 12
a_b 7
b 9
Note that the output would need to contain all the combinations possible between a and h, not just pairwise
Powerset Combinations
Use the powerset recipe with,
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
a 13
b 9
c 8
d 6
e 10
f 12
g 9
h 7
a_b 7
...
Pairwise Combinations
This is a special case, and can be vectorized.
from itertools import combinations
u = df.T.dot(df)
pd.DataFrame({
'combination': [*map('_'.join, combinations(df, 2))],
# pandas < 0.24
# 'count': u.values[np.triu_indices_from(u, k=1)]
# pandas >= 0.24
'count': u.to_numpy()[np.triu_indices_from(u, k=1)]
})
You can use dot, then extract the upper triangular matrix values:
combination count
0 a_b 7
1 a_c 7
2 a_d 5
3 a_e 8
4 a_f 10
5 a_g 7
6 a_h 6
7 b_c 6
8 b_d 4
9 b_e 9
As you happen to have 8 columns, np.packbits together with
np.bincount is rather convenient here:
import numpy as np
import pandas as pd
# make large example
ncol, nrow = 8, 1_000_000
df = pd.DataFrame(np.random.randint(0,2,(nrow,ncol)), columns=list("abcdefgh"))
from time import time
T = [time()]
# encode as binary numbers and count
counts = np.bincount(np.packbits(df.values.astype(np.uint8)),None,256)
# find sets in other sets
rng = np.arange(256, dtype=np.uint8)
contained = (rng & rng[:, None]) == rng[:, None]
# and sum
ccounts = (counts * contained).sum(1)
# if there are empty bins, remove them
nz = np.where(ccounts)[0].astype(np.uint8)
# helper to build bin labels
a2h = np.array(list("abcdefgh"))
# put labels to counts
result = pd.Series(ccounts[nz], index = ["_".join((*a2h[np.unpackbits(i).view(bool)],)) for i in nz])
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
T.append(time())
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
T.append(time())
print("packbits {:.3f} powerset {:.3f}".format(*np.diff(T)))
print("results equal", (result.sort_index()[1:]==s.sort_index()).all())
This gives the same result as the powerset approach but literally 1000x faster:
packbits 0.016 powerset 21.974
results equal True
If you have just values of 1 and 0, you could do:
df= pd.DataFrame({
'a': [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1],
'b': [1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0],
'c': [1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1],
'd': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],
})
(df.a * df.b).sum()
This results in 4.
To get all combinations you can use combinations from itertools:
from itertools import combinations
analyze=[(col,) for col in df.columns]
analyze.extend(combinations(df.columns, 2))
for cols in analyze:
num_ser= None
for col in cols:
if num_ser is None:
num_ser= df[col]
else:
num_ser*= df[col]
num= num_ser.sum()
print(f'{cols} contains {num}')
This results in:
('a',) contains 4
('b',) contains 7
('c',) contains 11
('d',) contains 23
('a', 'b') contains 4
('a', 'c') contains 4
('a', 'd') contains 4
('b', 'c') contains 7
('b', 'd') contains 7
('c', 'd') contains 11
Cooccurence matrix is all you need:
Let's construct an example first:
import numpy as np
import pandas as pd
mat = np.zeros((5,5))
mat[0,0] = 1
mat[0,1] = 1
mat[1,0] = 1
mat[2,1] = 1
mat[3,3] = 1
mat[3,4] = 1
mat[2,4] = 1
cols = ['a','b','c','d','e']
df = pd.DataFrame(mat,columns=cols)
print(df)
a b c d e
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0 1.0
4 0.0 0.0 0.0 0.0 0.0
Now we construct the cooccurence matrix:
# construct the cooccurence matrix:
co_df = df.T.dot(df)
print(co_df)
a b c d e
a 2.0 1.0 0.0 0.0 0.0
b 1.0 2.0 0.0 0.0 1.0
c 0.0 0.0 0.0 0.0 0.0
d 0.0 0.0 0.0 1.0 1.0
e 0.0 1.0 0.0 1.0 2.0
Finally the result you need:
result = {}
for c1 in cols:
for c2 in cols:
if c1 == c2:
if c1 not in result:
result[c1] = co_df[c1][c2]
else:
if '_'.join([c1,c2]) not in result:
result['_'.join([c1,c2])] = co_df[c1][c2]
print(result)
{'a': 2.0, 'a_b': 1.0, 'a_c': 0.0, 'a_d': 0.0, 'a_e': 0.0, 'b_a': 1.0, 'b': 2.0, 'b_c': 0.0, 'b_d': 0.0, 'b_e': 1.0, 'c_a': 0.0, 'c_b': 0.0, 'c': 0.0, 'c_d': 0.0, 'c_e': 0.0, 'd_a': 0.0, 'd_b': 0.0, 'd_c': 0.0, 'd': 1.0, 'd_e': 1.0, 'e_a': 0.0, 'e_b': 1.0, 'e_c': 0.0, 'e_d': 1.0, 'e': 2.0}