selecting from a pandas.dataframe based on a column of arrays

selecting from a pandas.dataframe based on a column of arrays - python

I have a data frame with a column containing arrays (all 1x9 arrays). For all rows in that column, I wish to find the ones where the third element is 1 and pick out the values from another column in the corresponding row.
For example, I wish to pick out the 'cal_nCa' value （116） where the second element in info_trig is 0
info_trig cal_nCa
0 [0, 1, 0, 0, 0, 0, 0, 0, 0] 128
1 [0, 1, 0, 0, 0, 0, 0, 0, 0] 79
2 [0, 0, 0, 1, 0, 0, 0, 1, 0] 116
3 [0, 1, 0, 0, 0, 0, 0, 0, 0] 82
I tried something in line of df["A"][(df["B"] > 50)], based on Selecting with complex criteria from pandas.DataFrame.
When selecting the desired rows:
data["info_trig"][:][3]
I only succeed selecting a specific row and the third element in that row. But unable to select all the third element in every row. A loop could work but I hope there is a cleaner way out.

Using str to access the column 3rd position value
data["info_trig"].str[3]

data.apply(lambda x: x['cal_nCa'] if x['info_trig'][1] == 0 else 0, axis = 1)
This will return a Series that only remain value in cal_nCa when the second element value in info_trig is 0:
0 0
1 0
2 116
3 0
dtype: int64
Or you can only select the rows you want by this:
data[data.apply(lambda x: True if x['info_trig'][1] == 0 else False, axis = 1)]
Hope it will help you.

Related

Python: How to compare values of a row with a threshold to determine cycles

I have the following code I made that gets data from a machine in CSV format:
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
#calculates rolling and rolling center of pressure values
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
#sets threshold for machine being on or off, if rolling center average is greater than 115 psi, machine is considered on
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
df
The following DF is created:
Throughout the rows in column "Machine On/Off" there will be values of 1 or 0 based on the threshold i set. I need to write a code that will go through these rows and indicate if a cycle has started. The problem is due to the data being slightly off, during a "on" cycle, there will be around 20 rows saying (1) with a couple of rows saying 0 due to poor data recieved.
I need to have a code that compares the values through the data in order to determine the amount of cycles the machine is on or off for. I was thinking that setting a threshold of around would work, so that if the value is (1) for more than 6 rows then it will indicate a cycle and ignore the incorrect 0's that are scattered throughout the column.
What would be the best way program this so I can get a total count of cycles the machine is on or off for throughout the 20,000 rows of data I have.
Edit: Here is a example Df that is similar, in this example we can see there are 3 cycles of the machine (1 values) and mixed into the on cycles is 0 values (bad data). I need a code that will count the total number of cycles and ignore the bad data that may be in the middle of a 'on cycle'.
import pandas as pd
Machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df2 = pd.DataFrame(Machine)

You can create groups of consecutive rows of on/off using cumsum:
machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df = pd.DataFrame(machine, columns=['Machine On/Off'])
df['group'] = df['Machine On/Off'].ne(df['Machine On/Off'].shift()).cumsum()
df['group_size'] = df.groupby('group')['group'].transform('size')
# Output
Machine On/Off group group_size
0 0 1 6
1 0 1 6
2 0 1 6
3 0 1 6
4 0 1 6
5 0 1 6
6 1 2 5
7 1 2 5
8 1 2 5
9 1 2 5
10 1 2 5
I'm not sure I got your intention on how you would like to filter/alter the values, but probably this can serve as a guide:
threshold = 6
# Replace 0 for 1 if group_size < threshold. This will make the groupings invalid.
df.loc[(df['Machine On/Off'].eq(0)) & (df.group_size.lt(threshold)), 'Machine On/Off'] = 1
# Output df['Machine On/Off'].values
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

Isolating Rows Of A Dataframe in a loop based on multiple conditions [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 1 year ago.
So I asked a question related to this recently and while the answer wassimple then ( I failed to utilize a specific column) this time I don't have that column. Here is the OP. None of the extra answers provided there actually work either :/
The problem is with a multilabel data frame when you want to isolate rows that contain 1 for a given class and zero for others. So far here is the code I have but it loops into infinity and crashes colab.
In this case I want just that Action row but Im also trying to loop it so I will append all Action with value 1 and column_list with value 0 next History 1 all others 0 etc...
Again the options provided on the link give me a The truth of the answer is ambiguous error
Index | Drama | Western | Action | History |
0 1 1 0 0
1 0 0 0 1
2 0 0 1 0
# Column list to be popped
column_list = list(balanced_df.columns)[1:]
single_labels = []
i=0
# 28 columns total
while i < 27:
# defining/reseting the full column list at the start of each loop
column_list = list(balanced_df.iloc[:,1:])
# Pop column name at index i
x = column_list.pop(i)
# storing the results in a list of lists
# Filters for the popped column where the column is 1 & the remaining columns are set to 0
single_labels.append(balanced_df[(balanced_df[x] == 1) & (balanced_df[column_list]==0)])
# incriment the column index number for the next run
i+=1
The output here would be something like
single_labels[0]
Index | Drama | Western | Action | History |
2 0 0 1 0
single_labels[1]
Index | Drama | Western | Action | History |
1 0 0 0 1

You don't need a loop.
You rarely need loops with pandas.
If you're selecting rows based on conditions, you should use boolean indexing.
In your case, that's:
df.loc[df.sum(axis='columns').eq(1)]
As an example:
pandas.DataFrame({
'A': [1, 0, 0, 0, 0, 1, 1, 0, 0],
'B': [0, 1, 0, 0, 1, 0, 1, 0, 0],
'C': [0, 0, 1, 0, 1, 0, 0, 1, 0],
'D': [0, 0, 0, 1, 0, 1, 0, 1, 0],
}).loc[lambda df: df.sum(axis='columns').eq(1)].values.tolist()
Which outputs:
[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]

calculate sum of Nth column of numpy array entry grouped by the indices in first two columns?

I would like to loop over following check_matrix in such a way that code recognize whether the first and second element is 1 and 1 or 1 and 2 etc? Then for each separate class of pair i.e. 1,1 or 1,2 or 2,2, the code should store in the new matrices, the sum of last element (which in this case has index 8) times exp(-i*q(check_matrix[k][2:5]-check_matrix[k][5:8])), where i is iota (complex number), k is the running index on check_matrix and q is a vector defined as given below. So there are 20 q vectors.
import numpy as np
q= []
for i in np.linspace(0, 10, 20):
q.append(np.array((0, 0, i)))
q = np.array(q)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
This means in principles I will have to have 20 matrices of shape 2x2, corresponding to each q vector.
For the moment my code is giving only one matrix, which appears to be the last one, even though I am appending in the Matrices. My code looks like below,
for i in range(2):
i = i+1
for j in range(2):
j= j +1
j_list = []
Matrices = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
j_list.append(check_matrix[k][8]*np.exp(-1J*np.dot(q,(np.subtract(check_matrix[k][2:5],check_matrix[k][5:8])))))
j_11 = np.sum(j_list)
I_matrix[i-1][j-1] = j_11
Matrices.append(I_matrix)
I_matrix is defined as below:
I_matrix= np.zeros((2,2),dtype=np.complex_)
At the moment I get following output.
Matrices = [array([[-0.66071446-0.77603624j, -0.29038112+2.34855023j], [-0.31387562-0.08116629j, 4.2788 +0.j ]])]
But, I desire to get a matrix corresponding to each q value meaning that in total there should be 20 matrices in this case, where each 2x2 matrix element would be containing sums such that elements belong to 1,1 and 1,2 and 2,2 pairs in following manner
array([[11., 12.],
[21., 22.]])
I shall highly appreciate your suggestion to correct it. Thanks in advance!

I am pretty sure you can solve this problem in an easier way and I am not 100% sure that I understood you correctly, but here is some code that does what I think you want. If you have a possibility to check if the results are valid, I would suggest you do so.
import numpy as np
n = 20
q = np.zeros((20, 3))
q[:, -1] = np.linspace(0, 10, n)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
check_matrix[:, :2] -= 1 # python indexing is zero based
matrices = np.zeros((n, 2, 2), dtype=np.complex_)
for i in range(2):
for j in range(2):
k_list = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
k_list.append(check_matrix[k][8] *
np.exp(-1J * np.dot(q, check_matrix[k][2:5]
- check_matrix[k][5:8])))
matrices[:, i, j] = np.sum(k_list, axis=0)
NOTE: I changed your indices to have consistent
zero-based indexing.
Here is another approach where I replaced the k-loop with a vectored version:
for i in range(2):
for j in range(2):
k = np.logical_and(check_matrix[:, 0] == i, check_matrix[:, 1] == j)
temp = np.dot(check_matrix[k, 2:5] - check_matrix[k, 5:8], q[:, :, np.newaxis])[..., 0]
temp = check_matrix[k, 8:] * np.exp(-1J * temp)
matrices[:, i, j] = np.sum(temp, axis=0)

3 line solution
You asked for efficient solution in your original title so how about this solution that avoids nested loops and if statements in a 3 liner, which is thus hopefully faster?
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
grp=np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[np.sum(x) for x in grp]
output:
[-0.23872600000000002, 1.126557, 0.023742000000000003, 0.21394]
How does it work?
I combine the first two columns into a single index, treating each as "bits" (i.e. base 2)
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
( If you have indexes that exceed 2, you can still use this technique but you will need to use a different base to combine the columns. i.e. if your indices go from 1 to 18, you would need to multiply column 0 by a number equal to or larger than 18 instead of 2. )
So the result of the first line is
array([0., 0., 1., 2., 2., 3.])
Note as well it assumes the data is ordered, that one column changes fastest, if this is not the case you will need an extra step to sort the index and the original check matrix. In your example the data is ordered.
The next step groups the data according to the index, and uses the solution posted here.
np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[array([-0.243293, 0.004567]), array([1.126557]), array([ 0.038934, -0.015192]), array([0.21394])]
i.e. it outputs the 8th column of check_matrix according to the grouping of fac
then the last line simply sums those... knowing how the first two columns were combined to give the single index allows you to map the result back. Or you could simply add it to check matrix as a 9th column if you wanted.

How to iterate over rows and assign values to a new column

I have a dataframe with over 75k rows, having about 13 pre-existing columns. Now, I want to create a new column based on an if statement, such that:
if each row of a certain column has the same value as the next, then the value in the new column for that row would be 0 or 1.
The if statement checks for two equalities (columns are tags_list and gateway_id).
The below code snippet is what I have tried
for i in range(1,len(df_sort['date'])-1):
if (df_sort.iloc[i]['tags_list'] == df_sort.iloc[i+1]['tags_list']) & (df_sort.iloc[i]['gateway_id'] == df_sort[i+1]['gateway_id']):
df_sort.iloc[i]['Transit']=0
else:
df_sort.iloc[i]['Transit']=1
Getting a keyerror :2 in this case
PS: All of the columns have the same number of rows

if (df_sort.iloc[i]['tags_list'] == df_sort.iloc[i+1]['tags_list']) &
(df_sort.iloc[i]['gateway_id'] == df_sort.iloc[i+1]['gateway_id']):
df_sort[i+1]['gateway_id'] should be df_sort.iloc[i+1]['gateway_id']
Also, are you sure you want to iterate from 1 and not from 0 ?

There is numpy machinery for this, namely numpy.diff. Consider a DataFrame that already has some generic column 'x' populated.
In [48]: df['x'].values
Out[48]: array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3])
In [49]: df['x_diff'] = (np.diff(df['x'], prepend=0) != 0) * 1
In [50]: df['x_diff'].values
Out[50]: array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
If you need the zeros and ones flipped, just change != to ==.

Get first number each block of duplicates numbers in a list of 0 and 1

I have a list that looks like this:
a = [0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0...]
How do I get the index of the first 1 in each block of zero - one so the resulting index is:
[8 23 ..] and so on
I've been using this code:
def find_one (a):
for i in range(len(a)):
if (a[i] > 0):
return i
print(find_one(a))
but it gives me only the first occurrence of 1. How can implement it to iterate trough the entire list?
Thank you!!

You can do it using zip and al list comprehension:
a = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
r = [i for n,(i,v) in zip([1]+a,enumerate(a)) if v > n]
print(r) # [8,23]

Since you tagged pandas, can use groupby. If s = pd.Series(a) then
>>> x = s.groupby(s.diff().ne(0).cumsum()).head(1).astype(bool)
>>> x[x].index
Int64Index([8, 23], dtype='int64')

Without pandas:
b = a[1:]
[(num+1) for num,i in enumerate(zip(a,b)) if i == (0,1)]

# `state` is (prev_char, cur_char)
# where `prev_char` is the previous character seen
# and `cur_char` is the current character
#
#
# (0, 1) .... previous was "0"
# current is "1"
# RECORD THE INDEX.
# STRING OF ONES JUST BEGAN
#
# (0, 0) .... previous was "0"
# current is "0"
# do **NOT** reccord the index
#
# (1, 1) .... previous was "1"
# current is "1"
# we are in a string of ones, but
# not the begining of it.
# do **NOT** reccord the index.
#
# (1, 0).... previous was "1"
# current is "0"
# string of ones, just ended
# not the start of a string of ones.
# do **NOT** reccord the index.
state_to_print_decision = dict()
state_to_print_decision[(0, 1)] = True
def find_one (a, state_to_print_decision):
#
# pretend we just saw a bunch of zeros
# initilize state to (0, 0)
state = (0, 0)
for i in range(len(a)):
#
# a[i] is current character
#
# state[0] is the left element of state
#
# state[1] is the right elemet of state
#
# state[1] was current character,
# is now previous character
#
state = (state[1], a[i])
it_is_time_to_print = state_to_print_decision.get(state, False)
if(it_is_time_to_print):
indicies.append()
return indicies
a = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
print(find_one(a, state_to_print_decision))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

selecting from a pandas.dataframe based on a column of arrays - python

Using str to access the column 3rd position value data["info_trig"].str[3]

Related

Python: How to compare values of a row with a threshold to determine cycles

Isolating Rows Of A Dataframe in a loop based on multiple conditions [duplicate]

calculate sum of Nth column of numpy array entry grouped by the indices in first two columns?

How to iterate over rows and assign values to a new column

Get first number each block of duplicates numbers in a list of 0 and 1

Categories

Resources