Check several conditions for all values in a column - python

I have just started using python & pandas. I have searched google and stack overflow for an answer to my question but haven't been able to find one.
This is what I need to do:
I have a df with several data rows per person (id) and a variable called response_go, which can be coded 1 or 0 (type int64), such as this one (just way bigger with 480 rows per person...)
ID response_go
0 1 1
1 1 0
2 1 0
3 1 1
4 2 1
5 2 0
6 2 1
7 2 1
Now, I want to check for each ID/ person whether the entries in response_go separately are all coded 0, all coded 1, or neither (the else condition). So far, I have come up with this:
ids = df['ID'].unique()
for id in ids:
if (df.response_go.all() == 1):
print "ID:",id,": 100% Go"
elif (df.response_go.all() == 0):
print "ID:",id,": 100% NoGo"
else:
print "ID:",id,": Mixed Response Pattern"
However, it gives me the following output:
ID: 1 : 100% NoGo
ID: 2 : 100% NoGo
ID: 2 : Mixed Response Pattern
when it should be (as both ones & zeros are included)
ID: 1 : Mixed Response Pattern
ID: 2 : Mixed Response Pattern
I am really sorry if this question might have been asked before but when searching for an answer, I really found nothing to solve this issue. And if this has been answered before, please point me to the solution. Thank you everyone!!!! Really appreciate it!

Sample (with different data) -
df = pd.DataFrame({'ID' : [1] * 3 + [2] * 3 + [3] * 3,
'response_go' : [0, 0, 0, 1, 1, 1, 0, 1, 0]})
df
ID response_go
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 2 1
6 3 0
7 3 1
8 3 0
Use groupby + mean -
v = df.groupby('ID').response_go.mean()
v
ID
1 0.000000
2 1.000000
3 0.333333
Name: response_go, dtype: float64
Use np.select to compute your statuses based on the mean of response_go -
u = np.select([v == 1, v == 0, v < 1], ['100% Go', '100% NoGo', 'Mixed Response Pattern'])
Or, use a nested np.where to do the same thing (slightly faster) -
u = np.where(v == 1, '100% Go', np.where(v == 0, '100% NoGo', 'Mixed Response Pattern'))
Now, assign the result back -
v[:] = u
v
ID
1 100% NoGo
2 100% Go
3 Mixed Response Pattern
Name: response_go, dtype: object

Related

Iterate through dataframe to find first row that satisfies condition for each group of id

I have panel data with id, time. For each id, after start =1 , I want to identify the first time that satisfy the rule: having "number in" greater than previous "number out".
With the example data below, the expected result is: for id=1, time =5; and for id=2 , time =3. The explanation is as below.
For id = 1, start =1 occurs a time =1. Tracking from time =1, time=5 is what I need as it is the first having "number in" = 4 and it is higher than prior "number out" = 1 occurs in time =3 (after start=1).
Similarly, for id=2, time=3 is what satisfy the rule
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'time': [1,2,3,4,5, 1,2,3,4],
'start':[1,0,0,0,0,1,0,0,0],
'number out': [2,99,1,13,9,10,2,8,8],
'number in': [2,9,1,0,4,1,5,7,8]})
df
id time start number out number in
0 1 1 1 2 2
1 1 2 0 99 9
2 1 3 0 1 1
3 1 4 0 13 0
4 1 5 0 9 4
5 2 1 1 10 1
6 2 2 0 2 5
7 2 3 0 8 7
8 2 4 0 8 8
The data is grouped by id. Apply is applied to them with a custom function. The first index after start=1(ind) is obtained. ind_in is the index at which to start searching for number in.
A check is used if there is no data so that an error does not occur. If you are sure of your data, you can remove this line:
if ind_in[0] > 0 and ind_in[0] <= x.index[-1]:
Next, in the 'aaa' list generator, each 'number in' element is compared to an array. If at least one element matches the condition, a boolean mask is stored. It is used for sampling as an index.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'time': [1,2,3,4,5, 1,2,3,4],
'start':[1,0,0,0,0,1,0,0,0],
'number out': [2,99,1,13,9,10,2,8,8],
'number in': [2,9,1,0,4,1,5,7,8]})
print(df)
def my_func(x):
ind = x[x['start'] == 1].index + 1
ind_in = ind + 1
if ind_in[0] > 0 and ind_in[0] <= x.index[-1]:
number_out = x.loc[ind[0]:x.index[-2], 'number out']
aaa = [i for i in range(ind_in[0], x.index[-1] + 1)
if (x.loc[i, 'number in'] > number_out.values).any()]
return x.loc[aaa[0], 'time']
print(df.groupby('id').apply(my_func))
Output
id
1 5
2 3
dtype: int64

Create sequential event id for groups of consecutive ones

I have a df like so:
Period Count
1 1
2 0
3 1
4 1
5 0
6 0
7 1
8 1
9 1
10 0
and I want to return a 'Event ID' in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Period Count Event_ID
1 1 0
2 0 0
3 1 1
4 1 1
5 0 0
6 0 0
7 1 2
8 1 2
9 1 2
10 0 0
I have researched and found solutions that allow me to flag out consecutive group of similar numbers (e.g 1) but I haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.
This will do the job:
ones = df.groupby('Count').groups[1].tolist()
# creates a list of the indices with a '1': [0, 2, 3, 6, 7, 8]
event_id = [0] * len(df.index)
# creates a list of length 10 for Event_ID with all '0'
# find consecutive numbers in the list of ones (yields [2,3] and [6,7,8]):
for k, g in itertools.groupby(enumerate(ones), lambda ix : ix[0] - ix[1]):
sublist = list(map(operator.itemgetter(1), g))
if len(sublist) > 1:
for i in sublist:
event_id[i] = len(sublist)-1
# event_id is now [0, 0, 1, 1, 0, 0, 2, 2, 2, 0]
df['Event_ID'] = event_id
The for loop is adapted from this example (using itertools, other approaches are possible too).

Creating a list based on column conditions

I have a DataFrame df
>>df
LED CFL Incan Hall Reading
0 3 2 1 100 150
1 2 3 1 150 100
2 0 1 3 200 150
3 1 2 4 300 250
4 3 3 1 170 100
I want to create two more column which contain lists, one for "Hall" and another for "Reading"
>>df_output
LED CFL Incan Hall Reading Hall_List Reading_List
0 3 2 1 100 150 [0,2,0] [2,0,0]
1 2 3 1 150 100 [0,3,0] [2,0,0]
2 0 1 3 200 150 [0,1,0] [0,0,2]
3 1 2 4 300 250 [0,2,0] [1,0,0]
4 3 3 1 100 100 [0,2,0] [2,0,0]
Each value within the list is populated as follows:
cfl_rating = 50
led_rating = 100
incan_rating = 25
For the Hall_List:
The preference is CFL > LED > Incan. And only one of them will be used (either CFL or LED or Incan).
We first check if CFL != 0 , if True then we calculate min(ceil(Hall/CFL_rating),CFL). For index=0 this evaluates to 2. Hence we have [0,2,0] whereas for index=2 we have [0,1,0].
Similarly for Reading_List, the preference is LED > Incan > CFL.
For index=2, we have LED == 0, so we calculate min(ceil(Reading/Incan_rating),Incan) and hence Reading_List is [0,0,2]
My question is:
Is there a "pandas/pythony-way" of doing this? I am currently iterating through each row, and using if-elif-else conditions to assign values.
My code snippet looks like this:
#Hall_List
for i in range(df.shape[0]):
Hall = []
if (df['CFL'].iloc[i] != 0):
Hall.append(0)
Hall.append(min((math.ceil(df['Hall'].iloc[i]/cfl_rating)),df['CFL'].iloc[i]))
Hall.append(0)
elif (df['LED'].iloc[i] != 0):
Hall.append(min((math.ceil(df['Hall'].iloc[i]/led_rating)),df['LED'].iloc[i]))
Hall.append(0)
Hall.append(0)
else:
Hall.append(0)
Hall.append(0)
Hall.append(min((math.ceil(df['Hall'].iloc[i]/incan_rating)),df['Incan'].iloc[i]))
df['Hall_List'].iloc[i] = Hall
This is really slow and definitely feels like a bad way to code this.
I shorten your formula for simplicity sake but you should use df.apply(axis=1)
this take every row and return and ndarray, then you can apply whatever function you want such has :
df = pd.DataFrame([[3, 2, 1, 100, 150], [2, 3, 1, 150, 100]], columns=['LED', 'CFL', 'Incan', 'Hall', 'Reading'])
def create_list(ndarray):
if ndarray[1] != 0:
result = [0, ndarray[1], 0]
else:
result = [ndarray[2], 0, 0]
return result
df['Hall_List'] = df.apply(lambda x: create_list(x), axis=1)
just change the function to whatever you like here.
In[49]: df
Out[49]:
LED CFL Incan Hall Reading Hall_List
0 3 2 1 100 150 [0, 2, 0]
1 2 3 1 150 100 [0, 3, 0]
hope this helps

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

PANDAS count on condition

I am trying to tabulate a change in condition using a 'groupby' but am stumped and would appreciate any guidance. I have a data frame as follows:
SUBJECT TYPE
1 1
1 2
1 2
2 1
2 1
3 1
3 3
3 5
I would like to generate a statement that tabulates any positive change, ignores any negative change, and generates a count of change per subject. For example, the output of the above would be:
Subject TYPE
1 1
2 0
3 2
Would I need create an if/else clause using pandas, or is there a simpler way to achieve this using summit? Maybe something like...
def tabchange(type, subject):
current_subject = subject[0]
type_diff = type - type
j = 1
for i in range(1,len(type)):
type_diff[i] = type[i] - type[i-j]
if subject[i] == current_subject:
if type_diff[i] > 0:
new_row = 1
j += 1
else:
j = 1
else:
new_row[i] = 0
current_subject = subject[i]
return new_row
import pandas as pd
df = pd.DataFrame({'SUBJECT': [1, 1, 1, 2, 2, 3, 3, 3],
'TYPE': [1, 2, 2, 1, 1, 1, 3, 5]})
grouped = df.groupby('SUBJECT')
df['TYPE'] = grouped['TYPE'].diff() > 0
result = grouped['TYPE'].agg('sum')
yields
SUBJECT
1 1.0
2 0.0
3 2.0
Name: TYPE, dtype: float64
Above, df is grouped by SUBJECT and the diff is taken of the TYPE column:
In [253]: grouped = df.groupby('SUBJECT'); df['TYPE'] = grouped['TYPE'].diff() > 0
In [254]: df
Out[254]:
SUBJECT TYPE
0 1 False
1 1 True
2 1 False
3 2 False
4 2 False
5 3 False
6 3 True
7 3 True
Then, again grouping by SUBJECT, the result is obtained by counting the number of Trues in the TYPE column:
In [255]: result = grouped['TYPE'].agg('sum'); result
Out[255]:
SUBJECT
1 1.0
2 0.0
3 2.0
Name: TYPE, dtype: float64

Categories

Resources