np where with two conditions and met first - python

I am trying to create a target variable based on 2 conditions. I have X values that are binary and X2 values that are also binary. My condition is whenver X changes from 1 to zero, we have one in y only if it is followed by a change from 0 to 1 in X2. If that was followed by a change from 0 to 1 in X then we don't do the change in the first place. I attached a picture from excel.
I also did the following to account for the change in X
df['X-prev']=df['X'].shift(1)
df['Change-X;]=np.where(df['X-prev']+df['X']==1,1,0)
# this is the data frame
X=[1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0]
X2=[0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1]
df=pd.DataFrame()
df['X']=X
df['X2']=X2
however, this is not enough as I need to know which change came first after the X change. I attached a picture of the example.
Thanks a lot for all the contributions.

Keep rows that match your transition (X=1, X+1=0) and (X2=1, X2-1=0) then merge all selected rows to a list where a value of 0 means 'start a cycle' and 1 means 'end a cycle'.
But in this list, you can have consecutive start or end so you need to filter again to get only cycles of (0, 1). After that, reindex this new series by your original dataframe index and back fill with 1.
x1 = df['X'].sub(df['X'].shift(-1)).eq(1)
x2 = df['X2'].sub(df['X2'].shift(1)).eq(1)
sr1 = pd.Series(0, df.index[x1])
sr2 = pd.Series(1, df.index[x2])
sr = pd.concat([sr2, sr1]).sort_index()
df['Y'] = sr[sr.lt(sr.shift(-1)) | sr.gt(sr.shift(1))] \
.reindex(df.index).bfill().fillna(0).astype(int)
>>> df
X X2 Y
0 1 0 0 # start here: (X=1, X+1=0) but never ended before another start
1 1 0 0
2 0 0 0
3 0 0 0
4 1 0 0 # start here: (X=1, X+1=0)
5 0 0 1 # <- fill with 1
6 0 0 1 # <- fill with 1
7 0 0 1 # <- fill with 1
8 0 0 1 # <- fill with 1
9 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 0 0
15 0 0 0
16 0 1 0 # end here: (X2=1, X2-1=0) but never started before
17 0 0 0
18 0 0 0
19 0 0 0
20 1 0 0
21 1 0 0 # start here: (X=1, X+1=0)
22 0 0 1 # <- fill with 1
23 0 0 1 # <- fill with 1
24 0 0 1 # <- fill with 1
25 0 0 1 # <- fill with 1
26 0 0 1 # <- fill with 1
27 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
28 0 1 0
29 0 1 0

Related

Pd.crosstab missing data?

I am using pd.crosstab to count presence/absence data. In the first column, I have several presence counts (represented by 1's), in the second column I have just one 'presence'. Howwever, when I run crosstab on this data that single presence in the second column isn't counted. Could anyone shed some light on why this happening and what I'm doing wrong?
Python v. 3.8.5
Pandas v. 1.2.3
System: MacOS Monterey v. 12.5.1
Column1:
>>> mbx_final['Cmpd1640']
OV745_1A 0
OV745_1B 0
OV745_1C 1
OV745_1D 1
OV745_1E 0
OV745_4A 1
OV745_4B 1
OV745_4C 0
OV22_12A 1
OV22_12B 1
OV22_12C 1
OV22_12D 0
OV22_12E 0
OV22_12F 0
OV22_13A 0
OV22_13B 0
OV22_13C 0
OV86_6A 1
OV86_6D 1
OV86_6E 1
OV86_6F 1
OV86_6G 1
OV86_6H 1
OV86_6I 1
OV86_6J 1
OV86_6K 0
OV86_6L 1
OV86_8A 1
OV86_8B 1
OV86_8C 1
OB1B 1
OB1C 1
SK3A 0
SK3B 0
SK3C 0
SK7A 1
SK7B 0
Column2:
>>> mgx_final['Otu2409']
OV745_1A 0
OV745_1B 0
OV745_1C 0
OV745_1D 0
OV745_1E 0
OV745_4A 0
OV745_4B 0
OV745_4C 0
OV22_12A 0
OV22_12B 0
OV22_12C 0
OV22_12D 0
OV22_12E 0
OV22_12F 0
OV22_13A 0
OV22_13B 0
OV22_13C 0
OV86_6A 0
OV86_6D 0
OV86_6E 0
OV86_6F 0
OV86_6G 0
OV86_6H 0
OV86_6I 0
OV86_6J 0
OV86_6K 0
OV86_6L 0
OV86_8A 0
OV86_8B 0
OV86_8C 0
OB1A 1
OB1C 0
SK3A 0
SK3B 0
SK3C 0
SK7A 0
SK7B 0
Crosstab command:
contingency_tab = pd.crosstab(mbx_final['Cmpd1640'],mgx_final['Otu2409'],margins=True)
Results:
>>> contingency_tab
Otu2409 0 All
Cmpd1640
0 15 15
1 21 21
All 36 36
I would expect to see a result like this:
>>> contingency_tab
Otu2409 0 1 All
Cmpd1640
0 15 0 15
1 21 1 22
All 36 1 37
What am I doing wrong?
You can use the dropna parameter, which is by default set to True. Setting it to False will include columns whose entries are all NaN.
contingency_tab = pd.crosstab(mbx_final['Cmpd1640'],mgx_final['Otu2409'],margins=True, dropna=False)
You can read more on the official documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
Edit 1:
I've replicated your dataset and code and run the following:
df_in = pd.read_excel("Book1.xlsx", index_col="index")
mbx_final = df_in[["Cmpd1640"]]
mgx_final = df_in[["Otu2409"]]
contingency_tab = pd.crosstab(mbx_final['Cmpd1640'], mgx_final['Otu2409'], margins=True)
display(contingency_tab)
And I get your expected output:
There might be something wrong with how you're displaying the crosstab function output.

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

Checking for subset in a column?

I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0

How to split a list using two nested conditions

Basically I have list of 0s and 1s. Each value in the list represents a data sample from an hour. Thus, if there are 24 0s and 1s in the list that means there are 24 hours, or a single day. I want to capture the first time the data cycles from 0s to 1s back to 0s in a span of 24 hours (or vice versa from 1s to 0s back to 1s).
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1]
expected output:
# D
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0]
output = [0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]
# ^ cycle.1:day.1 |dayline ^cycle.1:day.2
In the output list, when there is 1 that means 1 cycle is completed at that position of the signal list and at rest of the position there are 0. There should only 1 cycle in a days that's why only 1 is there.
I don't how to split this list according to that so can someone please help?
It seams to me like what you are trying to do is split your data first into blocks of 24, and then to find either the first rising edge, or the first falling edge depending on the first hour in that block.
Below I have tried to distill my understanding of what you are trying to accomplish into the following function. It takes in a numpy.array containing zeros and ones, as in your example. It checks to see what the first hour in the day is, and decides what type of edge to look for.
it detects an edge by using np.diff. This gives us an array containing -1's, 0's, and 1's. We then look for the first index of either a -1 falling edge, or 1 rising edge. The function returns that index, or if no edges were found it returns the index of the last element, or nothing.
For more info see the docs for descriptions on numpy features used here np.diff, np.array.nonzero, np.array_split
import numpy as np
def get_cycle_index(day):
'''
returns the first index of a cycle defined by nipun vats
if no cycle is found returns nothing
'''
first_hour = day[0]
if first_hour == 0:
edgetype = -1
else:
edgetype = 1
edges = np.diff(np.r_[day, day[-1]])
if (edges == edgetype).any():
return (edges == edgetype).nonzero()[0][0]
elif (day.sum() == day.size) or day.sum() == 0:
return
else:
return day.size - 1
Below is an example of how you might use this function in your case.
import numpy as np
_data = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
#_data = np.random.randint(0,2,280, dtype='int')
data = np.array(_data, 'int')
#split the data into a set of 'day' blocks
blocks = np.array_split(data, np.arange(24,data.size, 24))
_output = []
for i, day in enumerate(blocks):
print(f'day {i}')
buffer = np.zeros(day.size, dtype='int')
print('\tsignal:', *day, sep = ' ')
cycle_index = get_cycle_index(day)
if cycle_index:
buffer[cycle_index] = 1
print('\toutput:', *buffer, sep=' ')
_output.append(buffer)
output = np.concatenate(_output)
print('\nfinal output:\n', *output, sep=' ')
this yeilds the following output:
day 0
signal: 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0
output: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 1
signal: 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
output: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 2
signal: 0 0 0 0 0 0
output: 0 0 0 0 0 0
final output:
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Create a new row with zeroes in wide Pandas DataFrame

I'm trying to fill an empty array with one row of zeroes. This is apparently a lot harder said than done. This is my attempt:
Array = pd.DataFrame(columns=["curTime", "HC", "AC", "HG", "HF1", "HF2", "HF3", "HF4", "HF5", "HF6",
"HF7", "HF8", "HF9", "HF10", "HF11", "HF12", "HD1", "HD2", "HD3", "HD4", "HD5", "HD6",
"AG", "AF1", "AF2", "AF3", "AF4", "AF5", "AF6", "AF7", "AF8", "AF9", "AF10", "AF11", "AF12",
"AD1", "AD2", "AD3", "AD4", "AD5", "AD6"])
appendArray = [[0] * len(Array.columns)]
Array = Array.append(appendArray, ignore_index = True)
This however creates a row that stacks another 41 columns to the right of my existing 41 columns, and fills them with zeroes, while the original 41 columns get a "NaN" value.
How do I most easily do this?
You can using pd.Series within the append
Array.append(pd.Series(appendArray,index=Array.columns), ignore_index = True)
Out[780]:
curTime HC AC HG HF1 HF2 HF3 HF4 HF5 HF6 ... AF9 AF10 AF11 \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
AF12 AD1 AD2 AD3 AD4 AD5 AD6
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
[2 rows x 41 columns]

Categories

Resources