Join DataFrames on Condition Pandas

Join DataFrames on Condition Pandas - python

I have the following two dataframes with binary values that I want to merge.
df1
Action Adventure Animation Biography
0 0 1 0 0
1 0 0 0 0
2 1 0 0 0
3 0 0 0 0
4 1 0 0 0
df2
Action Adventure Biography Comedy
0 0 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 1
4 1 0 0 0
I want to join these two data frames in a way that the result has the distinct columns and if in one dataframe the value is 1 then the result has 1, if not it has 0.
Result
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0
I am stuck on this so I don not have a proposed solution.

Let us add the two dataframes then clip the upper value:
df1.add(df2, fill_value=0).clip(upper=1).astype(int)
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0

Thinking it as set problem may give you solution. Have look to code.
print((df1 | df2).fillna(0).astype(int) | df2)
COMPLETE CODE:
import pandas as pd
df1 = pd.DataFrame(
{
'Action':[0, 0, 1, 0, 1],
'Adventure':[1, 0, 0, 0, 0],
'Animation':[0, 0, 0, 0, 0],
'Biography':[0, 0, 0, 0, 0]
}
)
df2 = pd.DataFrame(
{
'Action':[0, 0, 1, 0, 1],
'Adventure':[1, 0, 0, 0, 0],
'Animation':[0, 0, 0, 0, 0],
'Biography':[0, 1, 0, 0, 0],
'Comedy':[0, 0, 0, 1, 0]
}
)
print((df1 | df2).fillna(0).astype(int) | df2)
OUTPUT:
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0

Related

how to label multiple columns effectively using pandas

I have a data columns look like below
a b c d e
1 0 0 0 0
0 2 0 0 0
3 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 1
For this dataframe I want to create a column called label
a b c d e lable
1 0 0 0 0 cola
0 2 0 0 0 colb
3 0 0 0 0 cola
0 0 0 1 0 cold
0 0 1 0 0 colc
0 0 0 0 1 cole
The label is the column index
my prior code is df['label'] = df['a'].apply(lambda x: 1 if x!=0) but it doesn't work. Is there anyway to return the expected result?

Try idxmax on axis 1
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3, 0, 0, 0],
'b': [0, 2, 0, 0, 0, 0],
'c': [0, 0, 0, 0, 1, 0],
'd': [0, 0, 0, 1, 0, 0],
'e': [0, 0, 0, 0, 0, 1]})
df['label'] = 'col'+df.idxmax(1)
Output
a b c d e label
0 1 0 0 0 0 cola
1 0 2 0 0 0 colb
2 3 0 0 0 0 cola
3 0 0 0 1 0 cold
4 0 0 1 0 0 colc
5 0 0 0 0 1 cole

How to convert consecutive Trues in a 3D array into integer values representing the length of that consecutive group

I have a (61,77,365) numpy array full of boolean values.
Taking a random slice across axis 2 (len=365) for illustrative purposes:
data = [False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
True False False False False True False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False True True False False False
True True True True True True True True False False False False
False False False False False False False False False False False False
False False False True True False False False False False False True
True True True True True False False False False False False False
False False False False False True True False False False True False
False False True True True False False True True False True True
False False True False True True True True True True False False
False False True True True]
I want to replace the True values with the length of their associated group of consecutive Trues, i.e.:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 6 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0
0 1 0 0 0 3 3 3 0 0 2 2 0 2 2 0 0 1 0 6 6 6 6 6 6 0 0 0 0 3 3 3]
How can I do this efficiently for the 3D array? I want to avoid looping as it would get very computationally expensive.
So far, I have used cumulative summing (which resets when it reaches False), and then done the same for the data reversed. Adding these together and subtracting 1 (if data=True) gives the required answer, but it's so convoluted and inefficient:
no_reset = np.cumsum(data,axis=axis)
reset = (data == 0)
excess = np.maximum.accumulate(no_reset*reset,axis=axis)
result = no_reset - excess
print(result)
result = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 3 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0
0 1 0 0 0 1 2 3 0 0 1 2 0 1 2 0 0 1 0 1 2 3 4 5 6 0 0 0 0 1 2 3]
no_reset_rev = np.cumsum(data_hits[..., ::-1],axis=axis)
reset_rev = (data_hits[..., ::-1] == 0)
excess_rev = np.maximum.accumulate(no_reset_rev*reset_rev,axis=axis)
result_rev = no_reset_rev - excess_rev
print(result_rev)
result_rev = [1 2 3 0 0 0 0 1 2 3 4 5 6 0 1 0 0 1 2 0 1 2 0 0 1 2 3 0 0 0 1 0 0 0 1 2 0
0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
final_res = result + result_rev[..., ::-1] - (1*data_hits)
print(final_res)
final_res = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 6 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0
0 1 0 0 0 3 3 3 0 0 2 2 0 2 2 0 0 1 0 6 6 6 6 6 6 0 0 0 0 3 3 3]

The key is to get the indices at which the Trues are present along with the length of the Trues.
First create a dataframe from the data like:
df = pd.DataFrame({'data':data})
You can first get cumulative sum of the Trues, False by using this trick:
tf = (df['data'] != df['data'].shift()).cumsum()
Then create another dataframe for only True values like:
df2 = pd.DataFrame(tf[df['data']])
Then reset index so you get the indices at which the values are True (this is important) and group and agg the list of indices:
df2 = df2.reset_index().rename(columns={'index':'trues'}).groupby('data',as_index=False).agg(list)
Then get the length of each list
df2['data'] = df2['trues'].apply(lambda x:len(x))
Then explode by the list
df2 = df2.explode('trues')
Then set index and remove the name so you can locate it in your original df and assign properly
df2 = df2.set_index('trues', drop=True)
df2.index.name = ''
Now assign the values in original df
df.iloc[df2.index] = df2['data']
All code is:
df = pd.DataFrame({'data':data})
tf = (df['data'] != df['data'].shift()).cumsum()
df2 = pd.DataFrame(tf[df['data']])
df2 = df2.reset_index().rename(columns={'index':'trues'}).groupby('data',as_index=False).agg(list)
df2['data'] = df2['trues'].apply(lambda x:len(x))
df2 = df2.explode('trues')
df2 = df2.set_index('trues', drop=True)
df2.index.name = ''
df.iloc[df2.index] = df2['data']
Your final answer is in df['data']

I'm not sure if this is faster but you can try this way. First, since your array is already in numpy, you can use np.where to change the boolean to 0 and 1.
myarray = np.where(data == True,1,0)
Next, you need to get the index of the 1's using np.where or np.nonzero (If you try it in your full array, np.nonzero will be faster than np.where.)
indexes = np.nonzero(myarray == 1)
With this, we will use a function posted by Unutbu to split the indexes based on the consecutive values. The link is here https://stackoverflow.com/a/7353335/16836078
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
split_index = consecutive(indexes[0])
For the last part, I apologize if this is not your requirement, we will use a for loop to assign the accumulated number of the consecutive values to the original array.
for i in split_index:
number = len(i)
myarray[i] = number
myarray
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0,
0, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
2, 0, 0, 0, 1, 0, 0, 0, 3, 3, 3, 0, 0, 2, 2, 0, 2, 2, 0, 0, 1, 0,
6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 3, 3, 3])
Extra
I have tried to use your full array to do the operations. Looks like there're a few things to be changed.
While finding the indexes of the 1's and using the consecutive function, you can use list comprehension.
indexes = [np.nonzero(j == 1)[0] for i in abc for j in i]
split_index = [consecutive(i) for i in indexes]
Finally, there's a nested for loop to assign the values to the original array.
split_index_arrary = np.reshape(61,77,14)
for dimension1, i in enumerate(split_index_array):
for dimension2, j in enumerate(i):
for k in j:
numbers = len(k)
myarray[dimension1][dimension2][k] = numbers

is it possible to use fnmatch.filter on a pandas dataframe instead of regex? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataframe as below for example, i want to have only tests with certain regex to be part of my updated dataframe. I was wondering if there is a way to do it with fnmatch instead of regex?
data = {'part1':[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
'part2':[0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1],
'part3':[0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1],
'part4':[0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
'part5':[1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
'part6':[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1],
'part7':[1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1],
'part8':[1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1],
'part9':[1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1 ],
'part10':[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1],
'part11':[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
'part12':[0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
}
df = pd.DataFrame(data, index =['test_gt1',
'test_gt2',
'test_gf3',
'test_gf4',
'test_gt5',
'test_gg6',
'test_gf7',
'test_gt8',
'test_gg9',
'test_gf10',
'test_gg11',
'test12'
])
i want to be able to create a new dataframe that only contains test_gg or test_gf or test_gt using fnmatch.filter? all examples i see are related to list, so how can i apply it to dataframe?

Import fnmatch.filter and filter on the index:
from fnmatch import filter
In [7]: df.loc[filter(df.index, '*g*')]
Out[7]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0
You can also just use pandas' filter function with regex, and filter on the index:
In [8]: df.filter(regex=r".+g.+", axis='index')
Out[8]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0
You can also just use like :
df.filter(like="g", axis='index')
Out[12]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0

Create a new column based on other column values - conditional forward fill?

Have the following dataframe
d = {'c_1': [0,0,0,1,0,0,0,1,0,0,0,0],
'c_2': [0,0,0,0,0,1,0,0,0,0,0,1]}
df = pd.DataFrame(d)
I want to create, another column 'f' that returns 1 when c_1 == 1 until c_2 == 1 in which case the value in 'f' will be 0
desired output as follows
c_1 c_2 f
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 0
6 0 0 0
7 1 0 1
8 0 0 1
9 0 0 1
10 0 0 1
11 0 1 0
Thinking this requires some kind of conditional forward fill, looking at previous questions however havn't been able to arrive at desired output
edit: have come across a related scenario where inputs differ and current solutions do not work. Will confirm answered but appreciate any input on the below
d = {'c_1': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
'c_2': [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}
df = pd.DataFrame(d)
desired output as follows - same as before I want to create, another column 'f' that returns 1 when c_1 == 1 until c_2 == 1 in which case the value in 'f' will be 0
c_1 c_2 f
0 0 1 0
1 0 1 0
2 0 1 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 0 1 0
11 0 1 0
12 0 1 0
13 0 0 0
14 0 0 0
15 0 0 0
16 1 0 1
17 0 0 1
18 1 0 1
19 1 0 1
20 0 0 1
21 0 0 1
22 0 0 1
23 0 0 1
24 0 1 0

You can try:
df['f'] = df[['c_1','c_2']].sum(1).cumsum().mod(2)
print(df)
c_1 c_2 f
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 0
6 0 0 0
7 1 0 1
8 0 0 1
9 0 0 1
10 0 0 1
11 0 1 0

You can also try like this:
df.loc[df['c_2'].shift().ne(1), 'f'] = df['c_1'].replace(to_replace=0, method='ffill')
c_1 c_2 f
0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 1 0 1.0
4 0 0 1.0
5 0 1 1.0 # <--- set these value to be zero
6 0 0 NaN
7 1 0 1.0
8 0 0 1.0
9 0 0 1.0
10 0 0 1.0
11 0 1 1.0 # <---
add one more line. if you don't want to include the end position.
Final:
df.loc[df['c_2'].shift().ne(1) & df['c_2'].ne(1), 'f'] = df['c_1'].replace(to_replace=0, method='ffill')
df = df.fillna(0)
c_1 c_2 f
0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 1 0 1.0
4 0 0 1.0
5 0 1 0.0
6 0 0 0.0
7 1 0 1.0
8 0 0 1.0
9 0 0 1.0
10 0 0 1.0
11 0 1 0.0

This should work for both scenarios:
df['c_1'].groupby(df[['c_1','c_2']].sum(1).cumsum()).transform('first')

numpy for no repeating for two columns

Basically, I am looking to solve without repeating comparing via AA and BB. If AA has 1, BB will start with 1, rather than repeating.
import numpy as np
df2 = pd.DataFrame({
'A': np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0], dtype='int32'),
'B': np.array([0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1], dtype='int32')
})
df2['AA'] = np.where( (df2['A'] > df2['A'].shift(1)),1,0)
df2['BB'] = np.where(( (df2['B'] > df2['B'].shift(1))),1,0)
I am getting repeating of BB. How can I get without repeating 1 if AA has 1, next BB should get 1 rather than repeating.
df2
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 1
The result should be as following.
if AA has 1 in previous row or past rows, and BB will start with 1 rather than repeating 1 at AA again, likewise in BB it should not repeat.
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join DataFrames on Condition Pandas - python

Let us add the two dataframes then clip the upper value: df1.add(df2, fill_value=0).clip(upper=1).astype(int) Action Adventure Animation Biography Comedy 0 0 1 0 0 0 1 0 0 0 1 0 2 1 0 0 0 0 3 0 0 0 0 1 4 1 0 0 0 0

Related

how to label multiple columns effectively using pandas

How to convert consecutive Trues in a 3D array into integer values representing the length of that consecutive group

is it possible to use fnmatch.filter on a pandas dataframe instead of regex? [closed]

Create a new column based on other column values - conditional forward fill?

numpy for no repeating for two columns

Categories

Resources