Reverse cumsum for countdown functionality in pandas? - python

I'm trying to ascertain how I can create a column that "counts down" until the next occurrence of a value in another column with pandas that in essence performs the following functionality:
rowid event countdown
1 False 0 # resets countdown
2 True 2 # resets countdown
3 False 1
4 False 0
5 True 1 # resets countdown
6 False 0
7 True 1 # resets countdown
...
In which the event column defines whether or not an event in a column occurs (True) or not (False). And the countdown column identifies the number of subsequent rows/steps that have to occur until said event occurs. The following works for when one needs to "count up" to when an event occurs:
df.groupby(df.event.cumsum()).cumcount()
Out[46]:
0 0
1 0
2 1
3 2
4 0
5 1
dtype: int64
However this effectively achieves the inverse of what I want to accomplish, is there a succinct method of achieving the former example, thanks!

Use GroupBy.cumcount with ascending=False, there is last value 0 because sample data has only 7 rows and no another value after last:
df['new'] = df.groupby(df.event.cumsum()).cumcount(ascending=False)
print (df)
rowid event countdown new
0 1 False 0 0
1 2 True 2 2
2 3 False 1 1
3 4 False 0 0
4 5 True 1 1
5 6 False 0 0
6 7 True 1 0
If logic is for last True is necessary set 1:
df.iloc[[-1], df.columns.get_loc('new')] = int(df.iloc[-1, df.columns.get_loc('event')])
print (df)
rowid event countdown new
0 1 False 0 0
1 2 True 2 2
2 3 False 1 1
3 4 False 0 0
4 5 True 1 1
5 6 False 0 0
6 7 True 1 1

Related

Adding boolean column to pandas dataframe where one row being true should make all same users rows true

I have problems with pandas dataframe when adding a boolean column. Data has users who have projects they can open in several places. I would need to have a group of users who have worked with the same project in several places. If the same user has opened the same project in different places even once it would make shared_projects true. Then all rows with that user_id should be true.
Here is an example df:
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
3 10 11
3 12 10
This is a simple example what I would like to get out. If the condition is true in one line it will be true in all the users with that user_id.
user_id project_id_x project_id_y shared_projects
1 1 2 false
1 3 4 false
2 5 6 true
2 7 7 true
2 8 9 true
3 10 11 true
3 12 10 true
I can get boolean values based on each row but I am stuck how to make it true to all users if it is true on one row.
Assuming you want to match on the same row:
df['shared_projects'] = (df['project_id_x'].eq(df['project_id_y'])
.groupby(df['user_id']).transform('any')
)
If you want to match on any value x/y for a given user, you can use a set intersection:
s = df.groupby('user_id').apply(lambda g: bool(set(g['project_id_x'])
.intersection(g['project_id_y'])))
df.merge(s.rename('shared_project'), left_on='user_id', right_index=True)
output:
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True
First you will have to do a complex selection to find the user that have worked in the same project in different columns:
df['shared_projects'] = (df['project_id_x'] == df['project_id_y'])
That will create a new boolean column as you've already done. But then you can use the index of those True values to apply to the rest, assuming that "user_id" is your index for the dataframe.
for index in df[df['shared_projects'] == True]].index.unique():
df.at[index, 'project_id_x'] = True
df.at[index, 'project_id_y'] = True
Update
Another approach without apply, using value_counts.
user_id = df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id']) \
.loc[lambda x: x > 1].index.get_level_values('user_id')
df['shared_projects'] = df['user_id'].isin(user_id)
Output:
>>> df
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
# Intermediate result
>>> df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id'])
user_id project_id
2 7 2 # <- project 7 in multiple places for user 2
1 1 1
2 1
3 1
4 1
2 5 1
6 1
8 1
9 1
dtype: int64
Old answer
You can use melt:
shared_projects = lambda x: len(set(x)) != len(x)
user_id = df.melt('user_id').groupby('user_id')['value'].apply(shared_projects)
df['shared_projects'] = df['user_id'].isin(user_id[user_id].index)
Output:
>>> df
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True

How to Invert column values in pandas - pythonic way?

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,1,1,0,0]})
My objective is to
a) replace 0s as 1s AND 1s as 0s in Label column
I was trying something like the below
cdf.assign(invert_label=cdf.Label.loc[::-1].reset_index(drop=True)) #not work
cdf['invert_label'] = np.where(cdf['Label']==0, '1', '0')
'
but this doesn't work. It reverses the order
I expect my output to be like as shown below
Id Label
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
You can compare 0, so for 0 get Trues and for not 0 get Falses, then converting to integers for mapping True, False to 1, 0:
print (cdf['Label'].eq(0))
0 False
1 False
2 False
3 True
4 True
Name: Label, dtype: bool
cdf['invert_label'] = cdf['Label'].eq(0).astype(int)
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
Another idea is use mapping:
cdf['invert_label'] = cdf['Label'].map({1:0, 0:1})
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
One maybe obvious answer might be to use 1-value:
cdf['Label2'] = 1-cdf['Label']
output:
Id Label Label2
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
You could map the not function as well -
import operator
cdf['Label'].map(operator.not_).astype('int')
Another way, and I am adding this as a separate answer as this is probably not "pythonic" enough (in the sense that it is not very explicit) is to use the bitwise xor
cdf['Label'] ^ 1

drop the row only if all columns contains 0

I am trying to drop rows that have 0 for all 3 columns, i tried using these codes, but it dropped all the rows that have 0 in either one of the 3 columns instead.
indexNames = news[ news['contain1']&news['contain2'] &news['contain3']== 0 ].index
news.drop(indexNames , inplace=True)
My CSV file
contain1 contain2 contain3
1 0 0
0 0 0
0 1 1
1 0 1
0 0 0
1 1 1
Using the codes i used, all of my rows would be deleted. Below are the result i wanted instead
contain1 contain2 contain3
1 0 0
0 1 1
1 0 1
1 1 1
First filter by DataFrame.ne for not equal 0 and then get rows with at least one match - so removed only 0 rows by DataFrame.any:
df = news[news.ne(0).any(axis=1)]
#cols = ['contain1','contain2','contain3']
#if necessary filter only columns by list
#df = news[news[cols].ne(0).any(axis=1)]
print (df)
contain1 contain2 contain3
0 1 0 0
2 0 1 1
3 1 0 1
5 1 1 1
Details:
print (news.ne(0))
contain1 contain2 contain3
0 True False False
1 False False False
2 False True True
3 True False True
4 False False False
5 True True True
print (news.ne(0).any(axis=1))
0 True
1 False
2 True
3 True
4 False
5 True
dtype: bool
If this is a pandas dataframe you can sum the indexes with .sum().
news_sums = news.sum(axis=0)
indexNames = news.loc[news_sums == 0].index
news.drop(indexNames, inplace=True)
(note: Not tested, just from memory)
A simple solution would be to filter on the sum of your columns. You can do this by running this code news[news.sum(axis=1)!=0].
Hope this will help you :)
You might want to try this.
news[(news.T != 0).any()]

Select columns with specific values in pandas DataFrame

I have some colums in my DataFrame with values 0 and 1
name a b c d e
0 one 1 0 1 0 0
1 two 0 0 1 0 0
2 three 0 0 1 0 1
How can I select columns where at least one value is 1? But another columns (that are strings or take not only 0 and 1 values) must be selected too.
I tried this expression
df.iloc[:, [(clm == 'name') | (1 in df[clm].unique()) for clm in df.columns]]
Out:
name a c e
0 one 1 1 0
1 two 0 1 0
2 three 0 1 1
But is seems not good because I explicitly choose column 'name'
If is possible remove all columns with only 0 values compare values by DataFrame.ne for not equal and return at least one True per columns in DataFrame.loc:
df = df.loc[:, df.ne(0).any()]
print (df)
name a c e
0 one 1 1 0
1 two 0 1 0
2 three 0 1 1
Details:
print (df.ne(0))
name a b c d e
0 True True False True False False
1 True False False True False False
2 True False False True False True
print (df.ne(0).any())
name True
a True
b False
c True
d False
e True
dtype: bool

How to identify unique ID's that have only 1 true condition?

Simple question: How to identify unique ID's that have only 1 true condition?
Index ID value condition
0 1 1 False
1 1 3 True
2 1 2 False
3 1 1 False
4 2 3 True
5 2 4 True
6 2 5 True
In the case above, ID 1(1 true) would only be identified while ID 2(3 trues) would not.
How would I go about editing the code below? I need to keep the original index and ID in a segmented data frame.
df[df['condition']==True]['ID'].unique()
Expected output:
Index ID value condition
1 1 3 True
All the best,
Thank you for your time.
Using filter
df.groupby('ID').filter(lambda x : sum(x['condition'])==1)
Out[685]:
Index ID value condition
0 0 1 1 False
1 1 1 3 True
2 2 1 2 False
3 3 1 1 False

Categories

Resources