How to identify unique ID's that have only 1 true condition? - python

Simple question: How to identify unique ID's that have only 1 true condition?
Index ID value condition
0 1 1 False
1 1 3 True
2 1 2 False
3 1 1 False
4 2 3 True
5 2 4 True
6 2 5 True
In the case above, ID 1(1 true) would only be identified while ID 2(3 trues) would not.
How would I go about editing the code below? I need to keep the original index and ID in a segmented data frame.
df[df['condition']==True]['ID'].unique()
Expected output:
Index ID value condition
1 1 3 True
All the best,
Thank you for your time.

Using filter
df.groupby('ID').filter(lambda x : sum(x['condition'])==1)
Out[685]:
Index ID value condition
0 0 1 1 False
1 1 1 3 True
2 2 1 2 False
3 3 1 1 False

Related

How to count the number of rows that have the value 1 for all the columns in a dataframe?

The following is an example dataframe for this issue:
name gender address phone no.
---------------------------------------
1 1 1 1
1 0 0 0
1 1 1 1
1 1 0 1
The desired output here is 2 because the number of rows containing all 1s is 2.
Can anyone please help me with this issue?
Thanks.
Use eq(1) to identify the values with 1, then aggregate per row with any to have True when all values are True and sum the True taking advantage of the True/1 equivalence:
df.eq(1).all(axis=1).sum()
output: 2
Intermediates:
df.eq(1)
name gender address phone no.
0 True True True True
1 True False False False
2 True True True True
3 True True False True
df.eq(1).all(axis=1)
0 True
1 False
2 True
3 False
dtype: bool
Let's do
l = sum(df.eq(1).all(axis=1))
print(l)
2
Assuming above dataframe is a binary table i.e. all values are either 1 or 0, then df.sum(axis=1) equal to 4 should give you all rows where all values are 1.
df[df.sum(axis=1) == len(df.columns)]
name gender address phone no.
0 1 1 1 1
2 1 1 1 1

Adding boolean column to pandas dataframe where one row being true should make all same users rows true

I have problems with pandas dataframe when adding a boolean column. Data has users who have projects they can open in several places. I would need to have a group of users who have worked with the same project in several places. If the same user has opened the same project in different places even once it would make shared_projects true. Then all rows with that user_id should be true.
Here is an example df:
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
3 10 11
3 12 10
This is a simple example what I would like to get out. If the condition is true in one line it will be true in all the users with that user_id.
user_id project_id_x project_id_y shared_projects
1 1 2 false
1 3 4 false
2 5 6 true
2 7 7 true
2 8 9 true
3 10 11 true
3 12 10 true
I can get boolean values based on each row but I am stuck how to make it true to all users if it is true on one row.
Assuming you want to match on the same row:
df['shared_projects'] = (df['project_id_x'].eq(df['project_id_y'])
.groupby(df['user_id']).transform('any')
)
If you want to match on any value x/y for a given user, you can use a set intersection:
s = df.groupby('user_id').apply(lambda g: bool(set(g['project_id_x'])
.intersection(g['project_id_y'])))
df.merge(s.rename('shared_project'), left_on='user_id', right_index=True)
output:
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True
First you will have to do a complex selection to find the user that have worked in the same project in different columns:
df['shared_projects'] = (df['project_id_x'] == df['project_id_y'])
That will create a new boolean column as you've already done. But then you can use the index of those True values to apply to the rest, assuming that "user_id" is your index for the dataframe.
for index in df[df['shared_projects'] == True]].index.unique():
df.at[index, 'project_id_x'] = True
df.at[index, 'project_id_y'] = True
Update
Another approach without apply, using value_counts.
user_id = df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id']) \
.loc[lambda x: x > 1].index.get_level_values('user_id')
df['shared_projects'] = df['user_id'].isin(user_id)
Output:
>>> df
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
# Intermediate result
>>> df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id'])
user_id project_id
2 7 2 # <- project 7 in multiple places for user 2
1 1 1
2 1
3 1
4 1
2 5 1
6 1
8 1
9 1
dtype: int64
Old answer
You can use melt:
shared_projects = lambda x: len(set(x)) != len(x)
user_id = df.melt('user_id').groupby('user_id')['value'].apply(shared_projects)
df['shared_projects'] = df['user_id'].isin(user_id[user_id].index)
Output:
>>> df
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True

Pandas: check if column value is unique

I have a DataFrame like:
value
0 1
1 2
2 2
3 3
4 4
5 4
I need to check if each value is unique or not, and mark that boolean value to new column. Expected result would be:
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
I have tried:
df['unique'] = ""
df.loc[df["value"].is_unique, 'unique'] = True
But this throws exception:
cannot use a single bool to index into setitem
Any advise would be highly appreciated. Thanks.
Use Series.duplicated witn invert mask by ~:
df['unique'] = ~df['value'].duplicated(keep=False)
print (df)
value unique
0 1 True
1 2 False
2 2 False
3 3 True
4 4 False
5 4 False
Or:
df['unique'] = np.where(df['value'].duplicated(keep=False), False, True)
This works as well:
df['unique'] = df.merge(df.value_counts().to_frame(), on='value')[0]==1

drop the row only if all columns contains 0

I am trying to drop rows that have 0 for all 3 columns, i tried using these codes, but it dropped all the rows that have 0 in either one of the 3 columns instead.
indexNames = news[ news['contain1']&news['contain2'] &news['contain3']== 0 ].index
news.drop(indexNames , inplace=True)
My CSV file
contain1 contain2 contain3
1 0 0
0 0 0
0 1 1
1 0 1
0 0 0
1 1 1
Using the codes i used, all of my rows would be deleted. Below are the result i wanted instead
contain1 contain2 contain3
1 0 0
0 1 1
1 0 1
1 1 1
First filter by DataFrame.ne for not equal 0 and then get rows with at least one match - so removed only 0 rows by DataFrame.any:
df = news[news.ne(0).any(axis=1)]
#cols = ['contain1','contain2','contain3']
#if necessary filter only columns by list
#df = news[news[cols].ne(0).any(axis=1)]
print (df)
contain1 contain2 contain3
0 1 0 0
2 0 1 1
3 1 0 1
5 1 1 1
Details:
print (news.ne(0))
contain1 contain2 contain3
0 True False False
1 False False False
2 False True True
3 True False True
4 False False False
5 True True True
print (news.ne(0).any(axis=1))
0 True
1 False
2 True
3 True
4 False
5 True
dtype: bool
If this is a pandas dataframe you can sum the indexes with .sum().
news_sums = news.sum(axis=0)
indexNames = news.loc[news_sums == 0].index
news.drop(indexNames, inplace=True)
(note: Not tested, just from memory)
A simple solution would be to filter on the sum of your columns. You can do this by running this code news[news.sum(axis=1)!=0].
Hope this will help you :)
You might want to try this.
news[(news.T != 0).any()]

Reverse cumsum for countdown functionality in pandas?

I'm trying to ascertain how I can create a column that "counts down" until the next occurrence of a value in another column with pandas that in essence performs the following functionality:
rowid event countdown
1 False 0 # resets countdown
2 True 2 # resets countdown
3 False 1
4 False 0
5 True 1 # resets countdown
6 False 0
7 True 1 # resets countdown
...
In which the event column defines whether or not an event in a column occurs (True) or not (False). And the countdown column identifies the number of subsequent rows/steps that have to occur until said event occurs. The following works for when one needs to "count up" to when an event occurs:
df.groupby(df.event.cumsum()).cumcount()
Out[46]:
0 0
1 0
2 1
3 2
4 0
5 1
dtype: int64
However this effectively achieves the inverse of what I want to accomplish, is there a succinct method of achieving the former example, thanks!
Use GroupBy.cumcount with ascending=False, there is last value 0 because sample data has only 7 rows and no another value after last:
df['new'] = df.groupby(df.event.cumsum()).cumcount(ascending=False)
print (df)
rowid event countdown new
0 1 False 0 0
1 2 True 2 2
2 3 False 1 1
3 4 False 0 0
4 5 True 1 1
5 6 False 0 0
6 7 True 1 0
If logic is for last True is necessary set 1:
df.iloc[[-1], df.columns.get_loc('new')] = int(df.iloc[-1, df.columns.get_loc('event')])
print (df)
rowid event countdown new
0 1 False 0 0
1 2 True 2 2
2 3 False 1 1
3 4 False 0 0
4 5 True 1 1
5 6 False 0 0
6 7 True 1 1

Categories

Resources