Match id in multiple groups in python - python

I have a dataframe that looks like below
Group
ID
1
AAA
1
BBB
1
CCC
2
AAA
2
DDD
2
CCC
3
AAA
3
GGG
3
TTT
Here i want to find the number of ids that are present in "group 1 only", "group 1 and 2", "group 1, 2 and 3".
I want the final table to look like below
Group
Count
1
3
2
2
3
1
This is just an example table but i have 10 groups and millions of rows of data like this and i need an efficient way to calculate the same.

Try with crosstab then cumsum
pd.crosstab(df.Group,df.ID).cumsum().eq([1,2,3],axis=0).sum(1).reset_index(name='count')
Out[70]:
Group count
0 1 3
1 2 2
2 3 1

Related

Python Pandas how to find matching values by label

I have a csv file that looks something like this:
mark
time
value1
value2
1
14:22:02
5
2
1
14:22:05
8
4
2
14:25:02
1
1
2
14:26:05
4
7
3
15:12:08
5
2
3
15:12:11
5
4
3
15:12:15
5
2
3
15:12:17
8
4
I would like to output all the matches by label 1 and 3
Expected result:
Number of matches is the number of intersections with the same symbols of the label 1 and 3
That is, if there are 5 in mark 1 and Value 1 column, then it counts the entire number of intersections with mark3 in Value 1
By two columns of value
mark
value1
value2
Number of matches
1-3
5
2
2
1-3
8
4
1
For value 1
mark
value1
Number of matches
1-3
5
3
1-3
8
1
For value 2
mark
value2
Number of matches
1-3
2
2
1-3
4
2
You can use a groupby on the filtered DataFrame, then filter again to have a count > 1:
target = ['value1', 'value2']
(df.loc[df['mark'].isin([1,3])]
.astype({'mark': 'str'})
.groupby(target, as_index=False)
.agg(**{'mark': ('mark', lambda g: '-'.join(dict.fromkeys(g))),
'Num matches': ('mark', 'count')
})
.loc[lambda d: d['Num matches'].gt(1)]
)
Output:
value1 value2 mark Num matches
0 5 2 1-3 3
2 8 4 1-3 2

Sum DataFrame rows a column contains a substring

I have this DataFrame:
df1:
Date Value Info
1 1 XXX.othertext2
1 4 somerandomtext
1 2 XXX.othertext2
1 3 XXX.othertext3
1 2 XXX.othertext3
1 1 XXX.othertext2
1 1 XXX.othertext3
2 6 somerandomtext
2 9 XXX.othertext2
I want to sum rows by same Date that start with XXX.othertext2 until a new XXX.othertext2 or sometext (so it is the sum of fisrt XXX.othertext2 + all XXX.othertext3). The resulting row value of Info will be XXX.othertext2:
newdf:
Date Value Info
1 1 XXX.othertext2
1 4 somerandomtext
1 7 XXX.othertext2
1 2 XXX.othertext2
2 6 sometext
2 9 XXX.othertext2
Here's one option, with a custom grouper:
grouper = ((b.Info.str.contains('some')) | (b.Info == 'XXX.othertext2')).cumsum()
b.groupby(['Date', grouper]).sum().reset_index()
You can refine it more with a regex if necessary.

Add a column in pandas dataframe using conditions on 3 existing columns

I have an existing Pandas Data-frame that I want to manipulate according to the following pattern:
The existing table has different set of codes in column 'code'. Each 'code' has certain labels listed in column 'label'. Each label has been tagged with either 0 or 1.
I have a requirement to add a 'new_column' with values 0 or 1 for each set of 'code', depending on the following condition:
Fill 1 in the 'new_column' only when all the 'label' of a particular 'code'
has value equals to 1 in the 'tag' column. Note I need to fill 1 for all the rows belonging to that particular 'code'.
As Shown in the desired Table, only code=30 has all the 'label' set in the 'tag' column equals to 1. Therefore i set the 'new_column' equals to 1 for that particular code. Rest of the codes have set to 0 value.
Existing Table:
code label tag
0 10 AAA 0
1 10 BBB 1
2 10 CCC 0
3 10 DDD 0
4 10 EEE 0
5 20 AAA 1
6 20 CCC 0
7 20 DDD 1
8 30 BBB 1
9 30 CCC 1
10 30 EEE 1
Desired Table
code label tag new_column
0 10 AAA 0 0
1 10 BBB 1 0
2 10 CCC 0 0
3 10 DDD 0 0
4 10 EEE 0 0
5 20 AAA 1 0
6 20 CCC 0 0
7 20 DDD 1 0
8 30 BBB 1 1
9 30 CCC 1 1
10 30 EEE 1 1
I have not tried any solution yet as it seems beyond my present level of expertise.
I think the right answer for this question is that given by #user3483203 in the comments:
df['new_column'] = df.groupby('code')['tag'].transform(all).astype(int)
The transform method applies to the dataframe whatever is passed to it, keeping the axis length the same.
The simple example in the documentation clearly explains the usage.
Coming to this particular question, the following happens when you run this snippet:
You first perform the grouping with respect to the 'code'. You end up with a DataFrameGroupBy object.
Next, from this you choose the tag column, ending up with a SeriesGroupBy object.
To this grouping, you apply the all function via transform, ultimately typecasting the boolean values to type int.
Basically, you can understand it like this (the values are binary to make them more related to your answer):
>>> int(all([1, 1, 1, 1]))
1
>>> int(all([1, 0, 1, 1]))
0
Finally, you are assigning the column you just created to the column new_column to the old dataframe.
the initial answer by user3483203 works. here is a variation. but his way was more concise.

Is there a better way to do pandas groupby with a map function on multiple condition?

I'm translating an excel formula in pandas COUNTIFS(pos!$D:$D,$A3,pos!$N:$N,$E3). I have two dataframe df1 and df2, and I will need to count values in a column first dataframe df1 and populate dataframe df2 where the values counted in df1 is equal to a value in df2. How do I check for second condition in my solution below?
df1:
id member seq
0 48299 Koif 1
1 48299 Iki 1
2 48299 Juju 2
3 48299 PNik 3
4 48865 Lok 1
5 48865 Mkoj 2
6 48865 Kino 1
7 64865 Boni 1
8 64865 Afriya 2
9 50774 Amah 2
df2:
group_id group_name seq count
0 48299 e_sys 1
1 50774 Y3N 2
2 64865 nana 1
3 48865 juzti 1
Using the answer of a related question:
df2['count'] = df2['group_id'].map(df1.groupby('id')['id'].count())
the count for the groupby first condition works, to add the second condition. I've tried a few solutions below:
soln1:
df2['count'] = df2['seq'].map(df1.groupby(['seq'])['id'].count())
soln2:
df2['count'] = df2['seq'].map(df1[df1['seq']==df2['seq']].groupby(['seq'])['id'].count())
But i dont seems to get correct counts for df2
Expected results:
group_id group_name seq count
0 48299 e_sys 1 2
1 50774 Y3N 2 1
2 64865 nana 1 1
3 48865 juzti 1 2
I suppose you can merge, groupby and then map:
merge = pd.merge(df2,df1, left_on=['group_id', 'seq'], right_on=['id','seq']).groupby('id')['id'].count()
df2['count'] = df2['group_id'].map(merge)
group_id group_name seq count
0 48299 e_sys 1 2
1 50774 Y3N 2 1
2 64865 nana 1 1
3 48865 juzti 1 2

python pandas, trying to find unique combinations of two columns and merging while summing a third column

Hi I will show what im trying to do through examples:
I start with a dataframe like this:
> pd.DataFrame({'A':['a','a','a','c'],'B':[1,1,2,3], 'count':[5,6,1,7]})
A B count
0 a 1 5
1 a 1 6
2 a 2 1
3 c 3 7
I need to find a way to get all the unique combinations between column A and B, and merge them. The count column should be added together between the merged columns, the result should be like the following:
A B count
0 a 1 11
1 a 2 1
2 c 3 7
Thans for any help.
Use groupby with aggregating sum:
print (df.groupby(['A','B'], as_index=False)['count'].sum())
A B count
0 a 1 11
1 a 2 1
2 c 3 7
print (df.groupby(['A','B'])['count'].sum().reset_index())
A B count
0 a 1 11
1 a 2 1
2 c 3 7

Categories

Resources