How to count consecutive repetitions in a pandas series

How to count consecutive repetitions in a pandas series - python

Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1

Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64

The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN

Related

How do I assign elements to the column of a pandas dataframe based on the properties of groups derived from that dataframe?

Suppose I import pandas and numpy as follows:
import pandas as pd
import numpy as np
and construct the following dataframe:
df = pd.DataFrame({'Alpha'
['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : np.NaN})
...which gives me this:
Alpha Beta
0 A NaN
1 A NaN
2 A NaN
3 B NaN
4 B NaN
5 B NaN
6 B NaN
7 C NaN
8 C NaN
9 C NaN
10 C NaN
11 C NaN
How do I use pandas to get the following dataframe?
df_u = pd.DataFrame({'Alpha':['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : [1,2,3,1,2,2,3,1,2,2,2,3]})
i.e. this:
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Generally speaking what I'm trying to achieve can be described by the following logic:
Suppose we group df by Alpha.
For every group, for every row in the group...
if the index of the row equals the minimum index of rows in the group, then assign 1 to Beta for that row,
else if the index of the row equals the maximum index of the rows in the group, then assign 3 to Beta for that row,
else assign 2 to Beta for that row.

Let's use duplicated:
df.loc[~df.duplicated('Alpha', keep='last'), 'Beta'] = 3
df.loc[~df.duplicated('Alpha', keep='first'), 'Beta'] = 1
df['Beta'] = df['Beta'].fillna(2)
print(df)
Output:
Alpha Beta
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 2.0
5 B 2.0
6 B 3.0
7 C 1.0
8 C 2.0
9 C 2.0
10 C 2.0
11 C 3.0

method 1
Use np.select:
mask1=df['Alpha'].ne(df['Alpha'].shift())
mask3=df['Alpha'].ne(df['Alpha'].shift(-1))
mask2=~(mask1|mask3)
cond=[mask1,mask2,mask3]
values=[1,2,3]
df['Beta']=np.select(cond,values)
print(df)
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Detail of cond list:
print(mask1)
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
Name: Alpha, dtype: bool
print(mask2)
0 False
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 True
11 False
Name: Alpha, dtype: bool
print(mask3)
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
Name: Alpha, dtype: bool
method 2
Use groupby:
def assign_value(x):
return pd.Series([1]+[2]*(len(x)-2)+[3])
new_df=df.groupby('Alpha').apply(assign_value).rename('Beta').reset_index('Alpha')
print(new_df)
Alpha Beta
0 A 1
1 A 2
2 A 3
0 B 1
1 B 2
2 B 2
3 B 3
0 C 1
1 C 2
2 C 2
3 C 2
4 C 3

assuming that "Alpha" column is sorted you can do it like this
df["Beta"] = 2
df.loc[~(df["Alpha"] == df["Alpha"].shift()), "Beta"] = 1
df.loc[~(df["Alpha"] == df["Alpha"].shift(-1)), "Beta"] = 3
df

Pandas: Filter a data-frame, and assign values to top n number of rows

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].

I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN

define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)

Delete rows based on list in pandas

node1 node2 weight date
3 6 1 2002
2 7 1 1998
2 7 1 2002
2 8 1 1999
2 15 1 2002
9 15 1 1998
2 16 1 2003
2 18 1 2001
I want to delete rows which have the values [3, 7, 18].These values can be in any of the rows node1 or node2.

In [8]: new = df[~df.filter(regex='^node').isin([3,7,18]).any(1)]
In [9]: new
Out[9]:
node1 node2 weight date
3 2 8 1 1999
4 2 15 1 2002
5 9 15 1 1998
6 2 16 1 2003
step by step:
In [164]: df.filter(regex='^node').isin([3,7,18])
Out[164]:
node1 node2
0 True False
1 False True
2 False True
3 False False
4 False False
5 False False
6 False False
7 False True
In [165]: df.filter(regex='^node').isin([3,7,18]).any(1)
Out[165]:
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [166]: ~df.filter(regex='^node').isin([3,7,18]).any(1)
Out[166]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
In [167]: df[~df.filter(regex='^node').isin([3,7,18]).any(1)]
Out[167]:
node1 node2 weight date
3 2 8 1 1999
4 2 15 1 2002
5 9 15 1 1998
6 2 16 1 2003

Streaks of True or False in pandas Series

I'm trying to work out how to show streaks of True or False in a pandas Series.
Data:
p = pd.Series([True,False,True,True,True,True,False,False,True])
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
dtype: bool
I tried p.diff() but not sure how to count the False values this generates to show my desired output which is as follows:.
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0

You can use cumcount of consecutives groups created by compare if p is not equal with shifted p and cumsum:
print (p.ne(p.shift()))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
dtype: bool
print (p.ne(p.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
dtype: int32
print (p.groupby(p.ne(p.shift()).cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Thank you MaxU for another solution:
print (p.groupby(p.diff().cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64

Another alternative solution is create the cumulative sum of p Series and subtract the most recent cumulative sum where p is 0. Then invert p and do same. Last multiple Series together:
c = p.cumsum()
a = c.sub(c.mask(p).ffill(), fill_value=0).sub(1).abs()
c = (~p).cumsum()
d = c.sub(c.mask(~(p)).ffill(), fill_value=0).sub(1).abs()
print (a)
0 0.0
1 1.0
2 0.0
3 1.0
4 2.0
5 3.0
6 1.0
7 1.0
8 0.0
dtype: float64
print (d)
0 1.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 1.0
8 1.0
dtype: float64
print (a.mul(d).astype(int))
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int32

Count rows that match string and numeric with pandas

I have 1-12 numbers in SAMPLE column and for each number I try to count mutation numbers(A:T, C:G etc..). This code works but how can I modify this code to gives me all 12 condition for each mutation, instead of writing the same code 12 times and also for each mutation?
In this example; AT gives me the number while SAMPLE=1. I am trying to get number of AT for each sample number(1,2,..12). So how can modify this code for that? I'll appreciate for any help. Thank you.
SAMPLE MUT
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
... ... ...
d= df["SAMPLE", "MUT" ]
chars1 = "TGC-"
number = {}
for item in chars1:
dm= d[(d["MUT"].str.contains("A:" + item)) & (d["SAMPLE"].isin([1]))]
num1 = dm.count()
number[item] = num1
AT=number["T"]
AG=number["G"]
AC=number["C"]
A_=number["-"]

I would use the native string extraction methods in pandas
df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')
Which returns the matches of the different groups:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN G NaN NaN
5 NaN NaN NaN NaN
6 NaN G NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
Then I would convert this to True or False using pd.isnull and invert it with ~. Thereby getting True where it is a match, and false where there is not.
~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
0 1 2 3
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False True False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
Then assign this to the dataframe
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
SAMPLE MUT T G C -
0 11 chr1:100154376:G:A False False False False
1 2 chr1:100177723:C:T False False False False
2 9 chr1:100177723:C:T False False False False
3 1 chr1:100194200:-:AA False False False False
4 8 chr1:10032249:A:G False True False False
5 2 chr1:100340787:G:A False False False False
6 1 chr1:100349757:A:G False True False False
7 3 chr1:10041186:C:A False False False False
8 10 chr1:100476986:G:C False False False False
9 4 chr1:100572459:C:T False False False False
10 5 chr1:100572459:C:T False False False False
Now we can simply sum the columns:
df[["T","G","C","-"]].sum()
T 0
G 2
C 0
- 0
But wait, we have not done this only where SAMPLE == 1
We can do this very easily with a mask:
sample_one_mask = df.SAMPLE == 1
df[sample_one_mask][["T","G","C","-"]].sum()
T 0
G 1
C 0
- 0
If you want this to count per SAMPLE instead, you can use the groupby function:
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
T G C -
SAMPLE
1 0 1 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
TLDR;
Do this:
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)

You can create a column with the mutation type (A->T, G->C) with a regular expression substitution then apply pandas groupby to count.
import pandas as pd
import re
df = pd.read_table('df.tsv')
df['mutation_type'] = df['MUT'].apply(lambda x: re.sub(r'^.*?:([^:]+:[^:]+)$', r'\1', x))
df.groupby(['SAMPLE','mutation_type']).agg('count')['MUT']
The output is like this for your data:
SAMPLE mutation_type
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1
Name: MUT, dtype: int64

I had a similar answer to A.P.
import pandas as pd
df = pd.DataFrame(data={'SAMPLE': [11,2,9,1,8,2,1,3,10,4,5], 'MUT': ['chr1:100154376:G:A', 'chr1:100177723:C:T', 'chr1:100177723:C:T', 'chr1:100194200:-:AA', 'chr1:10032249:A:G', 'chr1:100340787:G:A', 'chr1:100349757:A:G', 'chr1:10041186:C:A', 'chr1:100476986:G:C', 'chr1:100572459:C:T', 'chr1:100572459:C:T']}, columns=['SAMPLE', 'MUT'])
df['Sequence'] = df['MUT'].str.replace(r'\w+:\d+:', '\1')
df.groupby(['SAMPLE', 'Sequence']).count()
Produces
MUT
SAMPLE Sequence
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to count consecutive repetitions in a pandas series - python

Consider the following series, ser date id 2000 NaN 2001 NaN 2001 1 2002 1 2000 2 2001 2 2002 2 2001 NaN 2010 NaN 2000 1 2001 1 2002 1 2010 NaN How to count the values such that every consecutive number is counted and returned? Thanks. Count NaN 2 1 2 2 3 NaN 2 1 3 NaN 1

Related

How do I assign elements to the column of a pandas dataframe based on the properties of groups derived from that dataframe?

Pandas: Filter a data-frame, and assign values to top n number of rows

Delete rows based on list in pandas

Streaks of True or False in pandas Series

Count rows that match string and numeric with pandas

Categories

Resources