Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1
Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64
The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN
Related
Suppose I import pandas and numpy as follows:
import pandas as pd
import numpy as np
and construct the following dataframe:
df = pd.DataFrame({'Alpha'
['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : np.NaN})
...which gives me this:
Alpha Beta
0 A NaN
1 A NaN
2 A NaN
3 B NaN
4 B NaN
5 B NaN
6 B NaN
7 C NaN
8 C NaN
9 C NaN
10 C NaN
11 C NaN
How do I use pandas to get the following dataframe?
df_u = pd.DataFrame({'Alpha':['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : [1,2,3,1,2,2,3,1,2,2,2,3]})
i.e. this:
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Generally speaking what I'm trying to achieve can be described by the following logic:
Suppose we group df by Alpha.
For every group, for every row in the group...
if the index of the row equals the minimum index of rows in the group, then assign 1 to Beta for that row,
else if the index of the row equals the maximum index of the rows in the group, then assign 3 to Beta for that row,
else assign 2 to Beta for that row.
Let's use duplicated:
df.loc[~df.duplicated('Alpha', keep='last'), 'Beta'] = 3
df.loc[~df.duplicated('Alpha', keep='first'), 'Beta'] = 1
df['Beta'] = df['Beta'].fillna(2)
print(df)
Output:
Alpha Beta
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 2.0
5 B 2.0
6 B 3.0
7 C 1.0
8 C 2.0
9 C 2.0
10 C 2.0
11 C 3.0
method 1
Use np.select:
mask1=df['Alpha'].ne(df['Alpha'].shift())
mask3=df['Alpha'].ne(df['Alpha'].shift(-1))
mask2=~(mask1|mask3)
cond=[mask1,mask2,mask3]
values=[1,2,3]
df['Beta']=np.select(cond,values)
print(df)
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Detail of cond list:
print(mask1)
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
Name: Alpha, dtype: bool
print(mask2)
0 False
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 True
11 False
Name: Alpha, dtype: bool
print(mask3)
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
Name: Alpha, dtype: bool
method 2
Use groupby:
def assign_value(x):
return pd.Series([1]+[2]*(len(x)-2)+[3])
new_df=df.groupby('Alpha').apply(assign_value).rename('Beta').reset_index('Alpha')
print(new_df)
Alpha Beta
0 A 1
1 A 2
2 A 3
0 B 1
1 B 2
2 B 2
3 B 3
0 C 1
1 C 2
2 C 2
3 C 2
4 C 3
assuming that "Alpha" column is sorted you can do it like this
df["Beta"] = 2
df.loc[~(df["Alpha"] == df["Alpha"].shift()), "Beta"] = 1
df.loc[~(df["Alpha"] == df["Alpha"].shift(-1)), "Beta"] = 3
df
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].
I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN
define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)
node1 node2 weight date
3 6 1 2002
2 7 1 1998
2 7 1 2002
2 8 1 1999
2 15 1 2002
9 15 1 1998
2 16 1 2003
2 18 1 2001
I want to delete rows which have the values [3, 7, 18].These values can be in any of the rows node1 or node2.
In [8]: new = df[~df.filter(regex='^node').isin([3,7,18]).any(1)]
In [9]: new
Out[9]:
node1 node2 weight date
3 2 8 1 1999
4 2 15 1 2002
5 9 15 1 1998
6 2 16 1 2003
step by step:
In [164]: df.filter(regex='^node').isin([3,7,18])
Out[164]:
node1 node2
0 True False
1 False True
2 False True
3 False False
4 False False
5 False False
6 False False
7 False True
In [165]: df.filter(regex='^node').isin([3,7,18]).any(1)
Out[165]:
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [166]: ~df.filter(regex='^node').isin([3,7,18]).any(1)
Out[166]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
In [167]: df[~df.filter(regex='^node').isin([3,7,18]).any(1)]
Out[167]:
node1 node2 weight date
3 2 8 1 1999
4 2 15 1 2002
5 9 15 1 1998
6 2 16 1 2003
I'm trying to work out how to show streaks of True or False in a pandas Series.
Data:
p = pd.Series([True,False,True,True,True,True,False,False,True])
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
dtype: bool
I tried p.diff() but not sure how to count the False values this generates to show my desired output which is as follows:.
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
You can use cumcount of consecutives groups created by compare if p is not equal with shifted p and cumsum:
print (p.ne(p.shift()))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
dtype: bool
print (p.ne(p.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
dtype: int32
print (p.groupby(p.ne(p.shift()).cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Thank you MaxU for another solution:
print (p.groupby(p.diff().cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Another alternative solution is create the cumulative sum of p Series and subtract the most recent cumulative sum where p is 0. Then invert p and do same. Last multiple Series together:
c = p.cumsum()
a = c.sub(c.mask(p).ffill(), fill_value=0).sub(1).abs()
c = (~p).cumsum()
d = c.sub(c.mask(~(p)).ffill(), fill_value=0).sub(1).abs()
print (a)
0 0.0
1 1.0
2 0.0
3 1.0
4 2.0
5 3.0
6 1.0
7 1.0
8 0.0
dtype: float64
print (d)
0 1.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 1.0
8 1.0
dtype: float64
print (a.mul(d).astype(int))
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int32
I have 1-12 numbers in SAMPLE column and for each number I try to count mutation numbers(A:T, C:G etc..). This code works but how can I modify this code to gives me all 12 condition for each mutation, instead of writing the same code 12 times and also for each mutation?
In this example; AT gives me the number while SAMPLE=1. I am trying to get number of AT for each sample number(1,2,..12). So how can modify this code for that? I'll appreciate for any help. Thank you.
SAMPLE MUT
0 11 chr1:100154376:G:A
1 2 chr1:100177723:C:T
2 9 chr1:100177723:C:T
3 1 chr1:100194200:-:AA
4 8 chr1:10032249:A:G
5 2 chr1:100340787:G:A
6 1 chr1:100349757:A:G
7 3 chr1:10041186:C:A
8 10 chr1:100476986:G:C
9 4 chr1:100572459:C:T
10 5 chr1:100572459:C:T
... ... ...
d= df["SAMPLE", "MUT" ]
chars1 = "TGC-"
number = {}
for item in chars1:
dm= d[(d["MUT"].str.contains("A:" + item)) & (d["SAMPLE"].isin([1]))]
num1 = dm.count()
number[item] = num1
AT=number["T"]
AG=number["G"]
AC=number["C"]
A_=number["-"]
I would use the native string extraction methods in pandas
df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)')
Which returns the matches of the different groups:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN G NaN NaN
5 NaN NaN NaN NaN
6 NaN G NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
Then I would convert this to True or False using pd.isnull and invert it with ~. Thereby getting True where it is a match, and false where there is not.
~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
0 1 2 3
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False True False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
Then assign this to the dataframe
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
SAMPLE MUT T G C -
0 11 chr1:100154376:G:A False False False False
1 2 chr1:100177723:C:T False False False False
2 9 chr1:100177723:C:T False False False False
3 1 chr1:100194200:-:AA False False False False
4 8 chr1:10032249:A:G False True False False
5 2 chr1:100340787:G:A False False False False
6 1 chr1:100349757:A:G False True False False
7 3 chr1:10041186:C:A False False False False
8 10 chr1:100476986:G:C False False False False
9 4 chr1:100572459:C:T False False False False
10 5 chr1:100572459:C:T False False False False
Now we can simply sum the columns:
df[["T","G","C","-"]].sum()
T 0
G 2
C 0
- 0
But wait, we have not done this only where SAMPLE == 1
We can do this very easily with a mask:
sample_one_mask = df.SAMPLE == 1
df[sample_one_mask][["T","G","C","-"]].sum()
T 0
G 1
C 0
- 0
If you want this to count per SAMPLE instead, you can use the groupby function:
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
T G C -
SAMPLE
1 0 1 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
TLDR;
Do this:
df[["T","G","C","-"]] = ~pd.isnull(df.MUT.str.extract('A:(T)|A:(G)|A:(C)|A:(-)'))
df[["SAMPLE","T","G","C","-"]].groupby("SAMPLE").agg(sum).astype(int)
You can create a column with the mutation type (A->T, G->C) with a regular expression substitution then apply pandas groupby to count.
import pandas as pd
import re
df = pd.read_table('df.tsv')
df['mutation_type'] = df['MUT'].apply(lambda x: re.sub(r'^.*?:([^:]+:[^:]+)$', r'\1', x))
df.groupby(['SAMPLE','mutation_type']).agg('count')['MUT']
The output is like this for your data:
SAMPLE mutation_type
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1
Name: MUT, dtype: int64
I had a similar answer to A.P.
import pandas as pd
df = pd.DataFrame(data={'SAMPLE': [11,2,9,1,8,2,1,3,10,4,5], 'MUT': ['chr1:100154376:G:A', 'chr1:100177723:C:T', 'chr1:100177723:C:T', 'chr1:100194200:-:AA', 'chr1:10032249:A:G', 'chr1:100340787:G:A', 'chr1:100349757:A:G', 'chr1:10041186:C:A', 'chr1:100476986:G:C', 'chr1:100572459:C:T', 'chr1:100572459:C:T']}, columns=['SAMPLE', 'MUT'])
df['Sequence'] = df['MUT'].str.replace(r'\w+:\d+:', '\1')
df.groupby(['SAMPLE', 'Sequence']).count()
Produces
MUT
SAMPLE Sequence
1 -:AA 1
A:G 1
2 C:T 1
G:A 1
3 C:A 1
4 C:T 1
5 C:T 1
8 A:G 1
9 C:T 1
10 G:C 1
11 G:A 1