Delete rows based on list in pandas - python

node1 node2 weight date
3 6 1 2002
2 7 1 1998
2 7 1 2002
2 8 1 1999
2 15 1 2002
9 15 1 1998
2 16 1 2003
2 18 1 2001
I want to delete rows which have the values [3, 7, 18].These values can be in any of the rows node1 or node2.

In [8]: new = df[~df.filter(regex='^node').isin([3,7,18]).any(1)]
In [9]: new
Out[9]:
node1 node2 weight date
3 2 8 1 1999
4 2 15 1 2002
5 9 15 1 1998
6 2 16 1 2003
step by step:
In [164]: df.filter(regex='^node').isin([3,7,18])
Out[164]:
node1 node2
0 True False
1 False True
2 False True
3 False False
4 False False
5 False False
6 False False
7 False True
In [165]: df.filter(regex='^node').isin([3,7,18]).any(1)
Out[165]:
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [166]: ~df.filter(regex='^node').isin([3,7,18]).any(1)
Out[166]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
In [167]: df[~df.filter(regex='^node').isin([3,7,18]).any(1)]
Out[167]:
node1 node2 weight date
3 2 8 1 1999
4 2 15 1 2002
5 9 15 1 1998
6 2 16 1 2003

Related

Cumulative count resetting to and staying 0 based on a condition in pandas

I am trying to create a column 'count' on a pandas DF that cumulatively counts when field 'boolean' is True but resets and stays at 0 when 'boolean' is False. Also needs to be grouped by the ID column, so the count resets when looking at a new ID. No loops please as working with a big data set
Used the code from the following question which works but need to add a group by to include the ID column grouping
Pandas Dataframe - Row Iteration with Resetting Count-Value by Condition without loop
Expected output below: (ID, Boolean columns already exist, just need to create Count)
ID Boolean Count
1 True 1
1 True 2
1 True 3
1 True 4
1 True 5
1 False 0
1 False 0
1 False 0
1 False 0
1 True 1
1 True 2
1 True 3
2 True 1
2 True 2
2 True 3
2 True 4
2 False 0
2 False 0
2 False 0
2 True 1
2 True 2
2 True 3
Identify blocks by using cumsum on inverted boolean mask, then group the dataframe by ID and blocks and use cumsum on Boolean to create a counter
b = (~df['Boolean']).cumsum()
df['Count'] = df.groupby(['ID', b])['Boolean'].cumsum()
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
df['Count'] = df.groupby('ID')['Boolean'].diff()
df = df.fillna(False)
df['Count'] = df.groupby('ID')['Count'].cumsum()
df['Count'] = df.groupby(['ID', 'Count'])['Boolean'].cumsum()
df
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
You can use a column shift for ID and Boolean columns to identify the groups to do the groupby on. Then do a cumsum for each of those groups.
groups = ((df['ID']!=df['ID'].shift()) | (df['Boolean']!=df['Boolean'].shift())).cumsum()
df.assign(Count2=df.groupby(groups)['Boolean'].cumsum())
Result
ID Boolean Count Count2
0 1 True 1 1
1 1 True 2 2
2 1 True 3 3
3 1 True 4 4
4 1 True 5 5
5 1 False 0 0
6 1 False 0 0
7 1 False 0 0
8 1 False 0 0
9 1 True 1 1
10 1 True 2 2
11 1 True 3 3
12 2 True 1 1
13 2 True 2 2
14 2 True 3 3
15 2 True 4 4
16 2 False 0 0
17 2 False 0 0
18 2 False 0 0
19 2 True 1 1
20 2 True 2 2
21 2 True 3 3

How to count consecutive repetitions in a pandas series

Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1
Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64
The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN

Extract specific rows based on the set cut-off values in columns

I have a TAB-delimited .txt file that looks like this.
Gene_name A B C D E F
Gene1 1 0 5 2 0 0
Gene2 4 45 0 0 32 1
Gene3 0 23 0 4 0 54
Gene4 12 0 6 8 7 4
Gene5 4 0 0 6 0 7
Gene6 0 6 8 0 0 5
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0
I want to extract rows in which at least 3 columns have values > 0..
How do I do this using pandas? I am clueless about how to use conditions in .txt files.
thanks very much!
update: adding on to this question, how do I analyze specific columns for this conditon.. let's say I look into column A, C, E & F and then extract rows that have at least 3 of these columns with values >5.
cheers!
df = pd.read_csv(filename, delim_whitespace=True)
In [22]: df[df.select_dtypes(['number']).gt(0).sum(axis=1).ge(3)]
Out[22]:
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
some explanation:
In [25]: df.select_dtypes(['number']).gt(0)
Out[25]:
A B C D E F
0 True False True True False False
1 True True False False True True
2 False True False True False True
3 True False True True True True
4 True False False True False True
5 False True True False False True
6 True True True True False True
7 True True False True True True
8 True False True True False True
9 True True True True True False
In [26]: df.select_dtypes(['number']).gt(0).sum(axis=1)
Out[26]:
0 3
1 4
2 3
3 5
4 3
5 3
6 5
7 5
8 4
9 5
dtype: int64
Using operators (as a complement to Max's answer):
mask = (df.iloc[:, 1:] > 0).sum(1) >= 3
mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
df[mask]
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
Similarly, querying all rows with 5 or more positive values:
df[(df.iloc[:, 1:] > 0).sum(1) >= 5]
Gene_name A B C D E F
3 Gene4 12 0 6 8 7 4
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
9 Gene10 23 4 6 7 89 0
Piggy backing off of #MaxU solution, I like go ahead put 'gene_name' into the index not worry about all that index slicing:
df = pd.read_csv(tfile, delim_whitespace=True, index_col=0)
df[df.gt(0).sum(1).ge(3)]
Edit for question update:
df[df[['A','C','E','F']].gt(5).sum(1).ge(3)]
Output:
A B C D E F
Gene_name
Gene4 12 0 6 8 7 4
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0

How to "iron out" a column of numbers with duplicates in it

If one has the following column:
df = pd.DataFrame({"numbers":[1,2,3,4,4,5,1,2,2,3,4,5,6,7,7,8,1,1,2,2,3,4,5,6,6,7]})
How can one "iron" it out so that the duplicates become part of the series of numbers:
numbers new_numbers
1 1
2 2
3 3
4 4
4 5
5 6
1 1
2 2
2 3
3 4
4 5
5 6
6 7
7 8
7 9
8 10
1 1
1 2
2 3
2 4
3 5
4 6
5 7
6 8
6 9
7 10
(I put spaces into the df for clarification)
It seems you need cumcount by Series created with diff and compare with lt (<) for finding starts of each group. Groups are made by cumsum:
#for better testing helper df1
df1 = pd.DataFrame(index=df.index)
df1['dif'] = df.numbers.diff()
df1['compare'] = df.numbers.diff().lt(0)
df1['groups'] = df.numbers.diff().lt(0).cumsum()
print (df1)
dif compare groups
0 NaN False 0
1 1.0 False 0
2 1.0 False 0
3 1.0 False 0
4 0.0 False 0
5 1.0 False 0
6 -4.0 True 1
7 1.0 False 1
8 0.0 False 1
9 1.0 False 1
10 1.0 False 1
11 1.0 False 1
12 1.0 False 1
13 1.0 False 1
14 0.0 False 1
15 1.0 False 1
16 -7.0 True 2
17 0.0 False 2
18 1.0 False 2
19 0.0 False 2
20 1.0 False 2
21 1.0 False 2
22 1.0 False 2
23 1.0 False 2
24 0.0 False 2
25 1.0 False 2
df['new_numbers'] = df.groupby(df.numbers.diff().lt(0).cumsum()).cumcount() + 1
print (df)
numbers new_numbers
0 1 1
1 2 2
2 3 3
3 4 4
4 4 5
5 5 6
6 1 1
7 2 2
8 2 3
9 3 4
10 4 5
11 5 6
12 6 7
13 7 8
14 7 9
15 8 10
16 1 1
17 1 2
18 2 3
19 2 4
20 3 5
21 4 6
22 5 7
23 6 8
24 6 9
25 7 10

Streaks of True or False in pandas Series

I'm trying to work out how to show streaks of True or False in a pandas Series.
Data:
p = pd.Series([True,False,True,True,True,True,False,False,True])
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
dtype: bool
I tried p.diff() but not sure how to count the False values this generates to show my desired output which is as follows:.
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
You can use cumcount of consecutives groups created by compare if p is not equal with shifted p and cumsum:
print (p.ne(p.shift()))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
dtype: bool
print (p.ne(p.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
dtype: int32
print (p.groupby(p.ne(p.shift()).cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Thank you MaxU for another solution:
print (p.groupby(p.diff().cumsum()).cumcount())
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int64
Another alternative solution is create the cumulative sum of p Series and subtract the most recent cumulative sum where p is 0. Then invert p and do same. Last multiple Series together:
c = p.cumsum()
a = c.sub(c.mask(p).ffill(), fill_value=0).sub(1).abs()
c = (~p).cumsum()
d = c.sub(c.mask(~(p)).ffill(), fill_value=0).sub(1).abs()
print (a)
0 0.0
1 1.0
2 0.0
3 1.0
4 2.0
5 3.0
6 1.0
7 1.0
8 0.0
dtype: float64
print (d)
0 1.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 1.0
8 1.0
dtype: float64
print (a.mul(d).astype(int))
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 0
dtype: int32

Categories

Resources