Extract specific rows based on the set cut-off values in columns - python

I have a TAB-delimited .txt file that looks like this.
Gene_name A B C D E F
Gene1 1 0 5 2 0 0
Gene2 4 45 0 0 32 1
Gene3 0 23 0 4 0 54
Gene4 12 0 6 8 7 4
Gene5 4 0 0 6 0 7
Gene6 0 6 8 0 0 5
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0
I want to extract rows in which at least 3 columns have values > 0..
How do I do this using pandas? I am clueless about how to use conditions in .txt files.
thanks very much!
update: adding on to this question, how do I analyze specific columns for this conditon.. let's say I look into column A, C, E & F and then extract rows that have at least 3 of these columns with values >5.
cheers!

df = pd.read_csv(filename, delim_whitespace=True)
In [22]: df[df.select_dtypes(['number']).gt(0).sum(axis=1).ge(3)]
Out[22]:
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
some explanation:
In [25]: df.select_dtypes(['number']).gt(0)
Out[25]:
A B C D E F
0 True False True True False False
1 True True False False True True
2 False True False True False True
3 True False True True True True
4 True False False True False True
5 False True True False False True
6 True True True True False True
7 True True False True True True
8 True False True True False True
9 True True True True True False
In [26]: df.select_dtypes(['number']).gt(0).sum(axis=1)
Out[26]:
0 3
1 4
2 3
3 5
4 3
5 3
6 5
7 5
8 4
9 5
dtype: int64

Using operators (as a complement to Max's answer):
mask = (df.iloc[:, 1:] > 0).sum(1) >= 3
mask
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
df[mask]
Gene_name A B C D E F
0 Gene1 1 0 5 2 0 0
1 Gene2 4 45 0 0 32 1
2 Gene3 0 23 0 4 0 54
3 Gene4 12 0 6 8 7 4
4 Gene5 4 0 0 6 0 7
5 Gene6 0 6 8 0 0 5
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
8 Gene9 6 0 12 34 0 11
9 Gene10 23 4 6 7 89 0
Similarly, querying all rows with 5 or more positive values:
df[(df.iloc[:, 1:] > 0).sum(1) >= 5]
Gene_name A B C D E F
3 Gene4 12 0 6 8 7 4
6 Gene7 13 45 64 234 0 6
7 Gene8 11 6 0 7 7 9
9 Gene10 23 4 6 7 89 0

Piggy backing off of #MaxU solution, I like go ahead put 'gene_name' into the index not worry about all that index slicing:
df = pd.read_csv(tfile, delim_whitespace=True, index_col=0)
df[df.gt(0).sum(1).ge(3)]
Edit for question update:
df[df[['A','C','E','F']].gt(5).sum(1).ge(3)]
Output:
A B C D E F
Gene_name
Gene4 12 0 6 8 7 4
Gene7 13 45 64 234 0 6
Gene8 11 6 0 7 7 9
Gene9 6 0 12 34 0 11
Gene10 23 4 6 7 89 0

Related

cumcount() None

I would like to have a new columnen ( not_ordered_in_STREET_x_before_my_car ) that counts the None values in my Dataframe up until the row I am in, groupted by x, sorted by x and y.
import pandas as pd
x_start = 1
y_start = 1
size_city = 10
cars = pd.DataFrame({'x': np.repeat(np.arange(x_start,x_start+size_city),size_city),
'y': np.tile(np.arange(y_start,y_start+size_city),size_city),
'pizza_ordered' : np.repeat([None,None,1,6,3,7,5,None,8,9,0,None,None,None,4,None,11,12,14,15],5)})
The first 4 columns is what I have, and the 5th is the one I want.
x y pizza_ordered not_ordered_in_STREET_x_before_my_car
0 1 1 None 0
1 1 2 None 1
2 1 3 1 2
3 1 4 2 2
4 1 5 1 2
5 1 6 1 2
6 1 7 1 2
7 1 8 None 2
8 1 9 1 3
9 1 10 4 3
10 2 1 1 0
11 2 2 None 0
12 2 3 None 1
13 2 4 None 2
14 2 5 4 3
15 2 6 None 3
16 2 7 5 4
17 2 8 3 4
18 2 9 1 4
19 2 10 1 4
This is what I have tried, but it does not work.
cars = cars.sort_values(['x', 'y'])
cars['not_ordered_in_STREET_x_before_my_car'] = cars.where(cars['pizza_ordered'].isnull()).groupby(['x']).cumcount().add(1)
You can try:
cars["not_ordered_in_STREET_x_before_my_car"] = cars.groupby("x")[
"pizza_ordered"
].transform(lambda x: x.isna().cumsum().shift(1).fillna(0).astype(int
))
print(cars)
Prints:
x y pizza_ordered not_ordered_in_STREET_x_before_my_car
0 1 1 None 0
1 1 2 None 1
2 1 3 None 2
3 1 4 None 3
4 1 5 None 4
5 1 6 None 5
6 1 7 None 6
7 1 8 None 7
8 1 9 None 8
9 1 10 None 9
10 2 1 1 0
11 2 2 1 0
12 2 3 1 0
13 2 4 1 0
14 2 5 1 0
15 2 6 6 0
16 2 7 6 0
17 2 8 6 0
18 2 9 6 0
19 2 10 6 0
20 3 1 3 0
21 3 2 3 0
22 3 3 3 0
23 3 4 3 0
24 3 5 3 0
25 3 6 7 0
26 3 7 7 0
27 3 8 7 0
28 3 9 7 0
29 3 10 7 0
30 4 1 5 0
31 4 2 5 0
32 4 3 5 0
33 4 4 5 0
34 4 5 5 0
35 4 6 None 0
36 4 7 None 1
37 4 8 None 2
38 4 9 None 3
39 4 10 None 4
40 5 1 8 0
41 5 2 8 0
42 5 3 8 0
43 5 4 8 0
44 5 5 8 0
45 5 6 9 0
46 5 7 9 0
47 5 8 9 0
48 5 9 9 0
49 5 10 9 0
50 6 1 0 0
51 6 2 0 0
52 6 3 0 0
53 6 4 0 0
54 6 5 0 0
55 6 6 None 0
56 6 7 None 1
57 6 8 None 2
58 6 9 None 3
59 6 10 None 4
60 7 1 None 0
61 7 2 None 1
62 7 3 None 2
63 7 4 None 3
64 7 5 None 4
65 7 6 None 5
66 7 7 None 6
67 7 8 None 7
68 7 9 None 8
69 7 10 None 9
70 8 1 4 0
71 8 2 4 0
72 8 3 4 0
73 8 4 4 0
74 8 5 4 0
75 8 6 None 0
76 8 7 None 1
77 8 8 None 2
78 8 9 None 3
79 8 10 None 4
80 9 1 11 0
81 9 2 11 0
82 9 3 11 0
83 9 4 11 0
84 9 5 11 0
85 9 6 12 0
86 9 7 12 0
87 9 8 12 0
88 9 9 12 0
89 9 10 12 0
90 10 1 14 0
91 10 2 14 0
92 10 3 14 0
93 10 4 14 0
94 10 5 14 0
95 10 6 15 0
96 10 7 15 0
97 10 8 15 0
98 10 9 15 0
99 10 10 15 0
cars['not_ordered_in_STREET_x_before_my_car'] = pd.isnull(cars['pizza_ordered']).cumsum()

How do I obtain the second highest value in a row?

I want to obtain the second highest value of a certain section for each row from a dataframe. How do I do this?
I have tried the following code but it doesn't work:
df.iloc[:, 5:-3].nlargest(2)(axis=1, level=2)
Is there any other way to obtain this?
Using apply with axis=1 you can find the second largest value for each row. by finding the first 2 largest and then getting the last of them
df.iloc[:, 5:-3].apply(lambda row: row.nlargest(2).values[-1],axis=1)
Example
The code below find the second largest value in each row of df.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'Col{}'.format(i):np.random.randint(0,100,5) for i in range(5)})
In [4]: df
Out[4]:
Col0 Col1 Col2 Col3 Col4
0 82 32 14 62 90
1 62 32 74 62 72
2 31 79 22 17 3
3 42 54 66 93 50
4 13 88 6 46 69
In [5]: df.apply(lambda row: row.nlargest(2).values[-1],axis=1)
Out[5]:
0 82
1 72
2 31
3 66
4 69
dtype: int64
I think you need sorting per rows and then select:
a = np.sort(df.iloc[:, 5:-3], axis=1)[:, -2]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 8 8 3 7 7 0 4 2 5 2
1 2 2 1 0 8 4 0 9 6 2
2 4 1 5 3 4 4 3 7 1 1
3 7 7 0 2 9 9 3 2 5 8
4 1 0 7 6 2 0 8 2 5 1
5 8 1 5 4 2 8 3 5 0 9
6 3 6 3 4 7 6 3 9 0 4
7 4 5 7 6 6 2 4 2 7 1
8 6 6 0 7 2 3 5 4 2 4
9 3 7 9 0 0 5 9 6 6 5
print (df.iloc[:, 5:-3])
5 6
0 0 4
1 4 0
2 4 3
3 9 3
4 0 8
5 8 3
6 6 3
7 2 4
8 3 5
9 5 9
a = np.sort(df.iloc[:, 5:-3], axis=1)[:, -2]
print (a)
[0 0 3 3 0 3 3 2 3 5]
If need both values:
a = df.iloc[:, 5:-3].values
b = pd.DataFrame(a[np.arange(len(a))[:, None], np.argsort(a, axis=1)])
print (b)
0 1
0 0 4
1 0 4
2 3 4
3 3 9
4 0 8
5 3 8
6 3 6
7 2 4
8 3 5
9 5 9
You need to sort your dataframe with numpy.sort() and then get the second value.
import numpy as np
second = np.sort(df.iloc[:, 5:-3], axis=1)[:, 1]

Python Pandas: DataFrame modification with diagnal value = 0 [duplicate]

This question already has answers here:
Set values on the diagonal of pandas.DataFrame
(8 answers)
Closed 5 years ago.
I have a Pandas Dataframe question. I have a df with index=column. It looks like below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 812 62 174 0 4 46 46 7 2 15
B 62 427 27 0 0 12 61 2 4 11
C 174 27 174 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 130 10 57 33 4 5
F 46 12 13 0 10 187 4 5 0 0
......
Another words, df=df.transpose(). All I want to do is find pandas (or numpy for df.values())function to delete index=column values. My ideal output would be below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15
B 62 0 27 0 0 12 61 2 4 11
C 174 27 0 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5
F 46 12 13 0 10 0 4 5 0 0
......
Is there a python function that makes this step very fast? I tried for loop with df.iloc[i,i]=0 but since my dataset is ver big, it takes long time to finish. Thanks in advance!
Setup
np.random.seed([3,1415])
i = pd.MultiIndex.from_product(
[['Cat2'], list('ABCDEFGHIJ')],
names=['DNA', 'Item']
)
a = np.random.randint(5, size=(10, 10))
df = pd.DataFrame(a + a.T + 1, i, i)
df
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 1 6 6 7 7 7 4 4 8 2
B 6 1 3 6 1 6 6 4 8 5
C 6 3 9 8 9 6 7 8 4 9
D 7 6 8 1 6 9 4 5 4 3
E 7 1 9 6 9 7 3 7 2 6
F 7 6 6 9 7 9 3 4 6 6
G 4 6 7 4 3 3 9 4 5 5
H 4 4 8 5 7 4 4 5 4 5
I 8 8 4 4 2 6 5 4 9 7
J 2 5 9 3 6 6 5 5 7 3
Option 1
Simplest way is to multiply by 1 less the identity
df * (1 - np.eye(len(df), dtype=int))
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 2
However, we can also use pd.DataFrame.mask with np.eye. Masking is nice because it doesn't have to be numeric and it will still work.
df.mask(np.eye(len(df), dtype=bool), 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 3
In the event the columns and indices are not identical, OR the are out of order. We can use equality to tell us where to mask.
d = df.iloc[::-1]
d.mask(d.index == d.columns.values[:, None], 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 J 2 5 9 3 6 6 5 5 7 0
I 8 8 4 4 2 6 5 4 0 7
H 4 4 8 5 7 4 4 0 4 5
G 4 6 7 4 3 3 0 4 5 5
F 7 6 6 9 7 0 3 4 6 6
E 7 1 9 6 0 7 3 7 2 6
D 7 6 8 0 6 9 4 5 4 3
C 6 3 0 8 9 6 7 8 4 9
B 6 0 3 6 1 6 6 4 8 5
A 0 6 6 7 7 7 4 4 8 2

Holding a first value in a column while another column equals a value?

I would like to hold the first value in a column while another column does not equal zero. For Column B, values alternate between -1, 0, 1. For Column C, values equal any integer. The objective is holding the first value of Column C while Column B equals zero. The current DataFrame is as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 8
5 5 0 9
6 6 0 1
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 8
11 5 0 9
12 6 0 10
The resulting DataFrame should be as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
You need first create NaNs by condition in column C and then add values by ffill:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df.loc[mask, 'C']
df['C'] = df['C'].ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
Or use where and if type of all values is integer, add astype:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df['C'].where(mask).ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9

How to "iron out" a column of numbers with duplicates in it

If one has the following column:
df = pd.DataFrame({"numbers":[1,2,3,4,4,5,1,2,2,3,4,5,6,7,7,8,1,1,2,2,3,4,5,6,6,7]})
How can one "iron" it out so that the duplicates become part of the series of numbers:
numbers new_numbers
1 1
2 2
3 3
4 4
4 5
5 6
1 1
2 2
2 3
3 4
4 5
5 6
6 7
7 8
7 9
8 10
1 1
1 2
2 3
2 4
3 5
4 6
5 7
6 8
6 9
7 10
(I put spaces into the df for clarification)
It seems you need cumcount by Series created with diff and compare with lt (<) for finding starts of each group. Groups are made by cumsum:
#for better testing helper df1
df1 = pd.DataFrame(index=df.index)
df1['dif'] = df.numbers.diff()
df1['compare'] = df.numbers.diff().lt(0)
df1['groups'] = df.numbers.diff().lt(0).cumsum()
print (df1)
dif compare groups
0 NaN False 0
1 1.0 False 0
2 1.0 False 0
3 1.0 False 0
4 0.0 False 0
5 1.0 False 0
6 -4.0 True 1
7 1.0 False 1
8 0.0 False 1
9 1.0 False 1
10 1.0 False 1
11 1.0 False 1
12 1.0 False 1
13 1.0 False 1
14 0.0 False 1
15 1.0 False 1
16 -7.0 True 2
17 0.0 False 2
18 1.0 False 2
19 0.0 False 2
20 1.0 False 2
21 1.0 False 2
22 1.0 False 2
23 1.0 False 2
24 0.0 False 2
25 1.0 False 2
df['new_numbers'] = df.groupby(df.numbers.diff().lt(0).cumsum()).cumcount() + 1
print (df)
numbers new_numbers
0 1 1
1 2 2
2 3 3
3 4 4
4 4 5
5 5 6
6 1 1
7 2 2
8 2 3
9 3 4
10 4 5
11 5 6
12 6 7
13 7 8
14 7 9
15 8 10
16 1 1
17 1 2
18 2 3
19 2 4
20 3 5
21 4 6
22 5 7
23 6 8
24 6 9
25 7 10

Categories

Resources