How to read file and calculate graph properties in igraph python

How to read file and calculate graph properties in igraph python - python

I am new to the igraph python and I have read the tutorial but I cannot understand very well.
I need some help to calculate the graph properties of my data. My data is in bpseq format and I do not know how to read the file in igraph.
The graph properties that I need to get is
Articulation point
Average path length
Average node betweenness
Variance of node betweenness
Average edge betweenness
Variance of edge betweenness
Average co-citation coupling
Average bibliographic coupling
Average closeness centrality
Diameter
Graph Density
This is the example of my dataset. The # is the name of the RNA class, the first column is the position, the alphabet is the base and the third column is the pointer. Suppose the base should be the node. and the bond between the nucleotide base should be the edge. But I do not know how to do it. There are about 2,00 dataset that looks like below but that is the one of the RNA class.
# RF00001_AF095839_1_346-228 5S_rRNA
1 G 118
2 C 117
3 G 116
4 U 115
5 A 114
6 C 113
7 G 112
8 G 111
9 C 110
10 C 0
11 A 0
12 U 0
13 A 0
14 C 0
15 U 0
16 A 0
17 U 0
18 G 0
19 G 36
20 G 35
21 G 34
22 A 33
23 A 0
24 U 0
25 A 0
26 C 0
27 A 0
28 C 0
29 C 0
30 U 0
31 G 0
32 A 0
33 U 22
34 C 21
35 C 20
36 C 19
37 G 0
38 U 106
39 C 105
40 C 104
41 G 103
42 A 0
43 U 0
44 U 0
45 U 0
46 C 0
47 A 0
48 G 0
49 A 0
50 A 0
51 G 0
52 U 0
53 U 0
54 A 0
55 A 0
56 G 67
57 C 66
58 C 65
59 U 64
60 C 0
61 A 0
62 U 0
63 C 0
64 A 59
65 G 58
66 G 57
67 C 56
68 A 0
69 U 0
70 C 0
71 C 0
72 U 0
73 A 0
74 A 0
75 G 0
76 U 0
77 A 0
78 C 0
79 U 0
80 A 0
81 G 96
82 G 95
83 G 94
84 U 93
85 G 92
86 G 91
87 G 0
88 C 0
89 G 0
90 A 0
91 C 86
92 C 85
93 A 84
94 C 83
95 C 82
96 U 81
97 G 0
98 G 0
99 G 0
100 A 0
101 A 0
102 C 0
103 C 41
104 G 40
105 G 39
106 A 38
107 U 0
108 G 0
109 U 0
110 G 9
111 C 8
112 U 7
113 G 6
114 U 5
115 A 4
116 C 3
117 G 2
118 C 1
119 U 0
I am using Ubuntu 18.04. I really hope someone can help me and guide me on using igraph python.

Related

pandas manipulations with columns

I was trying to parse some important data from html tables from database links using pandas and ran into one problem. Code:
import pandas as pd
df_list = pd.read_html('iss1.html', match='Supplier ID')
df_list2 = pd.read_html('iss2.html', match='Attachments:')
df_list3 = pd.read_html('P.html', match='AWS-HO')
df = pd.concat(df_list, axis=1)
df2 = pd.concat(df_list2, axis=1)
df3 = pd.concat(df_list3, axis=1)
df3 = df3.iloc[:, ::-1]
df_rev = df.iloc[:, ::-1]
df2 = df2.iloc[:, ::-1]
df_rev.columns = df_rev.iloc[0]
lc = df_rev[["Code"]]
lc = pd.DataFrame({"Code": df_rev["Code"].values.T.ravel(),})
lc = lc[lc['Code'] != 'SSB tracking']
lc = lc[lc['Code'] != 'USB']
lc = lc[lc['Code'] != 'Review']
lc = lc[lc['Code'] != '( Review )']
sup = df_rev[["ID"]]
sup = pd.DataFrame({"ID": df_rev["ID"].values.T.ravel(),})
sup = sup[sup['ID'] != 'SupID']
lc_sup = pd.concat([lc, sup], axis=1) # 'group' column
lc_sup['group'] = lc_sup['Lang Code'].isna().cumsum()
lc_sup = lc_sup.sort_values(['group', 'Lang Code'], ascending=True) # 'group' column
lc_sup = lc_sup[lc_sup['Lang Code'].notna()] # 'group' column
ids_cons = pd.concat([ids_cons, lc_sup], axis=1)
This is 'ids_cons' DF.
I created the "group" column because of problems with NaNs and to sort the values. (lc_sup DF in code above)
The range of each project is individual and specialized according to the "group" column. Each identical digit refers to separate projects. In my example, there are 4 projects together.
Code Supplier ID group
1 d C0003 0
2 e R9996 0
3 f O0001 0
4 j MT0021 0
5 k DY0001 0
6 p B0114 0
7 z J0002 0
57 d T0096 48
58 e T0015 48
59 f R0167 48
60 i G0004 48
61 j T0021 48
62 k A0003 48
63 p S0035 48
64 z F0006 48
65 z C0002 48
113 j R0009 94
114 z A0013 94
169 e O0001 147
170 z A0013 147
281 d C0003 254
282 e O0001 254
283 f N0183 254
284 i O0001 254
So what I want to do now is add a project name for each of the 4 projects. I have a separate DF(just a list) with project names that are grouped into one column. The problem is that required project name appears only 1time here and I need to add it to every project by 'group' column.
Previous example + added Project names DF:
Code Supplier ID group Project name
1 d C0003 0 E01
2 e R9996 0 E02
3 f O0001 0 E03
4 j MT0021 0 E04
5 k DY0001 0 E05
6 p B0114 0
7 z J0002 0
57 d T0096 48
58 e T0015 48
59 f R0167 48
60 i G0004 48
61 j T0021 48
62 k A0003 48
63 p S0035 48
64 z F0006 48
65 z C0002 48
113 j R0009 94
114 z A0013 94
169 e O0001 147
170 z A0013 147
281 d C0003 254
282 e O0001 254
283 f N0183 254
284 i O0001 254
And this the result that I want:
Code Supplier ID group Project name
1 d C0003 0 E01
2 e R9996 0 E01
3 f O0001 0 E01
4 j MT0021 0 E01
5 k DY0001 0 E01
6 p B0114 0 E01
7 z J0002 0 E01
57 d T0096 48 E02
58 e T0015 48 E02
59 f R0167 48 E02
60 i G0004 48 E02
61 j T0021 48 E02
62 k A0003 48 E02
63 p S0035 48 E02
64 z F0006 48 E02
65 z C0002 48 E02
113 j R0009 94 E03
114 z A0013 94 E03
169 e O0001 147 E04
170 z A0013 147 E04
281 d C0003 254 E05
282 e O0001 254 E05
283 f N0183 254 E05
284 i O0001 254 E05

IIUC, you can simply use groupby to get the group number (ngroup) and map to the project name:
ids_cons["Project name"] = ids_cons.groupby("group").ngroup().map(projects["Project name"])
>>> ids_cons
Code Supplier ID group Project name
0 1 d C0003 0 E01
1 2 e R9996 0 E01
2 3 f O0001 0 E01
3 4 j MT0021 0 E01
4 5 k DY0001 0 E01
5 6 p B0114 0 E01
6 7 z J0002 0 E01
7 57 d T0096 48 E02
8 58 e T0015 48 E02
9 59 f R0167 48 E02
10 60 i G0004 48 E02
11 61 j T0021 48 E02
12 62 k A0003 48 E02
13 63 p S0035 48 E02
14 64 z F0006 48 E02
15 65 z C0002 48 E02
16 113 j R0009 94 E03
17 114 z A0013 94 E03
18 169 e O0001 147 E04
19 170 z A0013 147 E04
20 281 d C0003 254 E05
21 282 e O0001 254 E05
22 283 f N0183 254 E05
23 284 i O0001 254 E05
Inputs:
projects = pd.DataFrame({"Project name": ["E01","E02","E03","E04","E05"]})
>>> projects
Project name
0 E01
1 E02
2 E03
3 E04
4 E05

Python, how can I execute the output in tabular form: ten code-symbol pairs in each line?

def display_code_ascii():
for i in range(32, 128):
print(chr(i))
print(display_code_ascii())
This is my code. the output is:
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
#
A
B
C
D
E
But I want to print in a console like this:
32 is 33 is ! 34 is " 35 is # 36 is $ 37 is % 38 is & 39 is ' 40 is ( 41 is )
42 is * 43 is + 44 is , 45 is - 46 is . 47 is / 48 is 0 49 is 1 50 is 2 51 is 3
52 is 4 53 is 5 54 is 6 55 is 7 56 is 8 57 is 9 58 is : 59 is ; 60 is < 61 is =
62 is > 63 is ? 64 is # 65 is A 66 is B 67 is C 68 is D 69 is E 70 is F 71 is G
72 is H 73 is I 74 is J 75 is K 76 is L 77 is M 78 is N 79 is O 80 is P 81 is Q
82 is R 83 is S 84 is T 85 is U 86 is V 87 is W 88 is X 89 is Y 90 is Z 91 is [
92 is \ 93 is ] 94 is ^ 95 is _ 96 is ` 97 is a 98 is b 99 is c 100 is d 101 is e
102 is f 103 is g 104 is h 105 is i 106 is j 107 is k 108 is l 109 is m 110 is n 111 is o
112 is p 113 is q 114 is r 115 is s 116 is t 117 is u 118 is v 119 is w 120 is x 121 is y
122 is z 123 is { 124 is | 125 is } 126 is ~ 127 is None

# set chunk_size which is number of initial elements to handle per each outputted line
chunk_size = 10
def format_element(x):
x = chr(x)
return x if x != '\x7f' else "None"
# prepare initial list elements with output strings
ll = [f"{x} is {format_element(x)}" for x in range(32, 128)]
# split list into chunks using chunk_size
ll = [ll[i:i+chunk_size] for i in range(len(ll))[::chunk_size]]
# join inner lists into output lines strings
ll = [" ".join(x) for x in ll]
# print each line separately
for i in ll:
print(i)
Output:
32 is 33 is ! 34 is " 35 is # 36 is $ 37 is % 38 is & 39 is ' 40 is ( 41 is )
42 is * 43 is + 44 is , 45 is - 46 is . 47 is / 48 is 0 49 is 1 50 is 2 51 is 3
52 is 4 53 is 5 54 is 6 55 is 7 56 is 8 57 is 9 58 is : 59 is ; 60 is < 61 is =
62 is > 63 is ? 64 is # 65 is A 66 is B 67 is C 68 is D 69 is E 70 is F 71 is G
72 is H 73 is I 74 is J 75 is K 76 is L 77 is M 78 is N 79 is O 80 is P 81 is Q
82 is R 83 is S 84 is T 85 is U 86 is V 87 is W 88 is X 89 is Y 90 is Z 91 is [
92 is \ 93 is ] 94 is ^ 95 is _ 96 is ` 97 is a 98 is b 99 is c 100 is d 101 is e
102 is f 103 is g 104 is h 105 is i 106 is j 107 is k 108 is l 109 is m 110 is n 111 is o
112 is p 113 is q 114 is r 115 is s 116 is t 117 is u 118 is v 119 is w 120 is x 121 is y
122 is z 123 is { 124 is | 125 is } 126 is ~ 127 is None

Add Sum to all grouped rows in pandas dataframe

I have a dataframe and i want to group its "First" and "Second" column and then to produce the expected output as mentioned below:
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
print(df)
Output>
First Second Value_1 Value_2
0 a q 17 70
1 b e 44 47
2 c e 5 56
3 a e 23 58
4 b e 10 76
5 a q 11 67
6 b q 21 84
7 c q 42 67
8 b e 36 53
9 c q 16 63
When i Grouped this DataFrame using groupby, I am getting below output:
def func(arr,columns):
return arr.sort_values(by = columns).drop(columns, axis = 1)
df.groupby(['First','Second']).apply(func, columns = ['First','Second'])
Value_1 Value_2
First Second
a e 3 23 58
q 0 17 70
5 11 67
b e 1 44 47
4 10 76
8 36 53
q 6 21 84
c e 2 5 56
q 7 42 67
9 16 63
However i want below output:
Expected output:
Value_1 Value_2
First Second
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
It's not necessary to print "All" string but to print the sum of all grouped rows.

df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
First Second Value_1 Value_2
0 a q 4 69
1 b e 20 74
2 c e 13 82
3 a e 9 41
4 b e 11 79
5 a q 32 77
6 b q 6 75
7 c q 39 62
8 b e 26 80
9 c q 26 42
def lambda_t(x):
df = x.sort_values(['First','Second']).drop(['First','Second'],axis=1)
df.loc['all'] = df.sum()
return df
df.groupby(['First','Second']).apply(lambda_t)
Value_1 Value_2
First Second
a e 3 9 41
all 9 41
q 0 4 69
5 32 77
all 36 146
b e 1 20 74
4 11 79
8 26 80
all 57 233
q 6 6 75
all 6 75
c e 2 13 82
all 13 82
q 7 39 62
9 26 42
all 65 104

You can try this:
reset the index in your group by:
d1 = df.groupby(['First','Second']).apply(func, columns = ['First','Second']).reset_index()
Then group by 'First' and 'Second' and sum the values columns.
d2 = d.groupby(['First', 'Second']).sum().reset_index()
Create the 'level_2' column in the new dataframe and concatenate with the initial one to get the desired result
d2.loc[:,'level_2'] = 'All'
pd.concat([d1,d2],0).sort_values(by = ['First', 'Second'])

Not sure about your function; however, you could chunk it into two steps:
Create an indexed dataframe, where you append the First and Second columns to the existing index:
df.index = df.index.astype(str).rename("Total")
indexed = df.set_index(["First", "Second"], append=True).reorder_levels(
["First", "Second", "Total"]
)
indexed
Value_1 Value_2
First Second Total
a q 0 17 70
b e 1 44 47
c e 2 5 56
a e 3 23 58
b e 4 10 76
a q 5 11 67
b q 6 21 84
c q 7 42 67
b e 8 36 53
c q 9 16 63
Create an aggregation, grouped by First and Second:
summary = (
df.groupby(["First", "Second"])
.sum()
.assign(Total="All")
.set_index("Total", append=True)
)
summary
Value_1 Value_2
First Second Total
a e All 23 58
q All 28 137
b e All 90 176
q All 21 84
c e All 5 56
q All 58 130
Combine indexed and summary dataframes:
pd.concat([indexed, summary]).sort_index(level=["First", "Second"])
Value_1 Value_2
First Second Total
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130

pandas: enumerate items in each group

I have a DataFrame like
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 0
5 100 L 71 0
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 0
12 101 L 67 0
13 101 L 68 0
14 101 L 69 0
15 101 L 71 0
16 101 L 72 0
17 201 R 67 0
18 201 R 68 0
19 201 R 69 0
ord essentially gives the ordering of the entries when (prop, chi and id) all have the same value. This isn't quite what I'd like though. Instead, I'd like to be able to enumerate the entries of each group g in {(id, chi)} from 0 to n_g where n_g is the size of group g. So I'd like to obtain something that looks like
id chi prop count
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2
I'd like to know if there's a simple way of doing this with pandas. The following comes very close, but it feels way too complicated, and it for some reason won't let me join the resulting dataframe with the original one.
(df.groupby(['id', 'chi'])
.apply(lambda g: np.arange(g.shape[0]))
.apply(pd.Series, 1)
.stack()
.rename('counter')
.reset_index()
.drop(columns=['level_2']))
EDIT: A second way of course is the for loop way, but I'm looking for something more "Pythonic" than:
for gname, idx in df.groupby(['id','chi']).groups.items():
tmp = df.loc[idx]
df.loc[idx, 'counter'] = np.arange(tmp.shape[0])
R has a very simple way of achieving this behaviour using the tidyverse packages, but I haven't quite found the well-oiled way to achieve the same thing with pandas. Any help provided is greatly appreciated!

cumcount
df.assign(ord=df.groupby(['id', 'chi']).cumcount())
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2
defaultdict and count
from itertools import count
from collections import defaultdict
d = defaultdict(count)
df.assign(ord=[next(d[t]) for t in zip(df.id, df.chi)])
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2

Drop range of columns by labels

Suppose I had this large data frame:
In [31]: df
Out[31]:
A B C D E F G H I J ... Q R S T U V W X Y Z
0 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
1 26 27 28 29 30 31 32 33 34 35 ... 42 43 44 45 46 47 48 49 50 51
2 52 53 54 55 56 57 58 59 60 61 ... 68 69 70 71 72 73 74 75 76 77
[3 rows x 26 columns]
which you can create using
alphabet = [chr(letter_i) for letter_i in range(ord('A'), ord('Z')+1)]
df = pd.DataFrame(np.arange(3*26).reshape(3, 26), columns=alphabet)
What's the best way to drop all columns between column 'D' and 'R' using the labels of the columns?
I found one ugly way to do it:
df.drop(df.columns[df.columns.get_loc('D'):df.columns.get_loc('R')+1], axis=1)

Here's my entry:
>>> df.drop(df.columns.to_series()["D":"R"], axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
By converting df.columns from an Index to a Series, we can take advantage of the ["D":"R"]-style selection:
>>> df.columns.to_series()["D":"R"]
D D
E E
F F
G G
H H
I I
J J
... ...
Q Q
R R
dtype: object

Here you are:
print df.ix[:,'A':'C'].join(df.ix[:,'S':'Z'])
Out[1]:
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77

Here's another way ...
low, high = df.columns.get_slice_bound(('D', 'R'), 'left')
drops = df.columns[low:high+1]
print df.drop(drops, axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77

Use numpy for more flexibility ... numpy allows comparison of letters (probably by comparing on ASCII bit level, or something):
import numpy as np
array = (['A','B','C','D'])
array > 'B'
print(array)
print(array>'B')
gives:
['A' 'B' 'C' 'D']
array([False, False, True, True], dtype=bool)
More difficult selections are also easily possible:
b[np.logical_and(b>'B', b<'D')]
gives:
array(['C'],
dtype='|S1')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read file and calculate graph properties in igraph python - python

Related

pandas manipulations with columns

Python, how can I execute the output in tabular form: ten code-symbol pairs in each line?

Add Sum to all grouped rows in pandas dataframe

pandas: enumerate items in each group

Drop range of columns by labels

Categories

Resources