pandas manipulations with columns - python

I was trying to parse some important data from html tables from database links using pandas and ran into one problem. Code:
import pandas as pd
df_list = pd.read_html('iss1.html', match='Supplier ID')
df_list2 = pd.read_html('iss2.html', match='Attachments:')
df_list3 = pd.read_html('P.html', match='AWS-HO')
df = pd.concat(df_list, axis=1)
df2 = pd.concat(df_list2, axis=1)
df3 = pd.concat(df_list3, axis=1)
df3 = df3.iloc[:, ::-1]
df_rev = df.iloc[:, ::-1]
df2 = df2.iloc[:, ::-1]
df_rev.columns = df_rev.iloc[0]
lc = df_rev[["Code"]]
lc = pd.DataFrame({"Code": df_rev["Code"].values.T.ravel(),})
lc = lc[lc['Code'] != 'SSB tracking']
lc = lc[lc['Code'] != 'USB']
lc = lc[lc['Code'] != 'Review']
lc = lc[lc['Code'] != '( Review )']
sup = df_rev[["ID"]]
sup = pd.DataFrame({"ID": df_rev["ID"].values.T.ravel(),})
sup = sup[sup['ID'] != 'SupID']
lc_sup = pd.concat([lc, sup], axis=1) # 'group' column
lc_sup['group'] = lc_sup['Lang Code'].isna().cumsum()
lc_sup = lc_sup.sort_values(['group', 'Lang Code'], ascending=True) # 'group' column
lc_sup = lc_sup[lc_sup['Lang Code'].notna()] # 'group' column
ids_cons = pd.concat([ids_cons, lc_sup], axis=1)
This is 'ids_cons' DF.
I created the "group" column because of problems with NaNs and to sort the values. (lc_sup DF in code above)
The range of each project is individual and specialized according to the "group" column. Each identical digit refers to separate projects. In my example, there are 4 projects together.
Code Supplier ID group
1 d C0003 0
2 e R9996 0
3 f O0001 0
4 j MT0021 0
5 k DY0001 0
6 p B0114 0
7 z J0002 0
57 d T0096 48
58 e T0015 48
59 f R0167 48
60 i G0004 48
61 j T0021 48
62 k A0003 48
63 p S0035 48
64 z F0006 48
65 z C0002 48
113 j R0009 94
114 z A0013 94
169 e O0001 147
170 z A0013 147
281 d C0003 254
282 e O0001 254
283 f N0183 254
284 i O0001 254
So what I want to do now is add a project name for each of the 4 projects. I have a separate DF(just a list) with project names that are grouped into one column. The problem is that required project name appears only 1time here and I need to add it to every project by 'group' column.
Previous example + added Project names DF:
Code Supplier ID group Project name
1 d C0003 0 E01
2 e R9996 0 E02
3 f O0001 0 E03
4 j MT0021 0 E04
5 k DY0001 0 E05
6 p B0114 0
7 z J0002 0
57 d T0096 48
58 e T0015 48
59 f R0167 48
60 i G0004 48
61 j T0021 48
62 k A0003 48
63 p S0035 48
64 z F0006 48
65 z C0002 48
113 j R0009 94
114 z A0013 94
169 e O0001 147
170 z A0013 147
281 d C0003 254
282 e O0001 254
283 f N0183 254
284 i O0001 254
And this the result that I want:
Code Supplier ID group Project name
1 d C0003 0 E01
2 e R9996 0 E01
3 f O0001 0 E01
4 j MT0021 0 E01
5 k DY0001 0 E01
6 p B0114 0 E01
7 z J0002 0 E01
57 d T0096 48 E02
58 e T0015 48 E02
59 f R0167 48 E02
60 i G0004 48 E02
61 j T0021 48 E02
62 k A0003 48 E02
63 p S0035 48 E02
64 z F0006 48 E02
65 z C0002 48 E02
113 j R0009 94 E03
114 z A0013 94 E03
169 e O0001 147 E04
170 z A0013 147 E04
281 d C0003 254 E05
282 e O0001 254 E05
283 f N0183 254 E05
284 i O0001 254 E05

IIUC, you can simply use groupby to get the group number (ngroup) and map to the project name:
ids_cons["Project name"] = ids_cons.groupby("group").ngroup().map(projects["Project name"])
>>> ids_cons
Code Supplier ID group Project name
0 1 d C0003 0 E01
1 2 e R9996 0 E01
2 3 f O0001 0 E01
3 4 j MT0021 0 E01
4 5 k DY0001 0 E01
5 6 p B0114 0 E01
6 7 z J0002 0 E01
7 57 d T0096 48 E02
8 58 e T0015 48 E02
9 59 f R0167 48 E02
10 60 i G0004 48 E02
11 61 j T0021 48 E02
12 62 k A0003 48 E02
13 63 p S0035 48 E02
14 64 z F0006 48 E02
15 65 z C0002 48 E02
16 113 j R0009 94 E03
17 114 z A0013 94 E03
18 169 e O0001 147 E04
19 170 z A0013 147 E04
20 281 d C0003 254 E05
21 282 e O0001 254 E05
22 283 f N0183 254 E05
23 284 i O0001 254 E05
Inputs:
projects = pd.DataFrame({"Project name": ["E01","E02","E03","E04","E05"]})
>>> projects
Project name
0 E01
1 E02
2 E03
3 E04
4 E05

Related

Python, how can I execute the output in tabular form: ten code-symbol pairs in each line?

def display_code_ascii():
for i in range(32, 128):
print(chr(i))
print(display_code_ascii())
This is my code. the output is:
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
#
A
B
C
D
E
But I want to print in a console like this:
32 is 33 is ! 34 is " 35 is # 36 is $ 37 is % 38 is & 39 is ' 40 is ( 41 is )
42 is * 43 is + 44 is , 45 is - 46 is . 47 is / 48 is 0 49 is 1 50 is 2 51 is 3
52 is 4 53 is 5 54 is 6 55 is 7 56 is 8 57 is 9 58 is : 59 is ; 60 is < 61 is =
62 is > 63 is ? 64 is # 65 is A 66 is B 67 is C 68 is D 69 is E 70 is F 71 is G
72 is H 73 is I 74 is J 75 is K 76 is L 77 is M 78 is N 79 is O 80 is P 81 is Q
82 is R 83 is S 84 is T 85 is U 86 is V 87 is W 88 is X 89 is Y 90 is Z 91 is [
92 is \ 93 is ] 94 is ^ 95 is _ 96 is ` 97 is a 98 is b 99 is c 100 is d 101 is e
102 is f 103 is g 104 is h 105 is i 106 is j 107 is k 108 is l 109 is m 110 is n 111 is o
112 is p 113 is q 114 is r 115 is s 116 is t 117 is u 118 is v 119 is w 120 is x 121 is y
122 is z 123 is { 124 is | 125 is } 126 is ~ 127 is None
# set chunk_size which is number of initial elements to handle per each outputted line
chunk_size = 10
def format_element(x):
x = chr(x)
return x if x != '\x7f' else "None"
# prepare initial list elements with output strings
ll = [f"{x} is {format_element(x)}" for x in range(32, 128)]
# split list into chunks using chunk_size
ll = [ll[i:i+chunk_size] for i in range(len(ll))[::chunk_size]]
# join inner lists into output lines strings
ll = [" ".join(x) for x in ll]
# print each line separately
for i in ll:
print(i)
Output:
32 is 33 is ! 34 is " 35 is # 36 is $ 37 is % 38 is & 39 is ' 40 is ( 41 is )
42 is * 43 is + 44 is , 45 is - 46 is . 47 is / 48 is 0 49 is 1 50 is 2 51 is 3
52 is 4 53 is 5 54 is 6 55 is 7 56 is 8 57 is 9 58 is : 59 is ; 60 is < 61 is =
62 is > 63 is ? 64 is # 65 is A 66 is B 67 is C 68 is D 69 is E 70 is F 71 is G
72 is H 73 is I 74 is J 75 is K 76 is L 77 is M 78 is N 79 is O 80 is P 81 is Q
82 is R 83 is S 84 is T 85 is U 86 is V 87 is W 88 is X 89 is Y 90 is Z 91 is [
92 is \ 93 is ] 94 is ^ 95 is _ 96 is ` 97 is a 98 is b 99 is c 100 is d 101 is e
102 is f 103 is g 104 is h 105 is i 106 is j 107 is k 108 is l 109 is m 110 is n 111 is o
112 is p 113 is q 114 is r 115 is s 116 is t 117 is u 118 is v 119 is w 120 is x 121 is y
122 is z 123 is { 124 is | 125 is } 126 is ~ 127 is None

How to read file and calculate graph properties in igraph python

I am new to the igraph python and I have read the tutorial but I cannot understand very well.
I need some help to calculate the graph properties of my data. My data is in bpseq format and I do not know how to read the file in igraph.
The graph properties that I need to get is
Articulation point
Average path length
Average node betweenness
Variance of node betweenness
Average edge betweenness
Variance of edge betweenness
Average co-citation coupling
Average bibliographic coupling
Average closeness centrality
Diameter
Graph Density
This is the example of my dataset. The # is the name of the RNA class, the first column is the position, the alphabet is the base and the third column is the pointer. Suppose the base should be the node. and the bond between the nucleotide base should be the edge. But I do not know how to do it. There are about 2,00 dataset that looks like below but that is the one of the RNA class.
# RF00001_AF095839_1_346-228 5S_rRNA
1 G 118
2 C 117
3 G 116
4 U 115
5 A 114
6 C 113
7 G 112
8 G 111
9 C 110
10 C 0
11 A 0
12 U 0
13 A 0
14 C 0
15 U 0
16 A 0
17 U 0
18 G 0
19 G 36
20 G 35
21 G 34
22 A 33
23 A 0
24 U 0
25 A 0
26 C 0
27 A 0
28 C 0
29 C 0
30 U 0
31 G 0
32 A 0
33 U 22
34 C 21
35 C 20
36 C 19
37 G 0
38 U 106
39 C 105
40 C 104
41 G 103
42 A 0
43 U 0
44 U 0
45 U 0
46 C 0
47 A 0
48 G 0
49 A 0
50 A 0
51 G 0
52 U 0
53 U 0
54 A 0
55 A 0
56 G 67
57 C 66
58 C 65
59 U 64
60 C 0
61 A 0
62 U 0
63 C 0
64 A 59
65 G 58
66 G 57
67 C 56
68 A 0
69 U 0
70 C 0
71 C 0
72 U 0
73 A 0
74 A 0
75 G 0
76 U 0
77 A 0
78 C 0
79 U 0
80 A 0
81 G 96
82 G 95
83 G 94
84 U 93
85 G 92
86 G 91
87 G 0
88 C 0
89 G 0
90 A 0
91 C 86
92 C 85
93 A 84
94 C 83
95 C 82
96 U 81
97 G 0
98 G 0
99 G 0
100 A 0
101 A 0
102 C 0
103 C 41
104 G 40
105 G 39
106 A 38
107 U 0
108 G 0
109 U 0
110 G 9
111 C 8
112 U 7
113 G 6
114 U 5
115 A 4
116 C 3
117 G 2
118 C 1
119 U 0
I am using Ubuntu 18.04. I really hope someone can help me and guide me on using igraph python.

Dataframe pandas how to pass list as columns

I have two lists, such as:
list_columns = ['a','b','c','d','e','f','g','h','k','l','m','n']
and a list of values
list_values = [11,22,33,44,55,66,77,88,99,100, 111, 222]
I want to create a Pandas dataframe using list_columns as columns.
I tried with df = pd.DataFrame(list_values, columns=list_columns)
but it doesn't work
I get this error: ValueError: Shape of passed values is (1, 12), indices imply (12, 12)
A dataframe is a two-dimensional object. To reflect this, you need to feed a nested list. Each sublist, in this case the only sublist, represents a row.
df = pd.DataFrame([list_values], columns=list_columns)
print(df)
# a b c d e f g h k l m n
# 0 11 22 33 44 55 66 77 88 99 100 111 222
If you supply an index with length greater than 1, Pandas broadcasts for you:
df = pd.DataFrame([list_values], columns=list_columns, index=[0, 1, 2])
print(df)
# a b c d e f g h k l m n
# 0 11 22 33 44 55 66 77 88 99 100 111 222
# 1 11 22 33 44 55 66 77 88 99 100 111 222
# 2 11 22 33 44 55 66 77 88 99 100 111 222
If I understand your question correctly just wrap list_values in brackets so it's a list of lists
list_columns = ['a','b','c','d','e','f','g','h','k','l','m','n']
list_values = [[11,22,33,44,55,66,77,88,99,100, 111, 222]]
pd.DataFrame(list_values, columns=list_columns)
a b c d e f g h k l m n
0 11 22 33 44 55 66 77 88 99 100 111 222
from your list you can do like below:
df = pd.DataFrame(list_values)
df=df.T
df.columns=list_columns
>>df
a b c d e f g h k l m n
0 11 22 33 44 55 66 77 88 99 100 111 222

pandas: enumerate items in each group

I have a DataFrame like
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 0
5 100 L 71 0
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 0
12 101 L 67 0
13 101 L 68 0
14 101 L 69 0
15 101 L 71 0
16 101 L 72 0
17 201 R 67 0
18 201 R 68 0
19 201 R 69 0
ord essentially gives the ordering of the entries when (prop, chi and id) all have the same value. This isn't quite what I'd like though. Instead, I'd like to be able to enumerate the entries of each group g in {(id, chi)} from 0 to n_g where n_g is the size of group g. So I'd like to obtain something that looks like
id chi prop count
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2
I'd like to know if there's a simple way of doing this with pandas. The following comes very close, but it feels way too complicated, and it for some reason won't let me join the resulting dataframe with the original one.
(df.groupby(['id', 'chi'])
.apply(lambda g: np.arange(g.shape[0]))
.apply(pd.Series, 1)
.stack()
.rename('counter')
.reset_index()
.drop(columns=['level_2']))
EDIT: A second way of course is the for loop way, but I'm looking for something more "Pythonic" than:
for gname, idx in df.groupby(['id','chi']).groups.items():
tmp = df.loc[idx]
df.loc[idx, 'counter'] = np.arange(tmp.shape[0])
R has a very simple way of achieving this behaviour using the tidyverse packages, but I haven't quite found the well-oiled way to achieve the same thing with pandas. Any help provided is greatly appreciated!
cumcount
df.assign(ord=df.groupby(['id', 'chi']).cumcount())
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2
defaultdict and count
from itertools import count
from collections import defaultdict
d = defaultdict(count)
df.assign(ord=[next(d[t]) for t in zip(df.id, df.chi)])
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2

Change dataframe columns if column name exist in other dataframe, Python 3.6

I have a main data frame (DF) with below columns & data
C D E F G H I J K L QC
254 95 0 34543 43 32 4 4 4 4 Q23
255 59 1 43 tre r5 54 567 564 Q23
256 50 7 65 76557 65 65 5 5 Q23
And, mapping dataframe(MDF) with below columns
QC Res1 Res2 Res3 Res4 Res5 Res6 Res7 Res8 Res9 Res10
Q23 US CH JP CE OV NON DK TOT N KK
Q24 US ZZ JP ME KP NON DK TOT E LK
Here, column QC in both dataframe is for mapping.
I want to replace DF columns by mapping with MDF where MDF['QC']=DF[Q23]
Order is the same in both the dataframe. I have total 500 dataframe, I want to update all dataframe columns with new columns that present in another dataframe.
Final Expected dataframe: DF
US CH JP CE OV NON DK TOT N KK QC
254 95 0 34543 43 32 4 4 4 4 Q23
255 59 1 43 tre r5 54 567 564 Q23
256 50 7 65 76557 65 65 5 5 Q23
This is really challenging one.
You can use np.append by selecting the row of that contains 'QC's value i.e
If you have dataframes like
print(df1)
C D E F G H I J K L QC
0 254 95 0 34543 43 32.0 4 4 4 4 Q23
1 255 59 1 43 tre NaN r5 54 567 564 Q23
2 256 50 7 65 NaN 76557.0 65 65 5 5 Q23
print(df2)
C D E F G H I J K L QC
0 254 95 0 34543 43 32.0 4 4 4 4 Q24
1 255 59 1 43 tre NaN r5 54 567 564 Q24
2 256 50 7 65 NaN 76557.0 65 65 5 5 Q24
Then a for loop to assign the columns would help you i.e
for i in [df1,df2]:
q = i['QC'].unique()[0]
i.columns = np.append(mdf[mdf['QC'] == q].values[0][1:],['QC'])
print([df1,df2]
[ US CH JP CE OV NON DK TOT N KK QC
0 254 95 0 34543 43 32.0 4 4 4 4 Q23
1 255 59 1 43 tre NaN r5 54 567 564 Q23
2 256 50 7 65 NaN 76557.0 65 65 5 5 Q23,
US ZZ JP ME KP NON DK TOT E LK QC
0 254 95 0 34543 43 32.0 4 4 4 4 Q24
1 255 59 1 43 tre NaN r5 54 567 564 Q24
2 256 50 7 65 NaN 76557.0 65 65 5 5 Q24]

Categories

Resources