pandas: enumerate items in each group

pandas: enumerate items in each group - python

I have a DataFrame like
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 0
5 100 L 71 0
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 0
12 101 L 67 0
13 101 L 68 0
14 101 L 69 0
15 101 L 71 0
16 101 L 72 0
17 201 R 67 0
18 201 R 68 0
19 201 R 69 0
ord essentially gives the ordering of the entries when (prop, chi and id) all have the same value. This isn't quite what I'd like though. Instead, I'd like to be able to enumerate the entries of each group g in {(id, chi)} from 0 to n_g where n_g is the size of group g. So I'd like to obtain something that looks like
id chi prop count
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2
I'd like to know if there's a simple way of doing this with pandas. The following comes very close, but it feels way too complicated, and it for some reason won't let me join the resulting dataframe with the original one.
(df.groupby(['id', 'chi'])
.apply(lambda g: np.arange(g.shape[0]))
.apply(pd.Series, 1)
.stack()
.rename('counter')
.reset_index()
.drop(columns=['level_2']))
EDIT: A second way of course is the for loop way, but I'm looking for something more "Pythonic" than:
for gname, idx in df.groupby(['id','chi']).groups.items():
tmp = df.loc[idx]
df.loc[idx, 'counter'] = np.arange(tmp.shape[0])
R has a very simple way of achieving this behaviour using the tidyverse packages, but I haven't quite found the well-oiled way to achieve the same thing with pandas. Any help provided is greatly appreciated!

cumcount
df.assign(ord=df.groupby(['id', 'chi']).cumcount())
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2
defaultdict and count
from itertools import count
from collections import defaultdict
d = defaultdict(count)
df.assign(ord=[next(d[t]) for t in zip(df.id, df.chi)])
id chi prop ord
0 100 L 67 0
1 100 L 68 1
2 100 L 68 2
3 100 L 68 3
4 100 L 70 4
5 100 L 71 5
6 100 R 67 0
7 100 R 68 1
8 100 R 68 2
9 100 R 68 3
10 110 R 70 0
11 110 R 71 1
12 101 L 67 0
13 101 L 68 1
14 101 L 69 2
15 101 L 71 3
16 101 L 72 4
17 201 R 67 0
18 201 R 68 1
19 201 R 69 2

Related

Python: tidy data, how can I transform this table as I want? [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
How to melt 2 columns at the same time?
(2 answers)
Closed 1 year ago.
I need to transform my table from a wide format to a long table. The table has measurements over time, let's say mass over time: m0, m1, m2 etc. so it looks like this:
ID | Age | m0 | m1 | m2 | m3
1 67 72 69 66 67
2 70 80 81 79 77
3 72 69 69 70 70
How I want it is:
ID | Age | time | m
1 67 0 72
1 67 1 69
1 67 2 66
1 67 3 67
2 70 0 80
2 70 1 81
2 70 2 79
2 70 3 77
...
I appreciate any help! Thank you in advance.
Cheers.

You can make use of pandas melt method in this case
result = df.melt(id_vars=['ID', 'Age'], value_vars=['m0', 'm1', 'm2', 'm3'])
result.columns = ['ID', 'Age', 'time', 'm']
result['time'] = result['time'].str.replace('m', '')
result = result.sort_values('Age').reset_index(drop=True)
print(result)
ID Age time m
0 1 67 0 72
1 1 67 1 69
2 1 67 2 66
3 1 67 3 67
4 2 70 0 80
5 2 70 1 81
6 2 70 2 79
7 2 70 3 77
8 3 72 0 69
9 3 72 1 69
10 3 72 2 70
11 3 72 3 70
Alternative method using pd.wide_to_long
result = pd.wide_to_long(df, stubnames=["m"], i=["ID", "Age"], j="").reset_index()
result.columns = ['ID', 'Age', 'time', 'm']
result = result.sort_values('Age').reset_index(drop=True)
print(result)
ID Age time m
0 1 67 0 72
1 1 67 1 69
2 1 67 2 66
3 1 67 3 67
4 2 70 0 80
5 2 70 1 81
6 2 70 2 79
7 2 70 3 77
8 3 72 0 69
9 3 72 1 69
10 3 72 2 70
11 3 72 3 70
If there are more variables like m, one can mention it inside stubnames
pd.wide_to_long documentation : https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html

Add Sum to all grouped rows in pandas dataframe

I have a dataframe and i want to group its "First" and "Second" column and then to produce the expected output as mentioned below:
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
print(df)
Output>
First Second Value_1 Value_2
0 a q 17 70
1 b e 44 47
2 c e 5 56
3 a e 23 58
4 b e 10 76
5 a q 11 67
6 b q 21 84
7 c q 42 67
8 b e 36 53
9 c q 16 63
When i Grouped this DataFrame using groupby, I am getting below output:
def func(arr,columns):
return arr.sort_values(by = columns).drop(columns, axis = 1)
df.groupby(['First','Second']).apply(func, columns = ['First','Second'])
Value_1 Value_2
First Second
a e 3 23 58
q 0 17 70
5 11 67
b e 1 44 47
4 10 76
8 36 53
q 6 21 84
c e 2 5 56
q 7 42 67
9 16 63
However i want below output:
Expected output:
Value_1 Value_2
First Second
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
It's not necessary to print "All" string but to print the sum of all grouped rows.

df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
First Second Value_1 Value_2
0 a q 4 69
1 b e 20 74
2 c e 13 82
3 a e 9 41
4 b e 11 79
5 a q 32 77
6 b q 6 75
7 c q 39 62
8 b e 26 80
9 c q 26 42
def lambda_t(x):
df = x.sort_values(['First','Second']).drop(['First','Second'],axis=1)
df.loc['all'] = df.sum()
return df
df.groupby(['First','Second']).apply(lambda_t)
Value_1 Value_2
First Second
a e 3 9 41
all 9 41
q 0 4 69
5 32 77
all 36 146
b e 1 20 74
4 11 79
8 26 80
all 57 233
q 6 6 75
all 6 75
c e 2 13 82
all 13 82
q 7 39 62
9 26 42
all 65 104

You can try this:
reset the index in your group by:
d1 = df.groupby(['First','Second']).apply(func, columns = ['First','Second']).reset_index()
Then group by 'First' and 'Second' and sum the values columns.
d2 = d.groupby(['First', 'Second']).sum().reset_index()
Create the 'level_2' column in the new dataframe and concatenate with the initial one to get the desired result
d2.loc[:,'level_2'] = 'All'
pd.concat([d1,d2],0).sort_values(by = ['First', 'Second'])

Not sure about your function; however, you could chunk it into two steps:
Create an indexed dataframe, where you append the First and Second columns to the existing index:
df.index = df.index.astype(str).rename("Total")
indexed = df.set_index(["First", "Second"], append=True).reorder_levels(
["First", "Second", "Total"]
)
indexed
Value_1 Value_2
First Second Total
a q 0 17 70
b e 1 44 47
c e 2 5 56
a e 3 23 58
b e 4 10 76
a q 5 11 67
b q 6 21 84
c q 7 42 67
b e 8 36 53
c q 9 16 63
Create an aggregation, grouped by First and Second:
summary = (
df.groupby(["First", "Second"])
.sum()
.assign(Total="All")
.set_index("Total", append=True)
)
summary
Value_1 Value_2
First Second Total
a e All 23 58
q All 28 137
b e All 90 176
q All 21 84
c e All 5 56
q All 58 130
Combine indexed and summary dataframes:
pd.concat([indexed, summary]).sort_index(level=["First", "Second"])
Value_1 Value_2
First Second Total
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130

How to read file and calculate graph properties in igraph python

I am new to the igraph python and I have read the tutorial but I cannot understand very well.
I need some help to calculate the graph properties of my data. My data is in bpseq format and I do not know how to read the file in igraph.
The graph properties that I need to get is
Articulation point
Average path length
Average node betweenness
Variance of node betweenness
Average edge betweenness
Variance of edge betweenness
Average co-citation coupling
Average bibliographic coupling
Average closeness centrality
Diameter
Graph Density
This is the example of my dataset. The # is the name of the RNA class, the first column is the position, the alphabet is the base and the third column is the pointer. Suppose the base should be the node. and the bond between the nucleotide base should be the edge. But I do not know how to do it. There are about 2,00 dataset that looks like below but that is the one of the RNA class.
# RF00001_AF095839_1_346-228 5S_rRNA
1 G 118
2 C 117
3 G 116
4 U 115
5 A 114
6 C 113
7 G 112
8 G 111
9 C 110
10 C 0
11 A 0
12 U 0
13 A 0
14 C 0
15 U 0
16 A 0
17 U 0
18 G 0
19 G 36
20 G 35
21 G 34
22 A 33
23 A 0
24 U 0
25 A 0
26 C 0
27 A 0
28 C 0
29 C 0
30 U 0
31 G 0
32 A 0
33 U 22
34 C 21
35 C 20
36 C 19
37 G 0
38 U 106
39 C 105
40 C 104
41 G 103
42 A 0
43 U 0
44 U 0
45 U 0
46 C 0
47 A 0
48 G 0
49 A 0
50 A 0
51 G 0
52 U 0
53 U 0
54 A 0
55 A 0
56 G 67
57 C 66
58 C 65
59 U 64
60 C 0
61 A 0
62 U 0
63 C 0
64 A 59
65 G 58
66 G 57
67 C 56
68 A 0
69 U 0
70 C 0
71 C 0
72 U 0
73 A 0
74 A 0
75 G 0
76 U 0
77 A 0
78 C 0
79 U 0
80 A 0
81 G 96
82 G 95
83 G 94
84 U 93
85 G 92
86 G 91
87 G 0
88 C 0
89 G 0
90 A 0
91 C 86
92 C 85
93 A 84
94 C 83
95 C 82
96 U 81
97 G 0
98 G 0
99 G 0
100 A 0
101 A 0
102 C 0
103 C 41
104 G 40
105 G 39
106 A 38
107 U 0
108 G 0
109 U 0
110 G 9
111 C 8
112 U 7
113 G 6
114 U 5
115 A 4
116 C 3
117 G 2
118 C 1
119 U 0
I am using Ubuntu 18.04. I really hope someone can help me and guide me on using igraph python.

Comparing two consecutive rows and creating a new column based on a specific logical operation

I have a data frame with two columns
df = ['xPos', 'lineNum']
import pandas as pd
data = '''\
xPos lineNum
40 1
50 1
75 1
90 1
42 2
75 2
110 2
45 3
70 3
95 3
125 3
38 4
56 4
74 4'''
I have created the aggregate data frame for this by using
aggrDF = df.describe(include='all')
command
and I am interested in the minimum of the xPos value. So, i get it by using
minxPos = aggrDF.ix['min']['xPos']
Desired output
data = '''\
xPos lineNum xDiff
40 1 2
50 1 10
75 1 25
90 1 15
42 2 4
75 2 33
110 2 35
45 3 7
70 3 25
95 3 25
125 3 30
38 4 0
56 4 18
74 4 18'''
The logic
I want to compere the two consecutive rows of the data frame and calculate a new column based on this logic:
if( df['LineNum'] != df['LineNum'].shift(1) ):
df['xDiff'] = df['xPos'] - minxPos
else:
df['xDiff'] = df['xPos'].shift(1)
Essentially, I want the new column to have the difference of the two consecutive rows in the df, as long as the line number is the same.
If the line number changes, then, the xDiff column should have the difference with the minimum xPos value that I have from the aggregate data frame.
Can you please help? thanks,

These two lines should do it:
df['xDiff'] = df.groupby('lineNum').diff()['xPos']
df.loc[df['xDiff'].isnull(), 'xDiff'] = df['xPos'] - minxPos
>>> df
xPos lineNum xDiff
0 40 1 2.0
1 50 1 10.0
2 75 1 25.0
3 90 1 15.0
4 42 2 4.0
5 75 2 33.0
6 110 2 35.0
7 45 3 7.0
8 70 3 25.0
9 95 3 25.0
10 125 3 30.0
11 38 4 0.0
12 56 4 18.0
13 74 4 18.0

You just need groupby lineNum and apply the condition you already writing down
df['xDiff']=np.concatenate(df.groupby('lineNum').apply(lambda x : np.where(x['lineNum'] != x['lineNum'].shift(1),x['xPos'] - x['xPos'].min(),x['xPos'].shift(1)).astype(int)).values)
df
Out[76]:
xPos lineNum xDiff
0 40 1 0
1 50 1 40
2 75 1 50
3 90 1 75
4 42 2 0
5 75 2 42
6 110 2 75
7 45 3 0
8 70 3 45
9 95 3 70
10 125 3 95
11 38 4 0
12 56 4 38
13 74 4 56

Shift pandas dataframe down in a cyclical manner

If we have the following data:
X = pd.DataFrame({"t":[1,2,3,4,5],"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
X
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
How can I shift the data in a cyclical fashion so that the next step is:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
And then:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
etc.
This should also shift the index values with the row.
I know of pandas X.shift(), but it wasn't making the cyclical thing.

You can combine reindex with np.roll:
X = X.reindex(np.roll(X.index, 1))
Another option is to combine concat with iloc:
shift = 1
X = pd.concat([X.iloc[-shift:], X.iloc[:-shift]])
The resulting output:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
Timings
Using the following setup to produce a larger DataFrame and functions for timing:
df = pd.concat([X]*10**5, ignore_index=True)
def root1(df, shift):
return df.reindex(np.roll(df.index, shift))
def root2(df, shift):
return pd.concat([df.iloc[-shift:], df.iloc[:-shift]])
def ed_chum(df, num):
return pd.DataFrame(np.roll(df, num, axis=0), np.roll(df.index, num), columns=df.columns)
def divakar1(df, shift):
return df.iloc[np.roll(np.arange(df.shape[0]), shift)]
def divakar2(df, shift):
idx = np.mod(np.arange(df.shape[0])-1,df.shape[0])
for _ in range(shift):
df = df.iloc[idx]
return df
I get the following timings:
%timeit root1(df.copy(), 25)
10 loops, best of 3: 61.3 ms per loop
%timeit root2(df.copy(), 25)
10 loops, best of 3: 26.4 ms per loop
%timeit ed_chum(df.copy(), 25)
10 loops, best of 3: 28.3 ms per loop
%timeit divakar1(df.copy(), 25)
10 loops, best of 3: 177 ms per loop
%timeit divakar2(df.copy(), 25)
1 loop, best of 3: 4.18 s per loop

You can use np.roll in a custom func:
In [83]:
def roll(df, num):
return pd.DataFrame(np.roll(df,num,axis=0), np.roll(df.index, num), columns=df.columns)

roll(X,1)
Out[83]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [84]:
roll(X,2)
Out[84]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
Here we return a df using the rolled df array, with the index rolled also

You can use numpy.roll :
import numpy as np
nb_iterations = 3 # number of steps you want
for i in range(nb_iterations):
for col in X.columns :
df[col] = numpy.roll(df[col], 1)
Which is equivalent to :
for col in X.columns :
df[col] = numpy.roll(df[col], nb_iterations)
Here is a link to the documentation of this useful function.

One approach would be creating such an shifted-down indexing array once and re-using it over and over to index into rows with .iloc, like so -
idx = np.mod(np.arange(X.shape[0])-1,X.shape[0])
X = X.iloc[idx]
Another way to create idx would be with np.roll : np.roll(np.arange(X.shape[0]),1).
Sample run -
In [113]: X # Starting version
Out[113]:
A B C D E t
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
4 26 82 13 14 34 5
In [114]: idx = np.mod(np.arange(X.shape[0])-1,X.shape[0]) # Creating once
In [115]: X = X.iloc[idx] # Using idx
In [116]: X
Out[116]:
A B C D E t
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3
3 84 25 14 56 0 4
In [117]: X = X.iloc[idx] # Re-using idx
In [118]: X
Out[118]:
A B C D E t
3 84 25 14 56 0 4
4 26 82 13 14 34 5
0 34 54 56 0 78 1
1 12 87 78 23 12 2
2 78 35 0 72 31 3 ## and so on

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: enumerate items in each group - python

Related

Python: tidy data, how can I transform this table as I want? [duplicate]

Add Sum to all grouped rows in pandas dataframe

How to read file and calculate graph properties in igraph python

Comparing two consecutive rows and creating a new column based on a specific logical operation

Shift pandas dataframe down in a cyclical manner

Categories

Resources