pandas dataframe group columns based on name and apply a function - python

I have a dataframe:
df = [A B C D E_p0 E_p1 E_p2 K_p0 K_p1 K_2
a 2 r 4 3 6 1 9 5 1
e g 1 d 5 8 2 7 1 4]
And I want to group columns based on the prefix and aggregate them by a function, such as mean or max or rms.
So, for example if my function is max, the output is:
df = [A B C D E K
a 2 r 4 6 9
e g 1 d 8 7 ]

You can convert columns without separator to index and then grouping with lambda function per columns with aggregate function like max:
m = df.columns.str.contains('_')
df = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.max()
.reset_index())
print (df)
A B C D E K
0 a 2 r 4 6 9
1 e g 1 d 8 7
Solution with custom function:
def rms(x):
return np.sqrt(np.sum(x**2, axis=1)/len(x.columns))
m = df.columns.str.contains('_')
df1 = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.agg(rms)
.reset_index())
print (df1)
A B C D E K
0 a 2 r 4 3.915780 5.972158
1 e g 1 d 5.567764 4.690416

Related

Pairwise matrix counts of two columns using pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to obtain pairwise counts of two column variables using pandas. I have a dataframe of two columns in the following format:
col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h
What I would like to get as output would be the following matrix of counts, for e.g.:
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
I am getting totally confused with pandas iterating over columns, rows, indexes and such. Appreciate some guidance here.
Pandas often has simple functions built in - in this case, you want crosstab:
pd.crosstab(dat['col1'], dat['col2'])
full code:
import pandas as pd
from io import StringIO
x = '''col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h'''
dat = pd.read_csv(StringIO(x), sep = '\s+')
pd.crosstab(dat['col1'], dat['col2'])
You're looking for a crosstab:
count_matrix = pd.crosstab(index=df["col1"], columns=df["col2"])
print(count_matrix)
col2 e f g h
col1
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you don't like the column/index names in (e.g. still seeing "col1" and "col2"), then you can remove them with rename_axis:
count_matrix = count_matrix.rename_axis(index=None, columns=None)
print(count_matrix)
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you want that all together in one snippet:
count_matrix = (pd.crosstab(index=df["col1"], columns=df["col2"])
.rename_axis(index=None, columns=None))

Rearanging table structure based on number of rows and columns pandas

I have the following data frame table. The table has the columns Id, columns, rows, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Id columns rows 1 2 3 4 5 6 7 8 9
1 3 3 A B C D E F G H Z
2 3 2 I J K
By considering Id, the number of rows, and columns I would like to restructure the table as follows.
Id columns rows col_1 col_2 col_3
1 3 3 A B C
1 3 3 D E F
1 3 3 G H Z
2 3 2 I J K
2 3 2 - - -
Can anyone help to do this in Python Pandas?
Here's a solution using MultiIndex and .itterrows():
df
Id columns rows 1 2 3 4 5 6 7 8 9
0 1 3 3 A B C D E F G H Z
1 2 3 2 I J K None None None None None None
You can set n to any length, in your case 3:
n = 3
df = df.set_index(['Id', 'columns', 'rows'])
new_index = []
new_rows = []
for index, row in df.iterrows():
max_rows = index[-1] * (len(index)-1) # read amount of rows
for i in range(0, len(row), n):
if i > max_rows: # max rows reached, stop appending
continue
new_index.append(index)
new_rows.append(row.values[i:i+n])
pd.DataFrame(new_rows, index=pd.MultiIndex.from_tuples(new_index))
0 1 2
1 3 3 A B C
3 D E F
3 G H Z
2 3 2 I J K
2 None None None
And if you are keen on getting your old index and headers back:
new_headers = ['Id', 'columns', 'rows'] + list(range(1, n+1))
df2.reset_index().set_axis(new_headers, axis=1)
Id columns rows 1 2 3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 None None None
Using melt and str.split with floor division against your index to create groups of 3.
s = pd.melt(df,id_vars=['Id','columns','rows'])
s1 = (
s.sort_values(["Id", "variable"])
.assign(idx=s.index // 3)
.fillna("-")
.groupby(["idx", "Id"])
.agg(
columns=("columns", "first"), rows=("rows", "first"), value=("value", ",".join)
)
)
s2 = s1["value"].str.split(",", expand=True).rename(
columns=dict(zip(s1["value"].str.split(",", expand=True).columns,
[f'col_{i+1}' for i in range(s1["value"].str.split(',').apply(len).max())]
))
)
df1 = pd.concat([s1.drop('value',axis=1),s2],axis=1)
print(df1)
columns rows col_1 col_2 col_3
idx Id
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -
5 2 3 2 - - -
I modify unutbu solution for create array for each row by expected length of new rows, columns, then create Dataframe in list comprehension and join together by concat:
def f(x):
c, r = x.name[1], x.name[2]
#print (c, r)
arr = np.empty(c * r, dtype='O')
vals = x.iloc[:len(arr)]
arr[:len(vals)] = vals
idx = pd.MultiIndex.from_tuples([x.name] * r, names=df.columns[:3])
cols = [f'col_{c+1}' for c in range(c)]
return pd.DataFrame(arr.reshape((r, c)), index=idx, columns=cols).fillna('-')
df1 = (pd.concat([x for x in df.set_index(['Id', 'columns', 'rows'])
.apply(f, axis=1)])
.reset_index())
print (df1)
Id columns rows col_1 col_2 col_3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -

insert a list as row in a dataframe at a specific position

I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.
Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1
Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389
Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c
If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)

renaming pandas dataframe using values in columns taking into account repetitions

I have a dataframe df
df
Name
0 A
1 A
2 B
3 B
4 C
5 D
6 E
7 F
8 G
9 H
How can I rename the ideces of the dataframe so that
df
Name
0_A A
1_A A
0_B B
1_B B
0_C C
0_D D
0_E E
0_F F
0_G G
0_H H
Basically I would like to use the values in the columns "Name" and restarting the numbering every time the value change..
Use cumcount with count, more possible solutions for concatenating are in previous answer :
print (df.groupby('Name').cumcount().astype(str))
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
9 0
dtype: object
df.index = df.groupby('Name').cumcount().astype(str) + '_' + df['Name']
print (df)
Name
0_A A
1_A A
0_B B
1_B B
0_C C
0_D D
0_E E
0_F F
0_G G
0_H H

Convert N by N Dataframe to 3 Column Dataframe

I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4

Categories

Resources