how to convert column names into column values in pandas - python - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b

You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b

try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4

Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1

In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

Related

group rows based on partial strings from two columns and sum values

df = pd.DataFrame({'c1':['Ax','Ay','Bx','By'], 'c2':['Ay','Ax','By','Bx'], 'c3':[1,2,3,4]})
c1 c2 c3
0 Ax Ay 1
1 Ay Ax 2
2 Bx By 3
3 By Bx 4
I'd like to sum the c3 values by aggregating the same xy combinations from the c1 and c2 columns.
The expected output is
c1 c2 c3
0 x y 4 #[Ax Ay] + [Bx By]
1 y x 6 #[Ay Ax] + [By Bx]
You can select values in c1 and c2 without first letters and aggregate sum:
df = df.groupby([df.c1.str[1:], df.c2.str[1:]]).sum().reset_index()
print (df)
c1 c2 c3
0 x y 4
1 y x 6

Adjust column position after split

I have a column that is positioned in the middle of a dataframe. I need to split it into multiple columns, and replace it with the new columns. I'm able to do it with the following code:
df = df.join(df[col_to_split].str.split(', ', expand=True).add_prefix(col_to_split + '_'))
However, the new columns are placed at the end of the dataframe, rather than replacing the original column. I need a way to place the new columns at the same position of original columns.
Note that I don't want to manually order ALL columns (i.e. df = df[[c1, c2, c3 ... cn]]) because of many reasons, i.e.it's not known how many new columns are going to be generated, and dataframe contains hundreds of columns.
Sample data:
c1 c2 c3 col_to_split c4 c5 ... cn
1 a b 1,5,3 1 1 ... 1
2 a c 5,10 3 3 ... 4
3 z c 3 2 3 ... 4
Desired output:
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 ... cn
1 a b 1 5 3 1 1 ... 1
2 a c 5 10 3 3 ... 4
3 z c 3 2 3 ... 4
Idea is use your solution with dynamic insert df1.columns to original columns with cols[pos:pos] trick, position of original column is count by Index.get_loc:
col_to_split = 'col_to_split'
cols = df.columns.tolist()
pos = df.columns.get_loc(col_to_split)
df1 = df[col_to_split].str.split(',', expand=True).fillna("").add_prefix(col_to_split + '_')
cols[pos:pos] = df1.columns.tolist()
cols.remove(col_to_split)
print (cols)
['c1', 'c2', 'c3', 'col_to_split_0', 'col_to_split_1', 'col_to_split_2',
'c4', 'c5', 'cn']
df = df.join(df1).reindex(cols, axis=1)
print (df)
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
Similar solution for join columsn names in lists:
col_to_split = 'col_to_split'
pos = df.columns.get_loc(col_to_split)
df1 = df[col_to_split].str.split(",", expand=True).fillna("").add_prefix(col_to_split + '_')
cols = df.columns.tolist()
cols = cols[:pos] + df1.columns.tolist() + cols[pos+1:]
print(cols)
['c1', 'c2', 'c3', 'col_to_split_0', 'col_to_split_1', 'col_to_split_2',
'c4', 'c5', 'cn']
df = df.join(df1).reindex(cols, axis=1)
print (df)
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
We can wrap this operation to a function
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""c1 c2 c3 col_to_split c4 c5 cn
1 a b 1,5,3 1 1 1
2 a c 5,10 3 3 4
3 z c 3 2 3 4"""), sep="\s+")
def split_by_col(df, colname):
pos = df.columns.tolist().index(colname)
df_tmp = df[colname].str.split(",", expand=True).fillna("")
df_tmp.columns=["col_to_split_" + str(i) for i in range(len(df_tmp.columns))]
return pd.concat([df.iloc[:,:pos], df_tmp, df.iloc[:,pos+1:]], axis=1)
With example:
>>> split_by_col(df, "col_to_split")
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
Try this:
df = df.join(df[col_to_split].str.split(', ', expand=True).add_prefix(col_to_split + '_'))
df = df[["c1", "c2", "c3" "col_to_split_0" "col_to_split_1" "col_to_split_2" "c4" "c5" ... "cn"]]

Mean between n and n+1 row in pandas groupby object?

I have a groupby object:
col1 col2 x y z
0 A D1 0.269002 0.131740 0.401020
1 B D1 0.201159 0.072912 0.775171
2 D D1 0.745292 0.725807 0.106000
3 F D1 0.270844 0.214708 0.935534
4 C D1 0.997799 0.503333 0.250536
5 E D1 0.851880 0.921189 0.085515
How do I sort the groupby object into the following:
col1 col2 x y z
0 A D1 0.269002 0.131740 0.401020
1 B D1 0.201159 0.072912 0.775171
4 C D1 0.997799 0.503333 0.250536
2 D D1 0.745292 0.725807 0.106000
5 E D1 0.851880 0.921189 0.085515
3 F D1 0.270844 0.214708 0.935534
And then compute the means between Row A {x, y, z} and Row B {x, y, z}, Row B {x, y, z} and Row C {x, y, z}... such that I have:
col1 col2 x_mean y_mean z_mean
0 A-B D1 0.235508 0.102326 0.58809
1 B-C D1 ... ... ...
4 C-D D1 ... ... ...
2 D-E D1 ... ... ...
5 E-F D1 ... ... ...
3 F-A D1 ... ... ...
I am basically trying to computationally find the midpoints between vertices of a hexagonal structure (well... more like 10 million). Hints appreciated!
I believe you need groupby with rolling and aggregate mean, last for pairs use shift and delete first NaNs rows per group:
print (df)
col1 col2 x y z
0 A D1 0.269002 0.131740 0.401020
1 B D1 0.201159 0.072912 0.775171
2 D D1 0.745292 0.725807 0.106000
3 F D2 0.270844 0.214708 0.935534 <-change D1 to D2
4 C D2 0.997799 0.503333 0.250536 <-change D1 to D2
5 E D2 0.851880 0.921189 0.085515 <-change D1 to D2
#
df = (df.sort_values(['col1','col2'])
.set_index('col1')
.groupby('col2')['x','y','z']
.rolling(2)
.mean()
.reset_index())
df['col1'] = df.groupby('col2')['col1'].shift() + '-' + df['col1']
df = df.dropna(subset=['col1','x','y','z'], how='all')
#alternative
#df = df[df['col2'].duplicated()]
print (df)
col2 col1 x y z
1 D1 A-B 0.235081 0.102326 0.588095
2 D1 B-D 0.473226 0.399359 0.440586
4 D2 C-E 0.924840 0.712261 0.168026
5 D2 E-F 0.561362 0.567948 0.510524

Pandas randomly select n groups from a larger dataset

If I have a dataframe with groups like so
val label
x A
x A
x B
x B
x C
x C
x D
x D
how can I randomly pick out n groups without replacement?
You can use random.choice with loc:
N = 3
vals = np.random.choice(df['label'].unique(), N, replace=False)
print (vals)
['C' 'A' 'B']
df = df.set_index('label').loc[vals].reset_index()
print (df)
label val
0 C x5
1 C x6
2 A x1
3 A x2
4 B x3
5 B x4

merge two dataframe based on specific column information

I am trying to handling dataframe in several ways.
and now I'd like to merge two dataframe based on specific column information and delete rows which is duplicated
Is it possible?
I tried to use Concatenate function but faliled...
for example if I want to merge df1 and df2 into d3 with
condition:
if c1&c2 information is same, delete duplicated rows(only use df1, even if c3 data between df1 and df2 is different)
if c1&c2 information is different, use both rows (df1,df2)
before:
df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
df2
c1 c2 c3
0 0 x {'a':11 ,'b':12}
1 0 y {'a':13 ,'b':14}
2 3 z {'a':15 ,'b':16}
expected result d3:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
3 3 z {'a':15 ,'b':16}
enter code here
You can do this firstly by determining which rows are only in df2 using merge and passing how='right' and indicator=True, then concat this with df1:
In [125]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged = merged[merged['_merge']=='right_only']
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[125]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [126]:
combined = pd.concat([df1, merged[df1.columns]])
combined
Out[126]:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
2 3 z {'a':15 ,'b':16}
If we break down the above:
In [128]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged
Out[128]:
c1 c2 c3_x c3_y _merge
0 0 x {'a':1 ,'b':2} {'a':11 ,'b':12} both
1 0 y {'a':3 ,'b':4} {'a':13 ,'b':14} both
2 3 z NaN {'a':15 ,'b':16} right_only
In [129]:
merged = merged[merged['_merge']=='right_only']
merged
Out[129]:
c1 c2 c3_x c3_y _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [130]:
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[130]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only

Categories

Resources