I want to convert every three rows of a DataFrame into columns .
Input:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,11,12,13],'b':['a','b','c','aa','bb','cc']})
print(df)
Output:
a b
0 1 a
1 2 b
2 3 c
3 11 aa
4 12 bb
5 13 cc
Expected:
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
Use set_index by floor division and modulo by 3 with unstack and flattening MultiIndex:
a = np.arange(len(df))
#if default index
#a = df.index
df1 = df.set_index([a // 3, a % 3]).unstack()
#python 3.6+ solution
df1.columns = [f'{i}{j + 1}' for i,j in df1.columns]
#python bellow 3.6
#df1.columns = ['{}{}'.format(i,j+1) for i,j in df1.columns]
print (df1)
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
I'm adding a different approach with group -> apply.
df is first grouped by df.index//3 and then the munge function is applied to each group.
def munge(group):
g = group.T.stack()
g.index = ['{}{}'.format(c, i+1) for i, (c, _) in enumerate(g.index)]
return g
result = df.groupby(df.index//3).apply(munge)
Output:
>>> df.groupby(df.index//3).apply(munge)
a1 a2 a3 b4 b5 b6
0 1 2 3 a b c
1 11 12 13 aa bb cc
Related
I have a column that is positioned in the middle of a dataframe. I need to split it into multiple columns, and replace it with the new columns. I'm able to do it with the following code:
df = df.join(df[col_to_split].str.split(', ', expand=True).add_prefix(col_to_split + '_'))
However, the new columns are placed at the end of the dataframe, rather than replacing the original column. I need a way to place the new columns at the same position of original columns.
Note that I don't want to manually order ALL columns (i.e. df = df[[c1, c2, c3 ... cn]]) because of many reasons, i.e.it's not known how many new columns are going to be generated, and dataframe contains hundreds of columns.
Sample data:
c1 c2 c3 col_to_split c4 c5 ... cn
1 a b 1,5,3 1 1 ... 1
2 a c 5,10 3 3 ... 4
3 z c 3 2 3 ... 4
Desired output:
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 ... cn
1 a b 1 5 3 1 1 ... 1
2 a c 5 10 3 3 ... 4
3 z c 3 2 3 ... 4
Idea is use your solution with dynamic insert df1.columns to original columns with cols[pos:pos] trick, position of original column is count by Index.get_loc:
col_to_split = 'col_to_split'
cols = df.columns.tolist()
pos = df.columns.get_loc(col_to_split)
df1 = df[col_to_split].str.split(',', expand=True).fillna("").add_prefix(col_to_split + '_')
cols[pos:pos] = df1.columns.tolist()
cols.remove(col_to_split)
print (cols)
['c1', 'c2', 'c3', 'col_to_split_0', 'col_to_split_1', 'col_to_split_2',
'c4', 'c5', 'cn']
df = df.join(df1).reindex(cols, axis=1)
print (df)
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
Similar solution for join columsn names in lists:
col_to_split = 'col_to_split'
pos = df.columns.get_loc(col_to_split)
df1 = df[col_to_split].str.split(",", expand=True).fillna("").add_prefix(col_to_split + '_')
cols = df.columns.tolist()
cols = cols[:pos] + df1.columns.tolist() + cols[pos+1:]
print(cols)
['c1', 'c2', 'c3', 'col_to_split_0', 'col_to_split_1', 'col_to_split_2',
'c4', 'c5', 'cn']
df = df.join(df1).reindex(cols, axis=1)
print (df)
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
We can wrap this operation to a function
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""c1 c2 c3 col_to_split c4 c5 cn
1 a b 1,5,3 1 1 1
2 a c 5,10 3 3 4
3 z c 3 2 3 4"""), sep="\s+")
def split_by_col(df, colname):
pos = df.columns.tolist().index(colname)
df_tmp = df[colname].str.split(",", expand=True).fillna("")
df_tmp.columns=["col_to_split_" + str(i) for i in range(len(df_tmp.columns))]
return pd.concat([df.iloc[:,:pos], df_tmp, df.iloc[:,pos+1:]], axis=1)
With example:
>>> split_by_col(df, "col_to_split")
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
Try this:
df = df.join(df[col_to_split].str.split(', ', expand=True).add_prefix(col_to_split + '_'))
df = df[["c1", "c2", "c3" "col_to_split_0" "col_to_split_1" "col_to_split_2" "c4" "c5" ... "cn"]]
I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA
In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA
I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?
One way is to use agg, apply, and join:
g = df.groupby(['A','B'], group_keys=False)
g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'}))
Output:
S Q
A B
A1 B1 0 10
B2 0 16
A2 B1 5 24
B2 0 4
Here's one way
In [1800]: def agg(x):
...: m = x.S.iloc[np.argmax(x.Q.values)]
...: return pd.Series({'Q': x.Q.sum(), 'S': m})
...:
In [1801]: df.groupby(['A', 'B']).apply(agg)
Out[1801]:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
With these two data frames
df1 = pd.DataFrame({'c1':['a','b','c','d'],'c2':[10,20,10,22]})
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]})
I'm trying to add the values of c4 to df1 for only the elements in c3 that are also present in c1:
>>> df1
c1 c2 c4
a 10 3
b 20 5
c 10 6
d 22 9
Is there a simple way of doing this in pandas?
UPDATE:
If
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]},'c5':[10,20,30,40,50,60,70,80,90])
how can I achieve this result?
>>> df1
c1 c2 c4 c5
a 10 3 30
b 20 5 50
c 10 6 60
d 22 9 90
Doing:
>>> df1['c1'].map(df2.set_index('c3')['c4','c5'])
gives me a KeyError
You can call map on df2['c4'] after setting the index on df2['c3'], this will perform a lookup:
In [239]:
df1 = pd.DataFrame({'c1':['a','b','c','d'],'c2':[10,20,10,22]})
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]})
df1['c4'] = df1['c1'].map(df2.set_index('c3')['c4'])
df1
Out[239]:
c1 c2 c4
0 a 10 3
1 b 20 5
2 c 10 6
3 d 22 9
I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.
This is where I'm at:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'goods': ['a', 'a', 'b', 'b', 'b'],
'stock': [5, 10, 30, 40, 10],
'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv
which results in
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
And this is where I want to get to.
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
Previously, I used
piv = piv.stack()
piv = piv.reset_index()
print piv
to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):
month category stock
goods a b
0 1 c1 5 30
1 1 c2 0 0
2 2 c1 5 30
3 2 c2 10 40
4 3 c1 5 10
5 3 c2 10 40
Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?
>>> piv.unstack().reset_index().drop('level_0', axis=1)
goods category month 0
0 a c1 1 5
1 a c1 2 5
2 a c1 3 5
3 a c2 1 0
4 a c2 2 10
5 a c2 3 10
6 b c1 1 30
7 b c1 2 30
8 b c1 3 10
9 b c2 1 0
10 b c2 2 40
11 b c2 3 40
then all you need is to change last column name from 0 to stock.
It seems to me that melt (aka unpivot) is very close to what you want to do:
In [11]: pd.melt(piv)
Out[11]:
NaN goods category value
0 stock a c1 5
1 stock a c1 5
2 stock a c1 5
3 stock a c2 0
4 stock a c2 10
5 stock a c2 10
6 stock b c1 30
7 stock b c1 30
8 stock b c1 10
9 stock b c2 0
10 stock b c2 40
11 stock b c2 40
There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:
In [12]: piv.columns = piv.columns.droplevel(0)
In [13]: pd.melt(piv)
Out[13]:
goods category value
0 a c1 5
1 a c1 5
2 a c1 5
3 a c2 0
4 a c2 10
5 a c2 10
6 b c1 30
7 b c1 30
8 b c1 10
9 b c2 0
10 b c2 40
11 b c2 40
Edit: The above actually drops the index, you need to make it a column with reset_index:
In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
month goods category stock
0 1 a c1 5
1 2 a c1 5
2 3 a c1 5
3 1 a c2 0
4 2 a c2 10
5 3 a c2 10
6 1 b c1 30
7 2 b c1 30
8 3 b c1 10
9 1 b c2 0
10 2 b c2 40
11 3 b c2 40
I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.
Here is the problem I had:
As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.
The desired dataframe format was:
When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.
The solution I found was to apply a double stacking function over my dataframe.
Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.
Here is a snippet code:
import pandas as pd
# reading my table
df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3',
index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)
df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']
df.head()
# making the stacking:
df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()
df.head()
Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.
Here is an alternative code:
df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
value_vars=['Homens', 'Mulheres'],
value_name='Populacao residente em domicilios particulares ocupados',
var_name='sexo')
df.head()
Sincerely yours,
Philipe Riskalla Leal