Panda- How can some column values can be moved to new column? - python

I have the below data frame
d = {
"name":["RRR","RRR","RRR","RRR","RRR","ZZZ","ZZZ","ZZZ","ZZZ","ZZZ"],
"id":[1,1,2,2,3,2,3,3,4,4],"value":[12,13,1,44,22,21,23,53,64,9]
}
I want the out output as below:

First pivot by DataFrame.set_index with counter by GroupBy.cumcount and DataFrame.unstack with helper column ind by id, then sorting second level of MultiIndex with flatten values:
df = (df.assign(ind = df['id'])
.set_index(['name','id', df.groupby(['name','id']).cumcount()])[['value', 'ind']]
.unstack(1)
.sort_index(axis=1, kind='mergesort', level=1))
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.droplevel(1).reset_index()
print (df)
name ind_1 value_1 ind_2 value_2 ind_3 value_3 ind_4 value_4
0 RRR 1.0 12.0 2.0 1.0 3.0 22.0 NaN NaN
1 RRR 1.0 13.0 2.0 44.0 NaN NaN NaN NaN
2 ZZZ NaN NaN 2.0 21.0 3.0 23.0 4.0 64.0
3 ZZZ NaN NaN NaN NaN 3.0 53.0 4.0 9.0

try this:
def func(sub: pd.DataFrame) ->pd.DataFrame:
dfs = [g.reset_index(drop=True).rename(
columns=lambda x: f'{x}_{n}') for n, g in sub.drop(columns='name').groupby('id')]
return pd.concat(dfs, axis=1)
res = df.groupby('name').apply(func).droplevel(1).reset_index()
print(res)
>>>
name id_1 value_1 id_2 value_2 id_3 value_3 id_4 value_4
0 RRR 1.0 12.0 2.0 1.0 3.0 22.0 NaN NaN
1 RRR 1.0 13.0 2.0 44.0 NaN NaN NaN NaN
2 ZZZ NaN NaN 2.0 21.0 3.0 23.0 4.0 64.0
3 ZZZ NaN NaN NaN NaN 3.0 53.0 4.0 9.0

Related

Creating non-exist columns in multiindex dataframe

Let's say we have dataframe like this
df = pd.DataFrame({
"metric": ["1","2","1" ,"1","2"],
"group1":["o", "x", "x" , "o", "x"],
"group2":['a', 'b', 'a', 'a', 'b'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
metric group1 group2 value value2
0 1 o a 0 0
1 2 x b 1 2
2 1 x a 2 4
3 1 o a 3 6
4 2 x b 4 8
then I want to have pivot format
df['g'] = df.groupby(['group1','group2'])['group2'].cumcount()
df1 = df.pivot(index=['g','metric'], columns=['group1','group2'], values=['value','value2']).sort_index(axis=1).rename_axis(columns={'g':None})
value value2
group1 o x o x
group2 a a b a a b
g metric
0 1 0.0 2.0 NaN 0.0 4.0 NaN
2 NaN NaN 1.0 NaN NaN 2.0
1 1 3.0 NaN NaN 6.0 NaN NaN
2 NaN NaN 4.0 NaN NaN 8.0
From here we can see that ("value","o","b") and ("value2","o","b") not exist after making pivot
but I need to have those columns with values NA
So I tried;
cols = [('value','x','a'), ('value','o','a'),('value','o','b')]
df1.assign(**{col : "NA" for col in np.setdiff1d(cols, df1.columns.values)})
which gives
Expected output
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
one corner case with this is that if b does not exist how to create that column ?
value value2
group1 o x o x
group2 a a a a
g metric
0 1 0.0 2.0 0.0 4.0
2 NaN NaN NaN NaN
1 1 3.0 NaN 6.0 NaN
2 NaN NaN NaN NaN
Multiple insert columns if not exist pandas
Pandas: Check if column exists in df from a list of columns
Pandas - How to check if multi index column exists
Use DataFrame.stack with DataFrame.unstack:
df1 = df1.stack([1,2],dropna=False).unstack([2,3])
print (df1)
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
Or with selecting last and last previous levels:
df1 = df1.stack([-2,-1],dropna=False).unstack([-2,-1])
Another idea:
df1 = df1.reindex(pd.MultiIndex.from_product(df1.columns.levels), axis=1)
print (df1)
value value2
group1 o x o x
group2 a b a b a b a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 NaN 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN NaN
2 NaN NaN NaN 4.0 NaN NaN NaN 8.0
EDIT:
If need set new columns by list of tuples:
cols = [('value','x','a'), ('value','o','a'),('value','o','b')]
df = df1.reindex(pd.MultiIndex.from_tuples(cols).union(df1.columns), axis=1)
print (df)
value value2
o x o x
a b a b a a b
g metric
0 1 0.0 NaN 2.0 NaN 0.0 4.0 NaN
2 NaN NaN NaN 1.0 NaN NaN 2.0
1 1 3.0 NaN NaN NaN 6.0 NaN NaN
2 NaN NaN NaN 4.0 NaN NaN 8.0

Convert two pandas rows into one

I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?
Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

ReArrange Pandas DataFrame date columns in date order

I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')
Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN

convert specific rows of pandas dataframe into multiindex

here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()

Categories

Resources