I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?
Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Related
I have the below data frame
d = {
"name":["RRR","RRR","RRR","RRR","RRR","ZZZ","ZZZ","ZZZ","ZZZ","ZZZ"],
"id":[1,1,2,2,3,2,3,3,4,4],"value":[12,13,1,44,22,21,23,53,64,9]
}
I want the out output as below:
First pivot by DataFrame.set_index with counter by GroupBy.cumcount and DataFrame.unstack with helper column ind by id, then sorting second level of MultiIndex with flatten values:
df = (df.assign(ind = df['id'])
.set_index(['name','id', df.groupby(['name','id']).cumcount()])[['value', 'ind']]
.unstack(1)
.sort_index(axis=1, kind='mergesort', level=1))
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.droplevel(1).reset_index()
print (df)
name ind_1 value_1 ind_2 value_2 ind_3 value_3 ind_4 value_4
0 RRR 1.0 12.0 2.0 1.0 3.0 22.0 NaN NaN
1 RRR 1.0 13.0 2.0 44.0 NaN NaN NaN NaN
2 ZZZ NaN NaN 2.0 21.0 3.0 23.0 4.0 64.0
3 ZZZ NaN NaN NaN NaN 3.0 53.0 4.0 9.0
try this:
def func(sub: pd.DataFrame) ->pd.DataFrame:
dfs = [g.reset_index(drop=True).rename(
columns=lambda x: f'{x}_{n}') for n, g in sub.drop(columns='name').groupby('id')]
return pd.concat(dfs, axis=1)
res = df.groupby('name').apply(func).droplevel(1).reset_index()
print(res)
>>>
name id_1 value_1 id_2 value_2 id_3 value_3 id_4 value_4
0 RRR 1.0 12.0 2.0 1.0 3.0 22.0 NaN NaN
1 RRR 1.0 13.0 2.0 44.0 NaN NaN NaN NaN
2 ZZZ NaN NaN 2.0 21.0 3.0 23.0 4.0 64.0
3 ZZZ NaN NaN NaN NaN 3.0 53.0 4.0 9.0
I want to convert below dataframe,
ID TYPE A B
0 1 MISSING 0.0 0.0
1 2 1T 1.0 2.0
2 2 2T 3.0 4.0
3 3 MISSING 0.0 0.0
4 4 2T 10.0 4.0
5 5 CBN 15.0 20.0
6 5 DSV 25.0 35.0
to:
ID MISSING_A MISSING_B 1T_A 1T_B 2T_A 2T_B CBN_A CBN_B DSV_A DSV_B
0 1 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
3 3 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 4 10.0 4.0 NaN NaN 10.0 4.0 NaN NaN NaN NaN
5 5 NaN NaN NaN NaN NaN NaN 15.0 20.0 25.0 35.0
For IDs with multiple types, multiple rows for A and B to merge into one row as shown above.
You are looking for a pivot, which will end up giving you a multi-index. You'll need to join those columns to get the suffix you are looking for.
df = df.pivot(index='ID',columns='TYPE', values=['A','B'])
df.columns = ['_'.join(reversed(col)).strip() for col in df.columns.values]
df.reset_index()
Say I have dataset that is index on date
id, date, col1, col2
1, 4, 1, 12
1, 5, 2, 13
1, 6, 6, 14
2, 4, 20, 16
2, 5, 8, 17
2, 6, 11, 18
...
and I wish to compute the rolling mean, sum, min, max for col1 and col2 grouped by id, with window size 2 and 3. I can do that in a loop like so
def multi_rolling(df, winsize, column):
[df.groupby("id")[column].rolling(winsize).mean(),
df.groupby("id")[column].rolling(winsize).sum(),
df.groupby("id")[column].rolling(winsize).min(),
df.groupby("id")[column].rolling(winsize).max(),
df.groupby("id")[column].rolling(winsize).count()]
Then I just have to call the above in a loop. But this feels inefficient. Is there a way to call it on all combinations of all functions and all columns and all window size more efficiently? E.g. run them in parallel?
Use pandas.DataFrame.agg:
new_df = df.groupby("id").rolling(2)[["col1","col2"]].agg(['mean','sum','min','max','count'])
print(new_df)
Output:
col1 col2 \
mean sum min max count mean
col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
1 1.5 12.5 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0 1.5 12.5
2 4.0 13.5 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0 4.0 13.5
2 3 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
4 14.0 16.5 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0 14.0 16.5
5 9.5 17.5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0 9.5 17.5
sum min max count
col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN 1.0 1.0
1 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0
2 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0
2 3 NaN NaN NaN NaN NaN NaN 1.0 1.0
4 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0
5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0
Because your question is ambiguous, I'm not sure if I understand what you need the output data to look like.
But see if one liner helps:
df.groupby("id")[column].rolling(winsize).agg(['mean','sum','min','max','count'])
Because you are grouping repeatedly, it is bound to be very inefficient.
Say we have a dataframe set up as follows:
x = pd.DataFrame(np.random.randint(1, 10, 30).reshape(5,6),
columns=[f'col{i}' for i in range(6)])
x['col6'] = np.nan
x['col7'] = np.nan
col0 col1 col2 col3 col4 col5 col6 col7
0 6 5 1 5 2 4 NaN NaN
1 8 8 9 6 7 2 NaN NaN
2 8 3 9 6 6 6 NaN NaN
3 8 4 4 4 8 9 NaN NaN
4 5 3 4 3 8 7 NaN NaN
When calling x.shift(2, axis=1), col2 -> col5 shifts correctly, but col6 and col7 stays as NaN?
How can I overwrite the NaN in col6 and col7 values with col4 and col5's values? Is this a bug or intended?
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 NaN NaN
1 NaN NaN 8.0 8.0 9.0 6.0 NaN NaN
2 NaN NaN 8.0 3.0 9.0 6.0 NaN NaN
3 NaN NaN 8.0 4.0 4.0 4.0 NaN NaN
4 NaN NaN 5.0 3.0 4.0 3.0 NaN NaN
It's possible this is a bug, you can use np.roll to achieve this:
In[11]:
x.apply(lambda x: np.roll(x, 2), axis=1)
Out[11]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
Speedwise, it's probably quicker to construct a df and reuse the existing columns and pass the result of np.roll as the data arg to the constructor to DataFrame:
In[12]:
x = pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
x
Out[12]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
timings
In[13]:
%timeit pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
%timeit x.fillna(0).astype(int).shift(2, axis=1)
10000 loops, best of 3: 117 µs per loop
1000 loops, best of 3: 418 µs per loop
So constructing a new df with the result of np.roll is quicker than first filling the NaN values, cast to int, and then shifting.
here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()