Concat dataframes row wised and merge rows if exsist - python

I have two dataframes:
Df_1:
A B C D
1 10 nan 20 30
2 20 30 20 10
Df_2:
A B
1 10 40
2 30 70
I want to merge them and have this final dataframe.
A B C D
1 10 40 20 30
2 20 30 20 10
3 30 70 nan nan
How do I do that?

Looking at the expected result, I think, the index in the second row of Df_2
should be 3 (instead of 2).
Run Df_1.combine_first(Df_2).
The result is:
A B C D
1 10.0 40.0 20.0 30.0
2 20.0 30.0 20.0 10.0
3 30.0 70.0 NaN NaN
i.e. due to possible NaN values, the type of columns is coerced to float.
But if you want, you can revert this where possible, by applying to_numeric:
Df_1.combine_first(Df_2).apply(pd.to_numeric, downcast='integer')

Related

Convert a python df which is in pivot format to a proper row column format

i have the following dataframe
id a_1_1, a_1_2, a_1_3, a_1_4, b_1_1, b_1_2, b_1_3, c_1_1, c_1_2, c_1_3
1 10 20 30 40 90 80 70 Nan Nan Nan
2 33 34 35 36 nan nan nan 11 12 13
and i want my result to be as follow
id col_name 1 2 3
1 a 10 20 30
1 b 90 80 70
2 a 33 34 35
2 c 11 12 13
I am trying to use pd.melt function, but not yielding correct result ?
IIUC, you can reshape using an intermediate MultiIndex after extracting the letter and last digit from the original column names:
(df.set_index('id')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract(r'^([^_]+).*(\d+)'),
names=['col_name', None]
), axis=1))
.stack('col_name')
.dropna(axis=1) # assuming you don't want columns with NaNs
.reset_index()
)
Variant using janitor's pivot_longer:
# pip install janitor
import janitor
(df
.pivot_longer(index='id', names_to=('col name', '.value'),
names_pattern=r'([^_]+).*(\d+)')
.pipe(lambda d: d.dropna(thresh=d.shape[1]-2))
.dropna(axis=1)
)
output:
id col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0
Code:
df = df1.melt(id_vars=["id"],
var_name="Col_name",
value_name="Value").dropna()
df['Num'] = df['Col_name'].apply(lambda x: x[-1])
df['Col_name'] = df['Col_name'].apply(lambda x: x[0])
df = df.pivot(index=['id','Col_name'], columns='Num', values='Value').reset_index().dropna(axis=1)
df
Output:
Num id Col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0

how to adjust subtotal columns in pandas using grouby?

I'm working on exporting data frames to Excel using dataframe join.
However, after Join dataframe,
when calculating subtotal using groupby, the figure below is executed.
There's a "Subtotal" word in the index column.
enter image description here
Is there any way to move it into the code column and sort the indexes?
enter image description here
here codes :
def subtotal(df__, str):
container = []
for key, group in df__.groupby(['key']):
group.loc['subtotal'] = group[['quantity', 'quantity2', 'quantity3']].sum()
container.append(group)
df_subtotal = pd.concat(container)
df_subtotal.loc['GrandTotal'] = df__[['quantity', 'quantity2', 'quantity3']].sum()
print(df_subtotal)
return (df_subtotal.to_excel(writer, sheet_name=str))
Use np.where() to fill NaN in code column with value in df.index. Then assign a new index array to df.index.
import numpy as np
df['code'] = np.where(df['code'].isna(), df.index, df['code'])
df.index = np.arange(1, len(df) + 1)
print(df)
code key product quntity1 quntity2 quntity3
1 cs01767 a apple-a 10 0 10.0
2 Subtotal NaN NaN 10 0 10.0
3 cs0000 b bannana-a 50 10 40.0
4 cs0000 b bannana-b 0 0 0.0
5 cs0000 b bannana-c 0 0 0.0
6 cs0000 b bannana-d 80 20 60.0
7 cs0000 b bannana-e 0 0 0.0
8 cs01048 b bannana-f 0 0 NaN
9 cs01048 b bannana-g 0 0 0.0
10 Subtotal NaN NaN 130 30 100.0
11 cs99999 c melon-a 50 10 40.0
12 cs99999 c melon-b 20 20 0.0
13 cs01188 c melon-c 10 0 10.0
14 Subtotal NaN NaN 80 30 50.0
15 GrandTotal NaN NaN 220 60 160.0

With pd.merge() on different column names, the resulting dataframe has duplicated columns. How to avoid that?

Assume the dataframes df_1 and df_2 below, which I want to merge "left".
df_1= pd.DataFrame({'A': [1,2,3,4,5],
'B': [10,20,30,40,50]})
df_2= pd.DataFrame({'AA': [1,5],
'BB': [10,50],
'CC': [100, 500]})
>>> df_1
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
>>> df_2
AA BB CC
0 1 10 100
1 5 50 500
I want to perform a merging which will result to the following output:
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
So, I tried pd.merge(df_1, df_2, left_on=['A', 'B'], right_on=['AA', 'BB'], how='left') which unfortunately duplicates the columns upon which I merge:
A B AA BB CC
0 1 10 1.0 10.0 100.0
1 2 20 NaN NaN NaN
2 3 30 NaN NaN NaN
3 4 40 NaN NaN NaN
4 5 50 5.0 50.0 500.0
How do I achieve this without needing to drop the columns 'AA' and 'BB'?
Thank you!
You can use rename and join by A, B columns only:
df = pd.merge(df_1, df_2.rename(columns={'AA':'A','BB':'B'}), on=['A', 'B'], how='left')
print (df)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
In pd.merge's right_on parameter accepts array-like argument.
Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
df_1.merge(
df_2["CC"], left_on=["A", "B"], right_on=[df_2["AA"], df_2["BB"]], how="left"
)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
df.merge(sec_df) give sec_df without column names you want to merge on.
rigth_on as a list with columns you want to merge, [df_2['AA'], df_2['BB']] is equivalent to [*df_2[['AA', 'BB']].to_numpy()]
IMHO this method is cumbersome. As #jezrael posted renaming columns and merging them is pythonic/pandorable

Get sum of values from last nth row by group id

I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0

Transposing a Pandas DataFrame Without Aggregating

I have a multi-columned dataframe which holds several numerical values that are the same. It looks like the following:
A B C D
0 1 1 10 1
1 1 1 20 2
2 1 5 30 3
3 2 2 40 4
4 2 3 50 5
This is great, however, I need to make A the index and B the column. The problem is that the column is aggregated and is averaged for every identical value of B.
df = DataFrame({'A':[1,1,1,2,2],
'B':[1,1,5,2,3],
'C':[10,20,30,40,50],
'D':[1,2,3,4,5]})
transposed_df = df.pivot_table(index=['A'], columns=['B'])
Instead of keeping 10 and 20 across B1, it averages the two to 15.
C D
B 1 2 3 5 1 2 3 5
A
1 15.0 NaN NaN 30.0 1.5 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
Is there any way I can Keep column B the same and display every value of C and D using Pandas, or am I better off writing my own function to do this? Also, it is very important that the index and column stay the same because only one of each number can exist.
EDIT: This is the desired output. I understand that this exact layout probably isn't possible, but it shows that 10 and 20 need to both be in column 1 and index 1.
C D
B 1 2 3 5 1 2 3 5
A
1 10.0,20.0 NaN NaN 30.0 1.0,2.0 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN

Categories

Resources