Reset index without multiple headers after pivot in pandas - python

I have this DataFrame
df = pd.DataFrame({'store':[1,1,1,2],'upc':[11,22,33,11],'sales':[14,16,11,29]})
which gives this output
store upc sales
0 1 11 14
1 1 22 16
2 1 33 11
3 2 11 29
I want something like this
store upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN
I tried this
newdf = df.pivot(index='store', columns='upc')
newdf.columns = newdf.columns.droplevel(0)
and the output looks like this with multiple headers
upc 11 22 33
store
1 14.0 16.0 11.0
2 29.0 NaN NaN
I also tried
newdf = df.pivot(index='store', columns='upc').reset_index()
This also gives multiple headers
store sales
upc 11 22 33
0 1 14.0 16.0 11.0
1 2 29.0 NaN NaN

try via fstring+columns attribute and list comprehension:
newdf = df.pivot(index='store', columns='upc')
newdf.columns=[f"upc_{y}" for x,y in newdf.columns]
newdf=newdf.reset_index()
OR
In 2 steps:
newdf = df.pivot(index='store', columns='upc').reset_index()
newdf.columns=[f"upc_{y}" if y!='' else f"{x}" for x,y in newdf.columns]

Another option, which is longer than #Anurag's:
(df.pivot(index='store', columns='upc')
.droplevel(axis=1, level=0)
.rename(columns = lambda df: f"upc_{df}")
.rename_axis(index=None, columns=None)
)
upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN

Related

Convert a python df which is in pivot format to a proper row column format

i have the following dataframe
id a_1_1, a_1_2, a_1_3, a_1_4, b_1_1, b_1_2, b_1_3, c_1_1, c_1_2, c_1_3
1 10 20 30 40 90 80 70 Nan Nan Nan
2 33 34 35 36 nan nan nan 11 12 13
and i want my result to be as follow
id col_name 1 2 3
1 a 10 20 30
1 b 90 80 70
2 a 33 34 35
2 c 11 12 13
I am trying to use pd.melt function, but not yielding correct result ?
IIUC, you can reshape using an intermediate MultiIndex after extracting the letter and last digit from the original column names:
(df.set_index('id')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract(r'^([^_]+).*(\d+)'),
names=['col_name', None]
), axis=1))
.stack('col_name')
.dropna(axis=1) # assuming you don't want columns with NaNs
.reset_index()
)
Variant using janitor's pivot_longer:
# pip install janitor
import janitor
(df
.pivot_longer(index='id', names_to=('col name', '.value'),
names_pattern=r'([^_]+).*(\d+)')
.pipe(lambda d: d.dropna(thresh=d.shape[1]-2))
.dropna(axis=1)
)
output:
id col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0
Code:
df = df1.melt(id_vars=["id"],
var_name="Col_name",
value_name="Value").dropna()
df['Num'] = df['Col_name'].apply(lambda x: x[-1])
df['Col_name'] = df['Col_name'].apply(lambda x: x[0])
df = df.pivot(index=['id','Col_name'], columns='Num', values='Value').reset_index().dropna(axis=1)
df
Output:
Num id Col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0

Combining two dataframes

I've tried merging two dataframes, but I can't seem to get it to work. Each time I merge, the rows where I expect values are all 0. Dataframe df1 already as some data in it, with some left blank. Dataframe df2 will populate those blank rows in df1 where column names match at each value in "TempBin" and each value in "Month" in df1.
EDIT:
Both dataframes are in a for loop. df1 acts as my "storage", df2 changes for each location iteration. So if df2 contained the results for LocationZP, I would also want that data inserted in the matching df1 rows. If I use df1 = df1.append(df2) in the for loop, all of the rows from df2 keep inserting at the very end of df1 for each iteration.
df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 0 0 0
13 1 0 0 0
13 2 0 0 0
13 3 0 0 0
13 4 0 0 0
13 5 0 0 0
df2:
Month TempBin LocationAA
13 0 11
13 1 22
13 2 33
13 3 44
13 4 55
13 5 66
desired output in df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 11 0 0
13 1 22 0 0
13 2 33 0 0
13 3 44 0 0
13 4 55 0 0
13 5 66 0 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]}
)
df1 = pd.merge(df1, df2, on=["Month","TempBin","LocationAA"], how="left")
result:
Month TempBin LocationAA LocationXA LocationZP
1 0 7.0 1.0 2.0
1 1 98.0 0.0 89.0
1 2 12.0 23.0 38.0
1 3 3.0 14.0 17.0
1 4 7.0 9.0 14.0
1 5 1.0 8.0 99.0
13 0 NaN NaN NaN
13 1 NaN NaN NaN
13 2 NaN NaN NaN
13 3 NaN NaN NaN
13 4 NaN NaN NaN
13 5 NaN NaN NaN
Here's some code that worked for me:
# Merge two df into one dataframe on the columns "TempBin" and "Month" filling nan values with 0.
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]})
df_merge = pd.merge(df1, df2, how='left',
left_on=['TempBin', 'Month'],
right_on=['TempBin', 'Month'])
df_merge.fillna(0, inplace=True)
# add column LocationAA and fill it with the not null value from column LocationAA_x and LocationAA_y
df_merge['LocationAA'] = df_merge.apply(lambda x: x['LocationAA_x'] if pd.isnull(x['LocationAA_y']) else x['LocationAA_y'], axis=1)
# remove column LocationAA_x and LocationAA_y
df_merge.drop(['LocationAA_x', 'LocationAA_y'], axis=1, inplace=True)
print(df_merge)
Output:
Month TempBin LocationXA LocationZP LocationAA
0 1 0 1.0 2.0 0.0
1 1 1 0.0 89.0 0.0
2 1 2 23.0 38.0 0.0
3 1 3 14.0 17.0 0.0
4 1 4 9.0 14.0 0.0
5 1 5 8.0 99.0 0.0
6 13 0 0.0 0.0 11.0
7 13 1 0.0 0.0 22.0
8 13 2 0.0 0.0 33.0
9 13 3 0.0 0.0 44.0
10 13 4 0.0 0.0 55.0
11 13 5 0.0 0.0 66.0
Let me know if there's something you don't understand in the comments :)
PS: Sorry for the extra comments. But I left them there for some more explanations.
You need to use append to get the desired output:
df1 = df1.append(df2)
and if you want to replace the Nulls to zeros add:
df1 = df1.fillna(0)
Here is another way using combine_first()
i = ['Month','TempBin']
df2.set_index(i).combine_first(df1.set_index(i)).reset_index()

Normalizing/Adjusting time series dataframe

I am fairly new to Python and Pandas; been searching for a solution for couple days with no luck... here's the problem:
I have a data set like the below and I need to cull the first few values of some rows so the highest value in each row is in column A. In the below example, rows 0 & 3 would drop the values in column A and row 4 drop the values in column A and B and then shift all remaining values to left.
A B C D
0 11 23 21 14
1 24 18 17 15
2 22 18 15 13
3 10 13 12 10
4 5 7 14 11
Desired
A B C D
0 23 21 14 NaN
1 24 18 17 15
2 22 18 15 13
3 13 12 10 NaN
4 14 11 NaN NaN
I've looked at the df.shift(), but don't see how I can get that function to work on a unique row by row basis. Should I instead be using an array and a loop function?
Any help is greatly appreciated.
You need to turn all left values of the max to np.nan and use the solution in this question. I use the one from #cs95
df_final = df[df.eq(df.max(1), axis=0).cummax(1)].apply(lambda x: sorted(x, key=pd.isnull), 1)
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
You can loop over the unique shifts (fewer of these than rows) with a groupby and join the results back:
import pandas as pd
shifts = df.to_numpy().argmax(1)
pd.concat([gp.shift(-i, axis=1) for i, gp in df.groupby(shifts)]).sort_index()
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
One approach is to convert each row of the data frame to a list (excluding the index) and append NaN values. Then keep N elements, starting with the max value.
ncols = len(df.columns)
nans = [np.nan] * ncols
new_rows = list()
for row in df.itertuples():
# convert each row of the data frame to a list
# start at 1 to exclude the index;
# and append list of NaNs
new_list = list(row[1:]) + nans
# find index of max value (exluding NaNs we appended)
k = np.argmax(new_list[:ncols])
# collect `new row`, starting at max element
new_rows.append(new_list[k : k+ncols])
# create new data frame
df_new = pd.DataFrame(new_rows, columns=df.columns)
df_new
for i in range(df.shape[0]):
arr = list(df.iloc[i,:])
c = 0
while True:
if arr[0] != max(arr):
arr.remove(arr[0])
c += 1
else:
break
nan = ["NaN"]*c
arr.extend(nan)
df.iloc[i,:] = arr
print(df)
I have looped over every row and found out max value and remove values before the max and padding "NaN" values at the end to match the number of columns for every row.

Python Read all the sheet and combine

I tried to concatenate all the sheets in the excel file without leaving NaN in other sheets
import pandas as pd
excel_file = "C:/Users/User/Documents/UiPath/Endo Bot/endoProcess/NEW ENDO PASTE HERE/-r1- (07-23-2020).xlsx"
fil = pd.ExcelFile(excel_file)
names = fil.sheet_names
df = pd.concat([fil.parse(name) for name in names])
print(df)
Looks like it only appends the sheets to the first sheet.
The result:
COUNT NAME Number count2
0 4.0 kiko NaN NaN
1 5.0 esmer NaN NaN
2 6.0 jason NaN NaN
0 NaN NaN 9.0 23.0
1 NaN NaN 10.0 13.0
2 NaN NaN 11.0 14.0
The result that I want:
COUNT NAME Number count2
0 4.0 kiko 9.0 23.0
1 5.0 esmer 10.0 13.0
2 6.0 jason 11.0 14.0
Concatenate on axis 1 (columns) instead of axis 0 (index, the default), like so: df = pd.concat([fil.parse(name) for name in names], axis=1).
Code
import pandas as pd
excel_file = "C:/Users/User/Documents/UiPath/Endo Bot/endoProcess/NEW ENDO PASTE HERE/-r1- (07-23-2020).xlsx"
fil = pd.ExcelFile(excel_file)
names = fil.sheet_names
# concatenated
df = pd.concat([fil.parse(name) for name in names], axis=1)
print(df)
Output
COUNT NAME Number count2
0 4 kiko 9 23
1 5 esmer 10 13
2 6 jason 11 14

Add Series to DataFrame with additional index values

I have a DataFrame which looks like this:
Value
1 23
2 12
3 4
And a Series which looks like this:
1 24
2 12
4 34
Is there a way to add the Series to the DataFrame to obtain a result which looks like this:
Value New
1 23 24
2 12 12
3 4 0
4 0 34
Using concat(..., axis=1) and .fillna():
import pandas as pd
df = pd.DataFrame([23,12,4], columns=["Value"], index=[1,2,3])
s = pd.Series([24,12,34],index=[1,2,4], name="New")
df = pd.concat([df,s],axis=1)
print(df)
df = df.fillna(0) # or df.fillna(0, inplace=True)
print(df)
Output:
Value New
1 23.0 24.0
2 12.0 12.0
3 4.0 NaN
4 NaN 34.0
# If replacing NaNs with 0:
Value New
1 23.0 24.0
2 12.0 12.0
3 4.0 NaN
4 NaN 34.0
You can use join between a series and a dataframe:
my_df.join(my_series, how='outer').fillna(0)
Example:
>>> df
Value
1 23
2 12
3 4
>>> s
0
1 24
2 12
4 34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
>>> type(s)
<class 'pandas.core.series.Series'>
>>> df.join(s, how='outer').fillna(0)
Value 1
1 23.0 24.0
2 12.0 12.0
3 4.0 0.0
4 0.0 34.0

Categories

Resources