Confused about the meaning of calculating sum in Pandas Pivot Table - python

pd.pivot_table(df1,index=['Staff'],columns=['Products'],values=['Price'],margins=True)
Results:
Price
Products A B C All
Staff
Staff_A 14472.000000 15877.777778 14982.352941 14890.196078
Staff_B 14775.000000 15620.000000 16330.000000 15815.789474
Staff_C 15293.333333 14262.500000 15214.285714 15000.000000
All 14779.545455 15231.818182 15426.470588 15099.000000
The result can be shown successfully, however I do not understand where the values from the column and row All come from. Can anyone explain where the value of sum comes from, because I thought it should be the sum of the first three columns data added together?

To illustrate:
import pandas as pd
import numpy as np
Create dummy dataframe:
df = pd.DataFrame({
'id': ['a', 'b', 'c', 'a', 'b', 'c'],
'stat': ['col1', 'col1', 'col1', 'col2', 'col2', 'col2'],
'val': [2, 4, 6, 6, 8, 10],
})
This will give us:
>>> df
id stat val
0 a col1 2
1 b col1 4
2 c col1 6
3 a col2 6
4 b col2 8
5 c col2 10
Pivoting without giving aggfunc argument
table = pd.pivot_table(df, index='id', columns='stat', values='val', margins=True)
The above will give us:
>>> table
stat col1 col2 All
id
a 2 6 4.0
b 4 8 6.0
c 6 10 8.0
All 4 8 6.0
Notice that column All and row All both giving us the mean of the columns and rows, respectively.
Pivoting and using aggfunc=np.sum:
new_table = pd.pivot_table(df, index='id', columns='stat', values='val', margins=True, aggfunc=np.sum)
This will give us:
>>> new_table
stat col1 col2 All
id
a 2 6 8
b 4 8 12
c 6 10 16
All 12 24 36

Related

How to vertically combine two pandas dataframes that have different number of columns

There are two dataframes, one dataframe might have less columns than another one. For instance,
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'B'],
'col2': [2, 9],
'col3': [0, 1]
})
df1 = pd.DataFrame({
'col1': ['G'],
'col2': [3]
})
The df and df1 are shown as follows.
I would like to combine these two dataframes together, and the missing values should be assigned as some given value, like -100. How to perform this kind of combination.
You could reindex the DataFrames first to "preserve" the dtypes; then concatenate:
cols = df.columns.union(df1.columns)
out = pd.concat([d.reindex(columns=cols, fill_value=-100) for d in [df, df1]],
ignore_index=True)
Output:
col1 col2 col3
0 A 2 0
1 B 9 1
2 G 3 -100
Use concat with DataFrame.fillna:
df = pd.concat([df, df1], ignore_index=True).fillna(-100)
print (df)
col1 col2 col3
0 A 2 0.0
1 B 9 1.0
2 G 3 -100.0
If need same dtypes add DataFrame.astype:
d = df.dtypes.append(df1.dtypes).to_dict()
df = pd.concat([df, df1], ignore_index=True).fillna(-100).astype(d)
print (df)
col1 col2 col3
0 A 2 0
1 B 9 1
2 G 3 -100

Pandas how to perform outer merge with specific order of adding rows

I have two data frames:
df:
col1 col2
0 x 1
1 a 2
2 b 3
3 c 4
and
df2:
col1 col2
0 x 1
1 a 2
2 f 6
3 c 4
And I want to obtain data frame in which new row from df2 will be added to new data frame after row with the same index from the df, like this:
col1 col2
0 x 1
1 a 2
2 b 3
3 f 6
4 c 4
df1 = pd.DataFrame({
'col1': ['x', 'a', 'b', 'c'],
'col2': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'col1': ['x', 'a', 'f', 'c'],
'col2': [1, 2, 6, 4]
})
In order to get your output, I concatenated the two dataframes and sorted by the index, as requested, then you can set the index in order with reset_index().
df = pd.concat([df1, df2]).drop_duplicates().sort_index().reset_index(drop=True)
# Output
col1 col2
0 x 1
1 a 2
2 b 3
3 f 6
4 c 4

Unexpected row getting changed in pandas loc assignment

I want to copy a portion of a pandas dataframe onto a different portion, overwriting the existing values there. I am using .loc but more rows are changing than the ones I am referencing.
My example:
df = pd.DataFrame({
'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': range(1, 6),
'col3': range(6, 11)
})
print(df)
col1 col2 col3
0 A 1 6
1 B 2 7
2 C 3 8
3 D 4 9
4 E 5 10
I want to write the values of col2 and col3 from the C and D rows onto the A and B rows. Using .loc:
df.loc[0:2, ["col2", "col3"]] = df.loc[2:4, ["col2", "col3"]].values
print(df)
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 5 10
3 D 4 9
4 E 5 10
This does what I want for rows A and B, but row C has also changed. I expect only the first two rows to change, i.e. my expected output is
col1 col2 col3
0 A 3 8
1 B 4 9
2 C 3 8
3 D 4 9
4 E 5 10
Why did the C row also change, and how may I do this with only changing the first two rows?
Unlike list slicing pandas.DataFrame.loc slicing is inclusive-inclusive
Warning Note that contrary to usual python slices, both the start and the stop
are included
so you should do
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values
In addition, you can also pass a list of exhaustive elements, this way the rows need not to be consecutive:
df.loc[[0,1], ["col2", "col3"]] = df.loc[[2,3], ["col2", "col3"]].values
You went too far with the indices:
df.loc[0:1, ["col2", "col3"]] = df.loc[2:3, ["col2", "col3"]].values

apply function only within the same row index?

I have a dataframe with 2 sorted indexes and I want to apply diff on the column only within col1 in the order sorted by col2.
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.col3.diff(1)
This gives me
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 1
C 3 7 3
Above it applys diff by row.
What I want is
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 nan
C 3 7 nan
You'll want to use groupby to apply diff to each group:
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.groupby(axis=0, level='col1')['col3'].diff()
Since you already go through the heavy lifting of sort, you can diff and only assign within the group. You can't shift non-datetime indices, so either make a Series, or use np.roll, though that wraps around, and would lead to the wrong answer for a single group DataFrame
import pandas as pd
s = pd.Series(mini_df.index.get_level_values('col1'))
mini_df['diff'] = mini_df.col3.diff().where(s.eq(s.shift(1)).values)
col3 diff
col1 col2
A 1 1 NaN
4 3 2.0
B 2 4 NaN
C 3 7 NaN

for loop for searching value in dataframe and updating values next to it

I want python to perform updating of values next to a value found in both dataframes (somewhat similar to VLOOKUP in MS Excel). So, for
import pandas as pd
df1 = pd.DataFrame(data = {'col1':['a', 'b', 'd'], 'col2': [1, 2, 4], 'col3': [2, 3, 4]})
df2 = pd.DataFrame(data = {'col1':['a', 'f', 'c', 'd']})
In [3]: df1
Out[3]:
col1 col2 col3
0 a 1 2
1 b 2 3
2 d 4 4
In [4]: df2
Out[4]:
col1
0 a
1 f
2 c
3 d
Outcome must be the following:
In [6]: df3 = *somecode*
df3
Out[6]:
col1 col2 col3
0 a 1 2
1 f
2 c
3 d 4 4
The main part is that I want some sort of "for loop" to do this.
So, for instance python searches for first value in col1 in df2, finds it in df1, and updates col2 and col3 respectivly, then moves forward.
First for loop in pandas is best avoid if some vectorized solution exist.
I think merge with left join is necessary, parameter on should be omit if only col1 is same in both DataFrames:
df3 = df2.merge(df1, how='left')
print (df3)
col1 col2 col3
0 a 1.0 2.0
1 f NaN NaN
2 c NaN NaN
3 d 4.0 4.0
try this,
Simple left join will solve your problem,
pd.merge(df2,df1,how='left',on=['col1'])
col1 col2 col3
0 a 1.0 2.0
1 f NaN NaN
2 c NaN NaN
3 d 4.0 4.0

Categories

Resources