How to align yticklabels when combining a barplot with heatmap - python

I have similar problems as this question; I am trying to combine three plots in Seaborn, but the labels on my y-axis are not aligned with the bars.
My code (now a working copy-paste example):
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
### Generate example data
np.random.seed(123)
year = [2018, 2019, 2020, 2021]
task = [x + 2 for x in range(18)]
student = [x for x in range(200)]
amount = [x + 10 for x in range(90)]
violation = [letter for letter in "thisisjustsampletextforlabels"] # one letter labels
df_example = pd.DataFrame({
# some ways to create random data
'year':np.random.choice(year,500),
'task':np.random.choice(task,500),
'violation':np.random.choice(violation, 500),
'amount':np.random.choice(amount, 500),
'student':np.random.choice(student, 500)
})
### My code
temp = df_example.groupby(["violation"])["amount"].sum().sort_values(ascending = False).reset_index()
total_violations = temp["amount"].sum()
sns.set(font_scale = 1.2)
f, axs = plt.subplots(1,3,
figsize=(5,5),
sharey="row",
gridspec_kw=dict(width_ratios=[3,1.5,5]))
# Plot frequency
df1 = df_example.groupby(["year","violation"])["amount"].sum().sort_values(ascending = False).reset_index()
frequency = sns.barplot(data = df1, y = "violation", x = "amount", log = True, ax=axs[0])
# Plot percent
df2 = df_example.groupby(["violation"])["amount"].sum().sort_values(ascending = False).reset_index()
total_violations = df2["amount"].sum()
percent = sns.barplot(x='amount', y='violation', estimator=lambda x: sum(x) / total_violations * 100, data=df2, ax=axs[1])
# Pivot table and plot heatmap
df_heatmap = df_example.groupby(["violation", "task"])["amount"].sum().sort_values(ascending = False).reset_index()
df_heatmap_pivot = df_heatmap.pivot("violation", "task", "amount")
df_heatmap_pivot = df_heatmap_pivot.reindex(index=df_heatmap["violation"].unique())
heatmap = sns.heatmap(df_heatmap_pivot, fmt = "d", cmap="Greys", norm=LogNorm(), ax=axs[2])
plt.subplots_adjust(top=1)
axs[2].set_facecolor('xkcd:white')
axs[2].set(ylabel="",xlabel="Task")
axs[0].set_xlabel('Total amount of violations per year')
axs[1].set_xlabel('Percent (%)')
axs[1].set_ylabel('')
axs[0].set_ylabel('Violation')
The result can be seen here:
The y-labels are aligned according to my last plot, the heatmap. However, the bars in the bar plots are clipping at the top, and are not aligned to the labels. I just have to nudge the bars in the barplot -- but how? I've been looking through the documentation, but I feel quite clueless as of now.

See here that none of the y-axis ticklabels are aligned because multiple dataframes are used for plotting. It will be better to create a single dataframe, violations, with the aggregated data to be plotted. Start with the sum of amounts by violation, and then add a new percent column. This will insure the two bar plots have the same y-axis.
Instead of using .groupby and then .pivot, to create df_heatmap_pivot, use .pivot_table, and then reindex using violations.violation.
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
DataFrames and Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
# Generate example data
year = [2018, 2019, 2020, 2021]
task = [x + 2 for x in range(18)]
student = [x for x in range(200)]
amount = [x + 10 for x in range(90)]
violation = list("thisisjustsampletextforlabels") # one letter labels
np.random.seed(123)
df_example = pd.DataFrame({name: np.random.choice(group, 500) for name, group in
zip(['year', 'task', 'violation', 'amount', 'student'],
[year, task, violation, amount, student])})
# organize all of the data
# violations frequency
violations = df_example.groupby(["violation"])["amount"].sum().sort_values(ascending=False).reset_index()
total_violations = violations["amount"].sum()
# add percent
violations['percent'] = violations.amount.div(total_violations).mul(100).round(2)
# Use .pivot_table to create the pivot table
df_heatmap_pivot = df_example.pivot_table(index='violation', columns='task', values='amount', aggfunc='sum')
# Set the index to match the plot order of the 'violation' column
df_heatmap_pivot = df_heatmap_pivot.reindex(index=violations.violation)
Plotting
Using sharey='row' is causing the alignment problem. Use sharey=False, and remove the yticklabels from axs[1] and axs[2], with axs[1 or 2].set_yticks([]).
See How to add value labels on a bar chart for additional details and examples using .bar_label.
# set seaborn plot format
sns.set(font_scale=1.2)
# create the figure and set sharey=False
f, axs = plt.subplots(1, 3, figsize=(12, 12), sharey=False, gridspec_kw=dict(width_ratios=[3,1.5,5]))
# Plot frequency
sns.barplot(data=violations, x="amount", y="violation", log=True, ax=axs[0])
# Plot percent
sns.barplot(data=violations, x='percent', y='violation', ax=axs[1])
# add the bar labels
axs[1].bar_label(axs[1].containers[0], fmt='%.2f%%', label_type='edge', padding=3)
# add extra space for the annotation
axs[1].margins(x=1.3)
# plot the heatmap
heatmap = sns.heatmap(df_heatmap_pivot, fmt = "d", cmap="Greys", norm=LogNorm(), ax=axs[2])
# additional formatting
axs[2].set_facecolor('xkcd:white')
axs[2].set(ylabel="", xlabel="Task")
axs[0].set_xlabel('Total amount of violations per year')
axs[1].set_xlabel('Percent (%)')
axs[1].set_ylabel('')
axs[0].set_ylabel('Violation')
# remove yticks / labels
axs[1].set_yticks([])
_ = axs[2].set_yticks([])
Comment out the last two lines to verify the yticklabels are aligned for each axs.
DataFrame Views
df_example.head()
year task violation amount student
0 2020 2 i 84 59
1 2019 2 u 12 182
2 2020 5 s 20 9
3 2020 11 u 56 163
4 2018 17 t 59 125
violations
violation amount percent
0 s 4869 17.86
1 l 3103 11.38
2 t 3044 11.17
3 e 2634 9.66
4 a 2177 7.99
5 i 2099 7.70
6 h 1275 4.68
7 f 1232 4.52
8 b 1191 4.37
9 m 1155 4.24
10 o 1075 3.94
11 p 763 2.80
12 r 762 2.80
13 j 707 2.59
14 u 595 2.18
15 x 578 2.12
df_heatmap_pivot
task 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
violation
s 62.0 36.0 263.0 273.0 191.0 250.0 556.0 239.0 230.0 188.0 185.0 516.0 249.0 331.0 212.0 219.0 458.0 411.0
l 83.0 245.0 264.0 451.0 155.0 314.0 98.0 125.0 310.0 117.0 21.0 99.0 98.0 50.0 40.0 268.0 192.0 173.0
t 212.0 255.0 45.0 141.0 74.0 135.0 52.0 202.0 107.0 128.0 158.0 NaN 261.0 137.0 339.0 207.0 362.0 229.0
e 215.0 315.0 NaN 116.0 213.0 165.0 130.0 194.0 56.0 355.0 75.0 NaN 118.0 189.0 160.0 177.0 79.0 77.0
a 135.0 NaN 165.0 156.0 204.0 115.0 77.0 65.0 80.0 143.0 83.0 146.0 21.0 29.0 285.0 72.0 116.0 285.0
i 209.0 NaN 20.0 187.0 83.0 136.0 24.0 132.0 257.0 56.0 201.0 52.0 136.0 226.0 104.0 145.0 91.0 40.0
h 27.0 NaN 255.0 NaN 99.0 NaN 71.0 53.0 100.0 89.0 NaN 106.0 NaN 170.0 86.0 79.0 140.0 NaN
f 75.0 23.0 99.0 NaN 26.0 103.0 NaN 185.0 99.0 145.0 NaN 63.0 64.0 29.0 114.0 141.0 38.0 28.0
b 44.0 70.0 56.0 12.0 55.0 14.0 158.0 130.0 NaN 11.0 21.0 NaN 52.0 137.0 162.0 NaN 231.0 38.0
m 86.0 NaN NaN 147.0 74.0 131.0 49.0 180.0 94.0 16.0 NaN 88.0 NaN NaN NaN 51.0 161.0 78.0
o 109.0 NaN 51.0 NaN NaN NaN 20.0 139.0 149.0 NaN 101.0 60.0 NaN 143.0 39.0 73.0 10.0 181.0
p 16.0 NaN 197.0 50.0 87.0 NaN 88.0 NaN 11.0 162.0 NaN 14.0 NaN 78.0 45.0 NaN NaN 15.0
r NaN 85.0 73.0 40.0 NaN NaN 68.0 77.0 NaN 26.0 122.0 105.0 NaN 98.0 NaN NaN NaN 68.0
j NaN 70.0 NaN NaN 73.0 76.0 NaN 150.0 NaN NaN NaN 81.0 NaN 97.0 97.0 63.0 NaN NaN
u 174.0 45.0 NaN NaN 32.0 NaN NaN 86.0 30.0 56.0 13.0 NaN 24.0 NaN NaN 69.0 54.0 12.0
x 69.0 29.0 NaN 106.0 NaN 43.0 NaN NaN NaN 97.0 56.0 29.0 149.0 NaN NaN NaN NaN NaN

Related

Apply fillna(method='bfill') only if the values in same year and month with Python

Let's say I have a panel dataframe with lots of NaNs inside as follow:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
for col in df.columns:
df.loc[df.sample(frac=0.4).index, col] = pd.np.nan
df
Out:
A B C
2013-02-26 NaN NaN NaN
2013-02-27 NaN NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 NaN 24.0
2013-03-02 12.0 70.0 70.0
... ... ...
2015-02-11 38.0 42.0 NaN
2015-02-12 67.0 NaN NaN
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
I need to apply df.fillna(method='bfill') or df.fillna(method='ffill') to the dataframe only if they are in same year and month:
For example, if I apply df.fillna(method='bfill'), the expected result will like this:
A B C
2013-02-26 62.0 NaN 44.0
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ...
2015-02-11 38.0 42.0 74.0
2015-02-12 67.0 10.0 74.0
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
How could I do that in Pandas? Thanks.
You could resample by M (month) and transform bfill:
>>> df.resample("M").transform('bfill')
A B C
2013-02-26 62.0 NaN 44.0
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ... ...
2015-02-11 38.0 42.0 74.0
2015-02-12 67.0 10.0 74.0
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
[720 rows x 3 columns]
>>>
For specific columns:
>>> df[['A', 'B']] = df.resample("M")[['A', 'B']].transform('bfill')
>>> df
A B C
2013-02-26 62.0 NaN NaN
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ... ...
2015-02-11 38.0 42.0 NaN
2015-02-12 67.0 10.0 NaN
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
[720 rows x 3 columns]
>>>

Randomly replace 10% of dataframe with NaNs?

I have a randomly generated 10*10 dataset and I need to replace 10% of dataset randomly with NaN.
import pandas as pd
import numpy as np
Dataset = pd.DataFrame(np.random.randint(0, 100, size=(10, 10)))
Try the following method. I had used this when I was setting up a hackathon and needed to inject missing data for the competition. -
You can use np.random.choice to create a mask of the same shape as the dataframe. Just make sure to set the percentage of the choice p for True and False values where True represents the values that will be replaced by nans.
Then simply apply the mask using df.mask
import pandas as pd
import numpy as np
p = 0.1 #percentage missing data required
df = pd.DataFrame(np.random.randint(0,100,size=(10,10)))
mask = np.random.choice([True, False], size=df.shape, p=[p,1-p])
new_df = df.mask(mask)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 50.0 87 NaN 14 78.0 44.0 19.0 94 28 28.0
1 NaN 58 3.0 75 90.0 NaN 29.0 11 47 NaN
2 91.0 30 98.0 77 3.0 72.0 74.0 42 69 75.0
3 68.0 92 90.0 90 NaN 60.0 74.0 72 58 NaN
4 39.0 51 NaN 81 67.0 43.0 33.0 37 13 40.0
5 73.0 0 59.0 77 NaN NaN 21.0 74 55 98.0
6 33.0 64 0.0 59 27.0 32.0 17.0 3 31 43.0
7 75.0 56 21.0 9 81.0 92.0 89.0 82 89 NaN
8 53.0 44 49.0 31 76.0 64.0 NaN 23 37 NaN
9 65.0 15 31.0 21 84.0 7.0 24.0 3 76 34.0
EDIT:
Updated my answer for the exact 10% values that you are looking for. It uses itertools and sample to get a set of indexes to mask, and then sets them to nan values. Should be exact as you expected.
from itertools import product
from random import sample
p = 0.1
n = int(df.shape[0]*df.shape[1]*p) #Calculate count of nans
#Sample exactly n indexes
ids = sample(list(product(range(df.shape[0]), range(df.shape[1]))), n)
idx, idy = list(zip(*ids))
data = df.to_numpy().astype(float) #Get data as numpy
data[idx, idy]=np.nan #Update numpy view with np.nan
#Assign to new dataframe
new_df = pd.DataFrame(data, columns=df.columns, index=df.index)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 52.0 50.0 24.0 81.0 10.0 NaN NaN 75.0 14.0 81.0
1 45.0 3.0 61.0 67.0 93.0 NaN 90.0 34.0 39.0 4.0
2 1.0 NaN NaN 71.0 57.0 88.0 8.0 9.0 62.0 20.0
3 78.0 3.0 82.0 1.0 75.0 50.0 33.0 66.0 52.0 8.0
4 11.0 46.0 58.0 23.0 NaN 64.0 47.0 27.0 NaN 21.0
5 70.0 35.0 54.0 NaN 70.0 82.0 69.0 94.0 20.0 NaN
6 54.0 84.0 16.0 76.0 77.0 50.0 82.0 31.0 NaN 31.0
7 71.0 79.0 93.0 11.0 46.0 27.0 19.0 84.0 67.0 30.0
8 91.0 85.0 63.0 1.0 91.0 79.0 80.0 14.0 75.0 1.0
9 50.0 34.0 8.0 8.0 10.0 56.0 49.0 45.0 39.0 13.0

adding new rows to an existing dataframe

This is my dataframe. How to I add max_value, min_value, mean_value, median_value names to rows so that my index values will be like
0
1
2
3
4
max_value
min_value
mean_value
median_value
Could anyone help me in solving this
If want add rows use add DataFrame.agg:
df1 = df.append(df.agg(['max','min','mean','median']))
If want add columns use assign with min, max, mean and median:
df2 = df.assign(max_value=df.max(axis=1),
min_value=df.min(axis=1),
mean_value=df.mean(axis=1),
median_value=df.median(axis=1))
one Way is,
Thanks to #jezrael for the help.
df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
df1=df.copy()
#column wise calc
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
#row wise calc
df['max']=df1.max(axis=1)
df['min']=df1.min(axis=1)
df['mean']=df1.mean(axis=1)
df['median']=df1.median(axis=1)
O/P:
A B C D max min mean median
0 49.0 91.0 16.0 17.0 91.0 16.0 43.25 33.0
1 20.0 42.0 86.0 60.0 86.0 20.0 52.00 51.0
2 32.0 25.0 94.0 13.0 94.0 13.0 41.00 28.5
3 40.0 1.0 66.0 31.0 66.0 1.0 34.50 35.5
4 18.0 30.0 67.0 31.0 67.0 18.0 36.50 30.5
max 49.0 91.0 94.0 60.0 NaN NaN NaN NaN
min 18.0 1.0 16.0 13.0 NaN NaN NaN NaN
mean 31.8 37.8 65.8 30.4 NaN NaN NaN NaN
median 32.0 30.0 67.0 31.0 NaN NaN NaN NaN
This worked well and fine:
df1 = df.copy()
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Get total of Pandas column

I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.
print df
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
My attempt:
I have attempted to get the sum of the column using groupby and .sum():
Total = df.groupby['MyColumn'].sum()
print Total
This causes the following error:
TypeError: 'instancemethod' object has no attribute '__getitem__'
Expected Output
I'd have expected the output to be as follows:
319
Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
TOTAL 319
You should use sum:
Total = df['MyColumn'].sum()
print(Total)
319
Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:
df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index=['MyColumn'])
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
because if you pass scalar, the values of all rows will be filled:
df.loc['Total'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
Total 319 319 319.0 319.0
Two other solutions are with at, and ix see the applications below:
df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.
Another option you can go with here:
df.loc["Total", "MyColumn"] = df.MyColumn.sum()
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#Total NaN 319.0 NaN NaN
You can also use append() method:
df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))
Update:
In case you need to append sum for all numeric columns, you can do one of the followings:
Use append to do this in a functional manner (doesn't change the original data frame):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')
# append sums to the data frame
df.append(sums)
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 319.0 400.0 398.0
Use loc to mutate data frame in place:
df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 638.0 800.0 796.0
Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:
Total = sum(df['MyColumn'])
or alternatively
Total = sum(df.MyColumn)
print Total
There are two ways to sum of a column
dataset = pd.read_csv("data.csv")
1: sum(dataset.Column_name)
2: dataset['Column_Name'].sum()
If there is any issue in this the please correct me..
As other option, you can do something like below
Group Valuation amount
0 BKB Tube 156
1 BKB Tube 143
2 BKB Tube 67
3 BAC Tube 176
4 BAC Tube 39
5 JDK Tube 75
6 JDK Tube 35
7 JDK Tube 155
8 ETH Tube 38
9 ETH Tube 56
Below script, you can use for above data
import pandas as pd
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()

Categories

Resources