I have a randomly generated 10*10 dataset and I need to replace 10% of dataset randomly with NaN.
import pandas as pd
import numpy as np
Dataset = pd.DataFrame(np.random.randint(0, 100, size=(10, 10)))
Try the following method. I had used this when I was setting up a hackathon and needed to inject missing data for the competition. -
You can use np.random.choice to create a mask of the same shape as the dataframe. Just make sure to set the percentage of the choice p for True and False values where True represents the values that will be replaced by nans.
Then simply apply the mask using df.mask
import pandas as pd
import numpy as np
p = 0.1 #percentage missing data required
df = pd.DataFrame(np.random.randint(0,100,size=(10,10)))
mask = np.random.choice([True, False], size=df.shape, p=[p,1-p])
new_df = df.mask(mask)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 50.0 87 NaN 14 78.0 44.0 19.0 94 28 28.0
1 NaN 58 3.0 75 90.0 NaN 29.0 11 47 NaN
2 91.0 30 98.0 77 3.0 72.0 74.0 42 69 75.0
3 68.0 92 90.0 90 NaN 60.0 74.0 72 58 NaN
4 39.0 51 NaN 81 67.0 43.0 33.0 37 13 40.0
5 73.0 0 59.0 77 NaN NaN 21.0 74 55 98.0
6 33.0 64 0.0 59 27.0 32.0 17.0 3 31 43.0
7 75.0 56 21.0 9 81.0 92.0 89.0 82 89 NaN
8 53.0 44 49.0 31 76.0 64.0 NaN 23 37 NaN
9 65.0 15 31.0 21 84.0 7.0 24.0 3 76 34.0
EDIT:
Updated my answer for the exact 10% values that you are looking for. It uses itertools and sample to get a set of indexes to mask, and then sets them to nan values. Should be exact as you expected.
from itertools import product
from random import sample
p = 0.1
n = int(df.shape[0]*df.shape[1]*p) #Calculate count of nans
#Sample exactly n indexes
ids = sample(list(product(range(df.shape[0]), range(df.shape[1]))), n)
idx, idy = list(zip(*ids))
data = df.to_numpy().astype(float) #Get data as numpy
data[idx, idy]=np.nan #Update numpy view with np.nan
#Assign to new dataframe
new_df = pd.DataFrame(data, columns=df.columns, index=df.index)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 52.0 50.0 24.0 81.0 10.0 NaN NaN 75.0 14.0 81.0
1 45.0 3.0 61.0 67.0 93.0 NaN 90.0 34.0 39.0 4.0
2 1.0 NaN NaN 71.0 57.0 88.0 8.0 9.0 62.0 20.0
3 78.0 3.0 82.0 1.0 75.0 50.0 33.0 66.0 52.0 8.0
4 11.0 46.0 58.0 23.0 NaN 64.0 47.0 27.0 NaN 21.0
5 70.0 35.0 54.0 NaN 70.0 82.0 69.0 94.0 20.0 NaN
6 54.0 84.0 16.0 76.0 77.0 50.0 82.0 31.0 NaN 31.0
7 71.0 79.0 93.0 11.0 46.0 27.0 19.0 84.0 67.0 30.0
8 91.0 85.0 63.0 1.0 91.0 79.0 80.0 14.0 75.0 1.0
9 50.0 34.0 8.0 8.0 10.0 56.0 49.0 45.0 39.0 13.0
Related
I am trying to replicate countifs in excel to get a rank between two unique values that are listed in my dataframe. I have attached the expected output calculated in excel using countif and let/rank functions.
I am trying to generate "average rank of gas and coal plants" that takes the number from the "average rank column" and then ranks the two unique types from technology (CCGT or COAL) into two new ranks (Gas or Coal) so then I can get the relavant quantiles for this. In case you are wondering why I would need to do this seeing as there are only two coal plants, well when I run this model on a larger dataset it will be useful to know how to do this in code and not manually on my dataset.
Ideally the output will return two ranks 1-47 for all units with technology == CCGT and 1-2 for all units with technology == COAL.
This is the column I am looking to make
Unit ID
Technology
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
Average Rank
Unit Rank
Avg Rank of Gas & Coal plants
Gas Quintiles
Coal Quintiles
Quintiles
FAWN-1
CCGT
1.0
5.0
1.0
5.0
2.0
1.0
2.5
1
1
1
0
Gas_1
GRAI-6
CCGT
4.0
18.0
2.0
4.0
3.0
3.0
5.7
2
2
1
0
Gas_1
EECL-1
CCGT
5.0
29.0
4.0
1.0
1.0
2.0
7.0
3
3
1
0
Gas_1
PEMB-21
CCGT
7.0
1.0
6.0
13.0
8.0
8.0
7.2
4
4
1
0
Gas_1
PEMB-51
CCGT
3.0
3.0
3.0
11.0
16.0
7.2
5
5
1
0
Gas_1
PEMB-41
CCGT
9.0
4.0
7.0
7.0
10.0
13.0
8.3
6
6
1
0
Gas_1
WBURB-1
CCGT
6.0
9.0
22.0
2.0
7.0
5.0
8.5
7
7
1
0
Gas_1
PEMB-31
CCGT
14.0
6.0
13.0
6.0
4.0
9.0
8.7
8
8
1
0
Gas_1
GRMO-1
CCGT
2.0
7.0
10.0
24.0
11.0
6.0
10.0
9
9
1
0
Gas_1
PEMB-11
CCGT
21.0
2.0
9.0
10.0
9.0
14.0
10.8
10
10
2
0
Gas_2
STAY-1
CCGT
19.0
12.0
5.0
23.0
6.0
7.0
12.0
11
11
2
0
Gas_2
GRAI-7
CCGT
10.0
27.0
15.0
9.0
15.0
11.0
14.5
12
12
2
0
Gas_2
DIDCB6
CCGT
28.0
11.0
11.0
8.0
19.0
15.0
15.3
13
13
2
0
Gas_2
SCCL-3
CCGT
17.0
16.0
31.0
3.0
18.0
10.0
15.8
14
14
2
0
Gas_2
STAY-4
CCGT
12.0
8.0
20.0
18.0
14.0
23.0
15.8
14
14
2
0
Gas_2
CDCL-1
CCGT
13.0
22.0
8.0
25.0
12.0
16.0
16.0
16
16
2
0
Gas_2
STAY-3
CCGT
8.0
17.0
17.0
20.0
13.0
22.0
16.2
17
17
2
0
Gas_2
MRWD-1
CCGT
19.0
26.0
5.0
19.0
17.3
18
18
2
0
Gas_2
WBURB-3
CCGT
24.0
14.0
17.0
17.0
18.0
19
19
3
0
Gas_3
WBURB-2
CCGT
14.0
21.0
12.0
31.0
18.0
19.2
20
20
3
0
Gas_3
GYAR-1
CCGT
26.0
14.0
17.0
20.0
21.0
19.6
21
21
3
0
Gas_3
STAY-2
CCGT
18.0
20.0
18.0
21.0
24.0
20.0
20.2
22
22
3
0
Gas_3
KLYN-A-1
CCGT
24.0
12.0
19.0
27.0
20.5
23
23
3
0
Gas_3
SHOS-1
CCGT
16.0
15.0
28.0
15.0
29.0
27.0
21.7
24
24
3
0
Gas_3
DIDCB5
CCGT
10.0
35.0
22.0
22.3
25
25
3
0
Gas_3
CARR-1
CCGT
33.0
26.0
27.0
22.0
4.0
22.4
26
26
3
0
Gas_3
LAGA-1
CCGT
15.0
13.0
29.0
32.0
23.0
24.0
22.7
27
27
3
0
Gas_3
CARR-2
CCGT
24.0
25.0
27.0
29.0
21.0
12.0
23.0
28
28
3
0
Gas_3
GRAI-8
CCGT
11.0
28.0
36.0
16.0
26.0
25.0
23.7
29
29
4
0
Gas_4
SCCL-2
CCGT
29.0
16.0
28.0
25.0
24.5
30
30
4
0
Gas_4
LBAR-1
CCGT
19.0
25.0
31.0
28.0
25.8
31
31
4
0
Gas_4
CNQPS-2
CCGT
20.0
32.0
32.0
26.0
27.5
32
32
4
0
Gas_4
SPLN-1
CCGT
23.0
30.0
30.0
27.7
33
33
4
0
Gas_4
DAMC-1
CCGT
23.0
21.0
38.0
34.0
29.0
34
34
4
0
Gas_4
KEAD-2
CCGT
30.0
30.0
35
35
4
0
Gas_4
SHBA-1
CCGT
26.0
23.0
35.0
37.0
30.3
36
36
4
0
Gas_4
HUMR-1
CCGT
22.0
30.0
37.0
37.0
33.0
28.0
31.2
37
37
4
0
Gas_4
CNQPS-4
CCGT
27.0
33.0
35.0
30.0
31.3
38
38
5
0
Gas_5
CNQPS-1
CCGT
25.0
40.0
33.0
32.7
39
39
5
0
Gas_5
SEAB-1
CCGT
32.0
34.0
36.0
29.0
32.8
40
40
5
0
Gas_5
PETEM1
CCGT
35.0
35.0
41
41
5
0
Gas_5
ROCK-1
CCGT
31.0
34.0
38.0
38.0
35.3
42
42
5
0
Gas_5
SEAB-2
CCGT
31.0
39.0
39.0
34.0
35.8
43
43
5
0
Gas_5
WBURB-43
COAL
32.0
37.0
40.0
39.0
31.0
35.8
44
1
0
1
Coal_1
FDUNT-1
CCGT
36.0
36.0
45
44
5
0
Gas_5
COSO-1
CCGT
30.0
42.0
36.0
36.0
45
44
5
0
Gas_5
WBURB-41
COAL
33.0
38.0
41.0
40.0
32.0
36.8
47
2
0
1
Coal_1
FELL-1
CCGT
34.0
39.0
43.0
41.0
33.0
38.0
48
46
5
0
Gas_5
KEAD-1
CCGT
43.0
43.0
49
47
5
0
Gas_5
I have tried to do it the same way I got average rank, which is a rank of the average of inputs in the dataframe but it doesn't seem to work with additional conditions.
Thank you!!
import pandas as pd
df = pd.read_csv("gas.csv")
display(df['Technology'].value_counts())
print('------')
display(df['Technology'].value_counts()[0]) # This is how you access count of CCGT
display(df['Technology'].value_counts()[1])
Output:
CCGT 47
COAL 2
Name: Technology, dtype: int64
------
47
2
By the way: pd.cut or pd.qcut can be used to calculate quantiles. You don't have to manually define what a quantile is.
Refer to the documentation and other websites:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
https://www.geeksforgeeks.org/how-to-use-pandas-cut-and-qcut/
There are many methods you can pass to rank. Refer to documentation:
https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
df['rank'] = df.groupby("Technology")["Average Rank"].rank(method = "dense", ascending = True)
df
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.
I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.
print df
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
My attempt:
I have attempted to get the sum of the column using groupby and .sum():
Total = df.groupby['MyColumn'].sum()
print Total
This causes the following error:
TypeError: 'instancemethod' object has no attribute '__getitem__'
Expected Output
I'd have expected the output to be as follows:
319
Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
TOTAL 319
You should use sum:
Total = df['MyColumn'].sum()
print(Total)
319
Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:
df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index=['MyColumn'])
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
because if you pass scalar, the values of all rows will be filled:
df.loc['Total'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
Total 319 319 319.0 319.0
Two other solutions are with at, and ix see the applications below:
df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.
Another option you can go with here:
df.loc["Total", "MyColumn"] = df.MyColumn.sum()
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#Total NaN 319.0 NaN NaN
You can also use append() method:
df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))
Update:
In case you need to append sum for all numeric columns, you can do one of the followings:
Use append to do this in a functional manner (doesn't change the original data frame):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')
# append sums to the data frame
df.append(sums)
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 319.0 400.0 398.0
Use loc to mutate data frame in place:
df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 638.0 800.0 796.0
Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:
Total = sum(df['MyColumn'])
or alternatively
Total = sum(df.MyColumn)
print Total
There are two ways to sum of a column
dataset = pd.read_csv("data.csv")
1: sum(dataset.Column_name)
2: dataset['Column_Name'].sum()
If there is any issue in this the please correct me..
As other option, you can do something like below
Group Valuation amount
0 BKB Tube 156
1 BKB Tube 143
2 BKB Tube 67
3 BAC Tube 176
4 BAC Tube 39
5 JDK Tube 75
6 JDK Tube 35
7 JDK Tube 155
8 ETH Tube 38
9 ETH Tube 56
Below script, you can use for above data
import pandas as pd
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()
I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.
print df
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
My attempt:
I have attempted to get the sum of the column using groupby and .sum():
Total = df.groupby['MyColumn'].sum()
print Total
This causes the following error:
TypeError: 'instancemethod' object has no attribute '__getitem__'
Expected Output
I'd have expected the output to be as follows:
319
Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
TOTAL 319
You should use sum:
Total = df['MyColumn'].sum()
print(Total)
319
Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:
df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index=['MyColumn'])
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
because if you pass scalar, the values of all rows will be filled:
df.loc['Total'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
Total 319 319 319.0 319.0
Two other solutions are with at, and ix see the applications below:
df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.
Another option you can go with here:
df.loc["Total", "MyColumn"] = df.MyColumn.sum()
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#Total NaN 319.0 NaN NaN
You can also use append() method:
df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))
Update:
In case you need to append sum for all numeric columns, you can do one of the followings:
Use append to do this in a functional manner (doesn't change the original data frame):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')
# append sums to the data frame
df.append(sums)
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 319.0 400.0 398.0
Use loc to mutate data frame in place:
df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 638.0 800.0 796.0
Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:
Total = sum(df['MyColumn'])
or alternatively
Total = sum(df.MyColumn)
print Total
There are two ways to sum of a column
dataset = pd.read_csv("data.csv")
1: sum(dataset.Column_name)
2: dataset['Column_Name'].sum()
If there is any issue in this the please correct me..
As other option, you can do something like below
Group Valuation amount
0 BKB Tube 156
1 BKB Tube 143
2 BKB Tube 67
3 BAC Tube 176
4 BAC Tube 39
5 JDK Tube 75
6 JDK Tube 35
7 JDK Tube 155
8 ETH Tube 38
9 ETH Tube 56
Below script, you can use for above data
import pandas as pd
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()