Python: read a csv file, removed outlier then rebuild the csv file - python

i have a csv file "trainning_data.csv" contains 7 columns of data but i only read the last one
the format of csv file is as below:
A B C D E F Last
1 1.5 14.2 21.5 50.1 25.5 14.2 25.2
2 ... ... ... ... ... ... ...
3
.
.
.
I read the data file using pandas then visulized it:
import pandas as pd
df = pd.read_csv('trainning_data.csv')
saved_column = df['Last']
plt.plot(saved_column, 'o')
plt.show()
then i removed the oulier as:
Q1 = np.percentile(saved_column, 25)
Q3 = np.percentile(saved_column, 75)
range=[Q1-1.5*(Q3-Q1),Q3+1.5*(Q3-Q1)];
id_max = np.where(saved_column>range[1])
id_min = np.where(saved_column<range[0])
position = np.concatenate( (id_max, id_min), axis=1)
saved_column = np.array(saved_column, dtype = 'double')
new_column = np.delete(saved_column, position.T)
len(new_column)
plt.plot(new_column, 'o')
plt.xlim(0, 1000)
plt.ylim(0,500)
plt.show()
after removed all the outlier, i want to rebuild the data set, i tried:
fileHeader = ["Last"]
myFile = open('Training_Data_New.csv', 'w')
writer = csv.writer(myFile)
writer.writerow(fileHeader)
writer.writerows(new_column)
but it throws me an Error: iterable expected, not numpy.float64
another problem is i need to delete all the data related to the position of the outlier that i found as well. How do i fix this?

You can create DataFrame by numopy array and write to file by to_csv:
pd.DataFrame({'Last':new_column}).to_csv('Training_Data_New.csv', index=False)
Pandas solution for remove outliers:
I think you can use quantile and filter by between with boolean indexing, last for write DataFrame to file use to_csv:
df = pd.DataFrame({'Last':[1,2,3,5,8,10,45,100], 'A': np.arange(8)})
print (df)
A Last
0 0 1
1 1 2
2 2 3
3 3 5
4 4 8
5 5 10
6 6 45
7 7 100
Q1 = df['Last'].quantile(.25)
Q3 = df['Last'].quantile(.75)
q1 = Q1-1.5*(Q3-Q1)
q3 = Q3+1.5*(Q3-Q1)
df1 = df[df['Last'].between(q1, q3)]
print (df1)
A Last
0 0 1
1 1 2
2 2 3
3 3 5
4 4 8
5 5 10
plt.plot(df1['Last'].values, 'o')
plt.xlim(0, 1000)
plt.ylim(0,500)
plt.show()
#if want write only Last column
df1[['Last']].to_csv('Training_Data_New.csv', index=False)
#if you want write all columns
df1.to_csv('Training_Data_New.csv', index=False)

You can add you new column variable as a column in your existing_column and then use pd.to_csv() to save.
After you get new_column variable
Drop column last from df.
df.drop('last',axis=1, inplace=True)
2.
df['last'] = new_column
Save your df.
df.to_csv('Training_Data_New.csv',index=False)

Related

How to identify and highlight outliers in each row of a pandas dataframe

I want to do the following to my dataframe:
For each row identify outliers/anomalies
Highlight/color the identified outliers' cells (preferably 'red' color)
Count the number of identified outliers in each row (store in a column 'anomaly_count')
Export the output as an xlsx file
See below for sample data
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16,5)),
columns=list('ABCDE')
)
df
A B C D E
0 -1.685112 -0.432143 0.876200 1.626578 1.512677
1 0.401134 0.439393 1.027222 0.036267 -0.655949
2 -0.074890 0.312793 -0.236165 0.660909 0.074468
3 0.842169 2.759467 0.223652 0.432631 -0.484871
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380
5 0.083653 0.792835 -0.643204 1.182606 -1.207692
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188
8 2.354769 1.099483 -0.653342 -0.532208 0.269307
9 0.431649 0.666982 0.361765 0.419482 0.531072
10 -0.124268 -0.170720 -0.979012 -0.410861 1.000371
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283
14 0.029966 -0.579152 0.648176 0.833141 -0.942752
15 0.824767 0.974580 0.363170 0.428062 -0.232174
The desired outcome should look something like this:
## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.
## Please note: the outliers NaN in the sample are randomly assigned.
A B C D E Anomaly_Count
0 NaN -0.432143 0.876200 NaN 1.512677 2
1 0.401134 0.439393 1.027222 0.036267 -0.655949 0
2 -0.074890 0.312793 -0.236165 0.660909 0.074468 0
3 0.842169 NaN 0.223652 0.432631 -0.484871 1
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380 0
5 0.083653 0.792835 -0.643204 NaN NaN 2
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728 0
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188 0
8 2.354769 1.099483 -0.653342 -0.532208 0.269307 0
9 0.431649 0.666982 0.361765 0.419482 0.531072 0
10 -0.124268 -0.170720 -0.979012 -0.410861 NaN 1
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289 0
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504 0
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283 0
14 0.029966 -0.579152 0.648176 0.833141 -0.942752 0
15 0.824767 NaN 0.363170 0.428062 -0.232174 1
See below for my attempt, I am open to other approaches
import numpy as np
from scipy import stats
def outlier_detection (data):
# step I: identify the outliers in each row
df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)] # unfortunately this removes the outliers which I dont want
# step II: color/highlight the outlier cell
df = df.style.highlight_null('red')
# Step III: count the number of outliers in each row
df['Anomaly_count'] = df.isnull().sum(axis=1)
# step IV: export as xlsx file
df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
outlier_detection(df)
Thanks for your time.
This works for me
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)
mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")
sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)
Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.
The resulting excel file looks now like this

python stacking data with missing values in the header

I have data that is imported from a csv file, in reality there are more columns and more cycles, but this is a representative snippet:
Export date 2020-10-10
Record #3 Record #2 Record #1
Cycle #5 Cycle #4 Cycle #3
time ( min.) Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3
0 0.0390625 9.89619 0.853909 14.409 10.1961 0.859037 14.4676 10.0274 0.832598
1 0.0390625 9.53452 0.949844 14.4096 10.3034 1.224 14.4676 10.0323 1.20403
2 0.0390625 9.8956 1.47227 14.4097 10.6586 1.14486 14.4676 10.4936 1.12747
3 0.0390625 10.7829 1.44412 14.4097 10.9185 1.20247 14.5116 10.6892 1.12459
The top part of the data contains a row (export date) that is not needed in the table.
I would like to stack the data so that there will be Cycle and Record columns. The problem is that these values are found only above the first column of data for every cycle. For example, Cycle5 has three columns of data, then Cycle4 has three columns of data etc.
This is how the output should look like:
I didn't get very far:
df = pd.read_csv('cycles.csv')
#Fill the names of cycles to the right
df.ffill(axis = 1, inplace = True)
#Not sure this is needed, it might make it easier to melt/stack
df.iloc[0,0] = "time ( min.)"
df.iloc[1,0] = "time ( min.)"
Thank you for your thoughts and assistance!!
There are couple of problems with this which you need to address all:
Firstly Read all the required info:
This cannot be done unless all of the info is read separately:
import pandas as pd
from io import StringIO
string = open('SO.csv').read()
records = [i.split('#')[1].strip() for i in string.split('\n')[1].split(',') if '#' in i]
cycles = [i.split('#')[1].strip() for i in string.split('\n')[2].split(',') if '#' in i]
data = pd.read_csv(StringIO(string), sep=',', header=3).dropna(how = 'any')
Rename columns so they follow a pattern:
cols = [col for col in data.columns if '.' not in col]
data = data.rename(columns = dict(zip(cols ,[col+'.0' for col in cols])))
Build a loop to pluck out records for each record and cycle:
dfs = []
for rdx, rec in enumerate(records):
df = data[['time ( min.)'].__add__([col for col in data.columns if col.endswith(str(rdx))])].rename(columns = dict(zip([col+f'.{rdx}' for col in cols],cols)))
df[['Cycle', 'Record']] = cycles[rdx], records[rdx]
dfs.append(df)
Finally Merge them all:
pd.concat(dfs)
This results in:
time ( min.) Parameter1 Something2 Whatever3 Cycle Record
0 0.0 0.039062 9.89619 0.853909 5 3
1 1.0 0.039062 9.53452 0.949844 5 3
2 2.0 0.039062 9.89560 1.472270 5 3
3 3.0 0.039062 10.78290 1.444120 5 3
0 0.0 14.409000 10.19610 0.859037 4 2
1 1.0 14.409600 10.30340 1.224000 4 2
2 2.0 14.409700 10.65860 1.144860 4 2
3 3.0 14.409700 10.91850 1.202470 4 2
0 0.0 14.467600 10.02740 0.832598 3 1
1 1.0 14.467600 10.03230 1.204030 3 1
2 2.0 14.467600 10.49360 1.127470 3 1
3 3.0 14.511600 10.68920 1.124590 3 1
Breaking down a problem in simple steps will not only help you go with this one but also in EVERY OTHER case. Just figure out what you need to do, break into steps and go with it!

Python pandas to groupby dataframe columns and use them to calculate a new columns in excel sheets

My DataFrame collected from the dataset1.xlsx looks like this:
TimePoint Object 0 Object 1 Object 2 Object 3 Object 4 Object 0 Object 1 Object 2 Object 3 Object 4
0 10 4642.99 2000.71 4869.52 4023.69 3008.99 11188.15 2181.62 12493.47 10275.15 8787.99
1 20 4640.09 2005.17 4851.07 4039.73 3007.16 11129.38 2172.37 12438.31 10218.92 8723.45
Problem:
The Data contains header columns with duplicate names need to aggregate them to find the occurrence and then initialize the IDA and IAA values for each Objects.
Based on these new values need to calculate the Fc and EAPP values. So, the final excel output should looks like this:
TimePoint Objects IDA IAA Fc (using IDA- (a * IAA)) EAPP (using Fc/ (Fc + (G *Fc)))
10 Object 0 4642.99 11188.15 3300.412 0.463177397
10 Object 1 2000.71 2181.62 -527.78758 1
10 Object 2 4869.52 12493.47 4869.52 1
10 Object 3 4023.69 10275.15 4023.69 1
10 Object 4 3008.99 8787.99 3008.99 1
20 Object 0 4640.09 11129.38 4640.09 1
20 Object 1 2005.17 2172.37 2005.17 1
20 Object 2 4851.07 12438.31 4851.07 1
20 Object 3 4039.73 10218.92 4039.73 1
20 Object 4 3007.16 8723.45 3007.16 1
I tried to solve this problem using the following python script:
def main():
all_data = pd.DataFrame()
a = 0.12
G = 1.159
for f in glob.glob("data/dataset1.xlsx"):
df = pd.read_excel(f, 'Sheet1') # , header=[1]
all_data = all_data.append(df, ignore_index=True, sort=False)
all_data.columns = all_data.columns.str.split('.').str[0]
print(all_data)
object_df = all_data.groupby(all_data.columns, axis=1)
print(object_df)
for k in object_df.groups.keys():
if k != 'TimePoint':
for row_index, row in object_df.get_group(k).iterrows():
print(row)
# This logic is not working to group by Object and then apply the Following formula
# TODO: Calculation for the new added columns Assumption every time there will be two occurrence of any
# Object i.e. Object 0...4 in this example but Object count can varies sometime only one Object can
# appear
# IDA is the first occurrence value of the Object
all_data['IDA'] = row[0] # This is NOT correct
# IAA is the second occurrence value of the Object
all_data['IAA'] = row[1]
all_data['Fc'] = all_data.IDA.fillna(0) - (a * all_data.IAA.fillna(0))
all_data['EAPP'] = all_data.Fc.fillna(0) / (all_data.Fc.fillna(0) + (G * all_data.Fc.fillna(0)))
# now save the data frame
writer = pd.ExcelWriter('data/dataset1.xlsx')
all_data.to_excel(writer, 'Sheet2', index=True)
writer.save()
if __name__ == '__main__':
main()
Please let me know the part how to assign the IDA and IAA values for each Objects using groupby in pandas referring my code above.
I think melt might help you a lot
import pandas as pd
df = pd.read_clipboard()
# This part of breaking the df into 2 might be different based on how your reading the dataframe into memory
df1 = df[df.columns[:6]]
df2 = df[['TimePoint'] + df.columns.tolist()[6:]]
tdf1 = df1.melt(['TimePoint']).assign(key=range(10))
tdf2 = df2.melt(['TimePoint']).assign(key=range(10)).drop(['TimePoint', 'variable'], axis=1)
df = tdf1.merge(tdf2, on='key', how='left').drop(['key'], axis=1).rename(columns={'value_x': 'IDA', 'value_y': 'IAA'})
a = 0.12
G = 1.159
df['Fc'] = df['IDA'] - a * df['IAA']
df['EAPP'] = df['Fc'].div(df['Fc']+(G*df['Fc']))
TimePoint variable IDA IAA Fc EAPP
0 10 Object_0 4642.99 11188.15 3300.4120 0.463177
1 20 Object_0 4640.09 11129.38 3304.5644 0.463177
2 10 Object_1 2000.71 2181.62 1738.9156 0.463177
3 20 Object_1 2005.17 2172.37 1744.4856 0.463177
4 10 Object_2 4869.52 12493.47 3370.3036 0.463177
5 20 Object_2 4851.07 12438.31 3358.4728 0.463177
6 10 Object_3 4023.69 10275.15 2790.6720 0.463177
7 20 Object_3 4039.73 10218.92 2813.4596 0.463177
8 10 Object_4 3008.99 8787.99 1954.4312 0.463177
9 20 Object_4 3007.16 8723.45 1960.3460 0.463177

Append string of column index to DataFrame columns

I am working on a project using Learning to Rank. Below is the example dataset format (taken from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/). The first column is the rank, second column is query id, and the followings are [feature number]:[feature value]
1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 … 46:0.00000
1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 … 46:0.000000
1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 … 46:0.000000
Right now, I am successfully convert my data into this following format in Pandas.DataFrame.
10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0 92.28260 ...
...
The first two column is already fine. What I need next is appending feature number to the remaining columns (e.g. first feature from 3500 become 1:3500)
I know I can append a string to columns by using this following command.
df['col'] = 'str' + df['col'].astype(str)
Look at the first feature, 3500, is located at column index 2, so what I can think of is appending column index - 1 for each column. How do I append the string based on the column number?
Any help would be appreciated.
I think need DataFrame.radd for add columns names from right side and iloc for select from second column to end:
print (df)
0 1 2 3 4 5 6 7 8 \
0 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
1 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
9
0 92.2826
1 92.2826
df.iloc[:, 2:] = df.iloc[:, 2:].astype(str).radd(':').radd((df.columns[2:] - 1).astype(str))
print (df)
0 1 2 3 4 5 6 7 \
0 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
1 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
8 9
0 7:1840.0 8:92.2826
1 7:1840.0 8:92.2826
You can simply concatenate the columns
df['new_col'] = df[df.columns[3]].astype(str) + ':' + df[df.columns[2]].astype(str)
This will output a new column in your df named new_col. Now you can either delete the unnecessary columns.
You can convert the string to dictionary and then read it again as pandas dataframe.
import pandas as pd
import ast
df = pd.DataFrame({'rank': [1008, 1007, 1006], 'column':['qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000',\
'qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333',\
'qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000']} )
def putquotes(x):
x1 = x.split(":")
return "'" + x1[0] +"':" + x1[1]
def putcommas(x):
x1 = x.split()
return "{" + ",".join([putquotes(t) for t in x1]) + "}"
import ast
df1 = [ast.literal_eval(putcommas(x)) for x in df['column'].tolist()]
df = pd.concat([df,pd.DataFrame(df1)], axis=1)

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Categories

Resources