python stacking data with missing values in the header - python

I have data that is imported from a csv file, in reality there are more columns and more cycles, but this is a representative snippet:
Export date 2020-10-10
Record #3 Record #2 Record #1
Cycle #5 Cycle #4 Cycle #3
time ( min.) Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3
0 0.0390625 9.89619 0.853909 14.409 10.1961 0.859037 14.4676 10.0274 0.832598
1 0.0390625 9.53452 0.949844 14.4096 10.3034 1.224 14.4676 10.0323 1.20403
2 0.0390625 9.8956 1.47227 14.4097 10.6586 1.14486 14.4676 10.4936 1.12747
3 0.0390625 10.7829 1.44412 14.4097 10.9185 1.20247 14.5116 10.6892 1.12459
The top part of the data contains a row (export date) that is not needed in the table.
I would like to stack the data so that there will be Cycle and Record columns. The problem is that these values are found only above the first column of data for every cycle. For example, Cycle5 has three columns of data, then Cycle4 has three columns of data etc.
This is how the output should look like:
I didn't get very far:
df = pd.read_csv('cycles.csv')
#Fill the names of cycles to the right
df.ffill(axis = 1, inplace = True)
#Not sure this is needed, it might make it easier to melt/stack
df.iloc[0,0] = "time ( min.)"
df.iloc[1,0] = "time ( min.)"
Thank you for your thoughts and assistance!!

There are couple of problems with this which you need to address all:
Firstly Read all the required info:
This cannot be done unless all of the info is read separately:
import pandas as pd
from io import StringIO
string = open('SO.csv').read()
records = [i.split('#')[1].strip() for i in string.split('\n')[1].split(',') if '#' in i]
cycles = [i.split('#')[1].strip() for i in string.split('\n')[2].split(',') if '#' in i]
data = pd.read_csv(StringIO(string), sep=',', header=3).dropna(how = 'any')
Rename columns so they follow a pattern:
cols = [col for col in data.columns if '.' not in col]
data = data.rename(columns = dict(zip(cols ,[col+'.0' for col in cols])))
Build a loop to pluck out records for each record and cycle:
dfs = []
for rdx, rec in enumerate(records):
df = data[['time ( min.)'].__add__([col for col in data.columns if col.endswith(str(rdx))])].rename(columns = dict(zip([col+f'.{rdx}' for col in cols],cols)))
df[['Cycle', 'Record']] = cycles[rdx], records[rdx]
dfs.append(df)
Finally Merge them all:
pd.concat(dfs)
This results in:
time ( min.) Parameter1 Something2 Whatever3 Cycle Record
0 0.0 0.039062 9.89619 0.853909 5 3
1 1.0 0.039062 9.53452 0.949844 5 3
2 2.0 0.039062 9.89560 1.472270 5 3
3 3.0 0.039062 10.78290 1.444120 5 3
0 0.0 14.409000 10.19610 0.859037 4 2
1 1.0 14.409600 10.30340 1.224000 4 2
2 2.0 14.409700 10.65860 1.144860 4 2
3 3.0 14.409700 10.91850 1.202470 4 2
0 0.0 14.467600 10.02740 0.832598 3 1
1 1.0 14.467600 10.03230 1.204030 3 1
2 2.0 14.467600 10.49360 1.127470 3 1
3 3.0 14.511600 10.68920 1.124590 3 1
Breaking down a problem in simple steps will not only help you go with this one but also in EVERY OTHER case. Just figure out what you need to do, break into steps and go with it!

Related

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

Python: read a csv file, removed outlier then rebuild the csv file

i have a csv file "trainning_data.csv" contains 7 columns of data but i only read the last one
the format of csv file is as below:
A B C D E F Last
1 1.5 14.2 21.5 50.1 25.5 14.2 25.2
2 ... ... ... ... ... ... ...
3
.
.
.
I read the data file using pandas then visulized it:
import pandas as pd
df = pd.read_csv('trainning_data.csv')
saved_column = df['Last']
plt.plot(saved_column, 'o')
plt.show()
then i removed the oulier as:
Q1 = np.percentile(saved_column, 25)
Q3 = np.percentile(saved_column, 75)
range=[Q1-1.5*(Q3-Q1),Q3+1.5*(Q3-Q1)];
id_max = np.where(saved_column>range[1])
id_min = np.where(saved_column<range[0])
position = np.concatenate( (id_max, id_min), axis=1)
saved_column = np.array(saved_column, dtype = 'double')
new_column = np.delete(saved_column, position.T)
len(new_column)
plt.plot(new_column, 'o')
plt.xlim(0, 1000)
plt.ylim(0,500)
plt.show()
after removed all the outlier, i want to rebuild the data set, i tried:
fileHeader = ["Last"]
myFile = open('Training_Data_New.csv', 'w')
writer = csv.writer(myFile)
writer.writerow(fileHeader)
writer.writerows(new_column)
but it throws me an Error: iterable expected, not numpy.float64
another problem is i need to delete all the data related to the position of the outlier that i found as well. How do i fix this?
You can create DataFrame by numopy array and write to file by to_csv:
pd.DataFrame({'Last':new_column}).to_csv('Training_Data_New.csv', index=False)
Pandas solution for remove outliers:
I think you can use quantile and filter by between with boolean indexing, last for write DataFrame to file use to_csv:
df = pd.DataFrame({'Last':[1,2,3,5,8,10,45,100], 'A': np.arange(8)})
print (df)
A Last
0 0 1
1 1 2
2 2 3
3 3 5
4 4 8
5 5 10
6 6 45
7 7 100
Q1 = df['Last'].quantile(.25)
Q3 = df['Last'].quantile(.75)
q1 = Q1-1.5*(Q3-Q1)
q3 = Q3+1.5*(Q3-Q1)
df1 = df[df['Last'].between(q1, q3)]
print (df1)
A Last
0 0 1
1 1 2
2 2 3
3 3 5
4 4 8
5 5 10
plt.plot(df1['Last'].values, 'o')
plt.xlim(0, 1000)
plt.ylim(0,500)
plt.show()
#if want write only Last column
df1[['Last']].to_csv('Training_Data_New.csv', index=False)
#if you want write all columns
df1.to_csv('Training_Data_New.csv', index=False)
You can add you new column variable as a column in your existing_column and then use pd.to_csv() to save.
After you get new_column variable
Drop column last from df.
df.drop('last',axis=1, inplace=True)
2.
df['last'] = new_column
Save your df.
df.to_csv('Training_Data_New.csv',index=False)

Iterate over rows of Pandas dataframe

I have df like:
CELLID lon lat METER LATITUDE_SM LONGITUDE_SM Path_ID
2557709 5.286339 51.353820 E0047000004028217 51.3501 5.3125 2557709_E0047000004028217
For each Path_ID(str) I would like to iterate the loop and would like to achieve df1 like:
Path_ID METER LATITUDE_SM LONGITUDE_SM
2557709_E0047000004028217 E0047000004028217 51.3501 5.3125
Path_ID CELLID Lon lat
2557709_E0047000004028217 2557709 5.286339 51.353820
I have many rows in the df.
I am doing something like
for row in df.iterrows():
print row ['Path_ID'],row['METER'],row['LATITUDE_SM'], row ['LONGITUDE_SM']
It is very hard to understand your goal, but IIUC, you want to group by Path_ID and print each value
grouped_df= df.groupby("Path_ID")[["Path_ID", "METER", "LATITUDE_SM", "LONGITUDE_SM"]]
for key, val in grouped_df:
print grouped_df.get_group(key), "\n"
Output
Path_ID METER LATITUDE_SM LONGITUDE_SM
0 2557709_E0047000004028217 E0047000004028217 51.35 5.3125
It is unclear why you want this behaviour, but you can achieve this with pd.DataFrame.iloc.
If you only need specific columns, replace : with a list of column numbers.
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.random((5, 5)))
for i in range(len(df.index)):
print(df.iloc[[i], :])
# 0 1 2 3 4
# 0 0.587349 0.947435 0.974285 0.498303 0.135898
# 0 1 2 3 4
# 1 0.292748 0.880276 0.522478 0.081902 0.187494
# 0 1 2 3 4
# 2 0.692022 0.908397 0.200202 0.099722 0.348589
# 0 1 2 3 4
# 3 0.041564 0.980425 0.899634 0.725757 0.569983
# 0 1 2 3 4
# 4 0.787038 0.000077 0.213646 0.444095 0.022923

How to speed up the code - searching through a dataframe takes hours

I've got a CSV file containing the distance between centroids in a GIS-model in the next format:
InputID,TargetID,Distance
1,2,3050.01327866
1,7,3334.99565217
1,5,3390.99115304
1,3,3613.77046864
1,4,4182.29900892
...
...
3330,3322,955927.582933
It is sorted on origin (InputID) and then on the nearest destination (TargetID).
For a specific modelling tool I need this data in a CSV file, formatted as follows (the numbers are the centroid numbers):
distance1->1, distance1->2, distance1->3,.....distance1->3330
distance2->1, distance2->2,.....
.....
distance3330->1,distance3330->2....distance3330->3330
So no InputID's or TargetID's, just the distances with the origins on the rows and the destinations on the columns:
(example for the first 5 origins/destinations)
0,3050.01327866,3613.77046864,4182.29900892,3390.99115304
3050.01327866,0,1326.94611797,1175.10254872,1814.45584129
3613.77046864,1326.94611797,0,1832.209595,3132.78725738
4182.29900892,1175.10254872,1832.209595,0,1935.55056767
3390.99115304,1814.45584129,3132.78725738,1935.55056767,0
I've built the next code, and it works. But it is so slow that running it will take days to get the 3330x3330 file. As I am a beginner in Python I think I am overlooking something...
import pandas as pd
import numpy as np
file=pd.read_csv('c:\\users\\Niels\\Dropbox\\Python\\centroid_distances.csv')
df=file.sort_index(by=['InputID', 'TargetID'], ascending=[True, True])
number_of_zones=3330
text_file = open("c:\\users\\Niels\\Dropbox\\Python\\Output.csv", "w")
for origin in range(1,number_of_zones):
output_string=''
print(origin)
for destination in range(1,number_of_zones):
if origin==destination:
distance=0
else:
distance_row=df[(df['InputID']==origin) & (df['TargetID'] == destination)]
# I guess this is the time-consuming part
distance=distance_row.iloc[0]['Distance']
output_string=output_string+str(distance)+','
text_file.write(output_string[:-1]+'\n') #strip last ',' of line
text_file.close()
Could you give me some hints to speed up this code?
IIUC, all you need is pivot. If you start from a frame like this:
df = pd.DataFrame(columns="InputID,TargetID,Distance".split(","))
df["InputID"] = np.arange(36)//6 + 1
df["TargetID"] = np.arange(36) % 6 + 1
df["Distance"] = np.random.uniform(0, 100, len(df))
df = df[df.InputID != df.TargetID]
df = df.sort(["InputID", "Distance"])
>>> df.head()
InputID TargetID Distance
2 1 3 6.407198
3 1 4 43.037829
1 1 2 52.121284
4 1 5 86.769620
5 1 6 96.703294
and we know the InputID and TargetID are unique, we can simply pivot:
>>> pv = df.pivot(index="InputID", columns="TargetID", values="Distance").fillna(0)
>>> pv
TargetID 1 2 3 4 5 6
InputID
1 0.000000 52.121284 6.407198 43.037829 86.769620 96.703294
2 53.741611 0.000000 27.555296 85.328607 59.561345 8.895407
3 96.142920 62.532984 0.000000 6.320273 37.809105 69.896308
4 57.835249 49.350647 38.660269 0.000000 7.151053 45.017780
5 72.758342 48.947788 4.212775 98.183169 0.000000 15.702280
6 32.468329 83.979431 23.578347 30.212883 82.580496 0.000000
>>> pv.to_csv("out_dist.csv", index=False, header=False)
>>> !cat out_dist.csv
0.0,52.1212839519,6.40719759732,43.0378290605,86.769620064,96.7032941473
53.7416111725,0.0,27.5552964592,85.3286070586,59.5613449796,8.89540736892
96.1429198049,62.5329836475,0.0,6.32027280686,37.8091052942,69.8963084944
57.8352492462,49.3506467609,38.6602692461,0.0,7.15105257546,45.0177800391
72.7583417281,48.9477878574,4.21277494476,98.183168992,0.0,15.7022798801
32.4683285321,83.9794307564,23.578346756,30.2128827937,82.5804959193,0.0
The reshaping section of the tutorial might be useful.

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
VoilĂ !
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Categories

Resources