Get all row values from single column - python

This code is working and retrieving the exact data i need. I just need to output all the row values into a single value;
HERE IS THE bit of CODE
dfs = pd.read_html(str(tables))
df = dfs[0].iloc[[1,3,5,7,9,11,13,15,17,19],[1]]
s = df['score'].str.split('-',expand=True).astype(int)
df['team_win'] = np.where(s[0] == s[1], 0,s.idxmax(1) + 1)
df = df['team_win'].drop([9, 11])
HERE IS THE OUTPUT
Name: team_win, dtype: int64
1 1
3 0
5 2
7 0
13 1
15 2
17 2
19 2
I need the team_win to be outputed like this... 10201222 so i can copy into google sheet

Ranked from best performance to worst performance, you could:
1)
''.join(map(str, df['team_win']))
Out[319]: '10201222'
2)
''.join(df['team_win'].map(str))
Out[320]: '10201222'
3)
''.join([str(i) for i in df['team_win']])
Out[313]: '10201222'
Thanks to #piRSquared for the additional suggestions

Related

Python: how to multiply 2 columns?

I have the simple dataframe and I would like to add the column 'Pow_calkowita'. If 'liczba_kon' is 0, 'Pow_calkowita' is 'Powierzchn', but if 'liczba_kon' is not 0, 'Pow_calkowita' is 'liczba_kon' * 'Powierzchn. Why I can't do that?
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
row['Pow_calkowita'] = row['Powierzchn']
elif row['liczba_kon'] != 0:
row['Pow_calkowita'] = row['Powierzchn'] * row['liczba_kon']
My code didn't return any values.
liczba_kon Powierzchn
0 3 69.60495
1 1 39.27270
2 1 130.41225
3 1 129.29570
4 1 294.94400
5 1 64.79345
6 1 108.75560
7 1 35.12290
8 1 178.23905
9 1 263.00930
10 1 32.02235
11 1 125.41480
12 1 47.05420
13 1 45.97135
14 1 154.87120
15 1 37.17370
16 1 37.80705
17 1 38.78760
18 1 35.50065
19 1 74.68940
I have found some soultion:
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
Is it good way?
To write idiomatic code for Pandas and leverage on Pandas' efficient array processing, you should avoid writing codes to loop over the array by yourself. Pandas allows you to write succinct codes yet process efficiently by making use of vectorization over its efficient numpy ndarray data structure. Underlying, it uses fast array processing using optimized C language binary codes. Pandas already handles the necessary looping behind the scene and this is also an advantage using Pandas by single statement without explicitly writing loops to iterate over all elements. By using Pandas, you would better enjoy its fast efficient yet succinct vectorization processing instead.
As your formula is based on a condition, you cannot use direct multiplication. Instead you can use np.where() as follows:
import numpy as np
df['Pow_calkowita'] = np.where(df['liczba_kon'] == 0, df['Powierzchn'], df['Powierzchn'] * df['liczba_kon'])
When the test condition in first parameter is true, the value from second parameter is taken, else, the value from the third parameter is taken.
Test run output: (Add 2 more test cases at the end; one with 0 value of liczba_kon)
print(df)
liczba_kon Powierzchn Pow_calkowita
0 3 69.60495 208.81485
1 1 39.27270 39.27270
2 1 130.41225 130.41225
3 1 129.29570 129.29570
4 1 294.94400 294.94400
5 1 64.79345 64.79345
6 1 108.75560 108.75560
7 1 35.12290 35.12290
8 1 178.23905 178.23905
9 1 263.00930 263.00930
10 1 32.02235 32.02235
11 1 125.41480 125.41480
12 1 47.05420 47.05420
13 1 45.97135 45.97135
14 1 154.87120 154.87120
15 1 37.17370 37.17370
16 1 37.80705 37.80705
17 1 38.78760 38.78760
18 1 35.50065 35.50065
19 1 74.68940 74.68940
20 0 69.60495 69.60495
21 2 74.68940 149.37880
To answer the first question: "Why I can't do that?"
The documentation states (in the notes):
Because iterrows returns a Series for each row, ....
and
You should never modify something you are iterating over. [...] the iterator returns a copy and not a view, and writing to it will have no effect.
this basically means that it returns a new Series with the values of that row
So, what you are getting is NOT the actual row, and definitely NOT the dataframe!
BUT what you are doing is working, although not in the way that you want to:
df = DF(dict(a= [1,2,3], b= list("abc")))
df # To demonstrate what you are doing
a b
0 1 a
1 2 b
2 3 c
for index, row in df.iterrows():
... print("\n------------------\n>>> Next Row:\n")
... print(row)
... row["c"] = "ADDED" ####### HERE I am adding to 'the row'
... print("\n -- >> added:")
... print(row)
... print("----------------------")
...
------------------
Next Row: # as you can see, this Series has the same values
a 1 # as the row that it represents
b a
Name: 0, dtype: object
-- >> added:
a 1
b a
c ADDED # and adding to it works... but you aren't doing anything
Name: 0, dtype: object # with it, unless you append it to a list
----------------------
------------------
Next Row:
a 2
b b
Name: 1, dtype: object
### same here
-- >> added:
a 2
b b
c ADDED
Name: 1, dtype: object
----------------------
------------------
Next Row:
a 3
b c
Name: 2, dtype: object
### and here
-- >> added:
a 3
b c
c ADDED
Name: 2, dtype: object
----------------------
To answer the second question: "Is it good way?"
No.
Because using the multiplication like SeaBean has shown actually uses the power of
numpy and pandas, which are vectorized operations.
This is a link to a good article on vectorization in numpy arrays, which are basically the building blocks of pandas DataFrames and Series.
dataframe is designed to operate with vectorication. you can treat it as a database table. So you should use its functions as long as it's possible.
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
This simplified the code and enhanced performance., we can test their performance:
sampleSize = 100000
df=pd.DataFrame({
'liczba_kon': np.random.randint(3, size=(sampleSize)),
'Powierzchn': np.random.randint(1000, size=(sampleSize)),
})
# vectorication
s = time.time()
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
print(time.time() - s)
# iteration
s = time.time()
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
print(time.time() - s)
We can see vectorication performed much faster.
0.0034716129302978516
6.193516492843628

fill in entire dataframe cell by cell based on index AND column names?

I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]

Splitting and copying a row in pandas

I have a task that is completely driving me mad. Lets suppose we have this df:
import pandas as pd
k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}
k = pd.DataFrame(k)
What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.
So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!
I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:
n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
isin n_isins random_col
0 ES0140074008 1 a
1 ES0140074008ES0140074010 2 b
2 ES0140074008ES0140074010 2 b
3 ES0140074008ES0140074016ES0140074024 3 c
4 ES0140074008ES0140074016ES0140074024 3 c
5 ES0140074008ES0140074016ES0140074024 3 c
#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
isin n_isins random_col new
0 ES0140074008 1 a ES0140074008
1 ES0140074008ES0140074010 2 b ES0140074008
2 ES0140074008ES0140074010 2 b ES0140074010
3 ES0140074008ES0140074016ES0140074024 3 c ES0140074008
4 ES0140074008ES0140074016ES0140074024 3 c ES0140074016
5 ES0140074008ES0140074016ES0140074024 3 c ES0140074024

Fastest way to compare rows of two pandas dataframes?

So I have two pandas dataframes, A and B.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
For example, I would want
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
And I want to do this over every row in B.
The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
I have been messing around with DataFrame.apply to no avail. Could map work better?
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
Thanks and best regards!
We can abuse the fact that both dataframes have binary values 0 or 1 by collapsing the relevant columns from A and all columns from B into 1D arrays each, when considering each row as a sequence of binary numbers that could be converted to decimal number equivalents. This should reduce the problem set considerably, which would help with performance. Now, after getting those 1D arrays, we can use np.in1d to look for matches from B in A and finally np.where on it to get the matching indices.
Thus, we would have an implementation like so -
# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)
# Look for matches that exist from B_ID in A_ID, whose indices
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
Sample run -
In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
...: A_arr = np.random.randint(0,2,(10,14))
...: B_arr = np.random.randint(0,2,(7,10))
...:
...: B_arr[2] = A_arr[4,1:11]
...: B_arr[4] = A_arr[4,1:11]
...: B_arr[5] = A_arr[0,1:11]
...:
...: A = pd.DataFrame(A_arr)
...: B = pd.DataFrame(B_arr)
...:
In [158]: S = 2**np.arange(10)
...: A_ID = np.dot(A[range(1,11)],S)
...: B_ID = np.dot(B,S)
...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
...:
In [159]: out_row_idx
Out[159]: array([0, 4])
You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1
You can do it in pandas by using loc or ix and telling it to find the rows where the ten columns are all equal. Like this:
A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]
This is quite ugly in my opinion but it will work and gets rid of the loop so it should be significantly faster. I wouldn't be surprised if someone could come up with a more elegant way of coding the same operation.
In this special case, your rows of 10 zeros and ones can be interpreted as 10 digit binaries. If B is in order, then it can be interpreted as a range from 0 to 1023. In this case, all we need to do is take A's rows in 10 column chunks and calculate what its binary equivalent is.
I'll start by defining a range of powers of two so I can do matrix multiplication with it.
twos = pd.Series(np.power(2, np.arange(10)))
Next, I'll relabel A's columns into a MultiIndex and stack to get my chunks of 10.
A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)
A_.head()
Finally, I'll multiply A_ with twos to get integer representation of each row and unstack.
A_.dot(twos).unstack()
This is now a 1000 x 50 DataFrame where each cell represents which of B's rows we matched for that particular 10 column chunk for that particular row of A. There isn't even a need for B.

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
VoilĂ !
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Categories

Resources