Python: Scaling numbers column by column with pandas

Python: Scaling numbers column by column with pandas - python

I have a Pandas data frame 'df' in which I'd like to perform some scalings column by column.
In column 'a', I need the maximum number to be 1, the minimum number to be 0, and all other to be spread accordingly.
In column 'b', however, I need the minimum number to be 1, the maximum number to be 0, and all other to be spread accordingly.
Is there a Pandas function to perform these two operations? If not, numpy would certainly do.
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114

This is how you can do it using sklearn and the preprocessing module. Sci-Kit Learn has many pre-processing functions for scaling and centering data.
In [0]: from sklearn.preprocessing import MinMaxScaler
In [1]: df = pd.DataFrame({'A':[14,90,90,96,91],
'B':[103,107,110,114,114]}).astype(float)
In [2]: df
Out[2]:
A B
0 14 103
1 90 107
2 90 110
3 96 114
4 91 114
In [3]: scaler = MinMaxScaler()
In [4]: df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
In [5]: df_scaled
Out[5]:
A B
0 0.000000 0.000000
1 0.926829 0.363636
2 0.926829 0.636364
3 1.000000 1.000000
4 0.939024 1.000000

You could subtract by the min, then divide by the max (beware 0/0). Note that after subtracting the min, the new max is the original max - min.
In [11]: df
Out[11]:
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114
In [12]: df -= df.min() # equivalent to df = df - df.min()
In [13]: df /= df.max() # equivalent to df = df / df.max()
In [14]: df
Out[14]:
a b
A 0.000000 0.000000
B 0.926829 0.363636
C 0.926829 0.636364
D 1.000000 1.000000
E 0.939024 1.000000
To switch the order of a column (from 1 to 0 rather than 0 to 1):
In [15]: df['b'] = 1 - df['b']
An alternative method is to negate the b columns first (df['b'] = -df['b']).

In case you want to scale only one column in the dataframe, you can do the following:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['Col1_scaled'] = scaler.fit_transform(df['Col1'].values.reshape(-1,1))

This is not very elegant but the following works for this two column case:
#Create dataframe
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
#Apply operates on each row or column with the lambda function
#axis = 0 -> act on columns, axis = 1 act on rows
#x is a variable for the whole row or column
#This line will scale minimum = 0 and maximum = 1 for each column
df2 = df.apply(lambda x:(x.astype(float) - min(x))/(max(x)-min(x)), axis = 0)
#Want to now invert the order on column 'B'
#Use apply function again, reverse numbers in column, select column 'B' only and
#reassign to column 'B' of original dataframe
df2['B'] = df2.apply(lambda x: 1-x, axis = 1)['B']
If I find a more elegant way (for example, using the column index: (0 or 1)mod 2 - 1 to select the sign in the apply operation so it can be done with just one apply command, I'll let you know.

I think Acumenus' comment in this answer, should be mentioned explicitly as an answer, as it is a one-liner.
>>> import pandas as pd
>>> from sklearn.preprocessing import minmax_scale
>>> df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
>>> minmax_scale(df)
array([[0. , 0. ],
[0.92682927, 0.36363636],
[0.92682927, 0.63636364],
[1. , 1. ],
[0.93902439, 1. ]])

given a data frame
df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]})
scale with mean 0 and var 1
df.apply(lambda x: (x - np.mean(x)) / np.std(x), axis=0)
scale with range between 0 and 1
df.apply(lambda x: x / np.max(x), axis=0)

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)

Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3

Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.

Generating Column Matrix for all columns in pandas

I have a dataframe consisting of 6 columns . What shall be the fastest way to generate a matrix which does the following:
Step 1) col1*col1a , col2*col2a, col3*col3a, col4*col4a
Step 2 ) col_new = (col1*col1a)-col2*col2a)/(col1a-col2a)
Using a for loop is one of the options - but what could be a quicker way to go about this.
import pandas as pd
df=pd.DataFrame()
df['col1']=[100,200,300,400,500]
df['col1a']=[6,71,8,90,10]
df['col2']=[600,700,800,1900,100]
df['col2a']=[6,17,8,9,10]
df['col3']=[100,220,300,440,500]
df['col3a']=[1,22,3,44,5]
df[1x2]=(df['col1']*df['col1a']-df['col2']*df['col2a'])/(df['col1a']-df['col2a'])
I need to have column combinations of 1x3,1x4,1x5,2x3,2x4 and so on...

Here is how I will approach it:
def new_col(df, col1, col2):
"""
Add a new column, modifying the dataframe inplace.
col1: int
column counter in the first column name
col2: int
column counter in the second column name
"""
nr = (
df.loc[:, f"col{col1}"] * df.loc[:, f"col{col1}a"]
- df.loc[:, f"col{col2}"] * df.loc[:, f"col{col2}a"]
)
dr = df.loc[:, f"col{col1}a"] - df.loc[:, f"col{col2}a"]
df.loc[:, f"col{col1}X{col2}"] = nr / dr
I will call this function with desired column combinations. For ex.
new_col(df, 1, 2)
Output:
The call be issued from a loop.

So apparently, my first answer only matched the original question: Here is an answer for the updated question:
from itertools import combinations
from functools import partial
primary_columns = df.columns[~df.columns.str.endswith("a")]
combs = combinations(primary_columns, 2)
def column_comparison(first, second, df):
return (df[first]*df[first+"a"]-df[second]*df[second+"a"])/(df[first+"a"] - df[second+"a"])
dct = {'{first}X{second}'.format(first=comb[0].lstrip("col"), second=comb[1].lstrip("col")):
partial(column_comparison, comb[0], comb[1]) for comb in combs}
So we created a dictionary that contains the name of the desired columns and the right function.
Now we can leverage assign
df.assign(**dct)
to obtain
col1 col1a col2 col2a col3 col3a 1X2 1X3 2X3
0 100 6 600 6 100 1 -inf 100.000000 700.000000
1 200 71 700 17 220 22 42.592593 191.020408 -1412.000000
2 300 8 800 8 300 3 -inf 300.000000 1100.000000
3 400 90 1900 9 440 44 233.333333 361.739130 64.571429
4 500 10 100 10 500 5 inf 500.000000 -300.000000
In a previous version I was using a lambda here, but this was not working - check here for an explanation. I only realized this after finding the solution using partial.

Pandas groupby then drop groups below specified size

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4

The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:

add a different random number to every cell in a pandas dataframe

I need to add some 'noise' to my data, so I would like to add a different random number to every cell in my pandas dataframe. This code works, but seems unpythonic. Is there a better way?
import pandas as pd
import numpy as np
df = pd.DataFrame(0.0, index=[1,2,3,4,5], columns=list('ABC') )
print df
for x,line in df.iterrows():
for col in df:
line[col] = line[col] + (np.random.rand()-0.5)/1000.0
print df

df + np.random.rand(*df.shape) / 10000.0
OR
Let's use applymap:
df = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABC') )
df.applymap(lambda x: x + np.random.rand()/10000.0)
output:
A \
1 [[1.00006953418, 1.00009164785, 1.00003177706]...
2 [[1.00007291245, 1.00004186046, 1.00006935173]...
3 [[1.00000490127, 1.0000633115, 1.00004117181],...
4 [[1.00007159622, 1.0000559506, 1.00007038891],...
5 [[1.00000980335, 1.00004760836, 1.00004214422]...
B \
1 [[1.00000320322, 1.00006981682, 1.00008912557]...
2 [[1.00007443802, 1.00009270815, 1.00007225764]...
3 [[1.00001371778, 1.00001512412, 1.00007986851]...
4 [[1.00005883343, 1.00007936509, 1.00009523334]...
5 [[1.00009329606, 1.00003174878, 1.00006187704]...
C
1 [[1.00005894836, 1.00006592776, 1.0000171843],...
2 [[1.00009085391, 1.00006606979, 1.00001755092]...
3 [[1.00009736701, 1.00007240762, 1.00004558753]...
4 [[1.00003981393, 1.00007505714, 1.00007209959]...
5 [[1.0000031608, 1.00009372917, 1.00001960112],...

This would be the more succinct method and equivalent:
In [147]:
df = pd.DataFrame((np.random.rand(5,3) - 0.5)/1000.0, columns=list('ABC'))
df
Out[147]:
A B C
0 0.000381 -0.000167 0.000020
1 0.000482 0.000007 -0.000281
2 -0.000032 -0.000402 -0.000251
3 -0.000037 -0.000319 0.000260
4 -0.000035 0.000178 0.000166
If you're doing this to an existing df with non-zero values then add:
In [149]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df
Out[149]:
A B C
0 -1.705644 0.149067 0.835378
1 -0.956335 -0.586120 0.212981
2 0.550727 -0.401768 1.421064
3 0.348885 0.879210 0.136858
4 0.271063 0.132579 1.233789
In [154]:
df.add((np.random.rand(df.shape[0], df.shape[1]) - 0.5)/1000.0)
Out[154]:
A B C
0 -1.705459 0.148671 0.835761
1 -0.956745 -0.586382 0.213339
2 0.550368 -0.401651 1.421515
3 0.348938 0.878923 0.136914
4 0.270864 0.132864 1.233622

For nonzero data:
df + (np.random.rand(df.shape)-0.5)*0.001
OR
df + np.random.uniform(-0.01,0.01,(df.shape)))
For cases where your data frame contains zeros that you wish to keep as zero:
df * (1 + (np.random.rand(df.shape)-0.5)*0.001)
OR
df * (1 + np.random.uniform(-0.01,0.01,(df.shape)))
I think either of these should work, its a case of generating a same size "dataframe" (or perhaps array of arrays) as your existing df and adding it to your existing df (multiplying by 1 + random for cases where you wish zeros to remain zero). With the uniform function you can determine the scale of your noise by altering the 0.01 variable.

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']

Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()

df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.

New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0

Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47

Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47

Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.

This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T

Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)

New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs

Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1

Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!

When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14

Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Scaling numbers column by column with pandas - python

In case you want to scale only one column in the dataframe, you can do the following: from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df['Col1_scaled'] = scaler.fit_transform(df['Col1'].values.reshape(-1,1))

given a data frame df = pd.DataFrame({'A':[14,90,90,96,91], 'B':[103,107,110,114,114]}) scale with mean 0 and var 1 df.apply(lambda x: (x - np.mean(x)) / np.std(x), axis=0) scale with range between 0 and 1 df.apply(lambda x: x / np.max(x), axis=0)

Related

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

Generating Column Matrix for all columns in pandas

Pandas groupby then drop groups below specified size

add a different random number to every cell in a pandas dataframe

Pandas dataframe total row

Categories

Resources