I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477
I am working on a project using Learning to Rank. Below is the example dataset format (taken from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/). The first column is the rank, second column is query id, and the followings are [feature number]:[feature value]
1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 … 46:0.00000
1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 … 46:0.000000
1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 … 46:0.000000
Right now, I am successfully convert my data into this following format in Pandas.DataFrame.
10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0 92.28260 ...
...
The first two column is already fine. What I need next is appending feature number to the remaining columns (e.g. first feature from 3500 become 1:3500)
I know I can append a string to columns by using this following command.
df['col'] = 'str' + df['col'].astype(str)
Look at the first feature, 3500, is located at column index 2, so what I can think of is appending column index - 1 for each column. How do I append the string based on the column number?
Any help would be appreciated.
I think need DataFrame.radd for add columns names from right side and iloc for select from second column to end:
print (df)
0 1 2 3 4 5 6 7 8 \
0 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
1 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
9
0 92.2826
1 92.2826
df.iloc[:, 2:] = df.iloc[:, 2:].astype(str).radd(':').radd((df.columns[2:] - 1).astype(str))
print (df)
0 1 2 3 4 5 6 7 \
0 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
1 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
8 9
0 7:1840.0 8:92.2826
1 7:1840.0 8:92.2826
You can simply concatenate the columns
df['new_col'] = df[df.columns[3]].astype(str) + ':' + df[df.columns[2]].astype(str)
This will output a new column in your df named new_col. Now you can either delete the unnecessary columns.
You can convert the string to dictionary and then read it again as pandas dataframe.
import pandas as pd
import ast
df = pd.DataFrame({'rank': [1008, 1007, 1006], 'column':['qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000',\
'qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333',\
'qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000']} )
def putquotes(x):
x1 = x.split(":")
return "'" + x1[0] +"':" + x1[1]
def putcommas(x):
x1 = x.split()
return "{" + ",".join([putquotes(t) for t in x1]) + "}"
import ast
df1 = [ast.literal_eval(putcommas(x)) for x in df['column'].tolist()]
df = pd.concat([df,pd.DataFrame(df1)], axis=1)
This question already has answers here:
How to calculate correlation between all columns and remove highly correlated ones using pandas?
(28 answers)
Closed 1 year ago.
I have a DataFrame like this
dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())
I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
which yields
uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors
Col3
0 0.33
1 0.98
2 1.54
3 0.01
4 0.99
So far I am happy with the result, but I would like to keep one column from each correlated pair, so in the above example I would like to include Col1 or Col2. To get s.th. like this
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Also on a side note, is there any further evaluation I can do to determine which of the correlated columns to keep?
thanks
You can use np.tril() instead of np.eye() for the mask:
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.tril(np.ones([len(df_corr)]*2, dtype=bool))).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
Output:
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Use this directly on the dataframe to sort out the top correlation values.
import pandas as pd
import numpy as np
def correl(X_train):
cor = X_train.corr()
corrm = np.corrcoef(X_train.transpose())
corr = corrm - np.diagflat(corrm.diagonal())
print("max corr:",corr.max(), ", min corr: ", corr.min())
c1 = cor.stack().sort_values(ascending=False).drop_duplicates()
high_cor = c1[c1.values!=1]
## change this value to get more correlation results
thresh = 0.9
display(high_cor[high_cor>thresh])
correl(X)
output:
max corr: 0.9821068918331252 , min corr: -0.2993837739125243
object at 0x0000017712D504E0>
count_rech_2g_8 sachet_2g_8 0.982107
count_rech_2g_7 sachet_2g_7 0.979492
count_rech_2g_6 sachet_2g_6 0.975892
arpu_8 total_rech_amt_8 0.946617
arpu_3g_8 arpu_2g_8 0.942428
isd_og_mou_8 isd_og_mou_7 0.938388
arpu_2g_6 arpu_3g_6 0.933158
isd_og_mou_6 isd_og_mou_8 0.931683
arpu_3g_7 arpu_2g_7 0.930460
total_rech_amt_6 arpu_6 0.930103
isd_og_mou_7 isd_og_mou_6 0.926571
arpu_7 total_rech_amt_7 0.926111
dtype: float64
I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].
I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)