Getting meaningful results from pandas.describe() - python

I called describe on one column of a dataframe and ended up with the following output,
count 1.048575e+06
mean 8.232821e+01
std 2.859016e+02
min 0.000000e+00
25% 3.000000e+00
50% 1.400000e+01
75% 6.000000e+01
max 8.599700e+04
What parameter do I pass to get meaningful integer values. What I mean is when I check the SQL count its about 43 million. All the other values are also different.Can someone help me understand what this conversion means and how do I get float rounded to 2 decimal places. I'm new to Pandas.

You can directly use round() and pass the number of decimals you want as argument
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# setting the seed to create the dataframe
np.random.seed(25)
# Creating a 5 * 4 dataframe
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
# rounding describe
df.describe().round(2)
A B C D
count 5.00 5.00 5.00 5.00
mean 0.52 0.47 0.38 0.42
std 0.21 0.23 0.19 0.29
min 0.33 0.12 0.16 0.11
25% 0.41 0.37 0.28 0.19
50% 0.45 0.58 0.37 0.44
75% 0.56 0.59 0.40 0.52
max 0.87 0.70 0.68 0.84
DOCS

There are two ways to control the output of pandas, either by controlling it or by using apply.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)

Related

filter dataframe by condition when the condition is for the columns and not on the values

I have dataframe with float columns that look similar to this:
>>>397.55 400.231 404.42 407.12 465.23 478.92 492.3 501.2 505.6 ...
0 0.23 0.122 0.43 0.11 0.345 0.22 0.66 0.34 0.21
1 0.12 0.443 0.76 0.12 0.22 0.24 0.56 0.11 0.04
2 0.45 0.87 0.23 0.99 0.11 0.44 0.78 0.65 0.23
...
I want to filter the dataframe s'll have only column that their value is between 405.2 to 472.7.
I have tried to filter it with condition on the columns but it did not work:
df[(df.columns>405.2)]
>>>ValueError: Item wrong length 224 instead of 10783.
224 is the number of columns and 10783 is number of rows.
Is there any way I can filter my dataframe to be between two values when the values are the column name?
Use DataFrame.loc, first : means get all rows and columns by condition:
df.loc[:, (df.columns>405.2)]

Python: how can I pass parameters in def to inputs in pandas loc?

I want to pass the parameters in my def to inputs in pandas loc but I am not sure how to do so, as loc requires defined labels as inputs. Or is there any other way I can perform Excel INDEX MATCH equivalent in Python but not using loc? Many thanks!
Below please find my code:
def get_correl_diff_tenor(p1, p2):
correl = IRCorrMatrix.loc['p1', 'p2']
return correl
p1 and p2 in loc['p1', 'p2'] refer to the tenor pairs for calling the corresponding correlation value in the matrix below.
IRCorrMatrix is shown below, which is a correlation matrix defined by tenor pairs.
2w 1m 3m 6m 1y
Tenor
2w 1.00 0.73 0.64 0.57 0.44
1m 0.73 1.00 0.78 0.67 0.50
3m 0.64 0.78 1.00 0.85 0.66
6m 0.57 0.67 0.85 1.00 0.81
1y 0.44 0.50 0.66 0.81 1.00
IIUC remove '' from 'p1', 'p2' for pass variables from function:
IRCorrMatrix.loc[p1, p2]

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

Make console-friendly string a useable pandas dataframe python

A quick question as I'm currently changing from R to pandas for some projects:
I get the following print output from metrics.classification_report from sci-kit learn:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.50 1.00 0.67 1
2 1.00 0.80 0.89 5
avg / total 0.83 0.78 0.79 9
I want to use this (and similar ones) as a matrix/dataframe so, that I could subset it to extract, say the precision of class 0.
In R, I'd give the first "column" a name like 'outcome_class' and then subset it:
my_dataframe[my_dataframe$class_outcome == 1, 'precision']
And I can do this in pandas but the dataframe that I want to use is simply a string see sckikit's doc
How can I make the table output here to a useable dataframe in pandas?
Assign it to a variable, s:
s = classification_report(y_true, y_pred, target_names=target_names)
Or directly:
s = '''
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
'''
Use that as the string input for StringIO:
import io # For Python 2.x use import StringIO
df = pd.read_table(io.StringIO(s), sep='\s{2,}') # For Python 2.x use StringIO.StringIO(s)
df
Out:
precision recall f1-score support
class 0 0.5 1.00 0.67 1
class 1 0.0 0.00 0.00 1
class 2 1.0 0.67 0.80 3
avg / total 0.7 0.60 0.61 5
Now you can slice it like an R data.frame:
df.loc['class 2']['f1-score']
Out: 0.80000000000000004
Here, classes are the index of the DataFrame. You can use reset_index() if you want to use it as a regular column:
df = df.reset_index().rename(columns={'index': 'outcome_class'})
df.loc[df['outcome_class']=='class 1', 'support']
Out:
1 1
Name: support, dtype: int64

python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.
i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)
So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.
any tricks for this?
Country Week Bill%1 Bill%2 Bill%3 Bill%4 Bill%5 Bill%6
IT week1 0.94 0.88 0.85 1.21 0.77 0.75
IT week2 0.93 0.88 1.25 0.80 0.77 0.72
IT week3 0.94 1.33 0.85 0.82 0.76 0.76
IT week4 1.39 0.89 0.86 0.80 0.80 0.76
FR week1 0.92 0.86 0.82 1.18 0.75 0.73
FR week2 0.91 0.86 1.22 0.78 0.75 0.71
FR week3 0.92 1.29 0.83 0.80 0.75 0.75
FR week4 1.35 0.87 0.84 0.78 0.78 0.74
I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')
# Define cutoff value
cutoff = 0.90
for col in df.columns:
# Identify index locations above cutoff
outliers = df[col][ df[col]>cutoff ]
# Browse through outliers and average according to index location
for idx in outliers.index:
# Get index location
loc = df.index.get_loc(idx)
# If not one of last two values in dataframe
if loc<df.shape[0]-2:
df[col][loc] = np.mean( df[col][loc+1:loc+3] )
else:
df[col][loc] = np.mean( df[col][loc-3:loc-1] )

Categories

Resources