Plot distribution of differences between two pandas dataframe columns - python

I have a pandas dataframe, which have columns A & B
I just want to plot a distribution graph of the percentage of differences between column A & B
A B
1 1.051990e+10 1.051990e+04
2 1.051990e+10 1.051990e+04
5 4.841800e+10 1.200000e+10
8 2.327700e+10 2.716000e+10
9 1.204900e+10 2.100000e+08
Distribution graph will be like, how many records are having 10% of differences, how many are 20% difference
I tried as follows
df percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(df['A'], df['B']), axis=1)
This is not working, as i'm newbie please help

You don't need the lambda operation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(1, 10, (20, 2)), columns=['A', 'B'])
def percCal(x,y):
return (x-y)*100/x
Alternatively, just manipulate the columns directly:
df1['diff'] = (df1['A'] - df1['B']) * 100 / df1['A']
Apply the function and plot:
df1['diff'] = percCal(df1['A'], df1['B'])
df1['diff'].plot(kind='density')

df['perc'] = (df['A'] - df['B']) *100/df['A']

def percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(x['A'], x['B']), axis=1)
Change dfin lambda for x in this case you are giving the function the data xthat means you are giving the percCalwhat you have in the row of the data frame and when you use dfyou are giving actually the data frame and the function is returning a data frame not a value. But please check your code, if xin the function can be 0 is a problem.

Think this is what you are looking for:
# Dummy df
data = [
[1.051990e+10, 1.051990e+04],
[1.051990e+10, 1.051990e+04],
[4.841800e+10, 1.200000e+10],
[2.327700e+10, 2.716000e+10],
[1.204900e+10, 2.100000e+08],
]
cols = ['A', 'B']
df2 = pd.DataFrame(data, columns=cols)
# Solution
import seaborn as sns
df2['pct_diff'] = (df2['A'] - df2['B']) / df2['A']
sns.distplot(df2['pct_diff']);

Related

Difference of two Dataframes is not exact

I am trying to get random values of dataframe DF1 and them storing them in a new variable DF2. I want to take difference to the remaining values will be not in origional dataframe DF1. I need to do this task without using sklearn library.
I tried two ways to get random values and they are following:
Method 1:
DF2 = DF1.sample(n = 1000, random_state = 10)
Method 2:
chosen_idx = np.random.choice(2000, replace = False, size = 1000)
DF2 = DF1.iloc[chosen_idx]
Following is how I take their difference to get dataframe with remaining values, say DF3:
DF3 = pd.concat([DF1, DF2]).drop_duplicates(keep=False)
The problem is, the difference of len(DF1), len(DF2) and len(DF3) should be 0. But it is not. I am not sure where I am wrong. Following is my actual code with different variables:
def train_validation_test(set_dataframe):
if isinstance(set_dataframe, pd.DataFrame):
df_length = len(set_dataframe.index)
seventy = math.floor(df_length*0.7)
seventy = seventy if seventy%2==0 else seventy+1
remaining = int((df_length - seventy)/2)
# one = set_dataframe.sample(n = seventy, random_state = 10)
chosen_idx = np.random.choice(df_length, replace = False, size = seventy)
one = set_dataframe.iloc[chosen_idx]
return one
else:
return print('Argument passed is not dataframe. Please pass dataframe as argument.')
abc = train_validation_test(task01_df)
xyz = pd.concat([task01_df, abc]).drop_duplicates(keep=False)
print(len(task01_df) - len(abc) - len(xyz))
The result is 7 but it is depending on random_state. It is never 0 and having varying value.
You can use train_test_split from sklearn:
# Python env: pip install sklearn
# Conda env: conda install sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
DF1 = pd.DataFrame(np.random.randint(1, 100, (2000, 3)), columns=list('ABC'))
DF2, DF3 = train_test_split(DF1, train_size=1000)
Output:
>>> DF1.shape
(2000, 3)
>>> DF2.shape
(1000, 3)
>>> DF3.shape
(1000, 3)
>>> DF2.index.intersection(DF3.index)
Int64Index([], dtype='int64') # no overlaps

Create a function to perform a T-Test in python

I'm using the following code to compute t-test in python
import researchpy as rp
import scipy.stats as stats
summary, results = rp.ttest(group1= dfEnt['HA'][df['Q8_5'] == 0], group1_name= "Nascent",
group2= dfEnt['HA'][df['Q8_5'] == 2], group2_name= "Established")
How could I create a function that will provide as argument the name of the dataframe with the column on which I want to compute the t-test. I would like for example to rune t-test with dfEnt['IA] or dfSel['IA],...
Thanks for your help
What version of researchpy are you using? In the newest version that can be completed using -difference_test()-. The code below will conduct a ttest on the column -Exercise- which contains the group categories using the values from -StessReactivity-.
import pandas as pd
import researchpy as rp
import numpy as np
np.random.seed(12345678)
df = pd.DataFrame(np.random.randint(10, size= (100, 2)),
columns= ['No', 'Yes'])
df["id"] = range(1, df.shape[0] + 1)
df2 = pandas.melt(df, id_vars = "id", value_vars = ["No", "Yes"],
var_name = "Exercise", value_name = "StressReactivity")
rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = True,
independent_samples = True)

Pandas subsetting returing different results to numpy

I am trying to subset a pandas dataframe using two conditions. However, I am not getting the same results as when done with numpy. What am I doing wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(20,120,101)
y = np.linspace(-45,25,101)
xs,ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
plt.scatter(xs,ys,s=2,c='b')
plt.scatter(xs[idx],ys[idx],s=2,c='r')
I need to remove the red block from my dataset, which I can do with numpy by using:
plt.scatter(xs[~idx],ys[~idx],s=2,c='b')
How do I replicate this with a pandas dataframe?
I've tried using the same logic as I used above:
data = {'x':x,'y':y}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
I've also tried using loc:
df.loc[(df.x >=100) & (df.y >= 0),['x','y']] = np.nan
Both of these methods give the following result:
How do I replicate the results from numpy?
Many thanks.
You don't obtain the same result because you didn't create all the couple of coordinates before passing them to pandas. Here is a quick solution:
data = {'x':xs.flatten(),'y':ys.flatten()}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
plt.scatter(df2.x,df2.y,s=2,c='b')
Flatten reshape your arrays to only have one dimension so that they can be used to construct a DF containing couple of coordinates and not lists.
Output:
Edit: Same result but with dataframe containing x and y
Split the df in chunks
data_x = np.linspace(20,120,101)
data_y = np.linspace(-45,25,101)
dataframe = pd.DataFrame({'x':data_x,'y':data_y})
chunk_size = 25
dfs = [dataframe[i:i+chunk_size] for i in range(0,dataframe.shape[0],chunk_size)]
Define the function that will give you the points you are interested in. Two loops because you need to get every configuration of x and y values
def generatorPoints(dfs):
for i in range(len(dfs)):
x = dfs[i].x
for j in range(len(dfs)):
y = dfs[j].y
xs, ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
yield xs[~idx], ys[~idx]
x, y = [], []
for xs, ys in generatorPoints(dfs):
x.extend(xs), y.extend(ys)
plt.scatter(x,y,s=2,c='b')
This gives the same result as the previous code. There is certainly place to make some optimization but this is a start for your request :).

python scipy spearman correlations

I am trying to obtain the column names from the dataframe (df) and associate them to the resulting array produced by the spearmanr correlation function. I need to associate both the column names (a-j) back to the correlation value (spearman) and the p-values (spearman_pvalue). Is there an intuitive way to perform this task?
from scipy.stats import pearsonr,spearmanr
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.randint(0,100,size= (100,10)),columns=list('abcdefghij'))
def binary(row):
if row>=50:
return 1
else:
return 0
df['target']=df.a.apply(binary)
spearman,spearman_pvalue=spearmanr(df.drop(['target'],axis=1),df.target)
print(spearman)
print(spearman_pvalue)
It seems you need:
from scipy.stats import spearmanr
df=pd.DataFrame(np.random.randint(0,100,size= (100,10)),columns=list('abcdefghij'))
#print (df)
#faster for binary df
df['target'] = (df['a'] >= 50).astype(int)
#print (df)
spearman,spearman_pvalue=spearmanr(df.drop(['target'],axis=1),df.target)
df1 = pd.DataFrame(spearman.reshape(-1, 11), columns=df.columns)
#print (df1)
df2 = pd.DataFrame(spearman_pvalue.reshape(-1, 11), columns=df.columns)
#print (df2)
### Kyle, we can assign the index back to the column names for the total matrix:
df2=df2.set_index(df.columns)
df1=df1.set_index(df.columns)
Or:
df1 = pd.DataFrame(spearman.reshape(-1, 11),
columns=df.columns,
index=df.columns)
df2 = pd.DataFrame(spearman_pvalue.reshape(-1, 11),
columns=df.columns,
index=df.columns)

Visually separating bar chart clusters in pandas

This is more of a hack that almost works.
#!/usr/bin/env python
from pandas import *
import matplotlib.pyplot as plt
from numpy import zeros
# Create original dataframe
df = DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()
and gives:
pol1 pol2 pol3 pol4
art 0.247309 0.139797 0.673009 0.265708
mcf 0.951582 0.319486 0.447658 0.259821
mesa 0.888686 0.177007 0.845190 0.946728
perl 0.902977 0.863369 0.194451 0.698102
gcc 0.836407 0.700306 0.739659 0.265613
0 0.000000 0.000000 0.000000 0.000000
average 0.765392 0.439993 0.579993 0.487194
and
It gives the visual separation between benchmarks and average.
Is there a way to get rid of the 0 at the x-axis??
It turns out that DataFrame does not allow me to have muptiple dummy rows this way.
My solution was to change
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
into
row = pd.Series([dict({p:0.0 for p in df.columns}), ])
row.name = ""
Series can be named with empty string.
Still pretty hacky, but it works:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create original dataframe
df = pd.DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.reindex(np.where(df.index, df.index, ''))
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()

Categories

Resources