Difference of two Dataframes is not exact - python

I am trying to get random values of dataframe DF1 and them storing them in a new variable DF2. I want to take difference to the remaining values will be not in origional dataframe DF1. I need to do this task without using sklearn library.
I tried two ways to get random values and they are following:
Method 1:
DF2 = DF1.sample(n = 1000, random_state = 10)
Method 2:
chosen_idx = np.random.choice(2000, replace = False, size = 1000)
DF2 = DF1.iloc[chosen_idx]
Following is how I take their difference to get dataframe with remaining values, say DF3:
DF3 = pd.concat([DF1, DF2]).drop_duplicates(keep=False)
The problem is, the difference of len(DF1), len(DF2) and len(DF3) should be 0. But it is not. I am not sure where I am wrong. Following is my actual code with different variables:
def train_validation_test(set_dataframe):
if isinstance(set_dataframe, pd.DataFrame):
df_length = len(set_dataframe.index)
seventy = math.floor(df_length*0.7)
seventy = seventy if seventy%2==0 else seventy+1
remaining = int((df_length - seventy)/2)
# one = set_dataframe.sample(n = seventy, random_state = 10)
chosen_idx = np.random.choice(df_length, replace = False, size = seventy)
one = set_dataframe.iloc[chosen_idx]
return one
else:
return print('Argument passed is not dataframe. Please pass dataframe as argument.')
abc = train_validation_test(task01_df)
xyz = pd.concat([task01_df, abc]).drop_duplicates(keep=False)
print(len(task01_df) - len(abc) - len(xyz))
The result is 7 but it is depending on random_state. It is never 0 and having varying value.

You can use train_test_split from sklearn:
# Python env: pip install sklearn
# Conda env: conda install sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
DF1 = pd.DataFrame(np.random.randint(1, 100, (2000, 3)), columns=list('ABC'))
DF2, DF3 = train_test_split(DF1, train_size=1000)
Output:
>>> DF1.shape
(2000, 3)
>>> DF2.shape
(1000, 3)
>>> DF3.shape
(1000, 3)
>>> DF2.index.intersection(DF3.index)
Int64Index([], dtype='int64') # no overlaps

Related

Create a function to perform a T-Test in python

I'm using the following code to compute t-test in python
import researchpy as rp
import scipy.stats as stats
summary, results = rp.ttest(group1= dfEnt['HA'][df['Q8_5'] == 0], group1_name= "Nascent",
group2= dfEnt['HA'][df['Q8_5'] == 2], group2_name= "Established")
How could I create a function that will provide as argument the name of the dataframe with the column on which I want to compute the t-test. I would like for example to rune t-test with dfEnt['IA] or dfSel['IA],...
Thanks for your help
What version of researchpy are you using? In the newest version that can be completed using -difference_test()-. The code below will conduct a ttest on the column -Exercise- which contains the group categories using the values from -StessReactivity-.
import pandas as pd
import researchpy as rp
import numpy as np
np.random.seed(12345678)
df = pd.DataFrame(np.random.randint(10, size= (100, 2)),
columns= ['No', 'Yes'])
df["id"] = range(1, df.shape[0] + 1)
df2 = pandas.melt(df, id_vars = "id", value_vars = ["No", "Yes"],
var_name = "Exercise", value_name = "StressReactivity")
rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = True,
independent_samples = True)

Getting a python error "cannot reindex from a duplicate index" and a warning too "setting with copy warning" in the below code

import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score, classification_report
#kindly download the data FIRST which is required and then update the path accordingly for the variables you have to give the path
# variable1 = pd.read_csv(r"give the path to the data")
variable1 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/TCS.csv")
variable2 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/WIPRO.csv")
variable3 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/HDFC.csv")
variable4 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/ITC.csv")
frames = [variable1,variable2,variable3,variable4]
all = pd.concat(frames)
print(all)
price_data = all [['Symbol','Date','Close','High','Low','Open','Volume']]
First, for average investors, the return of an asset is a complete and scale–free summary of the investment opportunity. Second, return series are easier to handle than prices series as they have more attractive statistical properties
# sort the values by symbol and then date
price_data.sort_values(by = ['Symbol','Date'], inplace = True)
# calculate the change in price
price_data['change_in_price'] = price_data['Close'].diff()
# identify rows where the symbol changes
mask = price_data['Symbol'] != price_data['Symbol'].shift(1)
# For those rows, let's make the value null
price_data['change_in_price'] = np.where(mask == True, np.nan, price_data['change_in_price'])
# print the rows that have a null value, should only be 5
price_data[price_data.isna().any(axis = 1)]
days_out = 30
# Group by symbol, then apply the rolling function and grab the Min and Max.
price_data_smoothed = price_data.groupby(['Symbol'])
[['Close','Low','High','Open','Volume']].transform(lambda x: x.ewm(span = days_out).mean())
# Join the smoothed columns with the symbol and datetime column from the old data frame.
smoothed_df = pd.concat([price_data[['Symbol','Date']], price_data_smoothed], axis=1, sort=False)
smoothed_df
days_out = 30
# create a new column that will house the flag, and for each group calculate the diff compared to 30 days ago. Then use Numpy to define the sign.
smoothed_df['Signal_Flag'] = smoothed_df.groupby('Symbol')['Close'].transform(lambda x :
np.sign(x.diff(days_out)))
# print the first 50 rows
smoothed_df.head(50)
up to here it is working but when i execute the below code then it throws an error cannot reindex from a duplicate axis
n = 14
# First make a copy of the data frame twice
up_df, down_df = price_data[['Symbol','change_in_price']].copy(),
price_data[['Symbol','change_in_price']].copy()
# For up days, if the change is less than 0 set to 0.
up_df.loc['change_in_price'] = up_df.loc[(up_df['change_in_price'] < 0), 'change_in_price'] = 0
# For down days, if the change is greater than 0 set to 0.
down_df.loc['change_in_price'] = down_df.loc[(down_df['change_in_price'] > 0), 'change_in_price']
= 0
# We need change in price to be absolute.
down_df['change_in_price'] = down_df['change_in_price'].abs()
# Calculate the EWMA (Exponential Weighted Moving Average), meaning older values are given less weight compared to newer values.
ewma_up = up_df.groupby('Symbol')['change_in_price'].transform(lambda x: x.ewm(span = n).mean())
ewma_down = down_df.groupby('Symbol')['change_in_price'].transform(lambda x: x.ewm(span =
n).mean())
# Calculate the Relative Strength
relative_strength = ewma_up / ewma_down
# Calculate the Relative Strength Index
relative_strength_index = 100.0 - (100.0 / (1.0 + relative_strength))
# Add the info to the data frame.
price_data['down_days'] = down_df['change_in_price']
price_data['up_days'] = up_df['change_in_price']
price_data['RSI'] = relative_strength_index
# Display the head.
price_data.head(30)

Plot distribution of differences between two pandas dataframe columns

I have a pandas dataframe, which have columns A & B
I just want to plot a distribution graph of the percentage of differences between column A & B
A B
1 1.051990e+10 1.051990e+04
2 1.051990e+10 1.051990e+04
5 4.841800e+10 1.200000e+10
8 2.327700e+10 2.716000e+10
9 1.204900e+10 2.100000e+08
Distribution graph will be like, how many records are having 10% of differences, how many are 20% difference
I tried as follows
df percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(df['A'], df['B']), axis=1)
This is not working, as i'm newbie please help
You don't need the lambda operation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(1, 10, (20, 2)), columns=['A', 'B'])
def percCal(x,y):
return (x-y)*100/x
Alternatively, just manipulate the columns directly:
df1['diff'] = (df1['A'] - df1['B']) * 100 / df1['A']
Apply the function and plot:
df1['diff'] = percCal(df1['A'], df1['B'])
df1['diff'].plot(kind='density')
df['perc'] = (df['A'] - df['B']) *100/df['A']
def percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(x['A'], x['B']), axis=1)
Change dfin lambda for x in this case you are giving the function the data xthat means you are giving the percCalwhat you have in the row of the data frame and when you use dfyou are giving actually the data frame and the function is returning a data frame not a value. But please check your code, if xin the function can be 0 is a problem.
Think this is what you are looking for:
# Dummy df
data = [
[1.051990e+10, 1.051990e+04],
[1.051990e+10, 1.051990e+04],
[4.841800e+10, 1.200000e+10],
[2.327700e+10, 2.716000e+10],
[1.204900e+10, 2.100000e+08],
]
cols = ['A', 'B']
df2 = pd.DataFrame(data, columns=cols)
# Solution
import seaborn as sns
df2['pct_diff'] = (df2['A'] - df2['B']) / df2['A']
sns.distplot(df2['pct_diff']);

Statsmodels OLS with rolling window problem

I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304

Prophet Python ValueError: Regressor missing from dataframe

I am trying to use the latest (2nd) 0.3 version of the Prophet package for Python.
My model should include an exogenous regressor, but I receive a ValueError stating that the indeed existing regressor is missing from dataframe. Is this a bug or what am I doing wrong?
#Random Dataset Preparation
import random
random.seed(a=1)
df = pandas.DataFrame(data = None, columns = ['ds', 'y', 'ex'], index = range(50))
datelist = pandas.date_range(pandas.datetime.today(), periods = 50).tolist()
y = numpy.random.normal(0, 1, 50)
ex = numpy.random.normal(0, 2, 50)
df['ds'] = datelist
df['y'] = y
df['ex'] = ex
#Model
prophet_model = Prophet(seasonality_prior_scale = 0.1)
Prophet.add_regressor(prophet_model, 'ex')
prophet_model.fit(df)
prophet_forecast_step = prophet_model.make_future_dataframe(periods=1)
#Result-df
prophet_x_df = pandas.DataFrame(data=None, columns=['Date_x', 'Res'], index = range(int(len(y))))
#Error
prophet_x_df.iloc[0,1] = prophet_model.predict(prophet_forecast_step).iloc[0,0]
You need to first create a column with the regressor value which need to be present in both the fitting and prediction dataframes.
Refer prophet docs
make_future_dataframe generates a dataframe with ds column only.
You need to add 'ex' column to prophet_forecast_step dataframe in order to use it as a regressor

Categories

Resources