Calculate 2.5% below and 2.5% above the mean in Python - python

How do I print the dataframe, where the population is within 5% of the mean? (2.5% below and 2.5% above)
Here is what I've tried:
mean = df['population'].mean()
minimum = mean - (0.025*mean)
maximum = mean + (0.025*mean)
df[df.population < maximum]

Use:
df.loc[(df['population'] > minimum) & (df['population'] < maximum)]

import pandas as pd
df = pd.read_csv("fileName.csv")
#suppose this dataFrame contains the population in the int format
mean = df['population'].mean()
minimum = mean - (0.025*mean)
maximum = mean + (0.025*mean)
ans = df.loc[(df['population']>minimum) & (df['population'] <maximum)]
ans
you can use this

I built this dataframe for testing.
import numpy as np
import pandas as pd
random_data = np.random.randint(1_000_000, 100_000_000, 200)
random_df = pd.DataFrame(random_data, columns=['population'])
random_df
Here's the answer to specifically what you were asking for.
pop = random_df.population
top_boundary = pop.mean() + pop.mean() * 0.025
low_boundary = pop.mean() - pop.mean() * 0.025
criteria_boundary_limits = random_df.population.between(low_boundary, top_boundary)
criteria_boundary_df = random_df.loc[criteria_boundary_limits]
criteria_boundary_df
But, maybe, another answer could be had by using quantiles. I used 40 quantiles because 1/40 = 0.025.
groups_list = list(range(1,41))
random_df['groups'] = pd.qcut(random_df['population'], 40, labels = groups_list)
criteria_groups_limits = random_df.groups.between(20,21)
criteria_groups_df = random_df.loc[criteria_groups_limits]
criteria_groups_df

Related

Python's `.loc` is really slow on selecting subsets of Data

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.
import numpy as np
import pandas as pd
# Full DataFrame
y_max = 50
Y_max = range(1, y_max+1)
t_max = 100
T_max = range(1, t_max+1)
idx_max = tuple((y,t) for y in Y_max for t in T_max)
df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])
# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)
t1 = 5
tN = 9
T = range(t1, tN+1)
idx_sub = tuple((y,t) for y in Y for t in T)
data_sub = df.loc[(Y,T), :] #This is really slow
dict_sub = dict(zip(idx_sub, data_sub['Value']))
# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']
I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.
One idea is use Index.isin with itertools.product in boolean indexing:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Dataframe with Monte Carlo Simulation calculation next row Problem

I want to build up a Dataframe from scratch with calculations based on the Value before named Barrier option. I know that i can use a Monte Carlo simulation to solve it but it just wont work the way i want it to.
The formula is:
Value in row before * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
The first code I write just calculates the first column. I know that I need a second loop but can't really manage it.
The result should be, that for each simulation it will calculate a new value using the the value before, for 500 Day meaning S_1 should be S_500 with a total of 1000 simulations. (I need to generate new columns based on the value before using the formular.)
similar to this:
So for the 1. Simulations 500 days, 2. Simulation 500 day and so on...
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
simulation = 0
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 500
df = pd.DataFrame()
for i in range (0,TradingDays):
z = norm.ppf(rd.random())
simulation = simulation + 1
S_1 = S_0*np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
df = df.append ({
'S_1':S_1,
'S_0':S_0
}, ignore_index=True)
df = df.round ({'Z':6,
'S_T':2
})
df.index += 1
df.index.name = 'Simulation'
print(df)
I found another possible code which i found here and it does solve the problem but just for one row, the next row is just the same calculation. Generate a Dataframe that follow a mathematical function for each column / row
If i just replace it with my formular i get the same problem.
replacing:
exp(r - q * sqrt(sigma))*T+ (np.random.randn(nrows) * sqrt(deltaT)))
with:
exp((r-sigma**2/2)*T/nrows+sigma*np.sqrt(T/nrows)*z))
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 50
Simulation = 100
df = pd.DataFrame({'s0': [S_0] * Simulation})
for i in range(1, TradingDays):
z = norm.ppf(rd.random())
df[f's{i}'] = df.iloc[:, -1] * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
print(df)
I would work more likely with the last code and solve the problem with it.
How about just overwriting the value of S_0 by the new value of S_1 while you loop and keeping all simulations in a list?
Like this:
import numpy as np
import pandas as pd
import random
from scipy.stats import norm
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
trading_days = 50
output = []
for i in range(trading_days):
z = norm.ppf(random.random())
value = S_0*np.exp((r - sigma**2 / 2) * T / trading_days + sigma * np.sqrt(T/trading_days) * z)
output.append(value)
S_0 = value
df = pd.DataFrame({'simulation': output})
Perhaps I'm missing something, but I don't see the need for a second loop.
Also, this eliminates calling df.append() in a loop, which should be avoided. (See here)
Solution based on the the answer of bartaelterman, thank you very much!
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
#Dividing the list in chunks to later append it to the dataframe in the right order
def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def blackscholes():
d1 = ((math.log(S_0/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(2)))
d2 = ((math.log(S_0/K)+(r-sigma**2/2)*T)/(sigma*np.sqrt(2)))
preis_call_option = S_0*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return preis_call_option
K = 40
S_0 = 42
T = 2
r = 0.02
sigma = 0.2
U = 38
simulation = 10000
trading_days = 500
trading_days = trading_days -1
#creating 2 lists for the first and second loop
loop_simulation = []
loop_trading_days = []
#first loop calculates the first column in a list
for j in range (0,simulation):
print("Progressbar_1_2 {:2.2%}".format(j / simulation), end="\n\r")
S_Tag_new = 0
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_0*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
S_Tag_new = S_Tag
loop_simulation.append(S_Tag)
#second loop calculates the the rows for the columns in a list
for i in range (0,trading_days):
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_Tag_new*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
loop_trading_days.append(S_Tag)
S_Tag_new = S_Tag
#values from the second loop will be divided in number of Trading days per Simulation
loop_trading_days_chunked = list(chunk_list(loop_trading_days,trading_days))
#First dataframe with just the first results from the firstloop for each simulation
df1 = pd.DataFrame({'S_Tag 1': loop_simulation})
#Appending the the chunked list from the second loop to a second dataframe
df2 = pd.DataFrame(loop_trading_days_chunked)
#Merging both dataframe into one
df3 = pd.concat([df1, df2], axis=1)

Getting a python error "cannot reindex from a duplicate index" and a warning too "setting with copy warning" in the below code

import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import accuracy_score, classification_report
#kindly download the data FIRST which is required and then update the path accordingly for the variables you have to give the path
# variable1 = pd.read_csv(r"give the path to the data")
variable1 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/TCS.csv")
variable2 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/WIPRO.csv")
variable3 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/HDFC.csv")
variable4 = pd.read_csv(r"C:/Users/hp/Desktop/NIFTY/ITC.csv")
frames = [variable1,variable2,variable3,variable4]
all = pd.concat(frames)
print(all)
price_data = all [['Symbol','Date','Close','High','Low','Open','Volume']]
First, for average investors, the return of an asset is a complete and scale–free summary of the investment opportunity. Second, return series are easier to handle than prices series as they have more attractive statistical properties
# sort the values by symbol and then date
price_data.sort_values(by = ['Symbol','Date'], inplace = True)
# calculate the change in price
price_data['change_in_price'] = price_data['Close'].diff()
# identify rows where the symbol changes
mask = price_data['Symbol'] != price_data['Symbol'].shift(1)
# For those rows, let's make the value null
price_data['change_in_price'] = np.where(mask == True, np.nan, price_data['change_in_price'])
# print the rows that have a null value, should only be 5
price_data[price_data.isna().any(axis = 1)]
days_out = 30
# Group by symbol, then apply the rolling function and grab the Min and Max.
price_data_smoothed = price_data.groupby(['Symbol'])
[['Close','Low','High','Open','Volume']].transform(lambda x: x.ewm(span = days_out).mean())
# Join the smoothed columns with the symbol and datetime column from the old data frame.
smoothed_df = pd.concat([price_data[['Symbol','Date']], price_data_smoothed], axis=1, sort=False)
smoothed_df
days_out = 30
# create a new column that will house the flag, and for each group calculate the diff compared to 30 days ago. Then use Numpy to define the sign.
smoothed_df['Signal_Flag'] = smoothed_df.groupby('Symbol')['Close'].transform(lambda x :
np.sign(x.diff(days_out)))
# print the first 50 rows
smoothed_df.head(50)
up to here it is working but when i execute the below code then it throws an error cannot reindex from a duplicate axis
n = 14
# First make a copy of the data frame twice
up_df, down_df = price_data[['Symbol','change_in_price']].copy(),
price_data[['Symbol','change_in_price']].copy()
# For up days, if the change is less than 0 set to 0.
up_df.loc['change_in_price'] = up_df.loc[(up_df['change_in_price'] < 0), 'change_in_price'] = 0
# For down days, if the change is greater than 0 set to 0.
down_df.loc['change_in_price'] = down_df.loc[(down_df['change_in_price'] > 0), 'change_in_price']
= 0
# We need change in price to be absolute.
down_df['change_in_price'] = down_df['change_in_price'].abs()
# Calculate the EWMA (Exponential Weighted Moving Average), meaning older values are given less weight compared to newer values.
ewma_up = up_df.groupby('Symbol')['change_in_price'].transform(lambda x: x.ewm(span = n).mean())
ewma_down = down_df.groupby('Symbol')['change_in_price'].transform(lambda x: x.ewm(span =
n).mean())
# Calculate the Relative Strength
relative_strength = ewma_up / ewma_down
# Calculate the Relative Strength Index
relative_strength_index = 100.0 - (100.0 / (1.0 + relative_strength))
# Add the info to the data frame.
price_data['down_days'] = down_df['change_in_price']
price_data['up_days'] = up_df['change_in_price']
price_data['RSI'] = relative_strength_index
# Display the head.
price_data.head(30)

Pandas groupby winsorized mean

The normal groupby mean is easy:
df.groupby(['col_a','col_b']).mean()[col_i_want]
However, if i want to apply a winsorized mean (default limits of 0.05 and 0.95) which is equivalent to clipping the dataset then performing a mean, there suddenly seems to be no easy way to do it? I would have to:
winsorized_mean = []
col_i_want = 'col_c'
for entry in df['col_a'].unique():
for entry2 in df['col_b'].unique():
sub_df = df[(df['col_a'] == entry) & (df['col_b'] == entry2)]
m = sub_df[col_to_groupby].clip(lower=0.05,upper=0.95).mean()
winsorized_mean.append([entry,entry2,m])
Is there a function I'm not aware of to do this automatically?
You can use scipy.stats.trim_mean:
import pandas as pd
from scipy.stats import trim_mean
# label 'a' will exhibit different means depending on trimming
label = ['a'] * 20 + ['b'] * 80 + ['c'] * 400 + ['a'] * 100
data = list(range(100)) + list(range(500, 1000))
df = pd.DataFrame({'label': label, 'data': data})
grouped = df.groupby('label')
# trim 5% off both ends
print(grouped.apply(stats.trim_mean, .05))
# trim 10% off both ends
print(grouped.apply(stats.trim_mean, .1))

Adding a 45 degree line to a time series stock data plot

I guess this is supposed to be simple.. But I cant seem to make it work.
I have some stock data
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data = np.random.rand(62)*100)
I am doing some analysis on it, this results of my drawing some lines on the graph.
And I want to plot a 45 line somewhere on the graph as a reference for lines I drew on the graph.
What I have tried is
x = df.tail(len(df)/20).index
x = x.reset_index()
x_first_val = df.loc[x.loc[0].date].adj_close
In order to get some point and then use slope = 1 and calculate y values.. but this sounds all wrong.
Any ideas?
Here is a possibility:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data=np.random.rand(62)*100,
columns=['data'])
# Get values for the time:
index_range = df.index[('2018-06-18' < df.index) & (df.index < '2018-07-21')]
# get the timestamps in nanoseconds (since epoch)
timestamps_ns = index_range.astype(np.int64)
# convert it to a relative number of days (for example, could be seconds)
time_day = (timestamps_ns - timestamps_ns[0]) / 1e9 / 60 / 60 / 24
# Define y-data for a line:
slope = 3 # unit: "something" per day
something = time_day * slope
trendline = pd.Series(something, index=index_range)
# Graph:
df.plot(label='data', alpha=0.8)
trendline.plot(label='some trend')
plt.legend(); plt.ylabel('something');
which gives:
edit - first answer, using dayofyear instead of the timestamps:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(start = "06/01/2018", end = "08/01/2018"),
data=np.random.rand(62)*100,
columns=['data'])
# Define data for a line:
slope = 3 # unit: "something" per day
index_range = df.index[('2018-06-18' < df.index) & (df.index < '2018-07-21')]
dayofyear = index_range.dayofyear # it will not work around the new year...
dayofyear = dayofyear - dayofyear[0]
something = dayofyear * slope
trendline = pd.Series(something, index=index_range)
# Graph:
df.plot(label='data', alpha=0.8)
trendline.plot(label='some trend')
plt.legend(); plt.ylabel('something');

Categories

Resources