As an exercise, I am trying to set Monte Carlo Simulation on a chosen ticker symbol.
from numpy.random import randint
from datetime import date
from datetime import timedelta
import pandas as pd
import yfinance as yf
from math import log
# ticker symbol
ticker_input = "AAPL" # change
# start day + endday for Yahoo Finance API, 5 years of data
start_date = date.today()
end_date = start_date - timedelta(days=1826)
# retrieve data from Yahoo Finance
data = yf.download(ticker_input, end_date,start_date)
yf_data = data.reset_index()
# dataframe : define columns
df = pd.DataFrame(columns=['date', "ln_change", 'open_price', 'random_num'])
open_price = []
date_historical = []
for column in yf_data:
open_price = yf_data["Open"].values
date_historical = yf_data["Date"].values
# list order: descending
open_price[:] = open_price[::-1]
date_historical[:] = date_historical[::-1]
# Populate data into dataframe
for i in range(0, len(open_price)-1):
# date
day = date_historical[i]
# ln_change
lnc = log(open_price[i]/open_price[i+1], 2)
# random number
rnd = randint(1, 1258)
# op = (open_price[i]) open price
df.loc[i] = [day, open_price[i], lnc, rnd]
I was wondering how to calculate Big O if you have e.g. nested loops or exponential complexity but have a limited input like one in my example, maximum input size is 1259 instances of float number. Input size is not going to change.
How do you calculate code complexity in that scenario?
It is a matter of points of view. Both ways of seeing it are technically correct. The question is: What information do you wish to convey to the reader?
Consider the following code:
quadraticAlgorithm(n) {
for (i <- 1...n)
for (j <- 1...n)
doSomethingConstant();
}
quadraticAlgorithm(1000);
The function is clearly O(n2). And yet the program will always run in the same, constant time, because it just contains one function call with n=1000. It is still perfectly valid to refer to the function as O(n2). And we can refer to the program as O(1).
But sometimes the boundaries are not that clear. Then it is up to you to choose if you wish to see it as an algorithm with a time complexity as some function of n, or as a piece of constant code that runs in O(1). The importance is to make it clear to the reader how you define things.
Related
I'm working with field change history data which has timestamps for when the field value was changed. In this example, I need to calculate the overall case duration in 'Termination in Progress' status.
The given case was changed from and to this status three times in total:
see screenshot
I need to add up all three durations in this case and in other cases it can be more or less than three.
Does anyone know how to calculate that in Python?
Welcome to Stack Overflow!
Based on the limited data you provided, here is a solution that should work although the code makes some assumptions that could cause errors so you will want to modify it to suit your needs. I avoided using list comprehension or array math to make it more clear since you said you're new to Python.
Assumptions:
You're pulling this data into a pandas dataframe
All Old values of "Termination in Progress" have a matching new value for all Case Numbers
import datetime
import pandas as pd
import numpy as np
fp = r'<PATH TO FILE>\\'
f = '<FILENAME>.csv'
data = pd.read_csv(fp+f)
#convert ts to datetime for later use doing time delta calculations
data['Edit Date'] = pd.to_datetime(data['Edit Date'])
# sort by the same case number and date in opposing order to make sure values for old and new align properly
data.sort_values(by = ['CaseNumber','Edit Date'], ascending = [True,False],inplace = True)
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
# this loop could also be accomplished with list comprehension like this:
#ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
print('Deltas between groups')
print(ts_deltas)
print()
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
print('Total Time Delta')
print(total_ts_delta)
Deltas between groups
[Timedelta('0 days 00:08:00'), Timedelta('0 days 00:06:00'), Timedelta('0 days 02:08:00')]
Total Time Delta
0 days 02:22:00
I've also attached a picture of the solution minus my file path for obvious reasons. Hope this helps. Please remember to mark as correct if this solution works for you. Otherwise let me know what issues you run into.
EDIT:
If you have multiple case numbers you want to look at, you could do it in various ways, but the simplest would be to just get a list of unique case numbers with data['CaseNumber'].unique() then iterate over that array filtering for each case number and appending the total time delta to a new list or a dictionary (not necessarily the most efficient solution, but it will work).
cases_total_td = {}
unique_cases = data['CaseNumber'].unique()
for case in unique_cases:
temp_data = data[data['CaseNumber'] == case]
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
cases_total_td[case] = total_ts_delta
print(cases_total_td)
{1005222: Timedelta('0 days 02:22:00')}
I'm working on a project for college and it's kicking my ass.
I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
I'm trying to use an ANOVA to see if there's a statistically significant difference in time taken to summit between the seasons.
The F value I'm getting back doesn't seem to make any sense. Any suggestions?
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\\filepath\\expeditions.csv')
#extract only the data relating to everest
exp= exp[exp['peak_name'] == 'Everest']
#create a subset of the data only containing
exp_peaks = exp[['peak_name', 'member_deaths', 'termination_reason', 'hired_staff_deaths', 'year', 'season', 'basecamp_date', 'highpoint_date']]
#extract successful attempts
exp_peaks = exp_peaks[(exp_peaks['termination_reason'] == 'Success (main peak)')]
#drop missing values from basecamp_date & highpoint_date
exp_peaks = exp_peaks.dropna(subset=['basecamp_date', 'highpoint_date'])
#convert basecamp date to datetime
exp_peaks['basecamp_date'] = pd.to_datetime(exp_peaks['basecamp_date'])
#convert basecamp date to datetime
exp_peaks['highpoint_date'] = pd.to_datetime(exp_peaks['highpoint_date'])
from datetime import datetime
exp_peaks['time_taken'] = exp_peaks['highpoint_date'] - exp_peaks['basecamp_date']
#convert seasons from strings to ints
exp_peaks['season'] = exp_peaks['season'].replace('Spring', 1)
exp_peaks['season'] = exp_peaks['season'].replace('Autumn', 3)
exp_peaks['season'] = exp_peaks['season'].replace('Winter', 4)
#remove summer and unknown
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Summer')]
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Unknown')]
#subset the data according to the season
exp_peaks_spring = exp_peaks[exp_peaks['season'] == 1]
exp_peaks_autumn = exp_peaks[exp_peaks['season'] == 3]
exp_peaks_winter = exp_peaks[exp_peaks['season'] == 4]
#calculate the average time taken in spring
exp_peaks_spring_duration = exp_peaks_spring['time_taken']
mean_exp_peaks_spring_duration = exp_peaks_spring_duration.mean()
#calculate the average time taken in autumn
exp_peaks_autumn_duration = exp_peaks_autumn['time_taken']
mean_exp_peaks_autumn_duration = exp_peaks_autumn_duration.mean()
#calculate the average time taken in winter
exp_peaks_winter_duration = exp_peaks_winter['time_taken']
mean_exp_peaks_winter_duration = exp_peaks_winter_duration.mean()
# Turn the season column into a categorical
exp_peaks['season'] = exp_peaks['season'].astype('category')
exp_peaks['season'].dtypes
from scipy.stats import f_oneway
# One-way ANOVA
f_value, p_value = f_oneway(exp_peaks['season'], exp_peaks['time_taken'])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))
It seems that f_oneway requires the different samples of continuous data to be arguments, rather than taking a categorical variable argument. You can achieve this using groupby.
f_oneway(*(group for _, group in exp_peaks.groupby("season")["time_taken"]))
Or equivalently, since you have already created series for each season:
f_oneway(exp_peaks_spring_duration, exp_peaks_autumn_duration, exp_peaks_winter_duration)
I would have thought there would be an easier way to perform an ANOVA in this common case but can't find it.
No matter what I do I don't seem to be able to add all the base volumes and quote volumes together easily! I want to end up with a total base volume and a total quote volume of all the data in the data frame. Can someone help me on how you can do this easily?
I have tried summing and saving the data in a dictionary first and then adding it but I just don't seem to be able to make this work!
import urllib
import pandas as pd
import json
def call_data(): # Call data from Poloniex
global df
datalink = 'https://poloniex.com/public?command=returnTicker'
df = urllib.request.urlopen(datalink)
df = df.read().decode('utf-8')
df = json.loads(df)
global current_eth_price
for k, v in df.items():
if 'ETH' in k:
if 'USDT_ETH' in k:
current_eth_price = round(float(v['last']),2)
print("Current ETH Price $:",current_eth_price)
def calc_volumes(): # Calculate the base & quote volumes
global volume_totals
for k, v in df.items():
if 'ETH' in k:
basevolume = float(v['baseVolume'])*current_eth_price
quotevolume = float(v['quoteVolume'])*float(v['last'])*current_eth_price
if quotevolume > 0:
percentages = (quotevolume - basevolume) / basevolume * 100
volume_totals = {'key':[k],
'basevolume':[basevolume],
'quotevolume':[quotevolume],
'percentages':[percentages]}
print("volume totals:",volume_totals)
print("#"*8)
call_data()
calc_volumes()
A few notes:
For the next 2 years don't use the keyword globals for anything.
put function documentation under the function in quotes
using the requests library will be much easier than urllib. However ...
pandas can fetch the JSON and parse it all in one step
ok it doesn't have to be as split up as this, I'm just showing you how to properly pass variables around instead of globals.
I could not find "ETH" by itself. In the data they sent they have these 3 ['BTC_ETH', 'USDT_ETH', 'USDC_ETH']. So I used "USDT_ETH" I hope the substitution is ok.
calc_volumes is seeming to do the calculation and being some sort of filter (it's picky as to what it prints). This function needs to be broken up in to it's two separate jobs. printing and calculating. (maybe there was a filter step but I leave that for homework)
.
import pandas as pd
eth_price_url = 'https://poloniex.com/public?command=returnTicker'
def get_data(url=''):
""" Call data from Poloniex and put it in a dataframe"""
data = pd.read_json(url)
return data
def get_current_eth_price(data = None):
""" grab the price out of the dataframe """
current_eth_price = data['USDT_ETH']['last'].round(2)
return current_eth_price
def calc_volumes(data=None, current_eth_price=None):
""" Calculate the base & quote volumes """
data = df[df.columns[df.columns.str.contains('ETH')]].loc[['baseVolume', 'quoteVolume', 'last']]
data = data.transpose()
data[['baseVolume','quoteVolume']]*= current_eth_price
data['quoteVolume']*=data['last']
data['percentages']=(data['quoteVolume'] - data['baseVolume']) / data['quoteVolume'] * 100
return data
df = get_data(url = eth_price_url)
the_price = get_current_eth_price(data = df)
print(f'the current eth price is: {the_price}')
volumes = calc_volumes(data=df, current_eth_price=the_price)
print(volumes)
This code seems kind of odd and inconsistent... for example, you're importing pandas and calling your variable df but you're not actually using dataframes. If you used df = pd.read_json('https://poloniex.com/public?command=returnTicker', 'index')* to get a dataframe, most of your data manipulation here would become much easier, and wouldn't require any loops either.
For example, the first function's code would become as simple as current_eth_price = df.loc['USDT_ETH','last'].
The second function's code would basically be
eth_rows = df[df.index.str.contains('ETH')]
total_base_volume = (eth_rows.baseVolume * current_eth_price).sum()
total_quote_volume = (eth_rows.quoteVolume * eth_rows['last'] * current_eth_price).sum()
(*The 'index' argument tells pandas to read the JSON dictionary indexed by rows, then columns, rather than columns, then rows.)
I have the following excel file and the time stamp in the format
20180821_2330
1) for a lot of days. How would I format it as standard time so that I can plot it versus the other sensor values ?
2) I would like to have a big plot with for example sensor 1 reading against all the days, is that possible ?
https://www.mediafire.com/file/m36ha4777d6epvd/median_data.xlsx/file
is this something you are looking for? I improvised and created 'n' column which could represent your 'timestamp' as the data frame. Basically, what I think you should do, is to apply another function - let's call it 'apply_fun' on your column which stores 'timestamps' a function which takes each element and transforms it into strptime() format.
import datetime
import pandas as pd
n = {'timestamp':['20180822_2330', '20180821_2334', '20180821_2334', '20180821_2330']}
data_series = pd.DataFrame(n)
def format_dates(n):
x = n.find('_')
y = datetime.datetime.strptime(n[:x]+n[x+1:], '%Y%m%d%H%M')
return y
def apply_fun(dataset):
dataset['timestamp2'] = dataset['timestamp'].apply(format_dates)
return dataset
print(apply_fun(data_series))
When it comes to 2nd point, I am not able to reach the site due to McAffe agent at work, which does not allow to open it. Once you have 1st, you can ask for 2nd separately.
I'm trying to create a program in Python, in which the user inputs a number of stock tickers and a desired date range; the output would be a graph of daily cumulative returns of the portfolio in that date range assuming the stocks have equal weights.
So far, this is my input block:
import datetime
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib import pyplot
import csv
from pandas.io.data import DataReader
isValid = False
userInStart = raw_input("Enter Start Date mm/dd/yyyy: ") """start of date range"""
userInEnd = raw_input ("Enter End Date mm/dd/yyyy: ") """end of date range"""
StockCount = input ('Input the number of stocks in the porfolio: ')
StockArray = list() """an array of input stocks"""
Return = [[]] """an array of daily returns for each stock"""
CumulativeReturn = list() """an array of cumulative returns for each stock"""
The processing block is:
for i in range(0,StockCount):
for j in range(1,len(StockArray[i]["Adj Close"])):
Return[i].append (StockArray[i]["Adj Close"][j] - StockArray[i]["Adj Close"][j-1])
1) If I input 1 stock, by the end, I am left with an array of returns but no dates. How can I relate the Return array with the Dates
2) If I input more than one stock, I am given an index out of range error. What may be causing the problem?
Any help on conceptualizing a solution would also be highly appreciated.
Thanks
Without seeing the full listing of your program it is impossible to answer question 2. If you take a closer look at the stacktrace of the 'index out of range' exception, it should tell you the line number at which the exception is fired. Start from there.
As for question 1, You should consider applying OO design technique in your application.
For example, return can be implemented as class like this
class Return(object):
def __init__(self, date, r):
self.date = date
self.r = r
In this way you associate a date with a return value.