As an exercise, I am trying to set Monte Carlo Simulation on a chosen ticker symbol.
from numpy.random import randint
from datetime import date
from datetime import timedelta
import pandas as pd
import yfinance as yf
from math import log
# ticker symbol
ticker_input = "AAPL" # change
# start day + endday for Yahoo Finance API, 5 years of data
start_date = date.today()
end_date = start_date - timedelta(days=1826)
# retrieve data from Yahoo Finance
data = yf.download(ticker_input, end_date,start_date)
yf_data = data.reset_index()
# dataframe : define columns
df = pd.DataFrame(columns=['date', "ln_change", 'open_price', 'random_num'])
open_price = []
date_historical = []
for column in yf_data:
open_price = yf_data["Open"].values
date_historical = yf_data["Date"].values
# list order: descending
open_price[:] = open_price[::-1]
date_historical[:] = date_historical[::-1]
# Populate data into dataframe
for i in range(0, len(open_price)-1):
# date
day = date_historical[i]
# ln_change
lnc = log(open_price[i]/open_price[i+1], 2)
# random number
rnd = randint(1, 1258)
# op = (open_price[i]) open price
df.loc[i] = [day, open_price[i], lnc, rnd]
I was wondering how to calculate Big O if you have e.g. nested loops or exponential complexity but have a limited input like one in my example, maximum input size is 1259 instances of float number. Input size is not going to change.
How do you calculate code complexity in that scenario?
It is a matter of points of view. Both ways of seeing it are technically correct. The question is: What information do you wish to convey to the reader?
Consider the following code:
quadraticAlgorithm(n) {
for (i <- 1...n)
for (j <- 1...n)
doSomethingConstant();
}
quadraticAlgorithm(1000);
The function is clearly O(n2). And yet the program will always run in the same, constant time, because it just contains one function call with n=1000. It is still perfectly valid to refer to the function as O(n2). And we can refer to the program as O(1).
But sometimes the boundaries are not that clear. Then it is up to you to choose if you wish to see it as an algorithm with a time complexity as some function of n, or as a piece of constant code that runs in O(1). The importance is to make it clear to the reader how you define things.
My data come from BigQuery exported to GCS bucket as CSV file and if the file size is quite massive, BigQuery will automatically split the data into several chunk. With time series in mind, the time series might be scattered across different files. I have a custom function that I want to applied to each TimeseriesID.
Here's some constraint of the data:
The data is sorted by TimeseriesID and TimeID
The number of row of each files is may vary, but at minimum 1 row (which is very unlikely)
The starting of TimeID is not always 0
The length of each time series may vary but at maximum it will only scattered across 2 files. No time series scatter in 3 different files.
Here's the initial setup to illustrate the problem:
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
This should be pretty trivial if I can just concat all the files but the problem is if I concat all the dataframe then it won't fit in the memory.
The output I desired is should be similar to this but without concat all the files.
pd.concat([df1,df2,df3],axis=0).groupby('TimeseriesID').agg({"value":simple_func})
I'm also aware about vaex and dask but I want to stick with simple pandas for time being.
I'm also open to solution which involve modifying the BigQuery to split the files better.
Approach presented by op to use concat with million of records would be overkill for memories/other resources.
I have tested OP code using Google Colab Nootebooks and this was a bad approach
import pandas as pd
import numpy as np
import time
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
start = time.time()
df = pd.concat([df1,df2,df3]).groupby('TimeseriesID').agg({"value":custom_func})
elapsed = (time.time() - start)
print(elapsed)
print(df.head())
output will be:
0.023952960968017578
value
TimeseriesID A 11.666667
B 16.250000
C 20.000000
D 18.333333
As you can see, 'concat' takes time to process. Due to few records this is not perceived.
The approach should be as follow:
Get files with data that you are going to process. ie: only workable columns.
Create a dictionary from the processed files key and values. if necessary, obtain values per key in a necessary file. You can store the results in a 'results' directory as json/csv:
A.csv will have all key 'A' values
...
n.csv will have all key 'n' values
Iterate trough results directory and start building your final output inside a dictionary.
{'A': [10, 20, 5], 'B': [30, 10, 20, 5], 'C': [30, 10], 'D': [20, 5, 30]}
apply custom function to each key value list.
{'A': 11.666666666666666, 'B': 16.25, 'C': 20.0, 'D': 18.333333333333332}
You can check the logic using below code, I use json to store the data:
from google.colab import files
import json
import pandas as pd
#initial dataset
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
#get unique keys and its values
df1.groupby('TimeseriesID')['value'].apply(list).to_json('df1.json')
df2.groupby('TimeseriesID')['value'].apply(list).to_json('df2.json')
df3.groupby('TimeseriesID')['value'].apply(list).to_json('df3.json')
#as this is an example you can download the output as jsons
files.download('df1.json')
files.download('df2.json')
files.download('df3.json')
Update 06/10/2021
I have tuned code for OPs needs. This part creates refined files.
from google.colab import files
import json
#you should use your own function to get the data from the file
def retrieve_data(uploaded,file):
return json.loads(uploaded[file].decode('utf-8'))
#you should use your own function to get a list of files to process
def retrieve_files():
return files.upload()
key_list =[]
#call a function that gets a list of files to process
file_to_process = retrieve_files()
#read every raw file:
for file in file_to_process:
file_data = retrieve_data(file_to_process,file)
for key,value in file_data.items():
if key not in key_list:
key_list.append(key)
with open(f'{key}.json','w') as new_key_file:
new_json = json.dumps({key:value})
new_key_file.write(new_json)
else:
with open(f'{key}.json','r+') as key_file:
raw_json = key_file.read()
old_json = json.loads(raw_json)
new_json = json.dumps({key:old_json[key]+value})
key_file.seek(0)
key_file.write(new_json)
for key in key_list:
files.download(f'{key}.json')
print(key_list)
Update 07/10/2021
I have updated code to avoid confusion. This part process refined files.
import time
import numpy as np
#Once we get the refined values we can use it to apply custom functions
def custom_func(x):
return np.mean(x)
#Get key and data content from single json
def get_data(file_data):
content = file_data.popitem()
return content[0],content[1]
#load key list and build our refined dictionary
refined_values = []
#call a function that gets a list of files to process
file_to_process = retrieve_files()
start = time.time()
#read every refined file:
for file in file_to_process:
#read content of file n
file_data = retrieve_data(file_to_process,file)
#parse and apply function per file read
key,data = get_data(file_data)
func_output = custom_func(data)
#start building refined list
refined_values.append([key,func_output])
elapsed = (time.time() - start)
print(elapsed)
df = pd.DataFrame.from_records(refined_values,columns=['TimerSeriesID','value']).sort_values(by=['TimerSeriesID'])
df = df.reset_index(drop=True)
print(df.head())
output will be:
0.00045609474182128906
TimerSeriesID value
0 A 11.666667
1 B 16.250000
2 C 20.000000
3 D 18.333333
summarize:
When handling large datasets, you should always need to focus on the data that you are going to use and keep it minimal. Only using the workable values.
Processing times are faster when operations are performed by basic operators or python native libraries.
I'm trying to create a function to calculate Heikin Ashi candles (for financial analysis).
My indicators.py file looks like
from pandas import DataFrame, Series
def heikinashi(dataframe):
open = dataframe['open']
high = dataframe['high']
low = dataframe['low']
close = dataframe['close']
ha_close = 0.25 * (open + high + low + close)
ha_open = 0.5 * (open.shift(1) + close.shift(1))
ha_low = max(high, ha_open, ha_close)
ha_high = min(low, ha_open, ha_close)
return dataframe, ha_close, ha_open, ha_low, ha_high
And in my main script i'm trying to call this function the most effective way to return those four dataframes: ha_close, ha_open, ha_low and ha_high
My main script looks something like:
import indicators as ata
ha_close, ha_open, ha_low, ha_high = ata.heikinashi(dataframe)
dataframe['ha_close'] = ha_close(dataframe)
dataframe['ha_open'] = ha_open(dataframe)
dataframe['ha_low'] = ha_low(dataframe)
dataframe['ha_high'] = ha_high(dataframe)
but for some strange reason I cannot find the dataframes
What will be the most efficient way to perform this? code wise and with minimal calls
I expect to have dataframe['ha_close'] etc returned with the correct data as shown in the function
Any advice appreciated
Thank you!
In heikinashi() you are returning five elements, dataframe and ha_*, however, in your main script you are only assigning the return values to four variables.
Try changing the main script to:
dataframe, ha_close, ha_open, ha_low, ha_high = ata.heikinashi(dataframe)
Apart from that, it looks like dataframe which you are passing into heikinashi() is not defined anywhere.
I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.