In my program I have a part of code that uses an Estimated Moving Average (EMA) 4 times, but each time with different length. The program uses one or more EMAs depending on how much data it gets.
For now the code is not looped, just copy pasted with minor tweeks. That makes making changes difficult because I have to change everything 4 times.
Can somebody help me loop the code in such a way it wont loose it behaviour pattern. The mock-up code is presented here:
import random
import numpy as np
zakres=[5,10,15,20]
data=[]
def SI_sma(data, zakres):
weights=np.ones((zakres,))/zakres
smas=np.convolve(data, weights, 'valid')
return smas
def SI_ema(data, zakres):
weights_ema = np.exp(np.linspace(-1.,0.,zakres))
weights_ema /= weights_ema.sum()
ema=np.convolve(data,weights_ema)[:len(data)]
ema[:zakres]=ema[zakres]
return ema
while True:
data.append(random.uniform(0,100))
print(len(data))
if len(data)>zakres[0]:
smas=SI_sma(data=data, zakres=zakres[0])
ema=SI_ema(data=data, zakres=zakres[0])
print(smas[-1]) #calc using smas
print(ema[-1]) #calc using ema1
if len(data)>zakres[1]:
ema2=SI_ema(data=data, zakres=zakres[1])
print(ema2[-1]) #calc using ema2
if len(data)>zakres[2]:
ema3=SI_ema(data=data, zakres=zakres[2])
print(ema3[-1]) #calc using ema3
if len(data)>zakres[3]:
ema4=SI_ema(data=data, zakres=zakres[3])
print(ema4[-1]) #calc using ema4
input("press a key")
A variable number of variables is usually a bad idea. As you have found, it can make maintaining code cumbersome and error-prone. Instead, you can define a dict of results and use a for loop to iterate scenarios, defining len(data) just once.
ema = {}
while True:
data.append(random.uniform(0,100))
n = len(data)
for i, val in enumerate(zakres):
if n > val:
if i == 1:
smas = SI_sma(data=data, zakres=val)
ema[i] = SI_ema(data=data, zakres=val)
You can then access results via ema[0], ..., ema[3] as required.
"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.
if I have a list of phases for a sinusoidal data points and want to make a plot of time vs phase, the plot goes back to 0 after the data is past 2 pi. Is there a way I could manipulate the data so it continues after 2 pi?
I'm currently using phase = [i % 2*np.pi for i in phase], but this doesn't work.
This being about phase isn't important though. Lets say I had a list of data:
data = [0,1,2,0,1,2,0,1,2]
But I didn't want the data to reset to 0 after 2, so I want the data to be:
data = [0,1,2,3,4,5,6,7,8,9]
There are a few ways.
If you use numpy, the de-facto standard math and array manipulation library for python, then just use numpy.unwrap.
If you want to do it yourself or for some reason not use numpy you can do this
def my_unwrap(phase):
phase_diffs = [phase[i+1] - phase[i] for i in range(len(phases)-1)]
unwrapped_phases = [phase[0]]
previous_phase = phase[0]
for phase_diff in phase_diffs:
if abs(phase_diff) > pi:
phase_diff += 2*pi
previous_phase += phase_diff
unwrapped_phases.append(previous_phase)
return unwrapped
Seems to work in basic test cases.
I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages
Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).
If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...
Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.