How does this class work? (Related to Quantopian, Python and Pandas) - python

From here: https://www.quantopian.com/posts/wsj-example-algorithm
class Reversion(CustomFactor):
"""
Here we define a basic mean reversion factor using a CustomFactor. We
take a ratio of the last close price to the average price over the
last 60 days. A high ratio indicates a high price relative to the mean
and a low ratio indicates a low price relative to the mean.
"""
inputs = [USEquityPricing.close]
window_length = 60
def compute(self, today, assets, out, prices):
out[:] = -prices[-1] / np.mean(prices, axis=0)
Reversion() seems to return a pandas.DataFrame, and I have absolutely no idea why.
For one thing, where is inputs and window_length used?
And what exactly is out[:]?
Is this specific behavior related to Quantopian in particular or Python/Pandas?

TL;DR
Reversion() doesn't return a DataFrame, it returns an instance of the
Reversion class, which you can think of as a formula for performing a
trailing window computation. You can run that formula over a particular time
period using either quantopian.algorithm.pipeline_output or
quantopian.research.run_pipeline, depending on whether you're writing a
trading algorithm or doing offline research in a notebook.
The compute method is what defines the "formula" computed by a Reversion
instance. It calculates a reduction over a 2D numpy array of prices, where
each row of the array corresponds to a day and each column of the array
corresponds to a stock. The result of that computation is a 1D array
containing a value for each stock, which is copied into out. out is also
a numpy array. The syntax out[:] = <expression> says "copy the values from
<expression> into out".
compute writes its result directly into an output array instead of simply
returning because doing so allows the CustomFactor base class to ensure
that the output has the correct shape and dtype, which can be nontrivial for
more complex cases.
Having a function "return" by overwriting an input is unusual and generally
non-idiomatic Python. I wouldn't recommend implementing a similar API unless
you're very sure that there isn't a better solution.
All of the code in the linked example is open source and can be found in
Zipline, the framework on top of
which Quantopian is built. If you're interested in the implementation, the
following files are good places to start:
zipline/pipeline/engine.py
zipline/pipeline/term.py
zipline/pipeline/graph.py
zipline/pipeline/pipeline.py
zipline/pipeline/factors/factor.py
You can also find a detailed tutorial on the Pipeline API
here.
I think there are two kinds of answers to your question:
How does the Reversion class fit into the larger framework of a
Zipline/Quantopian algorithm? In other words, "how is the Reversion class
used"?
What are the expected inputs to Reversion.compute() and what computation
does it perform on those inputs? In other words, "What, concretely, does the
Reversion.compute() method do?
It's easier to answer (2) with some context from (1).
How is the Reversion class used?
Reversion is a subclass of CustomFactor, which is part of Zipline's
Pipeline API. The primary purpose of the Pipeline API is to make it easy
for users to perform a certain special kind of computation efficiently over
many sources of data. That special kind of computation is a cross-sectional
trailing-window computation, which has the form:
Every day, for some set of data sources, fetch the last N days of data for all
known assets and apply a reduction function to produce a single value per
asset.
A very simple cross-sectional trailing-window computation would be something
like "close-to-close daily returns", which has the form:
Every day, fetch the last two days' of close prices and, for each asset,
calculate the percent change between the asset's previous day close price and
its current current close price.
To describe a cross-sectional trailing-window computation, we need at least
three pieces of information:
On what kinds of data (e.g. price, volume, market cap) does the computation
operate?
On how long of a trailing window of data (e.g. 1 day, 20 days, 100 days)
does the computation operate?
What reduction function does the computation perform over the data described
by (1) and (2)?
The CustomFactor class defines an API for consolidating these three pieces of
information into a single object.
The inputs attribute describes the set of inputs needed to perform a
computation. In the snippet from the question, the only input is
USEquityPricing.close, which says that we just need trailing daily close
prices. In general, however, we can ask for any number of inputs. For
example, to compute VWAP (Volume-Weighted Average Price), we would use
something like inputs = [USEquityPricing.close, USEquityPricing.volume] to
say that we want trailing close prices and trailing daily volumes.
The window_length attribute describes the number of days of trailing data
required to perform a computation. In the snippet above we're requesting 60
days of trailing close prices.
The compute method describes the trailing-window computation to be
performed. In the section below, I've outlined exactly how compute performs
its computation. For now, it's enough to know that compute is essentially a
reduction function from some number of 2-dimensional arrays to a single
1-dimensional array.
You might notice that we haven't defined an actual set of dates on which we
might want to compute a Reversion factor. This is by design, since we'd like
to be able to use the same Reversion instance to perform calculations at
different points in time.
Quantopian defines two APIs for computing expressions like Reversion: an
"online" mode designed for use in actual trading algorithms, and a "batch" mode
designed for use in research and development. In both APIs, we first construct
a Pipeline object that holds all the computations we want to perform. We
then feed our pipeline object into a function that actually performs the
computations we're interested in.
In the batch API, we call run_pipeline passing our pipeline, a start date,
and an end date. A simple research notebook computing a custom factor might
look like this:
from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.research import run_pipeline
class Reversion(CustomFactor):
# Code from snippet above.
reversion = Reversion()
pipeline = Pipeline({'reversion': reversion})
result = run_pipeline(pipeline, start_date='2014-01-02', end_date='2015-01-02')
do_stuff_with(result)
In a trading algorithm, we're generally interested in the most recently
computed values from our pipeline, so there's a slightly different API: we
"attach" a pipeline to our algorithm on startup, and we request the latest
output from the pipeline at the start of each day. A simple trading algorithm
using Reversion might look something like this:
import quantopian.algorithm as algo
from quantopian.pipeline import Pipeline, CustomFactor
class Reversion(CustomFactor):
# Code from snippet above.
def initialize(context):
reversion = Reversion()
pipeline = Pipeline({'reversion': reversion})
algo.attach_pipeline(pipeline, name='my_pipe')
def before_trading_start(context, data):
result = algo.pipeline_output(name='my_pipe')
do_stuff_with(result)
The most important thing to understand about the two examples above is that
simply constructing an instance of Reversion doesn't perform any
computation. In particular, the line:
reversion = Reversion()
doesn't fetch any data or call the compute method. It simply creates an
instance of the Reversion class, which knows that it needs 60 days of close
prices each day to run its compute function. Similarly,
USEquityPricing.close isn't a DataFrame or a numpy array or anything like
that: it's just a sentinel value that describes what kind of data Reversion
needs as an input.
One way to think about this is by an analogy to mathematics. An instance of
Reversion is like a formula for performing a calculation, and
USEquityPricing.close is like a variable in that formula.
Simply writing down the formula doesn't produce any values; it just gives us a
way to say "here's how to compute a result if you plug in values for all of
these variables".
We get a concrete result by actually plugging in values for our variables,
which happens when we call run_pipeline or pipeline_output.
So what, concretely, does Reversion.compute() do?
Both run_pipeline and pipeline_output ultimately boil down to calls to
PipelineEngine.run_pipeline, which is where actual computation happens.
To continue the analogy from above, if reversion is a formula, and
USEquityPricing.close is a variable in that formula, then PipelineEngine is
the grade school student whose homework assignment is to look up the value of
the variable and plug it into the formula.
When we call PipelineEngine.run_pipeline(pipeline, start_date, end_date), the
engine iterates through our requested expressions, loads the inputs for those
expressions, and then calls each expression's compute method once per trading
day between start_date and end_date with appropriate slices of the loaded
input data.
Concretely, the engine expects that each expression has a compute method with
a signature like:
def compute(self, today, assets, out, input1, input2, ..., inputN):
The first four arguments are always the same:
self is the CustomFactor instance in question (e.g. reversion in the
snippets above). This is how methods work in Python in general.
today is a pandas Timestamp representing the day on which compute is
being called.
assets is a 1-dimensional numpy array containing an integer for every
tradeable asset on today.
out is a 1-dimensional numpy array of the same shape as assets. The
contract of compute is that it should write the result of its computation
into out.
The remaining parameters are 2-D numpy arrays with shape (window_length, len(assets)).
Each of these parameters corresponds to an entry in the expression's inputs
list. In the case of Reversion, we only have a single input,
USEquityPricing.close, so there's only one extra parameter, prices, which
contains a 60 x len(assets) array containing 60 days of trailing close prices
for every asset that existed on today.
One unusual feature of compute is that it's expected to write its computed
results into out. Having functions "return" by mutating inputs is common in
low level languages like C or Fortran, but it's rare in Python and generally
considered non-idiomatic. compute writes its outputs into out partly for
performance reasons (we can avoid extra copying of large arrays in some cases),
and partly to make it so that CustomFactor implementors don't need to worry
about constructing output arrays with correct shapes and dtypes, which can be
tricky in more complex cases where a user has more than one return value.

The way you presented it, that compute method might as well be static as its not using anything from within the Reversion class, unless whatever out is implicitly uses a predefined CustomFactor class when slicing/setting its elemenents. Also, since they don't share their source code we can only guess how exactly the quantopian.pipeline.CustomFactor class is implemented and used internally so you won't be getting a 100% correct answer, but we can split it into two parts and explain it using only Python natives.
The first is setting something to a sequence slice, which is what happens within the compute() method - that is a special sequence (a Pandas data frame most likely, but we'll stick to how it basically operates) that has its __setslice__() magic method overriden so that it doesn't produce the expected result - the expected in this case being replacing each of the elements in out with a given sequence, e.g.:
my_list = [1, 2, 3, 4, 5]
print(my_list) # [1, 2, 3, 4, 5]
my_list[:] = [5, 4, 3, 2, 1]
print(my_list) # [5, 4, 3, 2, 1]
But in that example, the right hand side doesn't necessarily produce a same-sized sequence as out so it most likely does calculations with each of the out elements and updates them in the process. You can create such a list like:
class InflatingList(list): # I recommend extending collections.MutableSequence instead
def __setslice__(self, i, j, value):
for x in range(i, min(len(self), j)):
self[x] += value
So now when you use it it would appear, hmm, non-standard:
test_list = InflatingList([1, 2, 3, 4, 5])
print(test_list) # [1, 2, 3, 4, 5]
test_list[:] = 5
print(test_list) # [6, 7, 8, 9, 10]
test_list[2:4] = -3
print(test_list) # [6, 7, 5, 6, 10]
The second part purely depends on where else the Reversion class (or any other derivate of CustomFactor) is used - you don't have to explicitly use class properties for them to be useful to some other internal structure. Consider:
class Factor(object):
scale = 1.0
correction = 0.5
def compute(self, out, inflate=1.0):
out[:] = inflate
class SomeClass(object):
def __init__(self, factor, data):
assert isinstance(factor, Factor), "`factor` must be an instance of `Factor`"
self._factor = factor
self._data = InflatingList(data)
def read_state(self):
return self._data[:]
def update_state(self, inflate=1.0):
self._factor.compute(self._data, self._factor.scale)
self._data[:] = -self._factor.correction + inflate
So, while Factor doesn't directly use its scale/correction variables, some other class might. Here's what happens when you run it through its cycles:
test = SomeClass(Factor(), [1, 2, 3, 4, 5])
print(test.read_state()) # [1, 2, 3, 4, 5]
test.update_state()
print(test.read_state()) # [2.5, 3.5, 4.5, 5.5, 6.5]
test.update_state(2)
print(test.read_state()) # [5.0, 6.0, 7.0, 8.0, 9.0]
But now you get the chance to define your own Factor that SomeClass uses, so:
class CustomFactor(Factor):
scale = 2.0
correction = -1
def compute(self, out, inflate=1.0):
out[:] = -inflate # deflate instead of inflate
Can give you vastly different results for the same input data:
test = SomeClass(CustomFactor(), [1, 2, 3, 4, 5])
print(test.read_state()) # [1, 2, 3, 4, 5]
test.update_state()
print(test.read_state()) # [-7.5, -6.5, -5.5, -4.5, -3.5]
test.update_state(2)
print(test.read_state()) # [-15.0, -14.0, -13.0, -12.0, -11.0]
[Opinion time] I'd argue that this structure is badly designed and whenever you encounter a behavior that's not really expected, chances are that somebody was writing a solution in search of a problem that serves only to confuse the users and signal that the writer is very knowledgeable since he/she can bend the behavior of a system to their whims - in reality, the writer is most likely a douche who wastes everybody's valuable time so that he/she can pat him/herself on the back. Both Numpy and Pandas, while great libraries on their own, are guilty of that - they're even worse offenders because a lot of people get introduced to Python by using those libraries and then when they want to step out of the confines of those libraries they find them self wondering why my_list[2, 5, 12] doesn't work...

Related

How to ensure good, different initial NumPy MT19937 states?

Introduction - legacy NumPy
The legacy NumPy code of initializing MT19937 instances (same as on Wikipedia) ensured that different seed values lead to different initial states (or at least if a single int is provided). Let's check the first 3 numbers in the prng's state:
np.random.seed(3)
np.random.get_state()[1][:3]
# gives array([ 3, 1142332464, 3889748055], dtype=uint32)
np.random.seed(7)
np.random.get_state()[1][:3]
array([ 7, 4097098180, 3572822661, 1142383841], dtype=uint32)
# gives array([ 7, 4097098180, 3572822661], dtype=uint32)
However, this method is criticized for 2 reasons:
seed size is limited by the underlying type, uint32
similar seeds may result in similar random numbers
The former can be solved if one can provide a sequence of int (which is indeed implemented, but how?), but the latter is harder to address. The implementation of the legacy code has been written keeping this property in mind [^1].
Introduction - new NumPy random
In the new implementation, the seed value provided is hashed first, then used to feed the initial state of the MT19937. This hashing ensures that
the similarity of the seed values doesn't matter, 2 similar seed values produce different initial state with the same probability as non-similar seed values. Previously we have seen that, for adjacent seed values, the first state variable (out of 600+) is similar. Whereas in the new implementation, not a single similar value can be found (with high chance) except for the first one for some reason:
prng = np.random.Generator(np.random.MT19937(3))
prng.bit_generator.state["state"]["key"][:3]
# gives array([2147483648, 2902887791, 607385081], dtype=uint32)
prng = np.random.Generator(np.random.MT19937(7))
prng.bit_generator.state["state"]["key"][:3]
# gives array([2147483648, 3939563265, 4185785210], dtype=uint32)
Two different seed values (ints of any length) may result in a similar initial state with a probability of $2^{-128}$ (by default).
If the problem of similar seeds has been already solved by Matsumoto et el., [^1], then there was no need to use a hash function, which introduces the state collision problem.
Question
Given the new implementation in NumPy, is there a good practice that ensures the different initial states of the MT19937 instances and passes quality requirements when it comes to similar seed values? I am looking for an initialization method that consumes at least 64 bits.
How about modifying the generate_state output of the SeedSequence class: if two ints are given, replace the first 2 states (maybe except the first one) with the given seed values themselves:
class secure_SeedSequence(np.random.SeedSequence):
def __init__(self, seed1: np.uint32, seed2: np.uint32):
self.seed1 = seed1
self.seed2 = seed2
def generate_state(self, n_words, dtype):
ss = np.random.SeedSequence([self.seed1, self.seed2])
states = ss.generate_state(n_words, dtype)
states[1] = self.seed1
states[2] = self.seed2
return states
ss_a = secure_SeedSequence(3, 1)
prng_a = np.random.Generator(np.random.MT19937(ss_a))
# gives [2147483648 3 1 354512857 3507208499 1943065218]
ss_b = secure_SeedSequence(3, 2)
prng_b = np.random.Generator(np.random.MT19937(ss_b))
# gives [2147483648 3 2 2744275888 1746192816 3474647608]
Here secure_SeedSequence consumes 2*32=64 bits, prng_a and prng_b are in different states, and except for the first 3 state variables, all the state variables are not alike. According to Wikipedia, the first 2 numbers may have some correlation with the 2 first state-variables, but after generating 624 random numbers, the next internal state won't reflect the initial seeds anymore. To avoid this problem, the code can be improved by skipping the first 2 random numbers.
Workaround
One can claim that the chances that two MT19937 instances will have the same state after providing different entropy for their SeedSequence is arbitrary low, by default, it is $2^{-128}$. But I am looking for a solution that ensures 100% probability that the initial states are different, not only with
$1-2^{-32\cdot N}$ probability.
Moreover, my concern with this calculation is that although the chance of getting garbage streams are low, once we have them, they produce garbage output forever, therefore, if a stream of length $M$ is generated, and $N$ streams/prngs are used, then by selecting $M$ pieces of numbers from this $M \times N$ 2D array, the chances that a number is garbage, tends to 1.
Why I asked it here?
this is strongly related to a given implementation in NumPy
the chances that I get an answer is the highest here
I think this is a common issue and I hope others have investigated this topic deeply already
[^1]: Common Defects in Initialization of Pseudorandom Number Generators, MAKOTO MATSUMOTO et al., around equation 30.

how does the python's membership function work?

Can someone explain to me what is this fuzz.trimf(x, [0, 5, 10]) membership taking in, The first one is the range array and in this case that is the 'x' and what is that [0,5,10] for? please explain.
The purpose of membership functions are to generalize a function using valuation.
In the case of the trimf() function, the membership function being created is triangularly shaped. In order to determine the bounds of the generalization being created based on the actual data, the user must input scalars as constraints on how large or small the user wants the generalization to be.
Those scalars are the second parameter of the trimf() function and are represented by the list [0, 5, 10].
If you are familiar with the underlying math, the attached image shows the equation used to determine the value of the membership function.
In the attached image,
The a would be your 0
The b would be your 5
And the c would be your 10.

Finding a abstraction for repetitive code: Bootstrap analysis

Intro
There is a pattern that I use all the time in my Python code which analyzes
numerical data. All implementations seem overly redundant or very cumbersome or
just do not play nicely with NumPy functions. I'd like to find a better way to
abstract this pattern.
The Problem / Current State
A method of statistical error propagation is the bootstrap method. It works by
running the same analysis many times with slightly different inputs and look at
the distribution of final results.
To compute the actual value of ams_phys, I have the following equation:
ams_phys = (amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
All the values that go into that equation have a statistical error associated
with it. These values are also computed from other equations. For instance
amk_phys is computed from this equation, where both numbers also have
uncertainties:
amk_phys_dist = mk_phys / a_inv
The value of mk_phys is given as (494.2 ± 0.3) in a paper. What I now do is
parametric bootstrap and generate R samples from a Gaussian distribution
with mean 494.2 and standard deviation 0.3. This is what I store in
mk_phys_dist:
mk_phys_dist = bootstrap.make_dist(494.2, 0.3, R)
The same is done for a_inv which is also quoted with an error in the
literature. Above equation is then converted into a list comprehension to yield
a new distribution:
amk_phys_dist = [mk_phys / a_inv
for a_inv, mk_phys in zip(a_inv_dist, mk_phys_dist)]
The first equation is then also converted into a list comprehension:
ams_phys_dist = [
(amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
for ampi_phys, amk_phys, aB, amcr
in zip(ampi_phys_dist, amk_phys_dist, aB_dist, amcr_dist)]
To get the end result in terms of (Value ± Error), I then take the average and
standard deviation of this distribution of numbers:
ams_phys_val, ams_phys_avg, ams_phys_err \
= bootstrap.average_and_std_arrays(ams_phys_dist)
The actual value is supposed to be computed with the actual value coming in,
not the mean of this bootstrap distribution. Before I had the code replicated
for that, now I have the original value at the 0th position in the _dist
arrays. The arrays now contain 1 + R elements and the
bootstrap.average_and_std_arrays function will separate that element.
This kind of line occurs for every number that I might want to quote in my
writing. I got annoyed by the writing and created a snippet for it:
$1_val, $1_avg, $1_err = bootstrap.average_and_std_arrays($1_dist)
The need for the snippet strongly told me that I need to do some refactoring.
Also the list comprehensions are always of the following pattern:
foo_dist = [ ... bar ...
for bar in bar_dist]
It feels bad to write bar three times there.
The Class Approach
I have tried to make those _dist things a Boot class such that I would not
write ampi_dist and ampi_val but could just use ampi.val without having
to explicitly call this average_and_std_arrays functions and type a bunch of
names for it.
class Boot(object):
def __init__(self, dist):
self.dist = dist
def __str__(self):
return str(self.dist)
#property
def cen(self):
return self.dist[0]
#property
def val(self):
x = np.array(self.dist)
return np.mean(x[1:,], axis=0)
#property
def err(self):
x = np.array(self.dist)
return np.std(x[1:,], axis=0)
However, this still does not solve the problem of the list comprehensions. I
fear that I still have to repeat myself there three times. I could make the
Boot object inherit from list, such that I could at least write it like
this (without the _dist):
bar = Boot([... foo ... for foo in foo])
Magic Approach
Ideally all those list comprehensions would be gone such that I could just
write
bar = ... foo ...
where the dots mean some non-trivial operation. Those can be simple arithmetic
as above, but that could also be a function call to something that does not
support being called with multiple values (like NumPy function do support).
For instance the scipy.optimize.curve_fit function needs to be called a bunch of times:
popt_dist = [op.curve_fit(linear, mpi, diff)[0]
for mpi, diff in zip(mpi_dist, diff_dist)]
One would have to write a wrapper for that because it does not automatically loops over list of arrays.
Question
Do you see a way to abstract this process of running every transformation with
1 + R sets of data? I would like to get rid of those patterns and the huge
number of variables in each namespace (_dist, _val, _avg, ...) as this
makes passing it to function rather tedious.
Still I need to have a lot of freedom in the ... foo ... part where I need to
call arbitrary functions.

Python: nearest neighbour (or closest match) filtering on data records (list of tuples)

I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a "nearest neighbour" or "nearest match" type algorithim.
I want to know the best (i.e. most Pythonic) way to go about doing this. The sample code below hopefully illustrates what I am trying to do.
datarows = [(10,2.0,3.4,100),
(11,2.0,5.4,120),
(17,12.9,42,123)]
filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter
def get_nearest_neighbour(data, criteria, weights):
for each row in data:
# calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
# determine the row which was either an exact match or was 'least dissimilar'
# return the match (or nearest match)
pass
if __name__ == '__main__':
result = get_nearest_neighbour(datarow, filter_record, weights)
print result
For the snippet above, the output should be:
(10,2.0,3.4,100)
since it is the 'nearest' to the sample data passed to the function get_nearest_neighbour().
My question then is, what is the best way to implement get_nearest_neighbour()?. For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the 'distance metric' we use is simply an arithmentic subtraction of the input data from the current row.
Simple out-of-the-box solution:
import math
def distance(row_a, row_b, weights):
diffs = [math.fabs(a-b) for a,b in zip(row_a, row_b)]
return sum([v*w for v,w in zip(diffs, weights)])
def get_nearest_neighbour(data, criteria, weights):
def sort_func(row):
return distance(row, criteria, weights)
return min(data, key=sort_func)
If you'd need to work with huge datasets, you should consider switching to Numpy and using Numpy's KDTree to find nearest neighbors. Advantage of using Numpy is that not only it uses more advanced algorithm, but also it's implemented a top of highly optimized LAPACK (Linear Algebra PACKage).
About naive-NN:
Many of these other answers propose "naive nearest-neighbor", which is an O(N*d)-per-query algorithm (d is the dimensionality, which in this case seems constant, so it's O(N)-per-query).
While an O(N)-per-query algorithm is pretty bad, you might be able to get away with it, if you have less than any of (for example):
10 queries and 100000 points
100 queries and 10000 points
1000 queries and 1000 points
10000 queries and 100 points
100000 queries and 10 points
Doing better than naive-NN:
Otherwise you will want to use one of the techniques (especially a nearest-neighbor data structure) listed in:
http://en.wikipedia.org/wiki/Nearest_neighbor_search (most likely linked off from that page), some examples linked:
http://en.wikipedia.org/wiki/K-d_tree
http://en.wikipedia.org/wiki/Locality_sensitive_hashing
http://en.wikipedia.org/wiki/Cover_tree
especially if you plan to run your program more than once. There are most likely libraries available. To otherwise not use a NN data structure would take too much time if you have a large product of #queries * #points. As user 'dsign' points out in comments, you can probaby squeeze out a large additional constant factor of speed by using the numpy library.
However if you can get away with using the simple-to-implement naive-NN though, you should use it.
use heapq.nlargest on a generator calculating the distance*weight for each record.
something like:
heapq.nlargest(N, ((row, dist_function(row,criteria,weight)) for row in data), operator.itemgetter(1))

Get function given a list of values

Is there a way that I can give python a list of values like [ 1, 3, 4.5, 1] and obtain a function that relates to those values? like y = 3x+4 or something like that?
I don't want to plot it or anything, I just want to substitute values in that function and see what the result would be.
edit: is there a way that python can calculate how the data is related? like if I give it the list containing thousands of values and it returns me the function that was adjusted to those values.
Based on your comments to David Heffernan's answer,
I want is to know what the relation between the values is, I have thousands of values stored in a list and I want to know if python can tell me how they are related..
it seems like you are trying do a regression analysis (probably a linear regression) and fit the values.
You can use NumPy for linear regression analysis in Python. Here a sample from the NumPy cookbook.
Yes, the function is called map().
def y(x):
return 3*x+4
map(y, [1,3,4.5,1])
The map() function applies the function to every item and returns a list of the results.
Based on your revised question, I'm going to go ahead and add an answer. No, there is no such function. I imagine you're unlikely to find a function that comes close in any programming language. Your definitions aren't tight enough for anything to be reasonable yet. If we take a simple case with only two input integers you can have all sorts of relationships:
[10, 1]
possible relationships:
def x(y):
return y ** 0
def x(y):
return y / 10
def x(y)
return y % 10 + 1
... ... repeat. Admittedly, some of those are arbitrary, but they are valid relationships between the first and second values in the array you passed in. The possibilities for "solutions" become even more absurd as you ask for a relationship between 10, 15, or 35 numbers.
I assume you want to find out if the sequences [1, 2, 3, 4] and [ 1, 3, 4.5, 1] (or else the pairs [(1, 1), (2, 3), (3, 4.5), (4, 1)] are related with a (linear) function or not.
Try to plot these and see if they form somethign that looks like a (straight) line or not.
You can also look for correlation techniques. Check this site with basic statistic stuff (look down on correlation: Basic Statistics
What you're looking for is called "statistical regression". There are many methods by which you might do this; here's a site that might help: Least Squares Regression but ultimately, this is a field to which many books have been devoted. There's polynomial regressions, trig regressions, logarithmic...you'll have to know something about your data before you decide which model you apply; if you don't have any knowledge of what the dataset will look like before you process it, I'd suggest comparing the residuals of whatever you get and choosing the one with the lowest sum.
Short answer: No, no function.

Categories

Resources