I was playing around in python. I used the following code in IDLE:
p = [1, 2]
p[1:1] = [p]
print p
The output was:
[1, [...], 2]
What is this […]? Interestingly I could now use this as a list of list of list up to infinity i.e.
p[1][1][1]....
I could write the above as long as I wanted and it would still work.
EDIT:
How is it represented in memory?
What's its use? Examples of some cases where it is useful would be helpful.
Any link to official documentation would be really useful.
This is what your code created
It's a list where the first and last elements are pointing to two numbers (1 and 2) and where the middle element is pointing to the list itself.
In Common Lisp when printing circular structures is enabled such an object would be printed as
#1=#(1 #1# 2)
meaning that there is an object (labelled 1 with #1=) that is a vector with three elements, the second being the object itself (back-referenced with #1#).
In Python instead you just get the information that the structure is circular with [...].
In this specific case the description is not ambiguous (it's backward pointing to a list but there is only one list so it must be that one). In other cases may be however ambiguous... for example in
[1, [2, [...], 3]]
the backward reference could either point to the outer or to the inner list.
These two different structures printed in the same way can be created with
x = [1, [2, 3]]
x[1][1:1] = [x[1]]
y = [1, [2, 3]]
y[1][1:1] = [y]
print(x)
print(y)
and they would be in memory as
It means that you created an infinite list nested inside itself, which can not be printed. p contains p which contains p ... and so on. The [...] notation is a way to let you know this, and to inform that it can't be represented! Take a look at #6502's answer to see a nice picture showing what's happening.
Now, regarding the three new items after your edit:
This answer seems to cover it
Ignacio's link describes some possible uses
This is more a topic of data structure design than programming languages, so it's unlikely that any reference is found in Python's official documentation
To the question "What's its use", here is a concrete example.
Graph reduction is an evaluation strategy sometime used in order to interpret a computer language. This is a common strategy for lazy evaluation, notably of functional languages.
The starting point is to build a graph representing the sequence of "steps" the program will take. Depending on the control structures used in that program, this might lead to a cyclic graph (because the program contains some kind of "forever" loop -- or use recursion whose "depth" will be known at evaluation time, but not at graph-creation time)...
In order to represent such graph, you need infinite "data structures" (sometime called recursive data structures), like the one you noticed. Usually, a little bit more complex though.
If you are interested in that topic, here is (among many others) a lecture on that subject: http://undergraduate.csse.uwa.edu.au/units/CITS3211/lectureNotes/14.pdf
We do this all the time in object-oriented programming. If any two objects refer to each other, directly or indirectly, they are both infinitely recursive structures (or both part of the same infinitely recursive structure, depending on how you look at it). That's why you don't see this much in something as primitive as a list -- because we're usually better off describing the concept as interconnected "objects" than an "infinite list".
You can also get ... with an infinitely recursive dictionary. Let's say you want a dictionary of the corners of a triangle, where each value is a dictionary of the other corners connected to that corner. You could set it up like this:
a = {}
b = {}
c = {}
triangle = {"a": a, "b": b, "c": c}
a["b"] = b
a["c"] = c
b["a"] = a
b["c"] = c
c["a"] = a
c["b"] = b
Now if you print triangle (or a or b or c for that matter), you'll see it's full of {...} because any two corners are referring to back to each other.
As I understood, this is an example of fixed point
p = [1, 2]
p[1:1] = [p]
f = lambda x:x[1]
f(p)==p
f(f(p))==p
Can someone explain to me what is this fuzz.trimf(x, [0, 5, 10]) membership taking in, The first one is the range array and in this case that is the 'x' and what is that [0,5,10] for? please explain.
The purpose of membership functions are to generalize a function using valuation.
In the case of the trimf() function, the membership function being created is triangularly shaped. In order to determine the bounds of the generalization being created based on the actual data, the user must input scalars as constraints on how large or small the user wants the generalization to be.
Those scalars are the second parameter of the trimf() function and are represented by the list [0, 5, 10].
If you are familiar with the underlying math, the attached image shows the equation used to determine the value of the membership function.
In the attached image,
The a would be your 0
The b would be your 5
And the c would be your 10.
Suppose I have two categorical pandas.Series like so:
> series_1 = pandas.Categorical(
["A", "B", "C", "A", "C"],
categories=["A", "B", "C"]
)
> series_2 = pandas.Categorical(
[1, 2, 3, 1, 3],
categories=[1, 3, 2]
)
So, the two series have the exact same informational content but differ only by how the categories were labeled. My goal is to test this very fast because I have a data frame with hundreds of such columns.
What I did so far was to calculate a contingency table with pandas.crosstab and check if it is a diagonal matrix (with np.diag(cont_table).sum() == cont_table.sum(), which is not perfect).
I could simply convert the labels into integers and always use the order of the first appearance to guarantee that corresponding labels would be assigned to the same integer, but I feel that this is such a basic task that surely pandas already have some way of doing this.
Hence the question is: is there a fast, simple way of doing this check with a few calls to pandas methods?
EDIT:
Changed to a different example that demonstrates the difficulty of the task more clearly, since some of the answers worked for the previous example but don't solve the general problem. Notice that I can't generally trust that the categories in the two series will be correctly paired in the same order of the corresponding labels.
In this gist there is a code that generates random instances of this problem to test eventual solutions. The code is simply:
generates two numpy arrays with the same structure but different labels
labels create two series and call the .astype('category') method.
It routinely generates cases where the categories are not in order.
Well, after banging my head against the documentation for a while it turns out that I can do this:
import pandas as pd
def compare_categorical_series():
values_1, *_ = pd.factorize(feature_1)
values_2, *_ = pd.factorize(feature_2)
return np.all(values_1 == values_2)
The factorize function transforms every entry into an integer value, using the same integer for equal. Of course, this is not enough, it also needs to do that in the always in the same order, irrespectively of the actual labels.
Although this behavior is not documented in the pandas documentation as is, after extensive testing it seems that this is the behavior. It seems the integers are attributed in the order that the labels appear in the series, which would be enough to guarantee the behavior needed for this application.
But since this behavior is not documented, it might change in the future, so it's good to have test cases in place to detect a possible change in behavior.
If you are confident that the ordering of the categories is the same (as in this example), you could just do:
series_match = (series_1 == series_2).all()
// True for this example
From here: https://www.quantopian.com/posts/wsj-example-algorithm
class Reversion(CustomFactor):
"""
Here we define a basic mean reversion factor using a CustomFactor. We
take a ratio of the last close price to the average price over the
last 60 days. A high ratio indicates a high price relative to the mean
and a low ratio indicates a low price relative to the mean.
"""
inputs = [USEquityPricing.close]
window_length = 60
def compute(self, today, assets, out, prices):
out[:] = -prices[-1] / np.mean(prices, axis=0)
Reversion() seems to return a pandas.DataFrame, and I have absolutely no idea why.
For one thing, where is inputs and window_length used?
And what exactly is out[:]?
Is this specific behavior related to Quantopian in particular or Python/Pandas?
TL;DR
Reversion() doesn't return a DataFrame, it returns an instance of the
Reversion class, which you can think of as a formula for performing a
trailing window computation. You can run that formula over a particular time
period using either quantopian.algorithm.pipeline_output or
quantopian.research.run_pipeline, depending on whether you're writing a
trading algorithm or doing offline research in a notebook.
The compute method is what defines the "formula" computed by a Reversion
instance. It calculates a reduction over a 2D numpy array of prices, where
each row of the array corresponds to a day and each column of the array
corresponds to a stock. The result of that computation is a 1D array
containing a value for each stock, which is copied into out. out is also
a numpy array. The syntax out[:] = <expression> says "copy the values from
<expression> into out".
compute writes its result directly into an output array instead of simply
returning because doing so allows the CustomFactor base class to ensure
that the output has the correct shape and dtype, which can be nontrivial for
more complex cases.
Having a function "return" by overwriting an input is unusual and generally
non-idiomatic Python. I wouldn't recommend implementing a similar API unless
you're very sure that there isn't a better solution.
All of the code in the linked example is open source and can be found in
Zipline, the framework on top of
which Quantopian is built. If you're interested in the implementation, the
following files are good places to start:
zipline/pipeline/engine.py
zipline/pipeline/term.py
zipline/pipeline/graph.py
zipline/pipeline/pipeline.py
zipline/pipeline/factors/factor.py
You can also find a detailed tutorial on the Pipeline API
here.
I think there are two kinds of answers to your question:
How does the Reversion class fit into the larger framework of a
Zipline/Quantopian algorithm? In other words, "how is the Reversion class
used"?
What are the expected inputs to Reversion.compute() and what computation
does it perform on those inputs? In other words, "What, concretely, does the
Reversion.compute() method do?
It's easier to answer (2) with some context from (1).
How is the Reversion class used?
Reversion is a subclass of CustomFactor, which is part of Zipline's
Pipeline API. The primary purpose of the Pipeline API is to make it easy
for users to perform a certain special kind of computation efficiently over
many sources of data. That special kind of computation is a cross-sectional
trailing-window computation, which has the form:
Every day, for some set of data sources, fetch the last N days of data for all
known assets and apply a reduction function to produce a single value per
asset.
A very simple cross-sectional trailing-window computation would be something
like "close-to-close daily returns", which has the form:
Every day, fetch the last two days' of close prices and, for each asset,
calculate the percent change between the asset's previous day close price and
its current current close price.
To describe a cross-sectional trailing-window computation, we need at least
three pieces of information:
On what kinds of data (e.g. price, volume, market cap) does the computation
operate?
On how long of a trailing window of data (e.g. 1 day, 20 days, 100 days)
does the computation operate?
What reduction function does the computation perform over the data described
by (1) and (2)?
The CustomFactor class defines an API for consolidating these three pieces of
information into a single object.
The inputs attribute describes the set of inputs needed to perform a
computation. In the snippet from the question, the only input is
USEquityPricing.close, which says that we just need trailing daily close
prices. In general, however, we can ask for any number of inputs. For
example, to compute VWAP (Volume-Weighted Average Price), we would use
something like inputs = [USEquityPricing.close, USEquityPricing.volume] to
say that we want trailing close prices and trailing daily volumes.
The window_length attribute describes the number of days of trailing data
required to perform a computation. In the snippet above we're requesting 60
days of trailing close prices.
The compute method describes the trailing-window computation to be
performed. In the section below, I've outlined exactly how compute performs
its computation. For now, it's enough to know that compute is essentially a
reduction function from some number of 2-dimensional arrays to a single
1-dimensional array.
You might notice that we haven't defined an actual set of dates on which we
might want to compute a Reversion factor. This is by design, since we'd like
to be able to use the same Reversion instance to perform calculations at
different points in time.
Quantopian defines two APIs for computing expressions like Reversion: an
"online" mode designed for use in actual trading algorithms, and a "batch" mode
designed for use in research and development. In both APIs, we first construct
a Pipeline object that holds all the computations we want to perform. We
then feed our pipeline object into a function that actually performs the
computations we're interested in.
In the batch API, we call run_pipeline passing our pipeline, a start date,
and an end date. A simple research notebook computing a custom factor might
look like this:
from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.research import run_pipeline
class Reversion(CustomFactor):
# Code from snippet above.
reversion = Reversion()
pipeline = Pipeline({'reversion': reversion})
result = run_pipeline(pipeline, start_date='2014-01-02', end_date='2015-01-02')
do_stuff_with(result)
In a trading algorithm, we're generally interested in the most recently
computed values from our pipeline, so there's a slightly different API: we
"attach" a pipeline to our algorithm on startup, and we request the latest
output from the pipeline at the start of each day. A simple trading algorithm
using Reversion might look something like this:
import quantopian.algorithm as algo
from quantopian.pipeline import Pipeline, CustomFactor
class Reversion(CustomFactor):
# Code from snippet above.
def initialize(context):
reversion = Reversion()
pipeline = Pipeline({'reversion': reversion})
algo.attach_pipeline(pipeline, name='my_pipe')
def before_trading_start(context, data):
result = algo.pipeline_output(name='my_pipe')
do_stuff_with(result)
The most important thing to understand about the two examples above is that
simply constructing an instance of Reversion doesn't perform any
computation. In particular, the line:
reversion = Reversion()
doesn't fetch any data or call the compute method. It simply creates an
instance of the Reversion class, which knows that it needs 60 days of close
prices each day to run its compute function. Similarly,
USEquityPricing.close isn't a DataFrame or a numpy array or anything like
that: it's just a sentinel value that describes what kind of data Reversion
needs as an input.
One way to think about this is by an analogy to mathematics. An instance of
Reversion is like a formula for performing a calculation, and
USEquityPricing.close is like a variable in that formula.
Simply writing down the formula doesn't produce any values; it just gives us a
way to say "here's how to compute a result if you plug in values for all of
these variables".
We get a concrete result by actually plugging in values for our variables,
which happens when we call run_pipeline or pipeline_output.
So what, concretely, does Reversion.compute() do?
Both run_pipeline and pipeline_output ultimately boil down to calls to
PipelineEngine.run_pipeline, which is where actual computation happens.
To continue the analogy from above, if reversion is a formula, and
USEquityPricing.close is a variable in that formula, then PipelineEngine is
the grade school student whose homework assignment is to look up the value of
the variable and plug it into the formula.
When we call PipelineEngine.run_pipeline(pipeline, start_date, end_date), the
engine iterates through our requested expressions, loads the inputs for those
expressions, and then calls each expression's compute method once per trading
day between start_date and end_date with appropriate slices of the loaded
input data.
Concretely, the engine expects that each expression has a compute method with
a signature like:
def compute(self, today, assets, out, input1, input2, ..., inputN):
The first four arguments are always the same:
self is the CustomFactor instance in question (e.g. reversion in the
snippets above). This is how methods work in Python in general.
today is a pandas Timestamp representing the day on which compute is
being called.
assets is a 1-dimensional numpy array containing an integer for every
tradeable asset on today.
out is a 1-dimensional numpy array of the same shape as assets. The
contract of compute is that it should write the result of its computation
into out.
The remaining parameters are 2-D numpy arrays with shape (window_length, len(assets)).
Each of these parameters corresponds to an entry in the expression's inputs
list. In the case of Reversion, we only have a single input,
USEquityPricing.close, so there's only one extra parameter, prices, which
contains a 60 x len(assets) array containing 60 days of trailing close prices
for every asset that existed on today.
One unusual feature of compute is that it's expected to write its computed
results into out. Having functions "return" by mutating inputs is common in
low level languages like C or Fortran, but it's rare in Python and generally
considered non-idiomatic. compute writes its outputs into out partly for
performance reasons (we can avoid extra copying of large arrays in some cases),
and partly to make it so that CustomFactor implementors don't need to worry
about constructing output arrays with correct shapes and dtypes, which can be
tricky in more complex cases where a user has more than one return value.
The way you presented it, that compute method might as well be static as its not using anything from within the Reversion class, unless whatever out is implicitly uses a predefined CustomFactor class when slicing/setting its elemenents. Also, since they don't share their source code we can only guess how exactly the quantopian.pipeline.CustomFactor class is implemented and used internally so you won't be getting a 100% correct answer, but we can split it into two parts and explain it using only Python natives.
The first is setting something to a sequence slice, which is what happens within the compute() method - that is a special sequence (a Pandas data frame most likely, but we'll stick to how it basically operates) that has its __setslice__() magic method overriden so that it doesn't produce the expected result - the expected in this case being replacing each of the elements in out with a given sequence, e.g.:
my_list = [1, 2, 3, 4, 5]
print(my_list) # [1, 2, 3, 4, 5]
my_list[:] = [5, 4, 3, 2, 1]
print(my_list) # [5, 4, 3, 2, 1]
But in that example, the right hand side doesn't necessarily produce a same-sized sequence as out so it most likely does calculations with each of the out elements and updates them in the process. You can create such a list like:
class InflatingList(list): # I recommend extending collections.MutableSequence instead
def __setslice__(self, i, j, value):
for x in range(i, min(len(self), j)):
self[x] += value
So now when you use it it would appear, hmm, non-standard:
test_list = InflatingList([1, 2, 3, 4, 5])
print(test_list) # [1, 2, 3, 4, 5]
test_list[:] = 5
print(test_list) # [6, 7, 8, 9, 10]
test_list[2:4] = -3
print(test_list) # [6, 7, 5, 6, 10]
The second part purely depends on where else the Reversion class (or any other derivate of CustomFactor) is used - you don't have to explicitly use class properties for them to be useful to some other internal structure. Consider:
class Factor(object):
scale = 1.0
correction = 0.5
def compute(self, out, inflate=1.0):
out[:] = inflate
class SomeClass(object):
def __init__(self, factor, data):
assert isinstance(factor, Factor), "`factor` must be an instance of `Factor`"
self._factor = factor
self._data = InflatingList(data)
def read_state(self):
return self._data[:]
def update_state(self, inflate=1.0):
self._factor.compute(self._data, self._factor.scale)
self._data[:] = -self._factor.correction + inflate
So, while Factor doesn't directly use its scale/correction variables, some other class might. Here's what happens when you run it through its cycles:
test = SomeClass(Factor(), [1, 2, 3, 4, 5])
print(test.read_state()) # [1, 2, 3, 4, 5]
test.update_state()
print(test.read_state()) # [2.5, 3.5, 4.5, 5.5, 6.5]
test.update_state(2)
print(test.read_state()) # [5.0, 6.0, 7.0, 8.0, 9.0]
But now you get the chance to define your own Factor that SomeClass uses, so:
class CustomFactor(Factor):
scale = 2.0
correction = -1
def compute(self, out, inflate=1.0):
out[:] = -inflate # deflate instead of inflate
Can give you vastly different results for the same input data:
test = SomeClass(CustomFactor(), [1, 2, 3, 4, 5])
print(test.read_state()) # [1, 2, 3, 4, 5]
test.update_state()
print(test.read_state()) # [-7.5, -6.5, -5.5, -4.5, -3.5]
test.update_state(2)
print(test.read_state()) # [-15.0, -14.0, -13.0, -12.0, -11.0]
[Opinion time] I'd argue that this structure is badly designed and whenever you encounter a behavior that's not really expected, chances are that somebody was writing a solution in search of a problem that serves only to confuse the users and signal that the writer is very knowledgeable since he/she can bend the behavior of a system to their whims - in reality, the writer is most likely a douche who wastes everybody's valuable time so that he/she can pat him/herself on the back. Both Numpy and Pandas, while great libraries on their own, are guilty of that - they're even worse offenders because a lot of people get introduced to Python by using those libraries and then when they want to step out of the confines of those libraries they find them self wondering why my_list[2, 5, 12] doesn't work...
I was playing around in python. I used the following code in IDLE:
p = [1, 2]
p[1:1] = [p]
print p
The output was:
[1, [...], 2]
What is this […]? Interestingly I could now use this as a list of list of list up to infinity i.e.
p[1][1][1]....
I could write the above as long as I wanted and it would still work.
EDIT:
How is it represented in memory?
What's its use? Examples of some cases where it is useful would be helpful.
Any link to official documentation would be really useful.
This is what your code created
It's a list where the first and last elements are pointing to two numbers (1 and 2) and where the middle element is pointing to the list itself.
In Common Lisp when printing circular structures is enabled such an object would be printed as
#1=#(1 #1# 2)
meaning that there is an object (labelled 1 with #1=) that is a vector with three elements, the second being the object itself (back-referenced with #1#).
In Python instead you just get the information that the structure is circular with [...].
In this specific case the description is not ambiguous (it's backward pointing to a list but there is only one list so it must be that one). In other cases may be however ambiguous... for example in
[1, [2, [...], 3]]
the backward reference could either point to the outer or to the inner list.
These two different structures printed in the same way can be created with
x = [1, [2, 3]]
x[1][1:1] = [x[1]]
y = [1, [2, 3]]
y[1][1:1] = [y]
print(x)
print(y)
and they would be in memory as
It means that you created an infinite list nested inside itself, which can not be printed. p contains p which contains p ... and so on. The [...] notation is a way to let you know this, and to inform that it can't be represented! Take a look at #6502's answer to see a nice picture showing what's happening.
Now, regarding the three new items after your edit:
This answer seems to cover it
Ignacio's link describes some possible uses
This is more a topic of data structure design than programming languages, so it's unlikely that any reference is found in Python's official documentation
To the question "What's its use", here is a concrete example.
Graph reduction is an evaluation strategy sometime used in order to interpret a computer language. This is a common strategy for lazy evaluation, notably of functional languages.
The starting point is to build a graph representing the sequence of "steps" the program will take. Depending on the control structures used in that program, this might lead to a cyclic graph (because the program contains some kind of "forever" loop -- or use recursion whose "depth" will be known at evaluation time, but not at graph-creation time)...
In order to represent such graph, you need infinite "data structures" (sometime called recursive data structures), like the one you noticed. Usually, a little bit more complex though.
If you are interested in that topic, here is (among many others) a lecture on that subject: http://undergraduate.csse.uwa.edu.au/units/CITS3211/lectureNotes/14.pdf
We do this all the time in object-oriented programming. If any two objects refer to each other, directly or indirectly, they are both infinitely recursive structures (or both part of the same infinitely recursive structure, depending on how you look at it). That's why you don't see this much in something as primitive as a list -- because we're usually better off describing the concept as interconnected "objects" than an "infinite list".
You can also get ... with an infinitely recursive dictionary. Let's say you want a dictionary of the corners of a triangle, where each value is a dictionary of the other corners connected to that corner. You could set it up like this:
a = {}
b = {}
c = {}
triangle = {"a": a, "b": b, "c": c}
a["b"] = b
a["c"] = c
b["a"] = a
b["c"] = c
c["a"] = a
c["b"] = b
Now if you print triangle (or a or b or c for that matter), you'll see it's full of {...} because any two corners are referring to back to each other.
As I understood, this is an example of fixed point
p = [1, 2]
p[1:1] = [p]
f = lambda x:x[1]
f(p)==p
f(f(p))==p