I am trying to understand why I get a different output in two different function in R vs the same(?) implementation in python.
python:
def increment(n):
n = n + 1
print(n)
n = 1
increment(n)
print(n)
2
1
def increment2(x):
x[0] = x[0] + 1
print(x)
n = [1]
increment2(n)
print(n)
2
2
R:
increment <- function(n){
n = n + 1
print(n)
}
n = 1
increment(n)
2
print(n)
1
increment2 <- function(n){
n[1] = n[1] + 1
print(n)
}
n = c(1)
increment2(n)
2
print(n)
1
In my head it seems more consistent the R output. everything is inside the function and do not get outside (unless I return and assign the output back to n). Can anyone give me a pythonic interpretation of it?
This can be interpreted in terms of object identity.
A list x in python is like a pointer in that it has an identity independent of its contents so assigning a new value to an element of a list does not change the identity of the list. Changing the contents in the function does not change the list's identity and it seems that a function is free to change the contents.
A vector in R does not have an identity apart from its contents. Changing the contents in the function creates a new vector. The original vector is unchanged. R does have objects which have object identity -- they are called environments.
increment3 <- function(e){
e$n = e$n + 1
print(e$n)
}
e <- new.env()
e$n <- 1
increment3(e)
## [1] 2
print(e$n)
## [1] 2
In R, it is also possible to modify a vector in place using external C or C++ code. For example, see https://gist.github.com/ggrothendieck/53811e2769d0582407ae
I can't speak for how R passes parameters, but it's pretty common for programming languages (including Python) to have mutations on mutable objects be reflected outside of the function that performed the mutation. Java, C#, and other popular languages that support OOP (Object Oriented Programming) act this way too.
Lists like [1] are mutable objects, so you see that mutation outside of the function. This type of behavior makes object oriented programming much more convenient.
If this behavior is undesirable, consider using a functional programming style in python (immutable objects, map, filter, reduce) or passing copies of your mutable objects to your functions.
I don't think there's much going on here that has to do with it being pythonic or not. It's a language mechanism: nothing more.
R is heavily influenced by functional languages, most notably Scheme. In functional languages, a "function" is understood just like in mathematics, it does not (and cannot) change its arguments, and its output depends only on arguments (and nothing else).
# pseudocode
let x be 1
tell_me sin(x) # 0.841
tell_me x # still 1
It is conceivable that sin(x) would commit a sin (from a functional perspective) and assign a new value to x.
R is not a purely functional language, however.
(1) You can (easily, and sometimes with bad consequences) access objects from within a function.
> rm(jumbo) # if you're not running this for the first time
> mumbo <- function() jumbo
> mumbo()
Error in mumbo() : object 'jumbo' not found
> jumbo <- 1
> mumbo()
[1] 1
[edit] There was an objection in a comment that some objects need to be visible from within a function. That is completely true, for example, one cannot possibly define arithmetical operations in every function. So the definition of + must be accessible ... but the difference is, in some languages you have explicit control over what is accessible and what is not. I'm not a python expert but I guess that's what is meant by
from jumbo import *
R has packages, which you can attach in a similar way but the difference is that everything in your workspace is, by default, visible from within a function. This may be useful but is also dangerous as you may inadvertently refer to objects that you forgot to define within a function ... and the thing will work in a wrong way, as in the following example:
X <- 1e+10
addone <- function(x) X + 1 # want to increment the argument by 1
addone(3)
# [1] 1e+10
addone(3)==1e+10+1
# [1] TRUE
This is avoided in packages, so a function in a package cannot accidentally get values from your global workspace. And if you are so inclined, you can change the environment of your own functions as well. This might be a way to prevent such accidental errors (not necessarily a convenient way, though):
environment(mumbo) # .GlobalEnv
environment(mumbo) <- baseenv() # changing the environment
mumbo() # error: object 'jumbo' not found
[/edit]
(2) You can, if you want to, change outside objects from within a function, for example, with <<- (as opposed to <-):
> increment.n <- function(){
+ n <<- n + 1
+ print(n)
+ }
> increment.n()
Error in increment.n() : object 'n' not found
> n <- 1
> increment.n()
[1] 2
> n
[1] 2
>
Related
I have noticed that it's common for beginners to have the following simple logical error. Since they genuinely don't understand the problem, a) their questions can't really be said to be caused by a typo (a full explanation would be useful); b) they lack the understanding necessary to create a proper example, explain the problem with proper terminology, and ask clearly. So, I am asking on their behalf, to make a canonical duplicate target.
Consider this code example:
x = 1
y = x + 2
for _ in range(5):
x = x * 2 # so it will be 2 the first time, then 4, then 8, then 16, then 32
print(y)
Each time through the loop, x is doubled. Since y was defined as x + 2, why doesn't it change when x changes? How can I make it so that the value is automatically updated, and I get the expected output
4
6
10
18
34
?
Declarative programming
Many beginners expect Python to work this way, but it does not. Worse, they may inconsistently expect it to work that way. Carefully consider this line from the example:
x = x * 2
If assignments were like mathematical formulas, we'd have to solve for x here. The only possible (numeric) value for x would be zero, since any other number is not equal to twice that number. And how should we account for the fact that the code previously says x = 1? Isn't that a contradiction? Should we get an error message for trying to define x two different ways? Or expect x to blow up to infinity, as the program keeps trying to double the old value of x
Of course, none of those things happen. Like most programming languages in common use, Python is a declarative language, meaning that lines of code describe actions that occur in a defined order. Where there is a loop, the code inside the loop is repeated; where there is something like if/else, some code might be skipped; but in general, code within the same "block" simply happens in the order that it's written.
In the example, first x = 1 happens, so x is equal to 1. Then y = x + 2 happens, which makes y equal to 3 for the time being. This happened because of the assignment, not because of x having a value. Thus, when x changes later on in the code, that does not cause y to change.
Going with the (control) flow
So, how do we make y change? The simplest answer is: the same way that we gave it this value in the first place - by assignment, using =. In fact, thinking about the x = x * 2 code again, we already have seen how to do this.
In the example code, we want y to change multiple times - once each time through the loop, since that is where print(y) happens. What value should be assigned? It depends on x - the current value of x at that point in the process, which is determined by using... x. Just like how x = x * 2 checks the existing value of x, doubles it, and changes x to that doubled result, so we can write y = x + 2 to check the existing value of x, add two, and change y to be that new value.
Thus:
x = 1
for _ in range(5):
x = x * 2
y = x + 2
print(y)
All that changed is that the line y = x + 2 is now inside the loop. We want that update to happen every time that x = x * 2 happens, immediately after that happens (i.e., so that the change is made in time for the print(y)). So, that directly tells us where the code needs to go.
defining relationships
Suppose there were multiple places in the program where x changes:
x = x * 2
y = x + 2
print(y)
x = 24
y = x + 2
print(y)
Eventually, it will get annoying to remember to update y after every line of code that changes x. It's also a potential source of bugs, that will get worse as the program grows.
In the original code, the idea behind writing y = x + 2 was to express a relationship between x and y: we want the code to treat y as if it meant the same thing as x + 2, anywhere that it appears. In mathematical terms, we want to treat y as a function of x.
In Python, like most other programming languages, we express the mathematical concept of a function, using something called... a function. In Python specifically, we use the def function to write functions. It looks like:
def y(z):
return z + 2
We can write whatever code we like inside the function, and when the function is "called", that code will run, much like our existing "top-level" code runs. When Python first encounters the block starting with def, though, it only creates a function from that code - it doesn't run the code yet.
So, now we have something named y, which is a function that takes in some z value and gives back (i.e., returns) the result of calculating z + 2. We can call it by writing something like y(x), which will give it our existing x value and evaluate to the result of adding 2 to that value.
Notice that the z here is the function's own name for the value was passed in, and it does not have to match our own name for that value. In fact, we don't have to have our own name for that value at all: for example, we can write y(1), and the function will compute 3.
What do we mean by "evaluating to", or "giving back", or "returning"? Simply, the code that calls the function is an expression, just like 1 + 2, and when the value is computed, it gets used in place, in the same way. So, for example, a = y(1) will make a be equal to 3:
The function receives a value 1, calling it z internally.
The function computes z + 2, i.e. 1 + 2, getting a result of 3.
The function returns the result of 3.
That means that y(1) evaluated to 3; thus, the code proceeds as if we had put 3 where the y(1) is.
Now we have the equivalent of a = 3.
For more about using functions, see How do I get a result (output) from a function? How can I use the result later?.
Going back to the beginning of this section, we can therefore use calls to y directly for our prints:
x = x * 2
print(y(x))
x = 24
print(y(x))
We don't need to "update" y when x changes; instead, we determine the value when and where it is used. Of course, we technically could have done that anyway: it only matters that y is "correct" at the points where it's actually used for something. But by using the function, the logic for the x + 2 calculation is wrapped up, given a name, and put in a single place. We don't need to write x + 2 every time. It looks trivial in this example, but y(x) would do the trick no matter how complicated the calculation is, as long as x is the only needed input. The calculation only needs to be written once: inside the function definition, and everything else just says y(x).
It's also possible to make the y function use the x value directly from our "top-level" code, rather than passing it in explicitly. This can be useful, but in the general case it gets complicated and can make code much harder to understand and prone to bugs. For a proper understanding, please read Using global variables in a function and Short description of the scoping rules?.
I'm aware that, technically, you cannot know the length of a Python iterator without actually iterating through it.
The __length_hint__ method i.e. it.__length_hint__() returns an estimate of len(list(it)). There's even a wrapper around this method in the operator module, which says that the method "may over- or under-estimate by an arbitrary amount."
For finite iterators, what are the cases where __length_hint__ will be inaccurate? If this can't be known, why not?
I don't see any reference to this in PEP 424.
>>> obja = iter(range(98345984))
>>> obja.__length_hint__()
98345984
>>> import numpy as np
>>> objb = iter(np.arange(817483))
>>> objb.__length_hint__()
817483
I know it's not a great idea to rely on an implementation detail. But this is a detail that is already explicitly used in a top-level function of the operator module. Would there be, for instance, specific data structures that would not give possible inaccuracies?
Basically, anything that is iterating over something that is generated dynamically, rather than iterating over a completed sequence.
Consider a simple iterator that flips a coin, with a head worth 1 point and a tail worth 2 points. It continues to flip the coin until you reach 4 points.
def coinflip():
s = 0
while s < 4:
x = random.choice([1,2])
s += x
yield ("H" if x == 1 else "T")
How long will the sequence be? It could be as short as 2: TT. It could be as long as 4: either HHHH or HHHT. However, in the majority of cases it will be 3: HHT, HTH, HTT, THT or THH. In this case, 3 would be the "safest" guess, but that could be higher or lower.
I am currently trying to learn OCaml. And I am searching for the equivalent of this python code:
f(*l[:n])
I thought I'd try to write a function that emulates this behavior, but it doesn't work. Here is the code:
let rec arg_supply f xs n =
if n = 0 then
f
else
match xs with
| x :: remainder -> arg_supply (f x) remainder (n - 1);;
And here is the error message I get:
Error: This expression has type 'a but an expression was expected of type
'b -> 'a
The type variable 'a occurs inside 'b -> 'a
Any help is appreciated, be it a way to get my function working, or another way to supply the first n elements of a list to a function as arguments.
Edit: n is the amount of arguments needed to call the function, and constant because of it.
Edit: This one doesn't work either.
type ('a, 'r) as_value = ASFunction of ('a -> ('a, 'r) as_value) | ASReturnValue of 'r;;
let rec _arg_supply f_or_rval xs n =
if n = 0 then
f_or_rval
else
match f_or_rval with
| ASFunction func -> (
match xs with
| x :: r -> _arg_supply (func x) r (n - 1)
| _ -> failwith "too few arguments for f"
)
| ASReturnValue out -> out;;
Your Python function f is being passed different numbers of arguments, depending on the value of n. This can't be expressed in OCaml. Functions take a statically fixed number of arguments (i.e., you can tell the number of arguments by reading the source code).
The way to pass different numbers of arguments to an OCaml function is to pass a list, corresponding to the Python code f(l[:n]).
(It's more accurate to say that an OCaml function takes one argument, but this is a discussion for another time.)
Update
If n is actually a constant, let's say it's 3. Then you can do something like the following:
match l with
| a :: b :: c :: _ -> f a b c
| _ -> failwith "too few arguments for f"
Update 2
Here's another way to look at what you want to do. You want to write a function, let's call it apply, that works like this:
let apply f xs =
. . .
The idea is that apply calls the function f, giving it arguments from the list xs.
OCaml is a strongly typed language, so we should be able to give a type for f and for xs. What is the type of f? You want it to be a function of n arguments, where n is the length of xs. I.e., n is not a constant! But there is no such type in OCaml. Any function type has a fixed, static number of parameters. Since there's no OCaml type for f, you can't write the function apply. You can write a series of functions apply1, apply2, and so on. Note that each of these functions has a different type. If you don't mind calling the correct one for each different function f, that would work.
It's fairly likely that you can restructure your problem to work with OCaml's typing rather than struggling with it. I stand by my comments on strong typing: once you get used to strong typing, it's very hard to give up the benefits and go back to languages with little (or no) typing support.
def f(u):
value = 0.0
if u > -1 and u < 1:
value = u * u
return value
Given the above, the following produces the expected plot:
plot(f,(x,-5,5))
But plot(f(x),(x,-5,5)) just draws a horizontal line. Can anyone explain what's going on?
The former passes the function, allowing it to be called inside plot(). The latter calls the function once and passes the returned value, resulting in the same value each time.
Similar to what #Ignacio said, the cause is the function being called once. The problem with this vs other functions like sin is the conditional. The if statement is evaluated when the function is called and not preserved as a symbolic statement. That is, the u > -1 and u < 1[1] is evaluated on the first function call and result is treated accordingly (i.e. left at 0).
As an illustration of what is happening:
sage: x = var('x')
sage: print ":)" if x > 0 else ":("
:(
There is no way to get around this in general[2], because Python has to evaluate the condition in the if statement to work out which code path to take when the function is called.
Best case solution
There is a solution that should work (but doesn't yet). Sage provides Piecewise, so you can define f as:
f = Piecewise([((-5, -1), ConstantFunction(0)),
((-1, 1), x*x),
((1, 5), ConstantFunction(0))],
x)
Unfortunately, the implementation of Piecewise is as yet incomplete, and severely lacking, so the only way to plot this seems to be:
f.plot()
(Limitations: trying to call f with a variable causes errors; it doesn't work with the conventional plot; you can't restrict the domain in Piecewise.plot, it plots the whole thing (hence why I restricted it to ±5); it doesn't cope with infinite intervals.)
Working solution
You could also just detect whether the argument to f is a number or variable and do the appropriate action based on that:
def f(u):
try:
float(u) # see it's a number by trying to convert
return u*u if -1 < u < 1 else 0.0
except TypeError: # the conversion failed
if callable(f):
return lambda uu: f(u(uu))
else:
return f
Note the callable call, it checks to see if u is a function (in some sense), if so returns the composition of f with u.
This version allows us to do things like:
sage: f(10)
0.0
sage: f(x)(0.5)
0.25
sage: f(x+3)(-2.2)
0.64
and it also works perfectly fine with plot, in either form. (Although it warns about DeprecationWarnings because of the u(uu) syntax; there are ways to get around this using u.variables but they are fairly awkward.)
Note: This "working" solution is quite fragile, and very suboptimal; the Piecewise version would be the correct solution, if it worked.
[1]: Python actually allows you to write this as -1 < u < 1. Pretty cool.
[2]: Although in some special cases you can, e.g. if you know x > 0, then you can use assume(x > 0) which means the example will print :).
Here is a (possibly) simpler solution for now, using lambdas.
sage: plot(lambda x:f(x), (x,-5,5))
my first time posting here, so hope I've asked my question in the right sort of way,
After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)
My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.
The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.
So, for example, given the following dictionary:
d = {}
for i in xrange(15000):
d[random.randint(15000000, 18000000)] = 0
can you get Python to tell you how many collisions happened when creating it?
My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.
To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.
Thanks for your help.
Edit: Some code to implement #Winston Ewert's solution:
n = 1500
global collision_count
collision_count = 0
class Foo():
def __eq__(self, other):
global collision_count
collision_count += 1
return id(self) == id(other)
def __hash__(self):
#return id(self) # #John Machin: yes, I know!
return 1
objects = [Foo() for i in xrange(n)]
d = {}
for o in objects:
d[o] = 1
print collision_count
Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.
It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.
Anyone know why this is saying 0 collisions?
Edit:
I might have figured this out.
Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as #John Machin put it)?
Short answer:
You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.
Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".
You shouldn't be worrying about collisions.
Long answer:
Some explanations, derived from reading the source code:
A dict is implemented as a table of 2 ** i entries, where i is an integer.
dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.
When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer
It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.
Consequently collisions are inevitable.
However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.
The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).
In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.
Update after testing on Python 2.6:
Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".
>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
... i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
... pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>
This idea doesn't actually work, see discussion in the question.
A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.
However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.
So:
Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.
Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.