What does self do inside a Python array embedded in a Class? - python

def traverse(self):
print("Traversing...")
nodes_to_visit = [self]
while len(nodes_to_visit) != 0:
current_node = nodes_to_visit.pop()
print(current_node.value)
nodes_to_visit += current_node.children
I have this function inside a class(I'm learning Data Structures), and on the third line, there's a self inside an array, that is then used. What does it do, what does it return? (And while asking, are Data Structures advanced? Can I consider my self an 'advanced' programmer now ;)?)

Data Structures can be indeed advance. Most likely it involves performance for processing many data like thousands and millions. You will learn terms like run time complexity O(n), log(n).
An example benefit of having good knowledge in Data Structure is when you parse an excel file that contains a million rows. I have a colleage that made a script that took an hour to finish the job while mine only took 5 minutes.
Take note that the most basic data structure for python is the dictionary.. It is always O(n) runtime complexity

Related

Best practice: how to pass many arguments to a function?

I am running some numerical simulations, in which my main function must receive lots and lots of arguments - I'm talking 10 to 30 arguments depending on the simulation to run.
What are some best practices to handle cases like this? Dividing the code into, say, 10 functions with 3 arguments each doesn't sound very feasible in my case.
What I do is create an instance of a class (with no methods), store the inputs as attributes of that instance, then pass the instance - so the function receives only one input.
I like this because the code looks clean, easy to read, and because I find it easy to define and run alternative scenarios.
I dislike it because accessing class attributes within a function is slower than accessing a local variable (see: How / why to optimise code by copying class attributes to local variables?) and because it is not an efficient use of memory - too much data stored multiple times unnecessarily.
Any thoughts or recommendations?
myinput=MyInput()
myinput.input_sql_table = that_sql_table
myinput.input_file = that_input_file
myinput.param1 = param1
myinput.param2 = param2
myoutput = calc(myinput)
Alternative scenarios:
inputs=collections.OrderedDict()
scenarios=collections.OrderedDict()
inputs['base scenario']=copy.deepcopy(myinput)
inputs['param2 = 100']=copy.deepcopy(myinput)
inputs['param2 = 100'].param2 = 100
# loop through all the inputs and stores the outputs in the ordered dictionary scenarios
I don't think this is really a StackOverflow question, more of a Software Engineering question. For example check out this question.
As far as whether or not this is a good design pattern, this is an excellent way to handle a large number of arguments. You mentioned that this isn't very efficient in terms of memory or speed, but I think you're making an improper micro-optimization.
As far as memory is concerned, the overhead of running the Python interpreter is going to dwarf the couple of extra bytes used by instantiating your class.
Unless you have run a profiler and determined that accessing members of that options class is slowing you down, I wouldn't worry about it. This is especially the case because you're using Python. If speed is a real concern, you should be using something else.
You may not be aware of this, but most of the large scale number crunching libraries for Python aren't actually written in Python, they're just wrappers around C/C++ libraries that are much faster.
I recommend reading this article, it is well established that "Premature optimization is the root of all evil".
You could pass in a dictionary like so:
all_the_kwargs = {kwarg1: 0, kwarg2: 1, kwargN: xyz}
some_func_or_class(**all_the_kwargs)
def some_func_or_class(kwarg1: int = -1, kwarg2: int = 0, kwargN: str = ''):
print(kwarg1, kwarg2, kwargN)
Or you could use several named tuples like referenced here: Type hints in namedtuple
also note that depending on which version of python you are using there may be a limit to the number of arguments you can pass into a function call.
Or you could use just a dictionary:
def some_func(a_dictionary):
a_dictionary.get('argXYZ', None) # defaults to None if argXYZ doesn't exist

Python - get object - Is dictionary or if statements faster?

I am making a POST to a python script, the POST has 2 parameters. Name and Location, and then it returns one string. My question is I am going to have 100's of these options, is it faster to do it in a dictionary like this:
myDictionary = {"Name":{"Location":"result", "LocationB":"resultB"},
"Name2":{"Location2":"result2A", "Location2B":"result2B"}}
And then I would use.get("Name").get("Location") to get the results
OR do something like this:
if Name = "Name":
if Location = "Location":
result = "result"
elif Location = "LocationB":
result = "resultB"
elif Name = "Name2":
if Location = "Location2B":
result = "result2A"
elif Location = "LocationB":
result = "result2B"
Now if there are hundreds or thousands of these what is faster? Or is there a better way all together?
First of all:
Generally, it's much more pythonic to match keys to values using dictionaries. You should do that from a point of style.
Secondly:
If you really care about performance, python might not always be the optimal tool. However, the dict approach should be much much faster, unless your selections happen about as often as the creation of these dicts. The creation of thousands and thousands of PyObjects to check your case is a really bad idea.
Thirdly:
If you care about your application so much, you might really want to benchmark both solutions -- as usual when it comes to performance questions, there's a million factors including your computing platform that only experiments will help to sort out
Fourth(ly?):
It looks like you're building something like a protocol parser. That's really not python's forte, performance-wise. Maybe you'd want to look into one of the dozens of tools that can write C code parsers for you and wrap that in a native module, it's pretty sure to be faster than either of your implementations, if done right.
Here's the python documentation on Extending Python with C or C++
I decided to test the two scenarios of 1000 Names and 2 locations
The Test Samples
Team Dictionary:
di = {}
for i in range(1000):
di["Name{}".format(i)] = {'Location': 'result{}'.format(i), 'LocationB':'result{}B'.format(i)}
def get_dictionary_value():
di.get("Name999").get("LocationB")
Team If Statement:
I used a python script to generate a 5000 line function if_statements(name, location): following this pattern
elif name == 'Name994':
if location == 'Location':
return 'result994'
elif location == 'LocationB':
return 'result994B'
# Some time later ...
def get_if_value():
if_statements("Name999", "LocationB")
Timing Results
You can time with the timeit function to test the time it takes a function to complete.
import timeit
print(timeit.timeit(get_dictionary_value))
# 0.06353...
print(timeit.timeit(get_if_value))
# 6.3684...
So there you have it, dictionary was 100 times faster on my machine than the hefty 165 KB if-statement function.
I will root for dict().
In most cases [key] selection is much faster than conditional checks. Rule of thumb conditionals are generally used for boolean statements.
The reason for this is; when you create a dictionary you essentially create a registry of that data which is stored in as hashes in a bucket. When you say for instance my dictonary_name['key'] if that value exist python knows the exact location of that value and returns it in almost in an instant.
However conditionals are different. Conditionals are sequential checks meaning worse case it has to check every condition provided to first establish the value's existence then it has return the respective data.
As you can see with 100's of statements this can be problematic. Though in this case dictionaries are faster. You also need to be cognizant of how often and how quickly these checks are. Because if they are faster than the the building of your dictionary you might get an error of value not found.

Python memory explosion with embedded functions

I have used Python for a while and from time to time I meet some memory explosion problem. I have searched for some sources to resolve my question such as
Memory profiling embedded python
and
https://mflerackers.wordpress.com/2012/04/12/fixing-and-avoiding-memory-leaks-in-python/
and
https://docs.python.org/2/reference/datamodel.html#object.del However, none of them works for me.
My current problem is the memory explosion when using embedded functions. The following codes works fine:
class A:
def fa:
some operations
get dictionary1
combine dictionary1 to get string1
dictionary1 = None
return *string1*
def fb:
for i in range(0, j):
call self.fa
get dictionary2 by processing *string1*
# dictionary1 and dictionary2 are basically the same.
update *dictionary3* by processing dictionary2
dictionary2 = None
return *dictionary3*
class B:
def ga:
for n in range(0, m):
call A.fb # as one argument is updated dynamically, I have to call it within the loop
processes *dictoinary3*
return something
The problem arouses when I notice that I don't need to combine dictionary1 to string1, I can directly pass dictionary1 to A.fb. I implemented it this way, then the program becomes extremely slow and the memory usage explodes for more than 10 times. I have verified that both method returned correct result.
May anybody suggest why such a little modification will result in so large difference?
Previously, I also noticed this when I was levelizing nodes in a multi-source tree (with 100,000+ nodes). If I start levelizing from the source node (which may have largest height) the memory usage is 100 times worse than that from the source node which may have smallest height. While the levelization time is about the same.
This has baffled me for a long time. Thank you so much in advance!
If anybody interested, I can email you the source code for a more clear explanation.
The fact that you're solving the same problem shouldn't imply anything about the efficiency of the solution. Same issue can be claimed with sorting arrays: you can use bubble-sort O(n^2), or merge-sort O(nlogn), or, if you can apply some restrictions you can use a non-comparison sorting algorithm like radix or bucket-sort which have linear runtime.
Starting to traverse from different nodes will generate different ways of traversing the graph - some of which might be non-efficient (repeating nodes more times).
As for "combine dictionary1 to string1" - it might be a very expensive operation and since this function is called recursively (many times) - the performance could be significantly poorer. But that's just an educated guess and cannot be answered without having more details about the complexity of the operations performed in these functions.

How to optimize operations on large (75,000 items) sets of booleans in Python?

There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.

Can this Python code be written more efficiently?

So I have this code in python that writes some values to a Dictionary where each key is a student ID number and each value is a Class (of type student) where each Class has some variables associated with it. '
Code
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[1])):
valuetowrite=str(row[i])
if students[str(variablekey)].var2 != []:
students[str(variablekey)].var2.append(valuetowrite)
else:
students[str(variablekey)].var2=([valuetowrite])
except:
two=1#This is just a dummy assignment because I #can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
#Assign var3
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[2])):
valuetowrite=str(row[i])
if students[str(variablekey)].var3 != []:
students[str(variablekey)].var3.append(valuetowrite)
else:
students[str(variablekey)].var3=([valuetowrite])
except:
two=1
#Assign var4
try:
if ((str(i) in row_num_id.iterkeys()) and (row_num_id[str(i)]==varschosen[3])):
valuetowrite=str(row[i])
if students[str(variablekey)].var4 != []:
students[str(variablekey)].var4.append(valuetowrite)
else:
students[str(variablekey)].var4=([valuetowrite])
except:
two=1
'
The same code repeats many, many times for each variable that the student has (var5, var6,....varX). However, the RAM spike in my program comes up as I execute the function that does this series of variable assignments.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
Thanks for your help!
EDIT:
Okay let me simplify my question:
In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. I don't really care about the number of lines my code is or the speed at which it runs (Right now, my code is at almost 20,000 lines and is about a 1 MB .py file!). What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU. The ultimate question is: does the number of code lines by which I build up this massive dictionary matter so much in terms of RAM usage?
My original code functions fine, but the RAM usage is high. I'm not sure if that is "normal" with the amount of data I am collecting. Does writing the code in a condensed fashion (as shown by the people who helped me below) actually make a noticeable difference in the amount of RAM I am going to eat up? Sure there are X ways to build a dictionary, but does it even affect the RAM usage in this case?
Edit: The suggested code-refactoring below won't reduce the memory consumption very much. 6000 classes each with 1000 attributes may very well consume half a gig of memory.
You might be better off storing the data in a database and pulling out the data only as you need it via SQL queries. Or you might use shelve or marshal to dump some or all of the data to disk, where it can be read back in only when needed. A third option would be to use a numpy array of strings. The numpy array will hold the strings more compactly. (Python strings are objects with lots of methods which make them bulkier memory-wise. A numpy array of strings loses all those methods but requires relatively little memory overhead.) A fourth option might be to use PyTables.
And lastly (but not leastly), there might be ways to re-design your algorithm to be less memory intensive. We'd have to know more about your program and the problem it's trying to solve to give more concrete advice.
Original suggestion:
for v in ('var2','var3','var4'):
try:
if row_num_id.get(str(i))==varschosen[1]:
valuetowrite=str(row[i])
value=getattr(students[str(variablekey)],v)
if value != []:
value.append(valuetowrite)
else:
value=[valuetowrite]
except PUT_AN_EXPLICT_EXCEPTION_HERE:
pass
PUT_AN_EXPLICT_EXCEPTION_HERE should be replaced with something like AttributeError or TypeError, or ValueError, or maybe something else.
It's hard to guess what to put here because I don't know what kind of values the variables might have.
If you run the code without the try...exception block, and your program crashes, take note of the traceback error message you receive. The last line will say something like
TypeError: ...
In that case, replace PUT_AN_EXPLICT_EXCEPTION_HERE with TypeError.
If your code can fail in a number of ways, say, with TypeError or ValueError, then you can replace PUT_AN_EXPLICT_EXCEPTION_HERE with
(TypeError,ValueError) to catch both kinds of error.
Note: There is a little technical caveat that should be mentioned regarding
row_num_id.get(str(i))==varschosen[1]. The expression row_num_id.get(str(i)) returns None if str(i) is not in row_num_id.
But what if varschosen[1] is None and str(i) is not in row_num_id? Then the condition is True, when the longer original condition returned False.
If that is a possibility, then the solution is to use a sentinal default value like row_num_id.get(str(i),object())==varschosen[1]. Now row_num_id.get(str(i),object()) returns object() when str(i) is not in row_num_id. Since object() is a new instance of object there is no way it could equal varschosen[1].
You've spelled this wrong
two=1#This is just a dummy assignment because I
#can't leave it empty... I don't need my program to do anything if the "try" doesn't work. I just want to prevent a crash.
It's spelled
pass
You should read a tutorial on Python.
Also,
except:
Is a bad policy. Your program will fail to crash when it's supposed to crash.
Names like var2 and var3 are evil. They are intentionally misleading.
Don't repeat str(variablekey) over and over again.
I wish to find out a way to make this more efficient in speed or more memory efficient because running this part of my program takes up around half a gig of memory. :(
This request is unanswerable because we don't know what it's supposed to do. Intentionally obscure names like var1 and var2 make it impossible to understand.
"6000 instantiated classes, where each class has 1000 attributed variables"
So. 6 million objects? That's a lot of memory. A real lot of memory.
What I am concerned about is the amount of memory it is taking up because this is the culprit in throttling my CPU
Really? Any evidence?
but the RAM usage is high
Compared with what? What's your basis for this claim?
Python dicts use a surprisingly large amount of memory. Try:
import sys
for i in range( 30 ):
d = dict( ( j, j ) for j in range( i ) )
print "dict with", i, "elements is", sys.getsizeof( d ), "bytes"
for an illustration of just how expensive they are. Note that this is just the size of the dict itself: it doesn't include the size of the keys or values stored in the dict.
By default, an instance of a Python class stores its attributes in a dict. Therefore, each of your 6000 instances is using a lot of memory just for that dict.
One way that you could save a lot of memory, provided that your instances all have the same set of attributes, is to use __slots__ (see http://docs.python.org/reference/datamodel.html#slots). For example:
class Foo( object ):
__slots__ = ( 'a', 'b', 'c' )
Now, instances of class Foo have space allocated for precisely three attributes, a, b, and c, but no instance dict in which to store any other attributes. This uses only 4 bytes (on a 32-bit system) per attribute, as opposed to perhaps 15-20 bytes per attribute using a dict.
Another way in which you could be wasting memory, given that you have a lot of strings, is if you're storing multiple identical copies of the same string. Using the intern function (see http://docs.python.org/library/functions.html#intern) could help if this turns out to be a problem.

Categories

Resources