I've got this script written which does some involved parsing while running through a large file. For every line in the file (after some heavy manipulation), I need to add a check to see whether it meets certain criteria and if it does, include it in a list for additional processing later.
The function doing the parsing is already a bit cluttered, and I'm wondering whether it's possible to shunt the line-checking and list manipulation to another function to make things easier to modify later on. My inclination is to use a global variable which gets modified by the function, but I know that's usually poor form. Until now I have never made use of classes, but I vaguely recall there being some advantage to them regarding persistent local variables.
One version of this part of the script could be:
matchingLines = []
def lineProcess(line):
global matchingLines
if line.startswith(criteria):
matchingLines.append(line)
for line in myFile:
# lots of other stuff
lineProcess(line)
Obviously in this simple example it's not much of a pain to just do the checking in the main function and not bother with any additional functions. But in principle, I'm wondering what a better general way of doing this sort of thing is without using external variables.
EDIT: Part of the reason I find the separate function attractive is because once I've collected the list of lines, I am going to use them to manipulate another external file and it would be convenient to have the whole operate wrapped up in a contained module. However, I recognize that this might be premature optimization.
Something like the following might be considered more pythonic:
def is_valid_line(line):
"""Return True if the line is valid, else False."""
return line.startswith(criteria)
valid_lines = [l for l in myFile if is_valid_line(l)]
Incidentally it would be even better practice to use a generator expression rather than a list e.g.
valid_lines = (l for l in myFile if is_valid_line(l))
That way the file reading and line-validation will only actually happen when something tries to iterate over valid_lines, and not before. E.g. In the following case:
valid_lines = [l for l in myFile if is_valid_line(l)]
for line in valid lines:
stuff_that_can_raise_exception(line)
In this case, you have read and validated the entire (huge) file and have the full list of validated lines, and then the first line causes an error, the time spent validating the whole file is wasted. If you use a generator expression (the (x for x in y)) version instead of a list comprehension (the [x for x in y]) then when the error happens you haven't actually validated the file yet (only the first line). I only mention it because I am terrible for not doing this more often myself, and it can yield big performance gains (in CPU and memory) in a lot of cases.
You could use a class, and have matching_lines be an attribute. But you could also just do this:
def process_line(line, matching_lines):
if line.startswith(criteria):
matching_lines.append(line)
...
matches = []
for line in my_file:
# lots of other stuff
process_line(line, matches)
Related
Extremely basic question that I don't quite get.
If I have this line:
my_string = "how now brown cow"
and I change it like so
my_string.split()
Is this acceptable coding practice to just straight write it like that to change it?
or should I instead change it like so:
my_string = my_string.split()
don't both effectively do the same thing?
when would I use one over the other?
how does this ultimately affect my code?
always try to avoid:
my_string = my_string.split()
never, ever do something like that. the main problem with that is it's going to introduce a lot of code bugs in the future, especially for another maintainer of the code. the main problem with this, is that the result of this the split() operation is not a string anymore: it's a list. Therefore, assigning a result of this type to a variable named my_string is bound to cause more problems in the end.
The first line doesn't actually change it - it calls the .split() method on the string, but since you're not doing anything with what that function call returns, the results are just discarded.
In the second case, you assign the returned values to my_string - that means your original string is discarded, but my_string no refers to the parts returned by .split().
Both calls to .split() do the same thing, but the lines of your program do something different.
You would only use the first example if you wanted to know if a split would cause an error, for example:
try:
my_string.split()
except:
print('That was unexpected...')
The second example is the typical use, although you could us the result directly in some other way, for example passing it to a function:
print(my_string.split())
It's not a bad question though - you'll find that some libraries favour methods that change the contents of the object they are called on, while other libraries favour returning the processed result without touching the original. They are different programming paradigms and programmers can be very divided on the subject.
In most cases, Python itself (and its built-in functions and standard libraries) favours the more functional approach and will return the result of the operation, without changing the original, but there are exceptions.
Part of a utility system my AcecoolLib package I'm writing by porting all / most of my logic to Python, and other various languages, on contains a simple, but greatly useful helper... a function named ENUM.
It has many useful features, such as automatically creating maps of the enums, extended or reverse maps if you have the map assigned to more than just values, and a lot more.
It can create maps for generating function names dynamically, it can create simple maps between enumeration and text or string identifiers for language, and much more.
The function declaration is simple, too:
def ENUM( _count = None, *_maps ):
It has an extra helper... Here: https://www.dropbox.com/s/6gzi44i7dh58v61/dynamic_properties_accessorfuncs_and_more.py?dl=0
The other one isn't used. ENUM_MAP is, but the other isn't.
Anyway, before I start going into etc.. etc.. the question is:
How can I count the return variables outside of the function... ie:
ENUM_EXAMPLE_A, ENUM_EXAMPLE_B, ENUM_EXAMPLE_C, ENUM_LIST_EXAMPLE, MAP_ENUM_EXAMPLE = ENUM( None, [ '#example_a', '#example_b', '#example_c' ] )
Where List is a simple list of 0 = 0, 1 = 1, 2 = 2, or something. , then the map links so [ 0 = '#example_a', 1 = '#example_b', etc.. ], then [ '#example_a' = 0, etc.. ] for reverse... or something along those lines.
There are other advanced use cases, not sure if I have those features in the file above, but regardless... I'm trying to simply count the return vars... and get the names.
I know it is likely possible, to read the line from which the call is executed... read the file, get the line, break it apart and do all of that... but I'm hoping something exists to do that without having to code it from scratch in the default Python system...
in short: I'd like to get rid of the first argument of ENUM( _count, *_maps ) so that only the optional *_maps is used. So if I call: ENUM_A, ENUM_B, ENUM_C, LIST_ENUMS = ENUM( ); it'll detect 4 output returns, and get the name of them so I can see if the last contains certain text different from the style of the first... ie, if they want the list, etc.... If they add a map, then optional list, etc.. and I can just count back n _maps to find the list arg, or not...
I know it probably isn't necessary, but I want it to be easy and dynamic so if I add a new enum to a giant list, I don't have to add the number ( although for those I use the maps which means I have to add an entry anyway )...
Either way - I know in Lua, this is stupid easy to do with built-in functions.. I'm hoping Python has built in functions to easily grab the data too.
Thanks!
Here is the one proposed answer, similar to what I could do in my Lua framework... The difference, though, is my framework has to load all of the files into memory ( for dynamic reloading, and dynamic changes, going to the appropriate location - and to network the data by combining everything so the file i/o cost is 'averted' - and Lua handles tables incredibly well ).
The simple answer, is that it is possible.. I'm not sure about in default Python without file i/o, however this method would easily work. This answer will be in pseudo context - but the functionality does exist.
Logic:
1) Using traces, you can determine which file / path and which line, called the ENUM function.
2) Read the calling file as text -- if you can read directly to a line without having to process the entire file - then that would be quicker. There may be some libraries out there that do this. In default Python, I haven't done a huge amount of file i/o other than the basics so I'm not up to speed on all of the most useful things as I typically use SQL for storage purposes, etc...
3) With the line in question, split the line text on '=', ie: before the function call to have the arguments, and the function itself.. call it _result
4)a IF you have no results then someone called the function without returning anything - odd..
4) split _result[ 0 ] on ',' to get each individual argument, and trim whitespace left / right --
5) Combine the clean arguments into a list..
6) Process the args -- ie: determine the method the developer uses to name their enum values, and see if that style changes from the last argument ( if no map ). If map, then go back n or n*2 elements for the list, then onward from there for the map vars. With maps, map returns are given - the only thing I need to do dynamically is the number and determine if the user has a list arg, or not..
Note: There is a very useful and simple mechanism in Python to do a lot of these functions in-line with a single line of code.
All of this is possible, and easy to create in Python. The thing I dislike about this solution is the fact that it requires file i/o -- If your program is executed from another program, and doesn't remain in memory, this means these tasks are always repeated making it less friendly, and more costly...
If the program opens, and remains open, then the cost is more up-front instead of on-going making it not as bad.
Because I use ENUMs in everything, including quick executable scripts which run then close - I don't want to use file i/o..
But, a solution does exist. I'm looking for an alternate.
Simple answer is you can't.
In Python when you do (a, b, c) = func() it's called tuple unpacking. Essentially it's expecting func() to return a tuple of exactly 3 elements (in this example). However, you can also do a = func() and then a will contain a 3-element tuple or whatever func decided to return. Regardless of how func is called, there's nothing within the method that knows how the return value is going to be processed after it's returned.
I wanted to provide a more pythonic way of doing what you're intending, but I'm not really sure I understand the purpose of ENUM(). It seems like you're trying to create constants, but Python doesn't really have true constants.
EDIT:
Methods are only aware of what's passed in as arguments. If you want some sort of ENUM to value mapping then the best equivalent is a dict. You could then have a method that took ENUM('A', 'B', 'C') and returned {'A':0, 'B':1, 'C':2} and then you'd use dict look-ups to get the values.
enum = ENUM('A', 'B', 'C')
print(enum['A']) # prints 0
I've got a django app that, at a very pseudo-codey level, does something like this:
class SerialisedContentItem():
def __init__(self, item),
self.__output = self.__jsonify(item)
def fetch_output(self):
return self.__output
def __jsonify(self, item):
serialized = do_a_bunch_of_serialisey_stuff()
return json.dumps(serialized)
So basically - as soon as the class is instantiated, it:
runs an internal function to generate an output string of JSON
stores it in an internal variable
exposes a public function that can be
called later to retrieve the JSON
It's then being used to generate a page something like this:
for item in page.items:
json_item = SerialisedContentItem(item)
yield json_item.fetch_output()
This, to me, seems a bit pointless. And it's also causing issues with some business logic changes we need to make.
What I'd prefer to do is defer the calling of the "jsonify" function until I actually want it. Roughly speaking, changing the above to:
class SerialisedContentItem():
def __init__(self, item),
self.__item = item
def fetch_output(self):
return self.__jsonify(self.__item):
This seems simpler, and mucks with my logic slightly less.
But: is there a downside I'm not seeing? Is my change less performant, or not a good way of doing things?
As long as you only call fetch_output once per item, there's no performance hit (there would be one, obviously, if you called fetch_output twice on the same SerializedContentItem instance). And not doing useless operations is usually a good thing too (you don't expect open("/path/to/some/file.ext") to read the file's content, do you ?)
The only caveat is that, with the original version, if item is mutated between the initialization of SerializedContentItem and the call to fetch_output, the change won't be reflected in the json output (since it's created right at initialisation time), while with your "lazy" version those changes WILL reflect in the json. Whether this is a no-go, a potential issue or actually just what you want depends on the context, so only you can tell.
EDIT:
what's prompted my question: according to my (poor) understanding of yield, using it here makes sense: I only need to iterate page items once, so do so in a way that minimises the memory footprint. But the bulk of the work isn't currently being done in the yield function, it's being done on the line above it when the class is instantiated, making yield a bit pointless. Or am I misunderstanding how it works?
I'm afraid you are indeed misunderstanding yield. Defering the json seralization until the yield json_item.fetch_output() will change nothing (nada, zero, zilch, shunya) to memory comsuption wrt/ the original version.
yield is not a function, it's a keyword. What it does is to turn the function containing it into a "generator function" - a function that returns a generator (a lazy iterator) object, that you can then iterate over. It will not change anything to the memory used to jsonify an item, and whether this jsonification happens "on the same line" as the yield keyword or not is totally irrelevant.
What a generator brings you (wrt/ memory use) is that you don't have to create a whole list of contents at once, ie:
def eager():
result = []
for i in range(1000):
result.append("foo {}\n".format(i))
return result
with open("file.txt", "w") as outfile:
for item in eager():
outfile.write(item)
This FIRST creates a 1000 items long list in memory, then iterate over it.
vs
def lazy():
result = []
for i in range(1000):
yield "foo {}\n".format(i)
with open("file.txt", "w") as outfile:
for item in lazy():
outfile.write(item)
this lazily generates string after string on each iteration, so you don't end up with a 1000 items list in memory - BUT you still generated 1000 strings, each of them using the same amount of space as with the first solution. The difference is that since (in this example) you don't keep any reference on those strings they can be garbage collected on each iteration, while storing them in a list prevent them from being collected until there' no more reference on the list itself.
In https://github.com/python/cpython/blob/3.6/Lib/linecache.py there are 3 instances of this
In getines
def getlines(filename, module_globals=None):
if filename in cache:
entry = cache[filename]
if len(entry) != 1:
return cache[filename][2]
In updatecache
def updatecache(filename, module_globals=None):
if filename in cache:
if len(cache[filename]) != 1:
del cache[filename]
In lazycache
def lazycache(filename, module_globals):
if filename in cache:
if len(cache[filename]) == 1:
return True
else:
return False
I am writing my own version of linecache and to write the tests for it I need to understand the scenario in which the tuple can be of length 1
There was one scenario in which the statement in getlines got executed. It was when the file was accessed for the first time and stored in the cache and then removed before accessing it the second time. But I still cannot figure out why is it there in the other two functions.
It would be very helpful if someone could help me understand the purpose of using this length check.
Look at the places the code stores values in the cache. It can store two different kinds of values under cache[filename]:
A 4-tuple of size, mod time, list of lines, and name (here and here).
A 1-tuple of a function that returns a list of lines on demands (here).
The 1-tuple is used for setting up for lazy loading of modules. Normal files, to lazy-load them, you just open and read. But module source may be available from the module's loader (found via the import system), but not available just by opening and reading a file—e.g., zipimport modules. So lazycache has to stash the loader's get_source method, so if it later needs the lines, it can get them.
And this means that, whenever it uses those cache values, it has to check which kind it's stored and do different things. If it needs the lines now, and it has a lazy 1-tuple, it has to go load the lines (via updatecache; if it's checking for cache eviction and it finds a lazy tuple that was never evaluated, it drops it; etc.
Also notice that in updatecache, if it's loaded the file from a lazy 1-tuple, it doesn't have a mod time, which means that in checkcache, it can't check whether the file is stale. But that's fine for module source—even if you change the file, the old version is still the one that's imported and being used.
If you were designing this from scratch, rather than hacking on something that's been in the stdlib since the early 1.x dark ages, you'd probably design this very differently. For example, you might use a class, or possibly two classes implementing the same interface.
Also, notice that a huge chunk of the code in linecache is there to deal with special cases related to loading module source that, unless you're trying to build something that reflects on Python code (like traceback does), you don't need any of. (And even if you are doing that, you'd probably want to use the inspect module rather than talking directly to the import system.)
So, linecache may not be the best sample code to base your own line cache on.
I want to write a Python generator function that never actually yields anything. Basically it's a "do-nothing" drop-in that can be used by other code which expects to call a generator (but doesn't always need results from it). So far I have this:
def empty_generator():
# ... do some stuff, but don't yield anything
if False:
yield
Now, this works OK, but I'm wondering if there's a more expressive way to say the same thing, that is, declare a function to be a generator even if it never yields any value. The trick I've employed above is to show Python a yield statement inside my function, even though it is unreachable.
Another way is
def empty_generator():
return
yield
Not really "more expressive", but shorter. :)
Note that iter([]) or simply [] will do as well.
An even shorter solution:
def empty_generator():
yield from []
For maximum readability and maintainability, I would prioritize a construct which goes at the top of the function. So either
your original if False: yield construct, but hoisted to the very first line, or
a separate decorator which adds generator behavior to a non-generator callable.
(That's assuming you didn't just need a callable which did something and then returned an empty iterable/iterator. If so then you could just use a regular function and return ()/return iter(()) at the end.)
Imagine the reader of your code sees:
def name_fitting_what_the_function_does():
# We need this function to be an empty generator:
if False: yield
# that crucial stuff that this function exists to do
Having this at the top immediately cues in every reader of this function to this detail, which affects the whole function - affects the expectations and interpretations of this function's behavior and usage.
How long is your function body? More than a couple lines? Then as a reader, I will feel righteous fury and condemnation towards the author if I don't get a cue that this function is a generator until the very end, because I will probably have spent significant mental cost weaving a model in my head based on the assumption that this is a regular function - the first yield in a generator should ideally be immediately visible, when you don't even know to look for it.
Also, in a function longer than a few lines, a construct at the very beginning of the function is more trustworthy - I can trust that anyone who has looked at a function has probably seen its first line every time they looked at it. That means a higher chance that if that line was mistaken or broken, someone would have spotted it. That means I can be less vigilant for the possibility that this whole thing is actually broken but being used in a way that makes the breakage non-obvious.
If you're working with people who are sufficiently fluently familiar with the workings of Python, you could even leave off that comment, because to someone who immediately remembers that yield is what makes Python turn a function into a generator, it is obvious that this is the effect, and probably the intent since there is no other reason for correct code to have a non-executed yield.
Alternatively, you could go the decorator route:
#generator_that_yields_nothing
def name_fitting_what_the_function_does():
# that crucial stuff for which this exists
def generator_that_yields_nothing(wrapped):
#functools.wraps(wrapped)
def wrapper_generator():
if False: yield
wrapped()
return wrapper_generator