Python Idiom for applying sequential steps to an iterable - python

When doing data processing tasks I often find myself applying a series of compositions, vectorized functions, etc. to some input iterable of data to generate a final result. Ideally I would like something that will work for both lists and generators (in addition to any other iterable). I can think of a number of approaches to structuring code to accomplish this, but every way I can think of has one or more ways where it feels unclean/unidiomatic to me. I have outlined below the different methods I can think of to do this, but my question is—is there a recommended, idiomatic way to do this?
Methods I can think of, illustrated with a simple example that is generally representative of:
Write it as one large expression
result = [sum(group)
for key, group in itertools.groupby(
filter(lambda x: x <= 2, [x **2 for x in input]),
keyfunc=lambda x: x % 3)]
This is often quite difficult to read for any non-trivial sequence of steps. When reading through the code one also encounters each step in reverse order.
Save each step into a different variable name
squared = [x**2 for x in input]
filtered = filter(lambda x: x < 2, squared)
grouped = itertools.groupby(filtered, keyfunc=lambda x: x % 3)
result = [sum(group) for key, group in grouped]
This introduces a number of local variables that can often be hard to name descriptively; additionally, if the result of some or all of the intermediate steps is especially large keeping them around could be very wasteful of memory. If one wants to add a step to this process, care must be taken that all variable names get updated correctly—for example, if we wished to divide every number by two we would add the line halved = [x / 2.0 for x in filtered], but would also have to remember to change filtered to halved in the following line.
Store each step into the same variable name
tmp = [x**2 for x in input]
tmp = filter(lambda x: x < 2, tmp)
tmp = itertools.groupby(tmp, keyfunc=lambda x: x % 3)
result = [sum(group) for key, group in tmp]
I guess this seems to me as the least-bad of these options, but storing things in a generically named placeholder variable feels un-pythonic to me and makes me suspect that there is some better way out there.

Code Review often is a better place for style questions. SO is more for problem solving. But CR can be picky about the completeness of the example.
But I can a few observations:
if you wrap this calculation in a function, naming isn't such a big deal. The names don't have to be globally meaningful.
a number of your expressions are generators. Itertools tends to produce generators or gen. expressions. So memory use shouldn't be much of an issue.
def better_name(input):
squared = (x**2 for x in input) # gen expression
filtered = filter(lambda x: x < 2, squared)
grouped = itertools.groupby(filtered, lambda x: x % 3)
result = (sum(group) for key, group in grouped)
return result
list(better_name(input))
Using def functions instead of lambdas can also make the code clearer. There's a trade off. Your lambdas are simple enough that I'd probably keep them.
Your 2nd option is much more readable than the 1st. The order of the expressions guides my reading and mental evaluation. In the 1st it's hard to identify the inner-most or first evaluation. And groupby is a complex operation, so any help in compartmentalizing the action is welcome.
Following the filter docs, these are equivalent:
filtered = filter(lambda x: x < 2, squared)
filtered = (x for x in squared if x<2)
I was missing the return. The function could return a generator as I show, or an evaluated list.
groupby keyfunc is not a keyword argument, but rather positional one.
groupby is complex function. It returns a generator that produces tuples, an element of which is a generator itself. Returning this makes it more obvious.
((key, list(group)) for key, group in grouped)
So a code style that clarifies its use is desirable.

Related

When and why to map a lambda function to a list

I am working through a preparatory course for a Data Science bootcamp and it goes over the lambda keyword and map and filter functions fairly early on in the course. It gives you syntax and how to use it, but I am looking for why and when for context. Here is a sample of their solutions:
def error_line_traces(x_values, y_values, m, b):
return list(map(lambda x_value: error_line_trace(x_values, y_values, m, b, x_value), x_values))
I feel as if every time I go over their solutions to the labs I've turned a single return line solution into a multi-part function. Is this style or is it something that I should be doing?
I'm not aware of any situations where it makes sense to use a map of a lambda, since it's shorter and clearer to use a generator expression instead. And a list of a map of a lambda is even worse cause it could be a list comprehension:
def error_line_traces(x_values, y_values, m, b):
return [error_line_trace(x_values, y_values, m, b, x) for x in x_values]
Look how much shorter and clearer that is!
A filter of a lambda can also be rewritten as a comprehension. For example:
list(filter(lambda x: x>5, range(10)))
[x for x in range(10) if x>5]
That said, there are good uses for lambda, map, and filter, but usually not in combination. Even list(map(...)) can be OK depending on the context, for example converting a list of strings to a list of integers:
[int(x) for x in list_of_strings]
list(map(int, list_of_strings))
These are about as clear and concise, so really the only thing to consider is whether people reading your code will be familiar with map, and whether you want to give a meaningful name to the elements of the iterable (here x, which, admittedly, is not a great example).
Once you get past the bootcamp, keep in mind that map and filter are iterators and do lazy evaluation, so if you're only looping over them and not building a list, they're often preferable for performance reasons, though a generator will probably perform just as well.

change the variable in a forloop just like in list comprehension, python

I want to be able to change what something does in a for loop
here a very simple example
for i in range(10):
print(i)
this will print 0 up to 9 and not include 10
and ofc i want 1 to 10 not 0 to 9
to fix this i would like to say:
i+1 for i in range(10):
print(i)
but i cant
if i did list comprehension i can do:
list0 = [i+1 for i in range(10)]
this is very handy
now i have to either do
for i in range(1, 10+1):
which is very annoying
or do
print(i+1)
but if i used i 10 times i'd have to change them all
or i could say:
for i in range(10):
i += 1
these methods are all not very nice, im just wondering if this neat way im looking for exists at all
thanks.
You ask if there exists any way to change the value received from an iterable in a for loop. The answer is yes; this can be accomplished in one of two ways. I'll continue to use your example with range to demonstrate this, but do note that I am in no way suggesting that these are ideal ways of solving that particular problem.
The first method is using the builtin map:
for i in map(lambda x: x + 1, range(10)):
map accepts a callable and an iterable, and will lazily apply the given callable to each element produced by the iterable. Do note that since this involves an additional function call during each iteration, this technique can incur a noticeable runtime penalty compared to performing the same action within the loop body.
The second method is using a generator expression (or, alternatively, any other flavor of list/set/dict compression):
for i in (x + 1 for x in range(10)):
As with map, using a generator will lazily produce transformed elements from the given iterable. Do note that if you opt to use a comprehension instead, the entire collection will be constructed upfront, which may be undesirable.
Again, for incrementing the values produced by range, neither of these are ideal. Simply using range(1, 11) is the natural solution for that.

Datediff in same column

I'm trying to get the difference in time between the last two times a person has applied for our service. My solution works, but it's ugly.
Is there a more pythonic way of accomplishing this?
for customer in previous_apps:
app_times = df.ix[df['customer_id']==customer, 'hit_datetime']
days_since_last_app = [(b-a).days for a,b in zip(app_times, app_times[1:])][-1:][0]
df.ix[df['customer_id']==customer, 'days_since_last_app'] = days_since_last_app
Having a list comprehension calculate all the differences in dates of applications then slice it with [-1:] so you have a list with only the last element then extract it by indexing with [0] is completely unnecessary.
you can just take the last application date app_times[-1] and the second last one app_times[-2] and take the difference:
days_since_last_app = (app_times[-1] - app_times[-2]).days
this will fail if there are less then 2 entries in the list so you probably want a special case for that.
(I'm guessing that line evolved into what it is by trying to resolve IndexErrors that were the result of not having previous entries.)
Start by defining a two-argument function that calculates the time difference for you, e.g. time_diff(a, b). Use it something like this:
df["last_visit"] = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
(Assuming the values in hit_datetime are sorted, which your code implies they are.)
The above "broadcasts" the last_visit values, since multiple records have the same customer_id. If you prefer you can just store the result as a Series with one row per customer:
last_visit = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
I'm not sure I precisely understand how your data is structured, but the following should provide the functionality you require:
df.sort_values(['customer_id','hit_datetime'],ascending=True,inplace=True)
df['days_since_last_app'] = df.groupby('customer_id')['hit_datetime'].transform(lambda y: y.diff().apply(lambda x: 0 if x!=x else x.days))

Sum a python list without the string values

So according to duck-typing advice, you aren't advised to check types in python, but simply see if an operation succeeds or fails. In which case, how do I sum a list of (mainly) numbers, while omitting the strings.
sum([1,2,3,4,'']) #fails
sum(filter(lambda x: type(x)==int, [1,2,3,4,''])) #bad style
I will do something like this
a = [1,2,3,4,'']
print sum(x if not isinstance(x,str) else 0 for x in a)
Well, I see two main solutions here:
Pre-processing: Filter the input data in order to prevent occurrences of 'missing data', might be quite complex. We can't help you on this point without more informations.
Post-processing: Filter the result list and remove 'missing data', easy but it isn't really scalable.
About post-processing, here is a solution using list comprehension, and another using your filter-based approach:
a = [1,2,3,4,'']
filtered_a = [x for x in t if isinstance(x, int)]
filtered_a = filter(lambda x: isinstance(x, int), a)
Then, you can simply do sum(filtered_a)
We can also argue that you can check for data consistency during the processing, and just don't add string in your array.

Pythonic Improvement Of Function In List Comprehension?

Is there a more pythonic way to do the following code? I would like to do it in one line
parsed_rows is a function that can return a tuple of size 3, or None.
parsed_rows = [ parse_row(tr) for tr in tr_els ]
data = [ x for x in parsed_rows if x is not None ]
Doing this in one line won't make it more Pythonic; it will make it less readable. If you really want to, you can always translate it directly by substitution like this:
data = [x for x in [parse_row(tr) for tr in tr_els] if x is not None]
… which can obviously be flattened as Doorknob of Snow shows, but it's still hard to understand. However, he didn't get it quite right: clauses nest from left to right, and you want x to be each parse_row result, not each element of each parse_row result (as Volatility points out), so the flattened version would be:
data = [x for tr in tr_els for x in (parse_row(tr),) if x is not None]
I think the fact that a good developer got it backward and 6 people upvoted it before anyone realized the problem, and then I missed a second problem and 7 more people upvoted that before anyone caught it, is pretty solid proof that this is not more pythonic or more readable, just as Doorknob said. :)
In general, when faced with either a nested comp or a comp with multiple for clauses, if it's not immediately obvious what it does, you should translate it into nested for and if statements with an innermost append expression statement, as shown in the tutorial. But if you need to do that with a comprehension you're trying to write, it's a pretty good sign you shouldn't be trying to write it…
However, there is a way to make this more Pythonic, and also more efficient: change the first list comprehension to a generator expression, like this:
parsed_rows = (parse_row(tr) for tr in tr_els)
data = [x for x in parsed_rows if x is not None]
All I did is change the square brackets to parentheses, and that's enough to compute the first one lazily, calling parse_row on each tr as needed, instead of calling it on all of the rows, and building up a list in memory that you don't actually need, before you even get started on the real work.
In fact, if the only reason you need data is to iterate over it once (or to convert it into some other form, like a CSV file or a NumPy array), you can make that a generator expression as well.
Or, even better, replace the list comprehension with a map call. When your expression is just "call this function on each element", map is generally more readable (whereas when you have to write a new function, especially with lambda, just to wrap up some more complex expression, it's usually not). So:
parsed_rows = map(parse_row, tr_els)
data = [x for x in parsed_rows if x is not None]
And now it actually is readable to sub in:
data = [x for x in map(parse_row, tr_els) if x is not None]
You could similarly turn the second comprehension into a filter call. However, just as with map, if the predicate isn't just "call this function and see if it returns something truthy", it usually ends up being less readable. In this case:
data = filter(lambda x: x is not None, map(parse_row, tr_els))
But notice that you really don't need to check is not None in the first place. The only non-None values you have are 3-tuples, which are always truthy. So, you can replace the if x is not None with if x, which can simplifies your comprehension:
data = [x for x in map(parse_row, tr_else) if x]
… and which can be written in two different ways with filter:
data = filter(bool, map(parse_row, tr_els))
data = filter(None, map(parse_row, tr_els))
Asking which of those two is better will start a religious war on any of the Python lists, so I'll just present them both and let you decide.
Note that if you're using Python 2.x, map is not lazy; it will generate the whole intermediate list. So, if you want to get the best of both worlds, and can't use Python 3, use itertools.imap instead of map. An in the same way, in 3.x, filter is lazy, so if you want a list, use list(filter(…)).
You can nest one in the other:
data = [x for tr in tr_els for x in parse_row(tr) if x is not None]
(Also, #Volatility points out that this will give an error if parse_row(tr) is None, which can be solved like this:
data = [x for tr in tr_els for x in (parse_row(tr),) if x is not None]
)
However, in my opinion this is much less readable. Shorter is not always better.

Categories

Resources