Python's list comprehensions and other better practices - python

This relates to a project to convert a 2-way ANOVA program in SAS to Python.
I pretty much started trying to learn the language Thursday, so I know I have a lot of room for improvement. If I'm missing something blatantly obvious, by all means, let me know. I haven't got Sage up and running yet, nor numpy, so right now, this is all quite vanilla Python 2.6.1. (portable)
Primary query: Need a good set of list comprehensions that can extract the data in lists of samples in lists by factor A, by factor B, overall, and in groups of each level of factors A&B (AxB).
After some work, the data is in the following form (3 layers of nested lists):
response[a][b][n]
(meaning [a1 [b1 [n1, ... ,nN] ...[bB [n1, ...nN]]], ... ,[aA [b1 [n1, ... ,nN] ...[bB [n1, ...nN]]]
Hopefully that's clear.)
Factor levels in my example case: A=3 (0-2), B=8 (0-7), N=8 (0-7)
byA= [[a[i] for i in range(b)] for a[b] in response]
(Can someone explain why this syntax works? I stumbled into it trying to see what the parser would accept. I haven't seen that syntax attached to that behavior elsewhere, but it's really nice. Any good links on sites or books on the topic would be appreciated. Edit: Persistence of variables between runs explained this oddity. It doesn't work.)
byB=lstcrunch([[Bs[i] for i in range(len(Bs)) ]for Bs in response])
(It bears noting that zip(*response) almost does what I want. The above version isn't actually working, as I recall. I haven't run it through a careful test yet.)
byAxB= [item for sublist in response for item in sublist]
(Stolen from a response by Alex Martelli on this site. Again could someone explain why? List comprehension syntax is not very well explained in the texts I've been reading.)
ByO= [item for sublist in byAxB for item in sublist]
(Obviously, I simply reused the former comprehension here, 'cause it did what I need. Edit:)
I'd like these to end up the same datatypes, at least when looped through by the factor in question, s.t. that same average/sum/SS/et cetera functions can be applied and used.
This could easily be replaced by something cleaner:
def lstcrunch(Dlist):
"""Returns a list containing the entire
contents of whatever is imported,
reduced by one level.
If a rectangular array, it reduces a dimension by one.
lstcrunch(DataSet[a][b]) -> DataOutput[a]
[[1, 2], [[2, 3], [2, 4]]] -> [1, 2, [2, 3], [2, 4]]
"""
flat=[]
if islist(Dlist):#1D top level list
for i in Dlist:
if islist(i):
flat+= i
else:
flat.append(i)
return flat
else:
return [Dlist]
Oh, while I'm on the topic, what's the preferred way of identifying a variable as a list?
I have been using:
def islist(a):
"Returns 'True' if input is a list and 'False' otherwise"
return type(a)==type([])
Parting query:
Is there a way to explicitly force a shallow copy to convert to a deep? copy? Or, similarly, when copying into a variable, is there a way of declaring that the assignment is supposed to replace the pointer, too, and not merely the value? (s.t.the assignment won't propagate to other shallow copies) Similarly, using that might be useful, as well, from time to time, so being able to control when it does or doesn't occur sounds really nice.
(I really stepped all over myself when I prepared my table for inserting by calling:
response=[[[0]*N]*B]*A
)
Edit:
Further investigation lead to most of this working fine. I've since made the class and tested it. it works fine. I'll leave the list comprehension forms intact for reference.
def byB(array_a_b_c):
y=range(len(array_a_b_c))
x=range(len(array_a_b_c[0]))
return [[array_a_b_c[i][j][k]
for k in range(len(array_a_b_c[0][0]))
for i in y]
for j in x]
def byA(array_a_b_c):
return [[repn for rowB in rowA for repn in rowB]
for rowA in array_a_b_c]
def byAxB(array_a_b_c):
return [rowB for rowA in array_a_b_c
for rowB in rowA]
def byO(array_a_b_c):
return [rep
for rowA in array_a_b_c
for rowB in rowA
for rep in rowB]
def gen3d(row, col, inner):
"""Produces a 3d nested array without any naughty shallow copies.
[row[col[inner]] named s.t. the outer can be split on, per lprn for easy display"""
return [[[k for k in range(inner)]
for i in range(col)]
for j in range(row)]
def lprn(X):
"""This prints a list by lines.
Not fancy, but works"""
if isiterable(X):
for line in X: print line
else:
print x
def isiterable(a):
return hasattr(a, "__iter__")
Thanks to everyone who responded. Already see a noticeable improvement in code quality due to improvements in my gnosis. Further thoughts are still appreciated, of course.

byAxB= [item for sublist in response for item in sublist] Again could someone explain why?
I am sure A.M. will be able to give you a good explanation. Here is my stab at it while waiting for him to turn up.
I would approach this from left to right. Take these four words:
for sublist in response
I hope you can see the resemblance to a regular for loop. These four words are doing the ground work for performing some action on each sublist in response. It appears that response is a list of lists. In that case sublist would be a list for each iteration through response.
for item in sublist
This is again another for loop in the making. Given that we first heard about sublist in the previous "loop" this would indicate that we are now traversing through sublist, one item at a time. If I were to write these loops out without comprehensions it would look like this:
for sublist in response:
for item in sublist:
Next, we look at the remaining words. [, item and ]. This effectively means, collect items in a list and return the resulting list.
Whenever you have trouble creating or understanding list iterations write the relevant for loops out and then compress them:
result = []
for sublist in response:
for item in sublist:
result.append(item)
This will compress to:
[
item
for sublist in response
for item in sublist
]
List comprehension syntax is not very well explained in the texts I've been reading
Dive Into Python has a section dedicated to list comprehensions. There is also this nice tutorial to read through.
Update
I forgot to say something. List comprehensions are another way of achieving what has been traditionally done using map and filter. It would be a good idea to understand how map and filter work if you want to improve your comprehension-fu.

For the copy part, look into the copy module, python simply uses references after the first object is created, so any change in other "copies" propagates back to the original, but the copy module makes real copies of objects and you can specify several copy modes

It is sometimes kinky to produce right level of recursion in your data structure, however I think in your case it should be relatively simple. To test it out while we are doing we need one sample data, say:
data = [ [a,
[b,
range(1,9)]]
for b in range(8)
for a in range(3)]
print 'Origin'
print(data)
print 'Flat'
## from this we see how to produce the c data flat
print([(a,b,c) for a,[b,c] in data])
print "Sum of data in third level = %f" % sum(point for point in c for a,[b,c] in data)
print "Sum of all data = %f" % sum(a+b+sum(c) for a,[b,c] in data)
for the type check, generally you should avoid it but if you must, as when you do not want to do recursion in string you can do it like this
if not isinstance(data, basestring) : ....
If you need to flatten structure you can find useful code in Python documentation (other way to express it is chain(*listOfLists)) and as list comprehension [ d for sublist in listOfLists for d in sublist ]:
from itertools import flat.chain
def flatten(listOfLists):
"Flatten one level of nesting"
return chain.from_iterable(listOfLists)
This does not work though if you have data in different depths. For heavy weight flattener see: http://www.python.org/workshops/1994-11/flatten.py,

Related

How to add up values of the "sublists" within a list of lists

I have a list of lists in my script:
list = [[1,2]
[4,3]
[6,2]
[1,6]
[9,2]
[6,5]]
I am looking for a solution to sum up the contents of each "sublist" within the list of lists. The desired output would be:
new_list = [3,7,8,7,11,11]
I know about combining ALL of these lists into one which would be:
new_list = [27,20]
But that's not what i'm looking to accomplish.
I need to combine the two values within these "sublists" and have them remain as their own entry in the main list.
I would also greatly appreciate it if it was also explained how you solved the problem rather than just handing me the solution. I'm trying to learn python so even a minor explanation would be greatly appreciated.
Using Python 3.7.4
Thanks, Riftie.
The "manual" solution will be using a for loop.
new_list = []
for sub_list in list:
new_list.append(sum(sub_list))
or as list compherension:
new_list = [sum(sub_list) for sub_list in list]
The for loop iterates through the elements of list. In your case, list is a list of lists. So every element is a list byitself. That means that while iterating, sub_list is a simple list. To get a sum of list I used sum() build-in function. You can of course iterate manually and sum every element:
new_list = []
for sub_list in list:
sum_val = 0
for element in sub_list:
sum_val = sum_val + element
new_list.append(sum_val)
but no need for that.
A better approach will be to use numpy, which allows you to sum by axis, as it looks on list of lists like an array. Since you are learning basic python, it's too soon to learn about numpy. Just keep in mind that there is a package for handling multi-dimensions arrays and it allows it perform some actions like sum on an axis by your choice.
Edit: I've seen the other solution suggest. As both will work, I believe this solution is more "accessible" for someone who learn to program for first time. Using list comprehension is great and correct, but may be a bit confusing while first learning. Also as suggested, calling your variables list is a bad idea because it's keyword. Better names will be "my_list", "tmp_list" or something else.
Use list comprehension. Also avoid using keywords as variable names, in your case you overrode the builtin list.
# a, b -> sequence unpacking
summed = [a + b for a, b in lst] # where lst is your list of lists
# if the inner lists contain variable number of elements, a more
# concise solution would be
summed2 = [sum(seq) for seq in lst]
Read more about the powerful list comprehension here.

Python functions: return method inside a 'for' loop

I have the following code:
def encrypt(plaintext, k):
return "".join([alphabet[(alphabet.index(i)+k)] for i in plaintext.lower()])
I don't understand how python can read this kind of syntax, can someone break down what's the order of executions here?
I came across this kind of "one-line" writing style in python a lot, which always seemed to be so elegant and efficient but I never understood the logic.
Thanks in advance, have a wonderful day.
In Python we call this a list comprehension. There other stackoverflow posts that have covered this topic extensively such as: What does “list comprehension” mean? How does it work and how can I use it? and Explanation of how nested list comprehension works?.
In your example the code is not complete so it is hard to figure what "alphabet" or "plaintext" are. However, let's try to break down what it does on the high level.
"".join([alphabet[(alphabet.index(i)+k)] for i in plaintext.lower()])
Can be broken down as:
"".join( # The join method will stitch all the elements from the container (list) together
[
alphabet[alphabet.index(i) + k] # alphabet seems to be a list, that we index increasingly by k
for i in plaintext.lower()
# we loop through each element in plaintext.lower() (notice the i is used in the alphabet[alphabet.index(i) + k])
]
)
Note that we can re-write the for-comprehension as a for-loop. I have created a similar example that I hope can clarify things better:
alphabet = ['a', 'b', 'c']
some_list = []
for i in "ABC".lower():
some_list.append(alphabet[alphabet.index(i)]) # 1 as a dummy variable
bringing_alphabet_back = "".join(some_list)
print(bringing_alphabet_back) # abc
And last, the return just returns the result. It is similar to returning the entire result of bringing_alphabet_back.

How to search for strings within nested lists

One of the questions for an assignment I'm doing consists of looking within a nested lists consisting of "an ultrashort story and its author.", to find a string that was inputted by a user. Not to sure on how to go about this, here is the assignment brief below if anyone would like more clarification. There are also more questions I'm not to sure on eg "find all stories by a certain author". Some explanations, or point me in the right direction is greatly appreciated :)
list = []
mylist = [['a','b','c'],['d','e','f']]
string = input("String?")
if string in [elem for sublist in mylist for elem in sublist] == True:
list.append(elem)
This is just an example of something i've tried, the list above is similar enough to the one i'm actually using for the question. I've just currently been going through different methods of iterating over a nested lists and adding mathcing items to another list. above code is just one example of an attemp i've made at this proccess.
""" the image above states that the data is in the
form of an list of sublists, with each sublist containing
two strings
"""
stories = [
['story string 1', 'author string 1'],
['story string 2', 'author string 2']
]
""" find stories that contain a given string
"""
stories_with_substring = []
substring = 'some string' # search string
for story, author in stories:
# if the substring is not in the story, a ValueError is raised
try:
story.index(substring)
stories_with_substring.append((story, author))
except ValueError:
continue
""" find stories by a given author
"""
stories_by_author = []
target_author = 'first last'
for story, author in stories:
if author == target_author:
stories_by_author.append((story, author))
This line here
for story, author in stories:
'Unpacks' the array. It's equivalent to
for pair in stories:
story = pair[0]
author = pair[1]
Or to go even further:
i = 0
while i < len(stories):
pair = stories[i]
story = pair[0]
author = pair[1]
I'm sure you can see how useful this is when dealing with lists that contain lists/tuples.
You may need to call .lower() on some of the strings if you want the search to be case insensitive
You can do a few things here. Your example showed the use of a list comprehension, so let's focus on some other aspects of this problem.
Recursion
You can define a function that iterates through all the items in the top level list. Assuming you know for sure all items are either strings or more lists, you can use type() to check if each item is another list, or is a string. If it's a string, do your search - if it's a list, have your function call itself. Let's look at an example. Please note that we should never using variables named list or string - these are core value types and we don't want to accidentally overwrite them!
mylist = [['a','b','c'],['d','e','f']]
def find_nested_items(my_list, my_input):
results = []
for i in mylist:
if type(i) == list:
items = find_nested_items(i, my_input)
results += items
elif my_input in i:
results.append(i)
return results
We're doing a few things here:
Creating an empty list named results
Iterating through the top level items of my_list
If one of those items is another list, we have our function call itself - at some point this will trigger the condition where an item is not a list, and will eventually return the results from that. For now, we assume the results we're getting back are going to be correct, so we concatenate those results to our top level results list
If the item is not a list, we simply check for the existence of our input and if so, add it to our results list
This kind of recursion is typically very safe, because it's inherently limited by our data structure. It can't run forever unless the data structure itself is infinitely deep.
Generators
Next, let's look at a much cooler function of python 3: generators. Right now, we're doing all the work of collecting the results in one go. If we later on want to iterate through those results, we need to iterate over them separately.
Instead of doing that, we can define a generator. This works almost the same, practically speaking, but instead of collecting the results in one loop and then using them in a second, we can collect and use each result all within a single loop. A generator "yields" a value, then stops until it is called the next time. Let's modify our example to make it a generator:
mylist = [['a','b','c'],['d','e','f']]
def find_nested_items(my_list, my_input):
for i in mylist:
if type(i) == list:
yield from find_nested_items(i, my_input)
elif my_input in i:
yield i
You'll notice this version is a fair bit shorter. There's no need to hold items in a temporary list - each item is "yielded", which means it's passed directly to the caller to use immediately, and the caller will stop our generator until it needs the next value.
yield from basically does the same recursion, it simply sets up a generator within a generator to return those nested items back up the chain to the caller.
These are some good techniques to try - please give them a go!

Creating a Python list comprehension with an if and break with nested for loops

I noticed from this answer that the code
for i in userInput:
if i in wordsTask:
a = i
break
can be written as a list comprehension in the following way:
next([i for i in userInput if i in wordsTask])
I have a similar problem which is that I would like to write the following (simplified from original problem) code in terms of a list comprehension:
for i in xrange(N):
point = Point(long_list[i],lat_list[i])
for feature in feature_list:
polygon = shape(feature['geometry'])
if polygon.contains(point):
new_list.append(feature['properties'])
break
I expect each point to be associated with a single polygon from the feature list. Hence, once a polygon that contains the point is found, break is used to move on to the next point. Therefore, new_list will have exactly N elements.
I wrote it as a list comprehension as follows:
new_list = [feature['properties'] for i in xrange(1000) for feature in feature_list if shape(feature['geometry']).contains(Point(long_list[i],lat_list[i])]
Of course, this doesn't take into account the break in the if statement, and therefore takes significantly longer than using nested for loops. Using the advice from the above-linked post (which I probably don't fully understand), I did
new_list2 = next(feature['properties'] for i in xrange(1000) for feature in feature_list if shape(feature['geometry']).contains(Point(long_list[i],lat_list[i]))
However, new_list2 has much fewer than N elements (in my case, N=1000 and new_list2 had only 5 elements)
Question 1: Is it even worth doing this as a list comprehension? The only reason is that I read that list comprehensions are usually a bit faster than nested for loops. With 2 million data points, every second counts.
Question 2: If so, how would I go about incorporating the break statement in a list comprehension?
Question 3: What was the error going on with using next in the way I was doing?
Thank you so much for your time and kind help.
List comprehensions are not necessarily faster than a for loop. If you have a pattern like:
some_var = []
for ...:
if ...:
some_var.append(some_other_var)
then yes, the list comprehension is faster than the bunch of .append()s. You have extenuating circumstances, however. For one thing, it is actually a generator expression in the case of next(...) because it doesn't have the [ and ] around it.
You aren't actually creating a list (and therefore not using .append()). You are merely getting one value.
Your generator calls Point(long_list[i], lat_list[i]) once for each feature for each i in xrange(N), whereas the loop calls it only once for each i.
and, of course, your generator expression doesn't work.
Why doesn't your generator expression work? Because it finds only the first value overall. The loop, on the other hand, finds the first value for each i. You see the difference? The generator expression breaks out of both loops, but the for loop breaks out of only the inner one.
If you want a slight improvement in performance, use itertools.izip() (or just zip() in Python 3):
from itertools import izip
for long, lat in izip(long_list, lat_list):
point = Point(long, lat)
...
I don't know that complex list comprehensions or generator expressions are that much faster than nested loops if they're running the same algorithm (e.g. visiting the same number of values). To get a definitive answer you should probably try to implement a solution both ways and test to see which is faster for your real data.
As for how to short-circuit the inner loop but not the outer one, you'll need to put the next call inside the main list comprehension, with a separate generator expression inside of it:
new_list = [next(feature['properties'] for feature in feature_list
if shape(feature['shape']).contains(Point(long, lat)))
for long, lat in zip(long_list, lat_list)]
I've changed up one other thing: Rather than indexing long_list and lat_list with indexes from a range I'm using zip to iterate over them in parallel.
Note that if creating the Point objects over and over ends up taking too much time, you can streamline that part of the code by adding in another nested generator expression that creates the points and lets you bind them to a (reusable) name:
new_list = [next(feature['properties'] for feature in feature_list
if shape(feature['shape']).contains(point))
for point in (Point(long, lat) for long, lat in zip(long_list, lat_list))]

Python: remove lots of items from a list

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.
I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.
So, I might have
myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}
I want to remove each (v, k) tuple from the list.
Currently I am doing:
for k, v in myDict.iteritems():
myList.remove((v, k))
Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.
Is there a better way to do this?
I can provide code used to test, plus pickled data from the actual application if needed.
You'll have to measure, but I can imagine this to be more performant:
myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)
because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.
Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."
To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:
totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]
The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).
I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!
If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:
sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]
or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):
sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]
In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).
Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.
Have you tried doing the "inverse" operation of:
newMyList = [(v,k) for (v,k) in myList if not k in myDict]
But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.
Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.
[(i, j) for i, j in myList if myDict.get(j) != i]
Try something like this:
myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)
This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. :)
The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. So to find each item in the list is a linear operation (O(n)), it has to iterate over the whole list until it finds a match.
If you could swap the list for some other container (set?) which uses a hash() of each item to order them, then each match could be performed much quicker.
The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread:
list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
if item in difference_set:
final_list.append(item)
[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]
A list containing a million 2-tuples is not large on most machines running Python. However if you absolutely must do the removal in situ, here is a clean way of doing it properly:
def filter_by_dict(my_list, my_dict):
sentinel = object()
for i in xrange(len(my_list) - 1, -1, -1):
key = my_list[i][1]
if my_dict.get(key, sentinel) is not sentinel:
del my_list[i]
Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2). Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. Benchmarks, anyone?
What are you going to do with the list after you have removed the items that are in the dict? Is it possible to piggy-back the dict-filtering onto the next step?

Categories

Resources