Python: fastest way to create a list of n lists - python

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(

The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.

The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?

Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])

To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value

So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

Related

List of lists updates an entire column when given a specific element [duplicate]

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(
The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.
The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?
Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])
To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value
So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

time and space complexity comparison: adding elements of two lists to a dict through loop or dict(zip(s, t))

I'm new to python and am trying to figure out the concept of time and space complexity. I want to make a dict of two lists, both of the same length. I can do this in the following two ways:
1) by looping over the lists and adding them to the dict:
dictLists = {}
for i in range(0,len(list1)):
dictLists[list1[i]] = list2[i]
2) by zipping the lists and then making a dict from that:
dictZip = dict(zip(list1,list2))
To my understanding, the time complexity of the first method should be O(N) where N is the length of the lists. However, I do not know the time complexity for the second option, except the fact that the zip operation itself takes O(1) time complexity.
What would be the difference in time complexity between these two methods? Would there be additional space complexity in the second method due to an extra zip object?
Both have the same time and space complexity. They each have their own individual overheads that aren’t included when talking about complexity, like the zip object you mentioned and the range object you didn’t, all the function calls that happen in the shadows….
In practice, these aren’t important, so don’t micro-optimize prematurely (“prematurely” here means without having a good reason to expect a performance problem, without encountering one, and without benchmarking) – pick the readable option of dict(zip(list1, list2)).
P.S.
except the fact that the zip operation itself takes O(1) time complexity
Creating a zip is O(1), but iterating over all of its elements is O(N) on the number of elements.
Due to python being a dynamic interpreted language and needs to figure out the type of variables in runtime, some variation of the way you implement your code can be noticeably different in run time. For example in the first solution, python would need to figure out the type of "i" in every iteration (can by fixed using cython) so this would kinda slow down the program. With that being said you would not probably notice that with a small number of iteration. As you can see in the testbench the first approach is almost 4X slower.
import time
list1 = [x for x in range(1000000)]
list2 = [x for x in range(1000000)]
dictLists = dict()
l = len(list1)
s = time.time()
for i in range(0, l):
dictLists[list1[i]] = list2[i]
print(f"Time: {time.time()-s}")
# 0.39275574684143066
dictLists = dict()
s = time.time()
dictZip = dict(zip(list1,list2))
print(f"Time: {time.time()-s}")
# 0.09296393394470215

A more streamline, efficient way to split a list

is there a faster, more efficient way to split rows in a list. My current setup isn't slow but does take longer than I think to split the whole list, maybe due to how many iterations it is required to go through the whole list.
I currently have the code below
found_reader = pd.read_csv(file, delimiter='\n', engine='c')
loaded_list = found_reader
for i in range(len(loaded_list)):
loaded_email_list = loaded_email_list + [loaded_list[i].split(':')[0]]
I just would like a method to do the above in the quickest but efficient time
Here's how you do that efficiently if both loaded_list and loaded_email_list were regular lists (it may need slight adaptation for whatever it is that Pandas uses):
loaded_email_list += [x.partition(':')[0] for x in loaded_list]
Why this is better:
It iterates over the list directly, instead of using range, len, and an index variable
It uses partition, which stops looking after the first :, instead of split, which walks the whole string
It uses a list comprehension to create the new list all at once, rather than creating and concatenating a bunch of single-element lists
It uses x += y, instead of x = x + y, which could theoretically be faster if its __iadd__ is more efficient than assigning its __add__ result back to itself.

Creating a Python list comprehension with an if and break with nested for loops

I noticed from this answer that the code
for i in userInput:
if i in wordsTask:
a = i
break
can be written as a list comprehension in the following way:
next([i for i in userInput if i in wordsTask])
I have a similar problem which is that I would like to write the following (simplified from original problem) code in terms of a list comprehension:
for i in xrange(N):
point = Point(long_list[i],lat_list[i])
for feature in feature_list:
polygon = shape(feature['geometry'])
if polygon.contains(point):
new_list.append(feature['properties'])
break
I expect each point to be associated with a single polygon from the feature list. Hence, once a polygon that contains the point is found, break is used to move on to the next point. Therefore, new_list will have exactly N elements.
I wrote it as a list comprehension as follows:
new_list = [feature['properties'] for i in xrange(1000) for feature in feature_list if shape(feature['geometry']).contains(Point(long_list[i],lat_list[i])]
Of course, this doesn't take into account the break in the if statement, and therefore takes significantly longer than using nested for loops. Using the advice from the above-linked post (which I probably don't fully understand), I did
new_list2 = next(feature['properties'] for i in xrange(1000) for feature in feature_list if shape(feature['geometry']).contains(Point(long_list[i],lat_list[i]))
However, new_list2 has much fewer than N elements (in my case, N=1000 and new_list2 had only 5 elements)
Question 1: Is it even worth doing this as a list comprehension? The only reason is that I read that list comprehensions are usually a bit faster than nested for loops. With 2 million data points, every second counts.
Question 2: If so, how would I go about incorporating the break statement in a list comprehension?
Question 3: What was the error going on with using next in the way I was doing?
Thank you so much for your time and kind help.
List comprehensions are not necessarily faster than a for loop. If you have a pattern like:
some_var = []
for ...:
if ...:
some_var.append(some_other_var)
then yes, the list comprehension is faster than the bunch of .append()s. You have extenuating circumstances, however. For one thing, it is actually a generator expression in the case of next(...) because it doesn't have the [ and ] around it.
You aren't actually creating a list (and therefore not using .append()). You are merely getting one value.
Your generator calls Point(long_list[i], lat_list[i]) once for each feature for each i in xrange(N), whereas the loop calls it only once for each i.
and, of course, your generator expression doesn't work.
Why doesn't your generator expression work? Because it finds only the first value overall. The loop, on the other hand, finds the first value for each i. You see the difference? The generator expression breaks out of both loops, but the for loop breaks out of only the inner one.
If you want a slight improvement in performance, use itertools.izip() (or just zip() in Python 3):
from itertools import izip
for long, lat in izip(long_list, lat_list):
point = Point(long, lat)
...
I don't know that complex list comprehensions or generator expressions are that much faster than nested loops if they're running the same algorithm (e.g. visiting the same number of values). To get a definitive answer you should probably try to implement a solution both ways and test to see which is faster for your real data.
As for how to short-circuit the inner loop but not the outer one, you'll need to put the next call inside the main list comprehension, with a separate generator expression inside of it:
new_list = [next(feature['properties'] for feature in feature_list
if shape(feature['shape']).contains(Point(long, lat)))
for long, lat in zip(long_list, lat_list)]
I've changed up one other thing: Rather than indexing long_list and lat_list with indexes from a range I'm using zip to iterate over them in parallel.
Note that if creating the Point objects over and over ends up taking too much time, you can streamline that part of the code by adding in another nested generator expression that creates the points and lets you bind them to a (reusable) name:
new_list = [next(feature['properties'] for feature in feature_list
if shape(feature['shape']).contains(point))
for point in (Point(long, lat) for long, lat in zip(long_list, lat_list))]

Python: remove lots of items from a list

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.
I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.
So, I might have
myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}
I want to remove each (v, k) tuple from the list.
Currently I am doing:
for k, v in myDict.iteritems():
myList.remove((v, k))
Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.
Is there a better way to do this?
I can provide code used to test, plus pickled data from the actual application if needed.
You'll have to measure, but I can imagine this to be more performant:
myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)
because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.
Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."
To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:
totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]
The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).
I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!
If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:
sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]
or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):
sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]
In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).
Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.
Have you tried doing the "inverse" operation of:
newMyList = [(v,k) for (v,k) in myList if not k in myDict]
But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.
Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.
[(i, j) for i, j in myList if myDict.get(j) != i]
Try something like this:
myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)
This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. :)
The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. So to find each item in the list is a linear operation (O(n)), it has to iterate over the whole list until it finds a match.
If you could swap the list for some other container (set?) which uses a hash() of each item to order them, then each match could be performed much quicker.
The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread:
list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
if item in difference_set:
final_list.append(item)
[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]
A list containing a million 2-tuples is not large on most machines running Python. However if you absolutely must do the removal in situ, here is a clean way of doing it properly:
def filter_by_dict(my_list, my_dict):
sentinel = object()
for i in xrange(len(my_list) - 1, -1, -1):
key = my_list[i][1]
if my_dict.get(key, sentinel) is not sentinel:
del my_list[i]
Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2). Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. Benchmarks, anyone?
What are you going to do with the list after you have removed the items that are in the dict? Is it possible to piggy-back the dict-filtering onto the next step?

Categories

Resources