I'm surely not the Python guru I'd like to be and I mostly learn studying/experimenting in my spare time, it is very likely I'm going to make a trivial question for experienced users... yet, I really want to understand and this is a place that helps me a lot.
Now, after the due premise, Python documentation says:
4.6.3. Mutable Sequence Types
s.append(x) appends x to the end of the sequence (same as
s[len(s):len(s)] = [x])
[...]
s.insert(i, x) inserts x into s at the index given by i (same as
s[i:i] = [x])
and, moreover:
5.1. More on Lists
list.append(x) Add an item to the end of the list. Equivalent to
a[len(a):] = [x].
[...]
list.insert(i, x) Insert an item at a given position. The first
argument is the index of the element before which to insert, so
a.insert(0, x) inserts at the front of the list, and a.insert(len(a),
x) is equivalent to a.append(x).
So now I'm wondering why there are two methods to do, basically, the same thing? Wouldn't it been possible (and simpler) to have just one append/insert(x, i=len(this)) where the i would have been an optional parameter and, when not present, would have meant add to the end of the list?
The difference between append and insert here is the same as in normal usage, and in most text editors. Append adds to the end of the list, while insert adds in front of a specified index. The reason they are different methods is both because they do different things, and because append can be expected to be a quick operation, while insert might take a while depending on the size of the list and where you're inserting, because everything after the insertion point has to be reindexed.
I'm not privy to the actual reasons insert and append were made different methods, but I would make an educated guess that it is to help remind the developer of the inherent performance difference. Rather than one insert method, with an optional parameter, which would normally run in linear time except when the parameter was not specified, in which case it would run in constant time (very odd), a second method which would always be constant time was added. This type of design decision can be seen in other places in Python, such as when methods like list.sort return None, instead of a new list, as a reminder that they are in-place operations, and not creating (and returning) a new list.
The other answer is great -- but another fundamental difference is that insert is much slower:
$ python -m timeit -s 'x=list(range(100000))' 'x.append(1)'
10000000 loops, best of 3: 0.19 usec per loop
$ python -m timeit -s 'x=list(range(100000))' 'x.insert(len(x), 1)'
1000000 loops, best of 3: 0.514 usec per loop
It's O(n), meaning the time it takes to insert at len(x) scales linearly with the size of your list. append is O(1), meaning it's always very fast, no matter how big your list is.
Related
I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.
lt = 1000 #list primes to ...
remaining = list(range(2, lt + 1)) #remaining primes
for c in remaining: #current "prime" being tested
for t in remaining[0: remaining.index(c)]: #test divisor
if c % t == 0 and c != t:
if c in remaining:
remaining.remove(c)
If you don't need context:
How can I either re-run the same target-list value, or use something other than for that reads the expression list every iteration?
If you need context:
I am currently creating a program that lists primes from 2 to a given value (lt). I have a list 'remaining' that starts as all integers from 2 to the given value. One at a time, it tests a value on the list 'c' and tests for divisibility one by one by all smaller numbers on the list 't'. If 'c' is divisible by 't', it removes it from the list. By the end of the program, in theory, only primes remain but I have run into the problem that because I am removing items from the list, and for only reads remaining once, for is skipping values in remaining and thus leaving composites in the list.
What you're trying to do is almost never the right answer (and it's definitely not the right answer here, for reasons I'll get to later), which is why Python doesn't give you a way to do it automatically. In fact, it's illegal for delete from or insert into a list while you're iterating over it, even if CPython and other Python implementations usually don't check for that error.
But there is a way you can simulate what you want, with a little verbosity:
for i in range(remaining.index(c)):
if i >= remaining.index(c): break
t = remaining[i]
Now we're not iterating over remaining, we're iterating over its indices. So, if we remove values, we'll be iterating over the indices of the modified list. (Of course we're not really relying on the range there, since the if…break tests the same thing; if you prefer for i in itertools.count():, that will work too.)
And, depending on what you want to do, you can expand it in different ways, such as:
end = remaining.index(c)
for i in range(end):
if i >= end: break
t = remaining[i]
# possibly subtract from end within the loop
# so we don't have to recalculate remaining.index(c)
… and so on.
However, as I mentioned at the top, this is really not what you want to be doing. If you look at your code, it's not only looping over all the primes less than c, it's calling a bunch of functions inside that loop that also loop over either all the primes less than c or your entire list (that's how index, remove, and in work for lists), meaning you're turning linear work into quadratic work.
The simplest way around this is to stop trying to mutate the original list to remove composite numbers, and instead build a set of primes as you go along. You can search, add, and remove from a set in constant time. And you can just iterate your list in the obvious way because you're no longer mutating it.
Finally, this isn't actually implementing a proper prime sieve, but a much less efficient algorithm that for some reason everyone has been teaching as a Scheme example for decades and more recently translating into other languages. See The Genuine Sieve of Eratosthenes for details, or this project for sample code in Python and Ruby that shows how to implement a proper sieve and a bit of commentary on performance tradeoffs.
(In the following, I ignore the XY problem of finding primes using a "mutable for".)
It's not entirely trivial to design an iteration over a sequence with well-defined (and efficient) behavior when the sequence is modified. In your case, where the sequence is merely being depleted, one reasonable thing to do is to use a list but "delete" elements by replacing them with a special value. (This makes it easy to preserve the current iteration position and avoids the cost of shifting the subsequent elements.)
To make it efficient to skip the deleted elements (both for the outer iteration and any inner iterations like in your example), the special value should be (or contain) a count of any following deleted elements. Note that there is a special case of deleting the current element, where for maximum efficiency you must move the cursor while you still know how far to move.
EDIT: Python 2.7.8
I have two files. p_m has a few hundred records that contain acceptable values in column 2. p_t has tens of millions of records in which I want to make sure that column 14 is from the set of acceptable values already mentioned. So in the first while loop I'm reading in all the acceptable values, making a set (for de-duping), and then turning that set into a list (I didn't benchmark to see if a set would have been faster than a list, actually...). I got it down to about as few lines as possible in the second loop, but I don't know if they are the FASTEST few lines (I'm using the [14] index twice because exceptions are so very rare that I didn't want to bother with an assignment to a variable). Currently it takes about 40 minutes to do a scan. Any ideas on how to improve that?
def contentScan(p_m,p_t):
""" """
vcont=sets.Set()
i=0
h = open(p_m,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
vcont.add(line.split("|")[2])
h.close()
vcont = list(vcont)
vcont.sort()
i=0
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
h.close()
print "PASS All variable_content_id values exist in the matrix." %rem
return 0
Checking for membership in a set of a few hundred items is much faster than checking for membership in the equivalent list. However, given your staggering 40-minutes running time, the difference may not be that meaningful. E.g:
ozone:~ alex$ python -mtimeit -s'a=list(range(300))' '150 in a'
100000 loops, best of 3: 3.56 usec per loop
ozone:~ alex$ python -mtimeit -s'a=set(range(300))' '150 in a'
10000000 loops, best of 3: 0.0789 usec per loop
so if you're checking "tens of millions of times" using the set should save you tens of seconds -- better than nothing, but barely measurable.
The same consideration applies for other very advisable improvements, such as turning the loop structure:
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
...
h.close()
into a much-sleeker:
with open(p_t, 'rb') as h:
for line in h:
...
again, this won't save you as much as a microsecond per iteration -- so, over, say, 50 million lines, that's less than one of those 40 minutes. Ditto for the removal of the completely unused i += 1 -- it makes no sense for it to be there, but taking it way will make little difference.
One answer focused on the cost of the split operation. That depends on how many fields per record you have, but, for example:
ozone:~ alex$ python -mtimeit -s'a="xyz|"*20' 'a.split("|")[14]'
1000000 loops, best of 3: 1.81 usec per loop
so, again, whatever optimization here could save you maybe at most a microsecond per iteration -- again, another minute shaved off, if that.
Really, the key issue here is why reading and checking e.g 50 million records should take as much as 40 minutes -- 2400 seconds -- 48 microseconds per line; and no doubt still more than 40 microseconds per line even with all the optimizations mentioned here and in other answers and comments.
So once you have applied all the optimizations (and confirmed the code is still just too slow), try profiling the program -- per e.g http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html -- to find out exactly where all of the time is going.
Also, just to make sure it's not just the I/O to some peculiarly slow disk, do a run with the meaty part of the big loop "commented out" - just reading the big file and doing no processing or checking at all on it; this will tell you what's the "irreducible" I/O overhead (if I/O is responsible for the bulk of your elapsed time, then you can't do much to improve things, though changing the open to open(thefile, 'rb', HUGE_BUFFER_SIZE) might help a bit) and may want to consider improving the hardware set-up instead -- defragment a disk, use a local rather than remote filesystem, whatever...
The list lookup was the issue (as you correctly noticed). Searching the list has O(n) time complexity, where n is the number of items stored in the list. on the other hand, finding a value in a hashtable (this is what the python dictionary actually is) has O(1) complexity. As you have hundreds of items in the list, the list lookup is about two orders of magnitude more expensive than the dictionary lookup. This is in line with the 34x improvement you saw when replacing the list with the dictionary.
To further reduce execution time by 5-10x you can use a Python JIT. I personally like Pypy http://pypy.org/features.html . You do not need to modify your script, just install pypy and run:
pypy [your_script.py]
EDIT: Made more pythony.
EDIT 2: Using set builtin rather than dict.
Based on the comments, I decided to try using a dict instead of a list to store the acceptable values against which I'd be checking the big file (I did keep a watchful eye on .split but did not change it). Based on just changing the list to a dict, I saw an immediate and HUGE improvement in execution time.
Using timeit and running 5 iterations over a million-line file, I get 884.2 seconds for the list-based checker, and 25.4 seconds for the dict-based checker! So like a 34x improvement for changing 2 or 3 lines.
Thanks all for the inspiration! Here's the solution I landed on:
def contentScan(p_m,p_t):
""" """
vcont=set()
with open(p_m,'rb') as h:
for line in h:
vcont.add(line.split("|")[2])
with open(p_t,"rb") as h:
for line in h:
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
print "PASS All variable_content_id values exist in the matrix."
return 0
Yes, it's not optimal at all. split is EXPENSIVE like hell (creates new list, creates N strings, append them to list). scan for 13s "|", scan for 14s "|" (from 13s pos) and line[pos13 + 1:pos14 - 1].
Pretty sure you can make in run 2-10x faster with this small change. To add more - you can not extract string, but loop through valid strings and for each start at pos13+1 char by char while chars match. If you ended at "|" for one of the strings - it's good one. Also it'll help a bit to sort valid strings list by frequency in data file. But not creating list with dozens of strings on each step is way more important.
Here are your tests:
generator (ts - you can adjust it to make us some real data).
no-op (just reading)
original solution.
no split solution
Wasn't able to make it format the code, so here is gist.
https://gist.github.com/iced/16fe907d843b71dd7b80
Test conditions: VBox with 2 cores and 1GB RAM running ubuntu 14.10 with latest updates. Each variation was executed 10 times, with rebooting VBox before each run and throwing off lowest and highest run time.
Results:
no_op: 25.025
original: 26.964
no_split: 25.102
original - no_op: 1.939
no_split - no_op: 0.077
Though in this case this particular optimization is useless as majority of time was spent in IO. I was unable to find test setup to make IO less than 70%. In any case - split IS expensive and should be avoided when it's not needed.
PS. Yes, I understand that in case of even 1K of good items it's way better to use hash (actually, it's better at the point when hash computation is faster than loookup - probably at 100 of elements), my point is that split is expensive in this case.
This may seem like an odd question but why doesn't python by default "iterate" through a single object by default.
I feel it would increase the resilience of for loops for low level programming/simple scripts.
At the same time it promotes sloppiness in defining data structures properly though. It also clashes with strings being iterable by character.
E.g.
x = 2
for a in x:
print(a)
As opposed to:
x = [2]
for a in x:
print(a)
Are there any reasons?
FURTHER INFO: I am writing a function that takes a column/multiple columns from a database and puts it into a list of lists. It would just be visually "nice" to have a number instead of a single element list without putting type sorting into the function (probably me just being OCD again though)
Pardon the slightly ambiguous question; this is a "why is it so?" not an "how to?". but in my ignorant world, I would prefer integers to be iterable for the case of the above mentioned function. So why would it not be implemented. Is it to do with it being an extra strain on computing adding an __iter__ to the integer object?
Discussion Points
Is an __iter__ too much of a drain on machine resources?
Do programmers want an exception to be thrown as they expect integers to be non-iterable
It brings up the idea of if you can't do it already, why not just let it, since people in the status quo will keep doing what they've been doing unaffected (unless of course the previous point is what they want); and
From a set theory perspective I guess strictly a set does not contain itself and it may not be mathematically "pure".
Python cannot iterate over an object that is not 'iterable'.
The 'for' loop actually calls inbuilt functions within the iterable data-type which allow it to extract elements from the iterable.
non-iterable data-types don't have these methods so there is no way to extract elements from them.
This Stack over flow question on 'if an object is iterable' is a great resource.
The problem is with the definition of "single object". Is "foo" a single object (Hint: it is an iterable with three strings)? Is [[1, 2, 3]][0] a single object (It is only one object, with 3 elements)?
The short answer is that there is no generalizable way to do it. However, you can write functions that have knowledge of your problem domain and can do conversions for you. I don't know your specific case, but suppose you want to handle an integer or list of integers transparently. You can create your own iterator:
def col_iter(item):
if isinstance(item, int):
yield item
else:
for i in item:
yield i
x = 2
for a in col_iter(x):
print a
y = [1,2,3,4]
for a in col_iter(y):
print a
The only thing that i can think of is that python for loops are looking for something to iterate through not just a value. If you think about it what would the value of "a" be? if you want it to be the number 2 then you don't need the for loop in the first place. If you want it to go through 1, 2 or 0, 1, 2 then you want. for a in range(x): not positive if that's the answer you're looking for but it's what i got.
total = sum([float(item) for item in s.split(",")])
total = sum(float(item) for item in s.split(","))
Source: https://stackoverflow.com/a/21212727/1825083
The first one uses a list comprehension to build a list of all of the float values.
The second one uses a generator expression to build a generator that only builds each float value as requested, one a time. This saves a lot of memory when the list would be very large.
The generator expression may also be either faster (because it allows work to be pipelined, and avoids memory allocation times) or slower (because it adds a bit of overhead), but that's usually not a good reason to choose between them. Just follow this simple rule of thumb:
If you need a list (or, more likely, just something you can store, loop over multiple times, print out, etc.), build a list. If you just need to loop over the values, don't build a list.
In this case, obviously, you don't need a list, so leave the square brackets off.
In Python 2.x, there are some other minor differences; in 3.x, a list comprehension is actually defined as just calling the list function on a generator expression. (Although there is a minor bug in at least 3.0-3.3 which you will only find if you go looking for it very hard…)
The first one makes a list while the second one is a generator expression. Try them without the sum() function call.
In [25]: [float(a) for a in s.split(',')]
Out[25]: [1.23, 2.4, 3.123]
In [26]: (float(a) for a in s.split(','))
Out[26]: <generator object <genexpr> at 0x0698EF08>
In [27]: m = (float(a) for a in s.split(','))
In [28]: next(m)
Out[28]: 1.23
In [29]: next(m)
Out[29]: 2.4
In [30]: next(m)
Out[30]: 3.123
So, the first expression creates the whole list in memory first and then computes the sum whereas the second one just gets the next item in the expression and adds it to its current total. (More memory efficient)
As others have said, the first creates a list, while the second creates a generator that generates all the values. The reason you might care about this is that creating the list puts all the elements into memory at once, whereas with the generator, you can process them as they are generated without having to store them all, which might matter for very large amounts of data.
The first one creates a list, and the sums the numbers in the list. It is a list comprehension inside a sum
The second one computes each item in turn and adds it to a running total, and returns that running total as the sum, when all items are exhausted. This is a generator comprehension.
It does not create a list at all, which means that it doesn't take extra time to allocate memory for a list and populate it. This also means that it has a better space complexity, as it only uses constant space (for the call to float; aside from the call to split, which both line do)