[Edit]
From the feedback/answers I have received, I gather there is some confusion regarding the original question. Consequently, I have reduced the problem to its most rudimentary form
Here are the relevant facts of the problem:
I have a sorted sequence: S
I have an item (denoted by i) that is GUARANTEED to be contained in S
I want a find() algorithm that returns an iterator (iter) that points to i
After obtaining the iterator, I want to be able to iterate FORWARD (BACKWARD?) over the elements in S, starting FROM (and including) i
For my fellow C++ programmers who can also program in Python, what I am asking for, is the equivalent of:
const_iterator std::find (const key_type& x ) const;
The iterator returned can then be used to iterate the sequence. I am just trying to find (pun unintended), if there is a similar inbuilt algorithm in Python, to save me having to reinvent the wheel.
Given your relevant facts:
>>> import bisect
>>> def find_fwd_iter(S, i):
... j = bisect.bisect_left(S, i)
... for k in xrange(j, len(S)):
... yield S[k]
...
>>> def find_bkwd_iter(S, i):
... j = bisect.bisect_left(S, i)
... for k in xrange(j, -1, -1):
... yield S[k]
...
>>> L = [100, 150, 200, 300, 400]
>>> list(find_fwd_iter(L, 200))
[200, 300, 400]
>>> list(find_bkwd_iter(L, 200))
[200, 150, 100]
>>>
yes , you can do like this:
import itertools
from datetime import datetime
data = {
"2008-11-10 17:53:59":"data",
"2005-11-10 17:53:59":"data",
}
list_ = data.keys()
new_list = [datetime.strptime(x, "%Y-%m-%d %H:%M:%S") for x in list_]
begin_date = datetime.strptime("2007-11-10 17:53:59", "%Y-%m-%d %H:%M:%S")
for i in itertools.ifilter(lambda x: x > begin_date, new_list):
print i
If you know for a fact that the items in your sequence are sorted, you can just use a generator expression:
(item for item in seq if item >= 5)
This returns a generator; it doesn't actually traverse the list until you iterate over it, i.e.:
for item in (item for item in seq if item > 5)
print item
will only traverse seq once.
Using a generator expression like this is pretty much identical to using itertools.ifilter, which produces a generator that iterates over the list returning only values that meet the filter criterion:
>>> import itertools
>>> seq = [1, 2, 3, 4, 5, 6, 7]
>>> list(itertools.ifilter(lambda x: x>=3, seq))
[3, 4, 5, 6, 7]
I'm not sure why (except for backwards compatibility) we need itertools.ifilter anymore now that we have generator expressions, but other methods in itertools are invaluable.
If, for instance, you don't know that your sequence is sorted, and you still want to return everything in the sequence from a known item and beyond, you can't use a generator expression. Instead, use itertools.dropwhile. This produces a generator that iterates over the list skipping values until it finds one that meets the filter criterion:
>>> seq = [1, 2, 4, 3, 5, 6, 7]
>>> list(itertools.dropwhile(lambda x: x != 3, seq))
[3, 5, 6, 7]
As far as searching backwards goes, this will only work if the sequence you're using is actually a sequence (like a list, i.e. something that has an end and can be navigated backwards) and not just any iterable (e.g. a generator that returns the next prime number). To do this, use the reversed function, e.g.:
(item for item in reversed(seq) if item >= 5)
One simpler way (albeit slower) would be to use filter and filter for keys before/after that date. Filter has to process each element in the list as opposed to slicing not needing to.
You can do
def on_or_after(date):
from itertools import dropwhile
sorted_items = sorted(date_dictionary.iteritems())
def before_date(pair):
return pair[0] < date
on_or_after_date = dropwhile(before_date, sorted_items)
which I think is about as efficient as it's going to get if you're just doing one such lookup on each sorted collection. on_or_after_date will iterate (date, value) pairs.
Another option would be to build a dictionary as a separate index into the sorted list:
sorted_items = sorted(date_dictionary.iteritems())
date_index = dict((key, i) for i, key in enumerate(sorted_items.keys()))
and then get the items on or after a date with
def on_or_after(date):
return sorted_items[date_index[date]:]
This second approach will be faster if you're going to be doing a lot of lookups on the same series of sorted dates (which it sounds like you are).
If you want really speedy slicing of the sorted dates, you might see some improvement by storing it in a tuple instead of a list. I could be wrong about that though.
note the above code is untested, let me know if it doesn't work and you can't sort out why.
First off, this question isn't related to dicts. You're operating on a sorted list. You're using the results on a dict, but that's not relevant to the question.
You want the bisect module, which implements binary searching. Starting from your code:
import bisect
mydict = {
"2001-01-01":"data1",
"2005-01-02":"data2",
"2002-01-01":"data3",
"2004-01-02":"data4",
}
# ['2001-01-01', '2002-01-01', '2004-01-02', '2005-01-02']:
sorted_dates = sorted(mydict)
# Iterates over 2002-01-01, 2004-01-02 and 2005-01-02:
offset = bisect.bisect_left(sorted_dates, "2002-01-01")
for item in sorted_dates[offset:]:
print item
Related
I have a list:
input = ['a','b','c','a','b','d','e','d','g','g']
I want index of all elements except duplicate in a list.
output = [0,1,2,5,6,8]
You should iterate over the enumerated list and add each element to a set of "seen" elements and add the index to the output list if the element hasn't already been seen (is not in the "seen" set).
Oh, the name input overrides the built-in input() function, so I renamed it input_list.
output = []
seen = set()
for i,e in enumerate(input_list):
if e not in seen:
output.append(i)
seen.add(e)
which gives output as [0, 1, 2, 5, 6, 8].
why use a set?
You could be thinking, why use a set when you could do something like:
[i for i,e in enumerate(input_list) if input_list.index(e) == i]
which would work because .index returns you the index of the first element in a list with that value, so if you check the index of an element against this, you can assert that it is the first occurrence of that element and filter out those elements which aren't the first occurrences.
However, this is not as efficient as using a set, because list.index requires Python to iterate over the list until it finds the element (or doesn't). This operation is O(n) complexity and since we are calling it for every element in input_list, the whole solution would be O(n^2).
On the other hand, using a set, as in the first solution, yields an O(n) solution, because checking if an element is in a set is complexity O(1) (average case). This is due to how sets are implemented (they are like lists, but each element is stored at the index of its hash so you can just compute the hash of an element and see if there is an element there to check membership rather than iterating over it - note that this is a vague oversimplification but is the idea of them).
Thus, since each check for membership is O(1), and we do this for each element, we get an O(n) solution which is much better than an O(n^2) solution.
You could do a something like this, checking for counts (although this is computation-heavy):
indexes = []
for i, x in enumerate(inputlist):
if (inputlist.count(x) == 1
and x not in inputlist[:i]):
indexes.append(i)
This checks for the following:
if the item appears only once. If so, continue...
if the item hasn't appeared before in the list up till now. If so, add to the results list
In case you don't mind indexes of the last occurrences of duplicates instead and are using Python 3.6+, here's an alternative solution:
list(dict(map(reversed, enumerate(input))).values())
This returns:
[3, 4, 2, 7, 6, 9]
Here is a one-liner using zip and reversed
>>> input = ['a','b','c','a','b','d','e','d','g','g']
>>> sorted(dict(zip(reversed(input), range(len(input)-1, -1, -1))).values())
[0, 1, 2, 5, 6, 8]
This question is missing a pandas solution. 😉
>>> import pandas as pd
>>> inp = ['a','b','c','a','b','d','e','d','g','g']
>>>
>>> pd.DataFrame(list(enumerate(inp))).groupby(1).first()[0].tolist()
[0, 1, 2, 5, 6, 8]
Yet another version, using a side effect in a list comprehension.
>>> xs=['a','b','c','a','b','d','e','d','g','g']
>>> seen = set()
>>> [i for i, v in enumerate(xs) if v not in seen and not seen.add(v)]
[0, 1, 2, 5, 6, 8]
The list comprehension filters indices of values that have not been seen already.
The trick is that not seen.add(v) is always true because seen.add(v) returns None.
Because of short circuit evaluation, seen.add(v) is performed if and only if v is not in seen, adding new values to seen on the fly.
At the end, seen contains all the values of the input list.
>>> seen
{'a', 'c', 'g', 'b', 'd', 'e'}
Note: it is usually a bad idea to use side effects in list comprehension,
but you might see this trick sometimes.
I want to calculate the sum of a collection, for sections of different sizes:
d = (1, 2, 3, 4, 5, 6, 7, 8, 9)
sz = (2, 3, 4)
# here I expect 1+2=3, 3+4+5=12, 6+7+8+9=30
itd = iter(d)
result = tuple( sum(tuple(next(itd) for i in range(s))) for s in sz )
print("result = {}".format(result))
I wonder whether the solution I came up with is the most 'pythonic' (elegant, readable, concise) way to achieve what I want...
In particular, I wonder whether there is a way to get rid of the separate iterator 'itd', and whether it would be easier to work with slices?
I would use itertools.islice since you can directly use the values in sz as the step size at each point:
>>> from itertools import islice
>>> it=iter(d)
>>> [sum(islice(it,s)) for s in sz]
[3, 12, 30]
Then you can convert that to a tuple if needed.
The iter is certainly needed in order to step through the tuple at the point where the last slice left off. Otherwise each slice would be d[0:s]
There's no reason to get rid of your iterator – iterating over d is what you are doing, after all.
You do seem to have an overabundance of tuples in that code, though. The line that's doing all the work could be made more legible by getting rid of them:
it = iter(d)
result = [sum(next(it) for _ in range(s)) for s in sz]
# [3, 12, 30]
… which has the added advantage that now you're producing a list rather than a tuple. d and sz also make more sense as lists, by the way: they're variable-length sequences of homogeneous data, not fixed-length sequences of heterogeneous data.
Note also that it is the conventional name for an arbitrary iterator, and _ is the conventional name for any variable that must exist but is never actually used.
Going a little further, next(it) for _ in range(s) is doing the same work that islice() could do more legibly:
from itertools import islice
it = iter(d)
result = [sum(islice(it, s)) for s in sz]
# [3, 12, 30]
… at which point, I'd say the code's about as elegant, readable and concise as it's likely to get.
I have a following problem while trying to do some nodal analysis:
For example:
my_list=[[1,2,3,1],[2,3,1,2],[3,2,1,3]]
I want to write a function that treats the element_list inside my_list in a following way:
-The number of occurrence of certain element inside the list of my_list is not important and, as long as the unique elements inside the list are same, they are identical.
Find the identical loop based on the above premises and only keep the
first one and ignore other identical lists of my_list while preserving
the order.
Thus, in above example the function should return just the first list which is [1,2,3,1] because all the lists inside my_list are equal based on above premises.
I wrote a function in python to do this but I think it can be shortened and I am not sure if this is an efficient way to do it. Here is my code:
def _remove_duplicate_loops(duplicate_loop):
loops=[]
for i in range(len(duplicate_loop)):
unique_el_list=[]
for j in range(len(duplicate_loop[i])):
if (duplicate_loop[i][j] not in unique_el_list):
unique_el_list.append(duplicate_loop[i][j])
loops.append(unique_el_list[:])
loops_set=[set(x) for x in loops]
unique_loop_dict={}
for k in range(len(loops_set)):
if (loops_set[k] not in list(unique_loop_dict.values())):
unique_loop_dict[k]=loops_set[k]
unique_loop_pos=list(unique_loop_dict.keys())
unique_loops=[]
for l in range(len(unique_loop_pos)):
unique_loops.append(duplicate_loop[l])
return unique_loops
from collections import OrderedDict
my_list = [[1, 2, 3, 1], [2, 3, 1, 2], [3, 2, 1, 3]]
seen_combos = OrderedDict()
for sublist in my_list:
unique_elements = frozenset(sublist)
if unique_elements not in seen_combos:
seen_combos[unique_elements] = sublist
my_list = seen_combos.values()
you could do it in a fairly straightforward way using dictionaries. but you'll need to use frozenset instead of set, as sets are mutable and therefore not hashable.
def _remove_duplicate_lists(duplicate_loop):
dupdict = OrderedDict((frozenset(x), x) for x in reversed(duplicate_loop))
return reversed(dupdict.values())
should do it. Note the double reversed() because normally the last item is the one that is preserved, where you want the first, and the double reverses accomplish that.
edit: correction, yes, per Steven's answer, it must be an OrderedDict(), or the values returned will not be correct. His version might be slightly faster too..
edit again: You need an ordered dict if the order of the lists is important. Say your list is
[[1,2,3,4], [4,3,2,1], [5,6,7,8]]
The ordered dict version will ALWAYS return
[[1,2,3,4], [5,6,7,8]]
However, the regular dict version may return the above, or may return
[[5,6,7,8], [1,2,3,4]]
If you don't care, a non-ordered dict version may be faster/use less memory.
I just started Python classes and I'm really in need of some help. Please keep in mind that I'm new if you're answering this.
I have to make a program that takes the average of all the elements in a certain list "l". That is a pretty easy function by itself; the problem is that the teacher wants us to remove any empty string present in the list before doing the average.
So when I receive the list [1,2,3,'',4] I want the function to ignore the '' for the average, and just take the average of the other 4/len(l). Can anyone help me with this?
Maybe a cycle that keeps comparing a certain position from the list with the '' and removes those from the list? I've tried that but it's not working.
You can use a list comprehension to remove all elements that are '':
mylist = [1, 2, 3, '', 4]
mylist = [i for i in mylist if i != '']
Then you can calculate the average by taking the sum and dividing it by the number of elements in the list:
avg = sum(mylist)/len(mylist)
Floating Point Average (Assuming python 2)
Depending on your application you may want your average to be a float and not an int. If that is the case, cast one of these values to a float first:
avg = float(sum(mylist))/len(mylist)
Alternatively you can use python 3's division:
from __future__ import division
avg = sum(mylist)/len(mylist)
You can use filter():
filter() returns a list in Python 2 if we pass it a list and an iterator in Python 3. As suggested by #PhilH you can use itertools.ifilter() in Python 2 to get an iterator.
To get a list as output in Python 3 use list(filter(lambda x:x != '', lis))
In [29]: lis = [1, 2, 3, '', 4, 0]
In [30]: filter(lambda x:x != '', lis)
Out[30]: [1, 2, 3, 4, 0]
Note to filter any falsy value you can simply use filter(None, ...):
>>> lis = [1, 2, 3, '', 4, 0]
>>> filter(None, lis)
[1, 2, 3, 4]
The other answers show you how to create a new list with the desired element removed (which is the usual way to do this in python). However, there are occasions where you want to operate on a list in place -- Here's a way to do it operating on the list in place:
while True:
try:
mylist.remove('')
except ValueError:
break
Although I suppose it could be argued that you could do this with slice assignment and a list comprehension:
mylist[:] = [i for i in mylist if i != '']
And, as some have raised issues about memory usage and the wonders of generators:
mylist[:] = (i for i in mylist if i != '')
works too.
itertools.ifilterfalse(lambda x: x=='', myList)
This uses iterators, so it doesn't create copies of the list and should be more efficient both in time and memory, making it robust for long lists.
JonClements points out that this means keeping track of the length separately, so to show that process:
def ave(anyOldIterator):
elementCount = 0
runningTotal = 0
for element in anyOldIterator:
runningTotal += element
elementCount += 1
return runningTotal/elementCount
Or even better
def ave(anyOldIterator):
idx = None
runningTotal = 0
for idx,element in enumerate(anyOldIterator):
runningTotal += element
return runningTotal/(idx+1)
Reduce:
def ave(anyOldIterator):
pieces = reduce(lambda x,y: (y[0],x[1]+y[1]), enumerate(anyOldIterator))
return pieces[1]/(pieces[0]+1)
Timeit on the average of range(0,1000) run 10000 times gives the list comprehension a time of 0.9s and the reduce version 0.16s. So it's already 5x faster before we add in filtering.
You can use:
alist = ['',1,2]
new_alist = filter(None, alist)
new_alist_2 = filter(bool, alist)
Result:
new_alist = [1,2]
new_alist_2 = [1,2]
mylist = [1, 2, 3, '', 4]
newlist = []
for i in mylist:
try:
newlist.append(int(i))
except ValueError:
pass
avg = sum(newlist)/len(newlist)
'' is equivalent to False. If we filter the 0 case out (because 0 is equivalent to False), we can use list comprehension :
[x for x in a if x or x == 0]
Or if we strictly want to filter out empty strings :
[x for x in a if x != '']
This may not be the fastest way.
Edit, added some bench results comparing with the other solutions (not for the sake of comparing myself to others, but I was curious too of what method was the fastest)
ragsagar>
6.81217217445
pistache>
1.0873541832
cerealy>
1.07090902328
Matt>
1.40736508369
Ashwini Chaudhary>
2.04662489891
Phil H (just the generator) >
0.935978889465
Phil H with list() >
3.58926296234
I made the script quickly, using timeit(), I used [0,1,2,0,3,4,'',5,8,0,'',4] as the list. I ran multiple tests, results did not vary.
NOTE: I'm not trying to put my solution on top using speed as a criteria. I know OP didn't specifically ask for speed, but I was curious and maybe some other are.
I'm new to Python and have a list of numbers. e.g.
5,10,32,35,64,76,23,53....
and I've grouped them into fours (5,10,32,35, 64,76,23,53 etc..) using the code from this post.
def group_iter(iterator, n=2, strict=False):
""" Transforms a sequence of values into a sequence of n-tuples.
e.g. [1, 2, 3, 4, ...] => [(1, 2), (3, 4), ...] (when n == 2)
If strict, then it will raise ValueError if there is a group of fewer
than n items at the end of the sequence. """
accumulator = []
for item in iterator:
accumulator.append(item)
if len(accumulator) == n: # tested as fast as separate counter
yield tuple(accumulator)
accumulator = [] # tested faster than accumulator[:] = []
# and tested as fast as re-using one list object
if strict and len(accumulator) != 0:
raise ValueError("Leftover values")
How can I access the individual arrays so that I can perform functions on them. For example, I'd like to get the average of the first values of every group (e.g. 5 and 64 in my example numbers).
Let's say you have the following tuple of tuples:
a=((5,10,32,35), (64,76,23,53))
To access the first element of each tuple, use a for-loop:
for i in a:
print i[0]
To calculate average for the first values:
elements=[i[0] for i in a]
avg=sum(elements)/float(len(elements))
Ok, this is yielding a tuple of four numbers each time it's iterated. So, convert the whole thing to a list:
L = list(group_iter(your_list, n=4))
Then you'll have a list of tuples:
>>> L
[(5, 10, 32, 35), (64, 76, 23, 53), ...]
You can get the first item in each tuple this way:
firsts = [tup[0] for tup in L]
(There are other ways, of course.)
You've created a tuple of tuples, or a list of tuples, or a list of lists, or a tuple of lists, or whatever...
You can access any element of any nested list directly:
toplist[x][y] # yields the yth element of the xth nested list
You can also access the nested structures by iterating over the top structure:
for list in lists:
print list[y]
Might be overkill for your application but you should check out my library, pandas. Stuff like this is pretty simple with the GroupBy functionality:
http://pandas.sourceforge.net/groupby.html
To do the 4-at-a-time thing you would need to compute a bucketing array:
import numpy as np
bucket_size = 4
n = len(your_list)
buckets = np.arange(n) // bucket_size
Then it's as simple as:
data.groupby(buckets).mean()