I'm new to Python and have a list of numbers. e.g.
5,10,32,35,64,76,23,53....
and I've grouped them into fours (5,10,32,35, 64,76,23,53 etc..) using the code from this post.
def group_iter(iterator, n=2, strict=False):
""" Transforms a sequence of values into a sequence of n-tuples.
e.g. [1, 2, 3, 4, ...] => [(1, 2), (3, 4), ...] (when n == 2)
If strict, then it will raise ValueError if there is a group of fewer
than n items at the end of the sequence. """
accumulator = []
for item in iterator:
accumulator.append(item)
if len(accumulator) == n: # tested as fast as separate counter
yield tuple(accumulator)
accumulator = [] # tested faster than accumulator[:] = []
# and tested as fast as re-using one list object
if strict and len(accumulator) != 0:
raise ValueError("Leftover values")
How can I access the individual arrays so that I can perform functions on them. For example, I'd like to get the average of the first values of every group (e.g. 5 and 64 in my example numbers).
Let's say you have the following tuple of tuples:
a=((5,10,32,35), (64,76,23,53))
To access the first element of each tuple, use a for-loop:
for i in a:
print i[0]
To calculate average for the first values:
elements=[i[0] for i in a]
avg=sum(elements)/float(len(elements))
Ok, this is yielding a tuple of four numbers each time it's iterated. So, convert the whole thing to a list:
L = list(group_iter(your_list, n=4))
Then you'll have a list of tuples:
>>> L
[(5, 10, 32, 35), (64, 76, 23, 53), ...]
You can get the first item in each tuple this way:
firsts = [tup[0] for tup in L]
(There are other ways, of course.)
You've created a tuple of tuples, or a list of tuples, or a list of lists, or a tuple of lists, or whatever...
You can access any element of any nested list directly:
toplist[x][y] # yields the yth element of the xth nested list
You can also access the nested structures by iterating over the top structure:
for list in lists:
print list[y]
Might be overkill for your application but you should check out my library, pandas. Stuff like this is pretty simple with the GroupBy functionality:
http://pandas.sourceforge.net/groupby.html
To do the 4-at-a-time thing you would need to compute a bucketing array:
import numpy as np
bucket_size = 4
n = len(your_list)
buckets = np.arange(n) // bucket_size
Then it's as simple as:
data.groupby(buckets).mean()
Related
I need to create a python function that takes a list of numbers (and possibly lists) and returns a list of both the nested level and the sum of that level. For example:
given a list [1,4,[3,[100]],3,2,[1,[101,1000],5],1,[7,9]] I need to count the values of all the integers at level 0 and sum them together, then count the integers at level 1 and sum them together, and so on, until I have found the sum of the deepest nested level.
The return output for the example list mentioned above should be:
[[0,11], [1,25], [2,1201]]
where the first value in each of the lists is the level, and the second value is the sum. I am supposed to use recursion or a while loop, without importing any modules.
My original idea was to create a loop that goes through the lists and finds any integers (ignoring nested lists), calculate the sum, then remove those integers from the list, turn the next highest level into integers, and repeat. However, I could not find a way to convert a list inside of a list into indivual integer values (essentially removing the 0th level and turning the 1st level into the new 0th level).
The code that I am working with now is as follows:
def sl(lst,p=0):
temp = []
lvl = 0
while lst:
if type(lst[0]) == int:
temp.append(lst[0])
lst = lst[1:]
return [lvl,sum(temp)]
elif type(lst[0]) == list:
lvl += 1
return [lvl,sl(lst[1:],p=0)]
Basically, I created a while loop to iterate through, find any integers, and append it to a temp list where I could then find the sum. But, I cannot find a way to make the loop access the next level to do the same, especially when the original list is going up and down in levels from left to right.
I would do it like this:
array = [1, 4, [3, [100]], 3, 2, [1, [101, 1000], 5], 1, [7, 9]]
sum_by_level = []
while array:
numbers = (element for element in array if not isinstance(element, list))
sum_by_level.append(sum(numbers))
array = [element for list_element in array if isinstance(list_element, list) for element in list_element]
print(sum_by_level)
print(list(enumerate(sum_by_level)))
Gives the output:
[11, 25, 1201]
[(0, 11), (1, 25), (2, 1201)]
So I sum up the non-list-elements and then take the list-elements and strip of the outer lists. I repeat this until the array is empty which means all levels where stripped off. I discarded to directly saving the level-information as it is just the index, but if you need that you can use enumerate for that (gives tuples though instead of lists).
Given a tuple T that contains all different integers, I want to get all the tuples that result from dropping individual integers from T. I came up with the following code:
def drop(T):
S = set(T)
for i in S:
yield tuple(S.difference({i}))
for t in drop((1,2,3)):
print(t)
# (2,3)
# (1,3)
# (1,2)
I'm not unhappy with this, but I wonder if there is a better/faster way because with large tuples, difference() needs to look for the item in the set, but I already know that I'll be removing items sequentially. However, this code is only 2x faster:
def drop(T):
for i in range(len(T)):
yield T[:i] + T[i+1:]
and in any case, neither scales linearly with the size of T.
Instead of looking at it as "remove one item each item" you can look at it as "use all but one" and then using itertools it becomes straightforward:
from itertools import combinations
T = (1, 2, 3, 4)
for t in combinations(T, len(T)-1):
print(t)
Which gives:
(1, 2, 3)
(1, 2, 4)
(1, 3, 4)
(2, 3, 4)
* Assuming the order doesn't really matter
From your description, you're looking for combinations of the elements of T. With itertools.combinations, you can ask for all r-length tuples, in sorted order, without repeated elements. For example :
import itertools
T = [1,2,3]
for i in itertools.combinations(T, len(T) - 1):
print(i)
I have a function f([i_0, i_1, ..., i_k-1]) which takes as input a k-dimensional array of integers and returns some object.
Given the range for each index (as a two-dimensional array ranges=[i_0_range, i_1_range, ...]), how do I generate a k-dimensional list / array containing objects evaluated for each value of indices?
If k was fixed, I'd simply do k nested loops. But I would like to have a solution working for any k. How can I do this in Python?
You can use itertools.product to generate all the different combinations of indexes from the ranges. You can then iterate over the tuples produced by that iterator, calling f for each tuple. For example, if f is defined to return a string of the input indexes:
import itertools
def f(indexes):
return ','.join(map(str, indexes))
ranges = [range(0, 2), range(1, 3), range(2, 4)]
objs = [f(list(t)) for t in itertools.product(*ranges)]
print(objs)
Output:
['0,1,2', '0,1,3', '0,2,2', '0,2,3', '1,1,2', '1,1,3', '1,2,2', '1,2,3']
Note that dependent on your implementation of f, it might not be necessary to convert the returned tuple from itertools.product to a list and you could just use f(t) instead of f(list(t)).
I was recently trying to solve some task in Python and I have found the solution that seems to have the complexity of O(n log n), but I believe it is very inefficient for some inputs (such as first parameter being 0 and pairs being very long list of zeros).
It has also three levels of for loops. I believe it can be optimized, but at the moment I cannot optimize it more, I am probably just missing something obvious ;)
So, basically, the problem is as follows:
Given list of integers (values), the function needs to return the number of indexes' pairs that meet the following criteria:
lets assume single index pair is a tuple like (index1, index2),
then values[index1] == complementary_diff - values[index2] is true,
Example:
If given a list like [1, 3, -4, 0, -3, 5] as values and 1 as complementary_diff, the function should return 4 (which is the length of the following list of indexes' pairs: [(0, 3), (2, 5), (3, 0), (5, 2)]).
This is what I have so far, it should work perfectly most of the time, but - as I said - in some cases it could run very slowly, despite the approximation of its complexity O(n log n) (it looks like pessimistic complexity is O(n^2)).
def complementary_pairs_number (complementary_diff, values):
value_key = {} # dictionary storing indexes indexed by values
for index, item in enumerate(values):
try:
value_key[item].append(index)
except (KeyError,): # the item has not been found in value_key's keys
value_key[item] = [index]
key_pairs = set() # key pairs are unique by nature
for pos_value in value_key: # iterate through keys of value_key dictionary
sym_value = complementary_diff - pos_value
if sym_value in value_key: # checks if the symmetric value has been found
for i1 in value_key[pos_value]: # iterate through pos_values' indexes
for i2 in value_key[sym_value]: # as above, through sym_values
# add indexes' pairs or ignore if already added to the set
key_pairs.add((i1, i2))
key_pairs.add((i2, i1))
return len(key_pairs)
For the given example it behaves like that:
>>> complementary_pairs_number(1, [1, 3, -4, 0, -3, 5])
4
If you see how the code could be "flattened" or "simplified", please let me know.
I am not sure if just checking for complementary_diff == 0 etc. is the best approach - if you think it is, please let me know.
EDIT: I have corrected the example (thanks, unutbu!).
I think this improves the complexity to O(n):
value_key.setdefault(item,[]).append(index) is faster than using
the try..except blocks. It is also faster than using a collections.defaultdict(list). (I tested this with ipython %timeit.)
The original code visits every solution twice. For each pos_value
in value_key, there is a unique sym_value associated with
pos_value. There are solutions when sym_value is also in
value_key. But when we iterate over the keys in value_key,
pos_value is eventually assigned to the value of sym_value, which
make the code repeat the calculation it has already done. So you can
cut the work in half if you can stop pos_value from equaling the
old sym_value. I implemented that with a seen = set() to keep
track of seen sym_values.
The code only cares about len(key_pairs), not the key_pairs themselves. So instead of keeping track of the pairs (with a
set), we can simply keep track of the count (with num_pairs). So we can replace the two inner for-loops with
num_pairs += 2*len(value_key[pos_value])*len(value_key[sym_value])
or half that in the "unique diagonal" case, pos_value == sym_value.
def complementary_pairs_number(complementary_diff, values):
value_key = {} # dictionary storing indexes indexed by values
for index, item in enumerate(values):
value_key.setdefault(item,[]).append(index)
# print(value_key)
num_pairs = 0
seen = set()
for pos_value in value_key:
if pos_value in seen: continue
sym_value = complementary_diff - pos_value
seen.add(sym_value)
if sym_value in value_key:
# print(pos_value, sym_value, value_key[pos_value],value_key[sym_value])
n = len(value_key[pos_value])*len(value_key[sym_value])
if pos_value == sym_value:
num_pairs += n
else:
num_pairs += 2*n
return num_pairs
You may want to look into functional programming idioms, such as reduce, etc.
Often times, nested array logic can be simplified by using functions like reduce, map, reject, etc.
For an example (in javascript) check out underscore js. I'm not terribly smart at Python, so I don't know which libraries they have available.
I think (some or all of) these would help, but I'm not sure how I would prove it yet.
1) Take values and reduce it to a distinct set of values, recording the count of each element (O(n))
2) Sort the resulting array.
(n log n)
3) If you can allocate lots of memory, I guess you might be able to populate a sparse array with the values - so if the range of values is -100 : +100, allocate an array of [201] and any value that exists in the reduced set pops a one at the value index in the large sparse array.
4) Any value that you want to check if it meets your condition now has to look at the index in the sparse array according to the x - y relationship and see if a value exists there.
5) as unutbu pointed out, it's trivially symmetric, so if {a,b} is a pair, so is {b,a}.
I think you can improve this by separating out the algebra part from the search and using smarter data structures.
Go through the list and subtract from the complementary diff for each item in the list.
resultlist[index] = complementary_diff - originallist[index]
You can use either a map or a simple loop. -> Takes of O(n) time.
See if the number in the resulting list exists in the original list.
Here, with a naive list, you would actually get O(n^2), because you can end up searching for the whole original list per item in the resulting list.
However, there are smarter ways to organize your data than this. If you have the original list sorted, your search time reduces to O(nlogn + nlogn) = O(nlogn), nlogn for the sort, and nlogn for the binary search per element.
If you wanted to be even smarter you can make your list in to a dictionary(or hash table) and then this step becomes O(n + n) = O(n), n to build the dictionary and 1 * n to search each element in the dictionary. (*EDIT: * Since you cannot assume uniqueness of each value in the original list. You might want to keep count of how many times each value appears in the original list.)
So with this now you get O(n) total runtime.
Using your example:
1, [1, 3, -4, 0, -3, 5],
Generate the result list:
>>> resultlist
[0, -2, 5, 1, 4, -4].
Now we search:
Flatten out the original list into a dictionary. I chose to use the original list's index as the value as that seems like a side data you're interested in.
>>> original_table
{(1,0), (3,1), (-4,2), (0,3), (-3,4), (5,5)}
For each element in the result list, search in the hash table and make the tuple:
(resultlist_index, original_table[resultlist[resultlist_index]])
This should look like the example solution you had.
Now you just find the length of the resulting list of tuples.
Now here's the code:
example_diff = 1
example_values = [1, 3, -4, 0, -3, 5]
example2_diff = 1
example2_values = [1, 0, 1]
def complementary_pairs_number(complementary_diff, values):
"""
Given an integer complement and a list of values count how many pairs
of complementary pairs there are in the list.
"""
print "Input:", complementary_diff, values
# Step 1. Result list
resultlist = [complementary_diff - value for value in values]
print "Result List:", resultlist
# Step 2. Flatten into dictionary
original_table = {}
for original_index in xrange(len(values)):
if values[original_index] in original_table:
original_table[values[original_index]].append(original_index)
else:
original_table[values[original_index]] = [original_index]
print "Flattened dictionary:", original_table
# Step 2.5 Search through dictionary and count up the resulting pairs.
pair_count = 0
for resultlist_index in xrange(len(resultlist)):
if resultlist[resultlist_index] in original_table:
pair_count += len(original_table[resultlist[resultlist_index]])
print "Complementary Pair Count:", pair_count
# (Optional) Step 2.5 Search through dictionary and create complementary pairs. Adds O(n^2) complexity.
pairs = []
for resultlist_index in xrange(len(resultlist)):
if resultlist[resultlist_index] in original_table:
pairs += [(resultlist_index, original_index) for original_index in
original_table[resultlist[resultlist_index]]]
print "Complementary Pair Indices:", pairs
# Step 3
return pair_count
if __name__ == "__main__":
complementary_pairs_number(example_diff, example_values)
complementary_pairs_number(example2_diff, example2_values)
Output:
$ python complementary.py
Input: 1 [1, 3, -4, 0, -3, 5]
Result List: [0, -2, 5, 1, 4, -4]
Flattened dictionary: {0: 3, 1: 0, 3: 1, 5: 5, -4: 2, -3: 4}
Complementary Pair Indices: [(0, 3), (2, 5), (3, 0), (5, 2)]
Input: 1 [1, 0, 1]
Result List: [0, 1, 0]
Flattened dictionary: {0: [1], 1: [0, 2]}
Complementary Pair Count: 4
Complementary Pair Indices: [(0, 1), (1, 0), (1, 2), (2, 1)]
Thanks!
Modified the solution provided by #unutbu:
The problem can be reduced to comparing these 2 dictionaries:
values
pre-computed dictionary for (complementary_diff - values[i])
def complementary_pairs_number(complementary_diff, values):
value_key = {} # dictionary storing indexes indexed by values
for index, item in enumerate(values):
value_key.setdefault(item,[]).append(index)
answer_key = {} # dictionary storing indexes indexed by (complementary_diff - values)
for index, item in enumerate(values):
answer_key.setdefault((complementary_diff-item),[]).append(index)
num_pairs = 0
print(value_key)
print(answer_key)
for pos_value in value_key:
if pos_value in answer_key:
num_pairs+=len(value_key[pos_value])*len(answer_key[pos_value])
return num_pairs
I have a dataset of events (tweets to be specific) that I am trying to bin / discretize. The following code seems to work fine so far (assuming 100 bins):
HOUR = timedelta(hours=1)
start = datetime.datetime(2009,01,01)
z = [dt + x*HOUR for x in xrange(1, 100)]
But then, I came across this fateful line at python docs 'This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n)'. The zip idiom does indeed work - but I can't understand how (what is the * operator for instance?). How could I use to make my code prettier? I'm guessing this means I should make a generator / iterable for time that yields the time in graduations of an HOUR?
I will try to explain zip(*[iter(s)]*n) in terms of a simpler example:
imagine you have the list s = [1, 2, 3, 4, 5, 6]
iter(s) gives you a listiterator object that will yield the next number from s each time you ask for an element.
[iter(s)] * n gives you the list with iter(s) in it n times e.g. [iter(s)] * 2 = [<listiterator object>, <listiterator object>] - the key here is that these are 2 references to the same iterator object, not 2 distinct iterator objects.
zip takes a number of sequences and returns a list of tuples where each tuple contains the ith element from each of the sequences. e.g. zip([1,2], [3,4], [5,6]) = [(1, 3, 5), (2, 4, 6)] where (1, 3, 5) are the first elements from the parameters passed to zip and (2, 4, 6) are the second elements from the parameters passed to zip.
The * in front of *[iter(s)]*n converts the [iter(s)]*n from being a list into being multiple parameters being passed to zip. so if n is 2 we get zip(<listiterator object>, <listiterator object>)
zip will request the next element from each of its parameters but because these are both references to the same iterator this will result in (1, 2), it does the same again resulting in (3, 4) and again resulting in (5, 6) and then there are no more elements so it stops. Hence the result [(1, 2), (3, 4), (5, 6)]. This is the clustering a data series into n-length groups as mentioned.
The expression from the docs looks like this:
zip(*[iter(s)]*n)
This is equivalent to:
it = iter(s)
zip(*[it, it, ..., it]) # n times
The [...]*n repeats the list n times, and this results in a list that contains nreferences to the same iterator.
This is again equal to:
it = iter(s)
zip(it, it, ..., it) # turning a list into positional parameters
The * before the list turns the list elements into positional parameters of the function call.
Now, when zip is called, it starts from left to right to call the iterators to obtain elements that should be grouped together. Since all parameters refer to the same iterator, this yields the first n elements of the initial sequence. Then that process continues for the second group in the resulting list, and so on.
The result is the same as if you had constructed the list like this (evaluated from left to right):
it = iter(s)
[(it.next(), it.next(), ..., it.next()), (it.next(), it.next(), ..., it.next()), ...]