Sorting Data Into Clumps With Python

Sorting Data Into Clumps With Python - python

I'd like to clump a list of data based off a list of ranges. The idea being that I'd like to make a histogram of the end result. I know about collections.Counter but have not seen someone us it or other built in to generate clumps. I have written out the long form but am hoping someone can offer up something that is more efficient.
def min_to_sec(val):
ret_val = 60 * int(val)
return ret_val
def hr_to_sec(val):
ret_val = 3600 * int(val)
return ret_val
def histogram(y_lst):
x_lst = [ 10,
20,
30,
40,
50,
60,
90,
min_to_sec(2),
min_to_sec(3),
min_to_sec(4),
min_to_sec(5),
min_to_sec(10),
min_to_sec(15),
min_to_sec(20),
]
results = {}
for y_val in y_lst:
for x_val in x_lst:
if y_val < x_val:
results[ str(x_val) ] = results.get( str(x_val), 0) + 1
break
else:
results['greater'] = results.get('greater', 0) + 1
return results
Updated to include an example of desired sample output:
So if my x_lst and y_list are:
x_lst = [10,20,30,40]
y_lst = [1,2,3,15,22,27,40]
I'd like a return value similar to Counter, of:
{
10:3,
20:1,
30:2,
}
So while my above code works, being that it's a nested for loop, it's quite slow, and I'm hoping there's a way to use something like collections.Count to do this 'clumping' operation.

You could use collections.Counter to do this kind of counting of elements in a list:
In [1]: from collections import Counter
In [2]: Counter([1, 2, 10, 1, 2, 100])
Out[2]: Counter({1: 2, 2: 2, 100: 1, 10: 1})
You can increment a Counter more simply using:
results['foo'] += 1
In order to count only those before the inequality, you could use itertools.takewhile:
In [3]: from itertools import takewhile
In [4]: Counter(takewhile(lambda x: x < 10, [1, 2, 10, 1, 2, 100]))
Out[4]: Counter({1: 1, 2: 1})
However this won't keep track of those which have broken out of the takewhile.

Have you considered using pandas? You could put y_lst into a DataFrame and pretty easily make a histogram.
Assuming you have matplotlib and pylab imported...
import pandas as pd
data = pd.DataFrame([1, 2, 3, 15, 22, 27, 40])
data[0].hist(bins = 4)
That would give you the histogram you describe above. However, once the data is in a pandas DataFrame it's not too challenging to slice it up however you'd like.

Related

How would I fix this function?

Hey this is my first question so I hope I'm doing it right.
I'm trying to write a function that given a list of integers and N as the maximum occurrence, would then return a list with any integer above the maximum occurrence deleted. For example if I input:
[20,37,20,21] #list of integers and 1 #maximum occurrence.
Then as output I would get:
[20,37,21] because the number 20 appears twice and the maximum occurrence is 1, so it is deleted from the list. Here's another example:
Input: [1,1,3,3,7,2,2,2,2], 3
Output: [1,1,3,3,7,2,2,2]
Here's what I wrote so far, how would I be able to optimize it? I keep on getting a timeout error. Thank you very much in advance.
def delete_nth(order,n):
order = Counter(order)
for i in order:
if order[i] > n:
while order[i] > n:
order[i] - 1
return order
print(delete_nth([20,37,20,21], 1))

You can remove building the Counter at the beginning - and just have temporary dictionary as counter:
def delete_nth(order,n):
out, counter = [], {}
for v in order:
counter.setdefault(v, 0)
if counter[v] < n:
out.append(v)
counter[v] += 1
return out
print(delete_nth([20,37,20,21], 1))
Prints:
[20, 37, 21]

You wrote:
while order[i] > n:
order[i] - 1
That second line should presumably be order[i] -= 1, or any code that enters the loop will never leave it.

You could use a predicate with a default argument collections.defaultdict to retain state as your list of numbers is being filtered.
def delete_nth(numbers, n):
from collections import defaultdict
def predicate(number, seen=defaultdict(int)):
seen[number] += 1
return seen[number] <= n
return list(filter(predicate, numbers))
print(delete_nth([1, 1, 3, 3, 7, 2, 2, 2, 2], 3))
Output:
[1, 1, 3, 3, 7, 2, 2, 2]
>>>

I've renamed variables to something that had more meaning for me:
This version, though very short and fairly efficient, will output identical values adjacently:
from collections import Counter
def delete_nth(order, n):
counters = Counter(order)
output = []
for value in counters:
cnt = min(counters[value], n)
output.extend([value] * cnt)
return output
print(delete_nth([1,1,2,3,3,2,7,2,2,2,2], 3))
print(delete_nth([20,37,20,21], 1))
Prints:
[1, 1, 2, 2, 2, 3, 3, 7]
[20, 37, 21]
This version will maintain original order, but run a bit more slowly:
from collections import Counter
def delete_nth(order, n):
counters = Counter(order)
for value in counters:
counters[value] = min(counters[value], n)
output = []
for value in order:
if counters[value]:
output.append(value)
counters[value] -= 1
return output
print(delete_nth([1,1,2,3,3,2,7,2,2,2,2], 3))
print(delete_nth([20,37,20,21], 1))
Prints:
[1, 1, 2, 3, 3, 2, 7, 2]
[20, 37, 21]

Building up a counting function

I need to build up a counting function starting from a dictionary. The dictionary is a classical Bag_of_Words and looks like as follows:
D={'the':5, 'pow':2, 'poo':2, 'row':2, 'bub':1, 'bob':1}
I need the function that for a given integer returns the number of words with at least that number of occurrences. In the example F(2)=4, all words but 'bub' and 'bob'.
First of all I build up the inverse dictionary of D:
ID={5:1, 2:3, 1:2}
I think I'm fine with that. Then here is the code:
values=list(ID.keys())
values.sort(reverse=True)
Lk=[]
Nw=0
for val in values:
Nw=Nw+ID[val]
Lk.append([Nw, val])
The code works fine but I do not like it. The point is that I would prefer to use a list comprehension to build up Lk; also I really ate the Nw variable I have used. It does not seems pythonic at all

you can create a sorted array of your word counts then find the insertion point with np.searchsorted to get how many items are to either side of it... np.searchsorted is very efficient and fast. If your dictionary doesn't change often this call is basically free compared to other methods
import numpy as np
def F(n, D):
#creating the array each time would be slow if it doesn't change move this
#outside the function
arr = np.array(D.values())
arr.sort()
L = len(arr)
return L - np.searchsorted(arr, n) #this line does all the work...
what's going on....
first we take just the word counts (and convert to a sorted array)...
D = {"I'm": 12, "pretty": 3, "sure":12, "the": 45, "Donald": 12, "is": 3, "on": 90, "crack": 11}
vals = np.arrau(D.values())
#vals = array([90, 12, 12, 3, 11, 12, 45, 3])
vals.sort()
#vals = array([ 3, 3, 11, 12, 12, 12, 45, 90])
then if we want to know how many values are greater than or equal to n, we simply find the length of the list beyond the first number greater than or equal to n. We do this by determining the leftmost index where n would be inserted (insertion sort) and subtracting that from the total number of positions (len)
# how many are >= 10?
# insertion point for value of 10..
#
# | index: 2
# v
# array([ 3, 3, 11, 12, 12, 12, 45, 90])
#find how many elements there are
#len(arr) = 8
#subtract.. 2-8 = 6 elements that are >= 10

A fun little trick for counting things: True has a numerical value of 1 and False has a numerical value of 0. SO we can do things like
sum(v >= k for v in D.values())
where k is the value you're comparing against.

collections.Counter() is ideal choice for this. Use them on dict.values() list. Also, you need not to install them explicitly like numpy. Sample example:
>>> from collections import Counter
>>> D = {'the': 5, 'pow': 2, 'poo': 2, 'row': 2, 'bub': 1, 'bob': 1}
>>> c = Counter(D.values())
>>> c
{2: 3, 1: 2, 5: 1}

Method to get the max distance (step) between values in python?

Given an list of integers does exists a default method find the max distance between values?
So if I have this array
[1, 3, 5, 9, 15, 30]
The max step between the values is 15. Does the list object has a method for do that?

No, list objects have no standard "adjacent differences" method or the like. However, using the pairwise function mentioned in the itertools recipes:
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return izip(a, b)
...you can (concisely and efficiently) define
>>> max(b-a for (a,b) in pairwise([1, 3, 5, 9, 15, 30]))
15

No, but it's trivial to code:
last = data[0]
dist = 0
for i in data[1:]:
dist = max(dist, i-last)
last = i
return dist

You can do:
>>> s = [1, 3, 5, 9, 15, 30]
>>> max(x[0] - x[1] for x in zip(s[1:], s))
15
This uses max and zip. It computes the difference between all consecutive elements and returns the max of those.

l=[1, 3, 5, 9, 15, 30]
max([j-i for i, j in zip(l[:-1], l[1:])])
That is using pure python and gives you the desired output "15".
If you like to work with "numpy" you could do:
import numpy as np
max(np.diff(l))

The list object does not. However, it is pretty quick to write a function that does that:
def max_step(my_list):
max_step = 0
for ind in xrange(len(my_list)-1):
step = my_list[ind+1] - my_list[ind]
if step > max_step:
max_step = step
return max_step
>>> max_step([1, 3, 5, 9, 15, 30])
15
Or if you prefer even shorter:
max_step = lambda l: max([l[i+1] - l[i] for i in xrange(len(l)-1)])
>>> max_step([1, 3, 5, 9, 15, 30])
15

It is possible to use the reduce() function, but it is not that elegant as you need some way to keep track of the previous value:
def step(maxStep, cur):
if isinstance(maxStep, int):
maxStep = (abs(maxStep-cur), cur)
return (max(maxStep[0], abs(maxStep[1]-cur)), cur)
l = [1, 3, 5, 9, 15, 30]
print reduce(step, l)[0]
The solution works by returing the previous value and the accumulated max calculation as a tuple for each iteration.
Also what is the expected outcome for [10,20,30,5]? Is it 10 or 25? If 25 then you need to add abs() to your calculation.

Pythonic way to manipulate same dictionary

A very naive question.. I have the following function:
def vectorize(pos, neg):
vec = {item_id:1 for item_id in pos}
for item_id in neg:
vec[item_id] = 0
return vec
Example:
>>> print vectorize([1, 2] [3, 200, 201, 202])
{1: 1, 2: 1, 3: 0, 200: 0, 201: 0, 202: 0}
I feel, this is too verbose in python.. Is there a more pythonic way to do this...
Basically, I am returning a dictionary whose values are 1 if its in pos (list) and 0 otherwise?

I'm not particularly sure if this is more pythonic... Maybe a little bit more efficient? Dunno, really
pos = [1, 2, 3, 4]
neg = [5, 6, 7, 8]
def vectorize(pos, neg):
vec = dict.fromkeys(pos, 1)
vec.update(dict.fromkeys(neg, 0))
return vec
print vectorize(pos, neg)
Outputs:
{1: 1, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0}
But I like your way too... Just giving an idea here.

I'd probably just do:
def vectorize(pos, neg):
vec = {}
vec.update((item, 1) for item in pos)
vec.update((item, 0) for item in neg)
return vec
But your code is fine as well.

You could use
vec = {item_id : 0 if item_id in neg else 1 for item_id in pos}
Note however that the lookup item_id in neg won't be efficient if neg is a list (as opposed to a set).
Update: After seeing your expected output.
Note that the above does not insert 0s for items that are only in neg. If you want that too, the following one-liner could be used.
vec = dict([(item_id, 1) for item_id in pos] + [(item_id, 0) for item_id in neg])
If you want to avoid creating the two temporary lists, itertools.chain could help.
from itertools import chain
vec = dict(chain(((item_id, 1) for item_id in pos), ((item_id, 0) for item_id in neg)))

This would be Pythonic, in the sense of being relatively short and making maximum use of the language's features:
def vectorize(pos, neg):
pos_set = set(pos)
return {item_id: int(item_id in pos_set) for item_id in set(pos+neg)}
print vectorize([1, 2], [3, 200, 201, 202])

Algorithm to offset a list of data

Given a list of data as follows:
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
I would like to create an algorithm that is able to offset the list of certain number of steps. For example, if the offset = -1:
def offsetFunc(inputList, offsetList):
#make something
return output
where:
output = [0,0,0,0,1,1,5,5,5,5,5,5,3,3,3,2,2]
Important Note: The elements of the list are float numbers and they are not in any progression. So I actually need to shift them, I cannot use any work-around for getting the result.
So basically, the algorithm should replace the first set of values (the 4 "1", basically) with the 0 and then it should:
Detect the lenght of the next range of values
Create a parallel output vectors with the values delayed by one set
The way I have roughly described the algorithm above is how I would do it. However I'm a newbie to Python (and even beginner in general programming) and I have figured out time by time that Python has a lot of built-in functions that could make the algorithm less heavy and iterating. Does anyone have any suggestion to better develop a script to make this kind of job? This is the code I have written so far (assuming a static offset at -1):
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
output = []
PrevVal = 0
NextVal = input[0]
i = 0
while input[i] == NextVal:
output.append(PrevVal)
i += 1
while i < len(input):
PrevVal = NextVal
NextVal = input[i]
while input[i] == NextVal:
output.append(PrevVal)
i += 1
if i >= len(input):
break
print output
Thanks in advance for any help!
BETTER DESCRIPTION
My list will always be composed of "sets" of values. They are usually float numbers, and they take values such as this short example below:
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
In this example, the first set (the one with value "1.236") is long 4 while the second one is long 6. What I would like to get as an output, when the offset = -1, is:
The value "0.000" in the first 4 elements;
The value "1.236" in the second 6 elements.
So basically, this "offset" function is creating the list with the same "structure" (ranges of lengths) but with the values delayed by "offset" times.
I hope it's clear now, unfortunately the problem itself is still a bit silly to me (plus I don't even speak good English :) )
Please don't hesitate to ask any additional info to complete the question and make it clearer.

How about this:
def generateOutput(input, value=0, offset=-1):
values = []
for i in range(len(input)):
if i < 1 or input[i] == input[i-1]:
yield value
else: # value change in input detected
values.append(input[i-1])
if len(values) >= -offset:
value = values.pop(0)
yield value
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
print list(generateOutput(input))
It will print this:
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]
And in case you just want to iterate, you do not even need to build the list. Just use for i in generateOutput(input): … then.
For other offsets, use this:
print list(generateOutput(input, 0, -2))
prints:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 5, 5, 5, 3, 3]

Using deque as the queue, and using maxlen to define the shift length. Only holding unique values. pushing inn new values at the end, pushes out old values at the start of the queue, when the shift length has been reached.
from collections import deque
def shift(it, shift=1):
q = deque(maxlen=shift+1)
q.append(0)
for i in it:
if q[-1] != i:
q.append(i)
yield q[0]
Sample = [1.236,1.236,1.236,1.236,1.863,1.863,1.863,1.863,1.863,1.863]
print list(shift(Sample))
#[0, 0, 0, 0, 1.236, 1.236, 1.236, 1.236, 1.236, 1.236]

My try:
#Input
input = [1,1,1,1,5,5,3,3,3,3,3,3,2,2,2,5,5]
shift = -1
#Build service structures: for each 'set of data' store its length and its value
set_lengths = []
set_values = []
prev_value = None
set_length = 0
for value in input:
if prev_value is not None and value != prev_value:
set_lengths.append(set_length)
set_values.append(prev_value)
set_length = 0
set_length += 1
prev_value = value
else:
set_lengths.append(set_length)
set_values.append(prev_value)
#Output the result, shifting the values
output = []
for i, l in enumerate(set_lengths):
j = i + shift
if j < 0:
output += [0] * l
else:
output += [set_values[j]] * l
print input
print output
gives:
[1, 1, 1, 1, 5, 5, 3, 3, 3, 3, 3, 3, 2, 2, 2, 5, 5]
[0, 0, 0, 0, 1, 1, 5, 5, 5, 5, 5, 5, 3, 3, 3, 2, 2]

def x(list, offset):
return [el + offset for el in list]

A completely different approach than my first answer is this:
import itertools
First analyze the input:
values, amounts = zip(*((n, len(list(g))) for n, g in itertools.groupby(input)))
We now have (1, 5, 3, 2, 5) and (4, 2, 6, 3, 2). Now apply the offset:
values = (0,) * (-offset) + values # nevermind that it is longer now.
And synthesize it again:
output = sum([ [v] * a for v, a in zip(values, amounts) ], [])
This is way more elegant, way less understandable and probably way more expensive than my other answer, but I didn't want to hide it from you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting Data Into Clumps With Python - python

Related

How would I fix this function?

Building up a counting function

Method to get the max distance (step) between values in python?

Pythonic way to manipulate same dictionary

Algorithm to offset a list of data

Categories

Resources