Group data by a tolerance

Group data by a tolerance - python

I have an ordered list
L = [301.148986835,
301.148986835,
301.148986835,
301.161562835,
301.161562835,
301.16156333500004,
301.167179835,
301.167179835,
301.167179835,
301.167179835,
301.167179835,
301.179755835,
301.179755835,
301.179755835,
301.646611835,
301.659187335,
301.659187335,
301.659187335,
301.659187335,
302.138619335,
302.142316335,
302.151194835,
302.1568118349999,
302.15681183500004,
302.15681183500004,
302.15681183500004,
302.156812335,
302.156812335,
302.156812335,
302.169387835,
302.169387835,
302.169387835,
302.169387835,
302.169387835,
302.169388335,
302.636243335,
302.636243835,
302.648819835,
302.648819835,
303.137565335,
303.140827335,
303.140827335,
303.146443835,
303.146443835,
303.146444335,
303.159019835,
303.159019835,
303.15901983500004,
303.159020335,
303.159020335,
303.15902033500004,
303.63283533500004,
303.638451335,
304.130459335,
304.130459335,
304.14370483499994,
304.14370483499994,
304.14370483499994,
304.148651835,
304.148652335,
304.148652335]
I want to group it with a margin of +-0.5
The expected output
R = [[301.148986835,
301.148986835,
301.148986835,
301.161562835,
301.161562835,
301.16156333500004,
301.167179835,
301.167179835,
301.167179835,
301.167179835,
301.167179835,
301.179755835,
301.179755835,
301.179755835,
301.646611835,
301.659187335,
301.659187335,
301.659187335,
301.659187335,
302.138619335],[302.142316335,
302.151194835,
302.1568118349999,
302.15681183500004,
302.15681183500004,
302.15681183500004,
302.156812335,
302.156812335,
302.156812335,
302.169387835,
302.169387835,
302.169387835,
302.169387835,
302.169387835,
302.169388335,
302.636243335,
302.636243835,
302.648819835,
302.648819835,
303.137565335,
303.140827335,
303.140827335,
303.146443835,
303.146443835,
303.146444335,
303.159019835,
303.159019835,
303.15901983500004,
303.159020335,
303.159020335,
303.15902033500004],
[303.63283533500004,
303.638451335,
304.130459335,
304.130459335,
304.14370483499994,
304.14370483499994,
304.14370483499994],[304.148651835,
304.148652335,
304.148652335]
When I use this code (my question is not duplicate
def grouper(iterable):
prev = None
group = []
for item in iterable:
if prev is None or item - prev <= 1:
group.append(item)
else:
yield group
group = [item]
prev = item
if group:
yield group
I get the same list as an output
calculate within a tolerance

You update prev in every iteration. Because of this, every element of your list is within 1 of prev. You want to update it only when you start a new group.
Better yet, get rid of prev altogether and always compare against the first element of the group.
I'd also suggest including a tol argument so that the function is more flexible:
def grouper(iterable, tol=0.5):
tol = abs(tol*2) # Since we're counting from the start of the group, multiply tol by 2
group = []
for item in iterable:
if not group or item - group[0] <= tol:
group.append(item)
else:
yield group
group = [item]
if group:
yield group
Try it online

Related

How to get the maximum value in a matrix and its row number

I would like to get the maxim value of a matrix and just its row position in the matrix. How to do that?
Thanks again.
I can get the maximum value in the matrix with the following code; but, I am not sure how to get its row index position. To be noticed that the matrix could also has equal values for each row.
ratio =[[0.01556884 0.01556884]
[0.1290337 0.1290337 ]
[0.07015939 0.07015939]
[0.12288323 0.12288323]]
dup = []
for k in ratio:
for i in k:
dup.append(i)
print(max(dup))
0.1290337
I expect to obtain the maximum value that I already had,
0.129037
and position 1
Could someone help me, to have the row position?

Assuming you have your data in a nested Python list, you can do it with a generator like this:
ratio =[[0.01556884, 0.01556884],
[0.1290337, 0.1290337 ],
[0.07015939, 0.07015939],
[0.12288323, 0.12288323]]
max_val, row_max, col_max = max((value, i, j)
for i, row in enumerate(ratio)
for j, value in enumerate(row))
print(f'Max value: ratio[{row_max}][{col_max}] = {max_val}')
# Max value: ratio[1][1] = 0.1290337
If you have a NumPy array, then you can do:
import numpy as np
ratio = np.array([[0.01556884, 0.01556884],
[0.1290337, 0.1290337 ],
[0.07015939, 0.07015939],
[0.12288323, 0.12288323]])
row_max, col_max = np.unravel_index(np.argmax(ratio), ratio.shape)
max_val = ratio[row_max, col_max]
print(f'Max value: ratio[{row_max}][{col_max}] = {max_val}')
# Max value: ratio[1][0] = 0.1290337
Note the different answers due to two array positions containing the maximum value.

Here is a naive implementation using a double for loop.
ratio = [[0.01556884, 0.01556884],
[0.1290337, 0.1290337 ],
[0.07015939, 0.07015939],
[0.12288323, 0.12288323]]
max_val = ratio[0][0]
max_loc = (0,0)
for i, row in enumerate(ratio):
for j, idx in enumerate(row):
if ratio[i][j] > max_val:
max_val = ratio[i][j]
max_loc = (i, j)
print(f"max value {max_val} at {max_loc}")
Outputs:
max value 0.1290337 at (1, 0)
There is no reason to copy the values into a new array. If you are already iterating through the nested list, then just keep track of the index and max value instead.

Use max and enumerate:
ratio =[[0.01556884, 0.01556884], [0.1290337, 0.1290337 ],[0.07015939, 0.07015939],[0.12288323, 0.12288323]]
print(max(enumerate(map(max, ratio)), key=lambda x:x[1]))
Results:
(1, 0.1290337)

This will do the job quite nicely (your original ratio matrix was missing some commas):
ratio = [[0.01556884, 0.01556884],
[0.1290337, 0.1290337],
[0.07015939, 0.07015939],
[0.12288323, 0.12288323]]
max_val = 0
idx = None
for i, row in enumerate(ratio):
if max(row) > max_val:
max_val = max(row)
idx = i
print(f"max value: {max_val}, at row: {idx}")
Output:
max value: 0.1290337, at row: 1
Or the same little bit more concisely:
ratio = [[0.01556884, 0.01556884],
[0.1290337, 0.1290337],
[0.07015939, 0.07015939],
[0.12288323, 0.12288323]]
idx, max_val = max(enumerate(map(max, ratio)), key=lambda x: x[1])
print(f"max value: {max_val}, at row: {idx}")
Output:
max value: 0.1290337, at row: 1

How to iterate a list of integer pairs, computing new 'union pairs'

I have a list of integer pairs representing year ranges and need to compute union ranges for pairs that are contiguous (within 1 year).
example input:
ts_list = [[1777, 1777], [1778, 1783], [1786, 1791], [1792, 1795]]
desired output
[[1777, 1781], [1786, 1795]]
I've tried for and while loops, and can get the union before the first disjoint, but I'm stumped as to how to iterate properly -- e.g. this produces a newlist of
[[1777, 1783], [1778, 1783], [1786, 1795]]
then returns a Type error: 'int' object is not subscriptable". The first and third pairs are correct, but the 2nd is extraneous
ts_list = [[1777, 1777], [1778, 1781], [1786, 1791], [1792, 1795]]
newlist=[]
last = ts_list[len(ts_list)-1][1]
for x in range(len(ts_list)):
ts=ts_list[x]
start = ts[0]
end = ts[1]
ts_next = ts_list[x+1] if x<len(ts_list)-1 else last
if ts_next[0]-end > 1:
# next is disjoint, break out
newlist.append([start,end])
else:
# next is contiguous (within 1 year)
newlist.append([start,ts_next[1]])

You could do it like this:
ts_list = [[1777, 1777], [1778, 1781], [1786, 1791], [1792, 1795]]
# We start with the first range
out = [ts_list[0]]
for start, end in ts_list[1:]:
if start <= out[-1][1] + 1:
# if the new range starts at most one year
# after the end of the previous one, we extend it:
out[-1][1] = end
else:
# otherwise, we append this new range to the output
out.append([start, end])
print(out)
# [[1777, 1781], [1786, 1795]]

Using set/list comprehensions:
ts_list = [[1777, 1777], [1778, 1781], [1786, 1791], [1792, 1795]]
# Get all years that are contained in one of the date ranges, including start and end year.
# Using a set ensures that each year appears only once.
all_years = {y for years in ts_list for y in range(years[0], years[1] + 1)}
# Get start years (years where the previous year is not in all_years)
# and end years (years where the following year is not in all_years).
start_years = [y for y in all_years if y - 1 not in all_years]
end_years = [y for y in all_years if y + 1 not in all_years]
# Combine start and end years. To match the right ones, sort first.
start_years.sort()
end_years.sort()
result = [[start_year, end_year] for start_year, end_year in zip(start_years, end_years)]
print(result)
# [[1777, 1781], [1786, 1795]]

Python: Find all nodes connected to n (tuple)

I want to find out whether I can reach all nodes from a certain node. I am not interested in the path, I just want to output YES or NO if I can or cannot. Let's assume I have the following graph - As a constraint, I need to represent my nodes as a tuple (i,j):
graph={
(1,1): [(1,2),(2,2)]
(1,2): [(1,3)]
(1,3): [(1,2),(2,3)]
(2,2): [(3,3)]
(2,3): []
(3,3): [(2,2)]
}
Now, I need to show if I can reach from (1,1), (2,2) or (3,3), i.e. (i,j) with i = j, all other nodes where i != j. If yes, print(YES) - if no, print(NO).
The example mentioned above would output YES for node(1,1), since I can reach (1,2), (1,3) and (2,3) via node (1,1).
I tried to use the following
G = nx.DiGraph()
G.add_edges_from(graph)
for reachable_node in nx.dfs_postorder_nodes(G, source=None):
print reachable_node
However, if I declare (1,1), (2,2) or (3,3) as my source in nx.dfs_postorder.nodes(), I get, e.g., following error -> KeyError: (1,1)
Which function or library (the more standard the library is the better!!) should I use to indicate whether I can reach all nodes from any of the (i, i) nodes?
Thanks for all clarifications! I am a new member, so if my question doesn't follow the Stackoverflow guidelines, feel free to tell me how I can improve my next questions!

This program should do the work and it uses just standard library (basically gives you all possible states that can be visited for a given starting point):
graph={
(1,1): [(1,2), (2,2)],
(1,2): [(1,3)],
(1,3): [(1,2), (2,3)],
(2,2): [(3,3)],
(2,3): [],
(3,3): [(2,2)]
}
node0 = (1,1) #choose the starting node
node0_connections = [node0] #this list will contain all the possible states that can be visited from node0
for node in node0_connections:
for node_to in graph[node]:
if node0_connections.count(node_to) == 0:
node0_connections.append(node_to)
print 'All possible states to be visted from node', node0,':', node0_connections,'.'
count = node0_connections.count((1,2)) + node0_connections.count((1,3)) + node0_connections.count((2,2))
if count == 3:
print 'YES'
else:
print 'NO'

I think I understand your question. You could try an exhaustive approach with a try/except block using nx.shortest_path like this:
import networkx as nx
graph={
(1,1): [(1,2),(2,2)],
(1,2): [(1,3)],
(1,3): [(1,2),(2,3)],
(2,2): [(3,3)],
(3,3): [(2,2)],
(4,4): [(1,3)],
(5,5): []
}
G = nx.Graph(graph)
nodes = G.nodes()
balanced_nodes = [node for node in G.nodes() if node[0] == node[1]]
unbalanced_nodes = [node for node in G.nodes() if node[0] != node[1]]
for balanced_node in balanced_nodes:
for unbalanced_node in unbalanced_nodes:
connected = True
try:
path = nx.shortest_path(G,balanced_node, unbalanced_node)
except:
connected = False
break
print(balanced_node, ": ", connected)
This results in:
(1, 1) : True
(2, 2) : True
(3, 3) : True
(4, 4) : True
(5, 5) : False

Return highest key value pair in-between certain number

I am using and image scanning API to look for specific images i.e. cat or dog this API tells you the probability of a result in your image. For example an image of a dog would return
{u'dog': 0.99628395, u'cat': 0.87454434}
I want my end result only to take the highest returned value AND only if its above .89 AND below 1.
Code I have so far:
import operator
#define lists
tags = [u'dog', u'cat']
tags_value= [0.99628395, 0.87454434]
#merge key and value pairs
keywords = dict(zip(tags, tags_value))
#check the result
print (keywords)
which gives me
{u'dog': 0.99628395, u'cat': 0.87454434}
I am looking for an end result of
[u'dog']
(notice how the end result is in [] and not {})
Note: if the end result is {u'dog': 0.88628395, u'cat': 0.87454434} then I don't want to return anything because the value is less than .89

There are two steps:
To get the key with the max value of a dictionary you can do best = max(keywords, key=keywords.get). This will get the max value of the dictionary, as it uses the key to get the value for each item in the dictionary.
Then you can simply check if its within your bounds: return best if .89 < keywords[best] < 1 else []. This will return an empty array if the value is not between .89 and 1

Updated
For efficiency (from #Mad Physicist's comment), you may take the max element and check if it is in the expected range.
data = {u'dog': 0.99628395, u'cat': 0.87454434}
probable_key = max(data, key = data.get)
result = None
if 0.89<data[probable_key] and data[probable_key]<1:
result = probable_key
print(result) # dog
You can also sort the dictionary by its value and then check if the value is in expected range.
import operator
data = {u'dog': 0.99628395, u'cat': 0.87454434}
sorted_data = sorted(data.items(), key = operator.itemgetter(1), reverse=True)
result = None
if 0.89<sorted_data[0][1] and sorted_data[0][1]<1:
result = sorted_data[0][0]
print(result) # dog

max((k for k, v in keywords.items() if .89 < v < 1), default=None, key=keywords.get)
# 'dog'
The first argument filters the dictionary for items that meet the condition, i.e. .89 < v < 1.
The second argument is a default value that is returned if no items meet the condition, e.g. {u'dog': 0.88628395, u'cat': 0.87454434} -> None.
The last argument is a key function that applies max() to the values of the original dictionary.

You can convert items into the tuple format and then use max , tuple always compare by its first item :
sample={u'dog': 0.99628395, u'cat': 0.87454434}
print(max([(j,i) for i,j in sample.items() if j>0.89 and j<1]))
output:
(0.99628395, 'dog')

Answer 1 (more pythonic):
res = [ key for key, val in keywords.items() if ( (val == max(list(keywords.values()))) and (0.89 <= val) and (val <= 1) ) ]
print (res)
Answer 2 (more "granular"):
keys = list(keywords.keys())
values = list(keywords.values())
print( keys, values )
max_value = max(values)
max_index = values.index(max_value)
max_key = keys[max_index]
res = []
if ( (0.89 <= max_value) and (max_value <= 1) ) :
res.append(max_key)
print (res)
From what I tested, option 1 is ca. 25% faster (5.3us vs. 6.9us).

Python - partitioning a list of strings using an equivalence relation

I have a list of alphabetic strings [str1,str2,...] which I need to partition into equivalence classes using an equivalence relation R, where str1 R str2 (in relational notation) if str2 can be obtained from str1 by a sequence of valid one-letter changes, where 'valid' means it produces a valid alphabetic word, e.g. cat --> car is valid but cat --> 'cax is not. If the input list was ['cat','ace','car','zip','ape','pip'] then the code should return [['cat','car'],['ace','ape'],['zip','pip']].
I've got an initial working version which, however, produces some "classes" which contain duplicates.
I don't suppose there is any Python package which allows me to define such equivalence relations, but even otherwise what would be the best way of doing this?

Should work for different length strings. Obviously, ordering matters.
def is_one_letter_different(s1, s2):
if len(s1) != len(s2):
return False
diff_count = 0;
for char1, char2 in zip(s1, s2):
if char1 != char2:
diff_count += 1
return diff_count == 1
def group(candidates):
groups = []
for candidate in candidates:
for group in groups:
for word in group:
if is_one_letter_different(word, candidate):
group.append(candidate)
break
if candidate in group:
break
else:
groups.append([candidate])
return groups
print group(['bread','breed', 'bream', 'tread', 'treat', 'short', 'shorn', 'shirt', 'shore', 'store','eagle','mired', 'sired', 'hired'])
Output:
[['bread', 'breed', 'bream', 'tread', 'treat'], ['short', 'shorn', 'shirt', 'shore', 'store'], ['eagle'], ['mired', 'sired', 'hired']]
EDIT: Updated to follow additional testcases. I'm not sure of output correctness - please validate. And provide us good testcases next time.

I would do it something like this: construct an undirected graph where each word is a node, and each edge indicates that the relation holds between them. Then you can identify each disconnected "island" in the graph, each of which represents an equivalence class.
from collections import defaultdict
def exactly_one(iter):
count = 0
for x in iter:
if x:
count += 1
if count > 1:
break
return count == 1
def are_one_letter_apart(a,b):
if len(a) != len(b): return False
return exactly_one(a_char != b_char for a_char, b_char in zip(a,b))
def pairs(seq):
for i in range(len(seq)):
for j in range(i+1, len(seq)):
yield (seq[i], seq[j])
def search(graph, node):
seen = set()
to_visit = set()
to_visit.add(node)
while to_visit:
cur = to_visit.pop()
if cur in seen: continue
for neighbor in graph[cur]:
if neighbor not in seen:
to_visit.add(neighbor)
seen.add(cur)
return seen
def get_islands(graph):
seen = set()
islands = []
for item in graph.iterkeys():
if item in seen: continue
group = search(graph, item)
seen = seen | group
islands.append(group)
return islands
def create_classes(seq, f):
graph = defaultdict(list)
for a,b in pairs(seq):
if f(a,b):
graph[a].append(b)
graph[b].append(a)
#one last pass to pick up items with no relations to anything else
for item in seq:
if item not in graph:
graph[item].append(item)
return [list(group) for group in get_islands(graph)]
seq = ['cat','ace','car','zip','ape','pip']
print create_classes(seq, are_one_letter_apart)
Result:
[['ace', 'ape'], ['pip', 'zip'], ['car', 'cat']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group data by a tolerance - python

Related

How to get the maximum value in a matrix and its row number

How to iterate a list of integer pairs, computing new 'union pairs'

Python: Find all nodes connected to n (tuple)

Return highest key value pair in-between certain number

Python - partitioning a list of strings using an equivalence relation

Categories

Resources