Python: Compute all possible pairwise distances of a list (DTW)

Python: Compute all possible pairwise distances of a list (DTW) - python

I have a list of items like so: T=[T_0, T_1, ..., T_N] where each of T_i is itself a time series. I want to find the pairwise distances (via DTW) for all potential pairs.
E.g. If T=[T_0, T_1, T_2] and I had a DTW function f, I want to find f(T_0, T_1), f(T_0, T_2), f(T_1, T_2).
Note T_i actually looks like ( id of i, [ time series values ] ).
My code snippet looks like this:
cluster = defaultdict( list )
donotcluster = defaultdict( list )
for i, lst1 in tqdm(enumerate(T)):
for lst2 in tqdm(T):
if lst2 in cluster[lst1[0]] or lst2 in donotcluster[lst1[0]]:
pass
else:
distance, path = fastdtw(lst1[1], lst2[1], dist=euclidean)
if distance <= distance_threshold:
cluster[lst1[0]] += [ lst2 ]
cluster[lst2[0]] += [ lst1 ]
else:
donotcluster[lst1[0]] += [ lst2 ]
donotcluster[lst2[0]] += [ lst1 ]
Right now I have around 20,000 time series and this take way too long (it will run in about 5 days). I am using the python library fastdtw. Is there a more optimised library? Or just a better/faster way of computing all possible distances? Since distances are symmetric I don't have to calculate for example f(T_41,T_33) if I have already calculated f(T_33, T_41)

I would recommend keeping a set of all of the pairs you've done so far, keeping in mind that set has a constant time lookup operation. Besides that, you should consider other approaches where you don't extend your lists so often (that nasty += you're doing) since it can be rather expensive. I don't know enough of the implementation of your application to comment on that though. If you provide more information, I may be able to figure a way to get rid of some of the += that you don't need. One idea (for efficiency) would be to append each list to a list of lists, and then flatten it at the end of your script with something like
[i for x in cluster[lst[0]] for i in x]
I modified your code as follows:
cluster = defaultdict( list )
donotcluster = defaultdict( list )
seen = set() # added this
for i, lst1 in tqdm(enumerate(T)):
for lst2 in tqdm(T):
if hashPair( lst1[1], lst2[1] ) not in seen and lst2 not in cluster[lst1[0]] and lst2 not in donotcluster[lst1[0]]: # changed around your condition
seen.add( hashPair( lst1[1], lst2[1] ) # added this
distance, path = fastdtw(lst1[1], lst2[1], dist=euclidean)
if distance <= distance_threshold:
cluster[lst1[0]] += [ lst2 ]
cluster[lst2[0]] += [ lst1 ]
else:
donotcluster[lst1[0]] += [ lst2 ]
donotcluster[lst2[0]] += [ lst1 ]
def hashPair( a, b ): # added this
return ','.join(max(a,b), min(a,b))

I cannot answer your question about whether there is a more optimized dtw library, but you can use itertools to get the combinations you want without duplicates:
import itertools
for combination in itertools.combinations(T, 2):
f(combination[0], combination[1])
Here is an example of the combinations:
('T_1', 'T_2')
('T_1', 'T_3')
('T_1', 'T_4')
('T_1', 'T_5')
('T_2', 'T_3')
('T_2', 'T_4')
('T_2', 'T_5')
('T_3', 'T_4')
('T_3', 'T_5')
('T_4', 'T_5')

Related

How do I use list comprehension for slicing in a 2d matrix with multiple entries?

I am trying to have i in range 0 to 20. I have tried a for loop, but then I deleted it due to run time. How would one do it for List comprehension?
finallinearsystem = [
[np.transpose(i), np.transpose(pts_3d[i]) , np.dot(-y[i],np.transpose(pts_3d[i]))],
[np.transpose(pts_3d[i]), np.transpose(i) , np.dot(-x[i],np.transpose(pts_3d[i]))],]

I know you accepted my answer already, but I think your performance issue isn't because of the for loop, but because you're doing several expensive computations multiple times.
I'd try this:
final_linear_system = []
for i in range(20)
transposed_i = np.transpose(i) # you're doing this twice
transposed_pts_3d_i = np.transpose(pts_3d[i]) # you were doing this 4 times
final_linear_system.append([transposed_i, transposed_pts_3d_i, np.dot(-y[i],transposed_pts_3d_i)])
final_linear_system.append([transposed_i, transposed_pts_3d_i, np.dot(-x[i],transposed_pts_3d_i)])

It looks like you want to add two elements to the array for each value of i, this can be done with a nested comprehension
print([j for i in range(5) for j in ([f"{i}a",f"{i}a"],[f"{i}b",f"{i}b"])])
# [['0a', '0a'], ['0b', '0b'], ['1a', '1a'], ['1b', '1b'], ['2a', '2a'], ['2b', '2b'], ['3a', '3a'], ['3b', '3b'], ['4a', '4a'], ['4b', '4b']]
or:
finallinearsystem = [
j for i in range(20)
for j in (
[np.transpose(i), np.transpose(pts_3d[i]) , np.dot(-y[i],np.transpose(pts_3d[i]))],
[np.transpose(pts_3d[i]), np.transpose(i) , np.dot(-x[i],np.transpose(pts_3d[i]))]
)
]

Efficient way to reverse a large iterator

This may sound a bit insane but I have an iterator with N = 10**409 elements. Is there a way to get items from the end of this "list"? I.e. when I call next(iterator) it gives me what I want to be the last thing, but to get to what I want to be first thing I would need to call next(iterator) N times.
If I do something like list(iterator).reverse() it will of course crash due to lack of memory.
Edit: how the iterator is being used with a simplified example:
# prints all possible alphabetical character combinations that can fit in a tweet
chars = "abcdefghijklmnopqrstuvwxyz "
cproduct = itertools.product(chars,repeat=250)
for subset in cproduct:
print(''.join(subset))
# will start with `aaaaaaaa...aaa`
# but I want it to start with `zzz...zzz`

For some problems, you can compute the elements in reverse. For the example you provide, one can simply reverse the items you are taking the product of.
In this example, we reverse the symbols before taking the product to get the "reverse iterator":
>>> symbols = "abc"
>>> perms = itertools.product(symbols, repeat=5)
>>> perms = ["".join(x) for x in perms]
>>> perms
['aaaaa', 'aaaab', 'aaaac', 'aaaba', 'aaabb',
...,
'cccbb', 'cccbc', 'cccca', 'ccccb', 'ccccc']
>>> perms_rev = itertools.product(symbols[::-1], repeat=5)
>>> perms_rev = ["".join(x) for x in perms_rev]
>>> perms_rev
['ccccc', 'ccccb', 'cccca', 'cccbc', 'cccbb',
...,
'aaabb', 'aaaba', 'aaaac', 'aaaab', 'aaaaa']
>>> perms_rev == perms[::-1]
True

Separate List data with Python

I have a list of multiple strings ) and I want to separate them by :
MainList :
[
GENERAL NOTES & MISCELLANEOUS DETAILS_None_None_None,
STR_XX_XX_0001,
STR_XX_XX_0002,
STR_XX_XX_0003,
GENERAL ARRANGEMENT_None_None_None,
STR_XX_XX_10001.0,
STR_XX_XX_10002.0,
STR_XX_XX_10003.0,
STR_XX_XX_10004.0,
STR_XX_XX_10005.0,
STR_XX_XX_10006.0
]
if string "_None_None_None" found in main list, it can add this data in new empty list and and remaining STR_XX_XX_0001 value to another list and it goes until it found another string with "_None_None_None" and do the same.
I have tried myself, But I think I won't be able to break my loop when it will find next string with "_None_None_None". Just figuring out the way, not sure logic is right.
empty1 = []
empty2 = []
for i in MainList:
if "_None_None_None" in i:
empty1.append(i)
# Need help on hear onwards
else:
while "_None" not in i:
empty2.append(i)
break
I am expecting the Output result in two list. Something like this:
List1:
[
GENERAL NOTES & MISCELLANEOUS DETAILS_None_None_None,
GENERAL ARRANGEMENT_None_None_None
]
List2:
[
[STR_XX_XX_0001,STR_XX_XX_0002,STR_XX_XX_0003],[STR_XX_XX_10001.0,STR_XX_XX_10002.0,STR_XX_XX_10003.0,STR_XX_XX_10004.0,STR_XX_XX_10005.0,STR_XX_XX_10006.0]
]
List2 is the list with sublists

You are making it a little too complicated, you can let the list run the whole way through without the internal while loop. Just make the decision for each element as it shows up in the loop:
empty1 = []
empty2 = []
for i in MainList:
if "_None_None_None" in i:
empty1.append(i)
else:
empty2.append(i)
This will give you two lists:
> empty1
> ['GENERAL NOTES & MISCELLANEOUS DETAILS_None_None_None',
'GENERAL ARRANGEMENT_None_None_None']
> empty2
> ['STR_XX_XX_0001',
'STR_XX_XX_0002',
'STR_XX_XX_0003',
'STR_XX_XX_10001.0',
'STR_XX_XX_10002.0',
'STR_XX_XX_10003.0',
'STR_XX_XX_10004.0',
'STR_XX_XX_10005.0',
'STR_XX_XX_10006.0']
EDIT Based on comment
If the commenter is correct and you want to group the non-NONE values into separate lists, this is a good use case for itertools.groupby. It will make the groups for you in a convenient, efficient way and your loop will look almost the same:
from itertools import groupby
empty1 = []
empty2 = []
for k, i in groupby(MainList, key = lambda x: "_None_None_None" in x):
if k:
empty1.extend(i)
else:
empty2.append(list(i))
This will give you the same empty1 but empty2 will not be a list of lists:
[['STR_XX_XX_0001', 'STR_XX_XX_0002', 'STR_XX_XX_0003'],
['STR_XX_XX_10001.0',
'STR_XX_XX_10002.0',
'STR_XX_XX_10003.0',
'STR_XX_XX_10004.0',
'STR_XX_XX_10005.0',
'STR_XX_XX_10006.0']]

You can try the following code snippet:
dlist = ["GENERAL NOTES & MISCELLANEOUS DETAILS_None_None_None","STR_XX_XX_0001","STR_XX_XX_0002","STR_XX_XX_0003", "GENERAL ARRANGEMENT_None_None_None","STR_XX_XX_10001.0","STR_XX_XX_10002.0", "STR_XX_XX_10003.0", "STR_XX_XX_10004.0", "STR_XX_XX_10005.0", "STR_XX_XX_10006.0"]
with_None = [elem for elem in dlist if elem.endswith("_None")]
without_None = [elem for elem in dlist if not elem.endswith("_None")]
You can also write a generic function for the process:
def cust_sept(src_list, value_to_find,f):
with_value, without_value = [elem for elem in dlist if f(elem,value_to_find)],[elem for elem in dlist if not f(elem,value_to_find)]
return with_value,without_value
list_one,list_two = cust_sept(dlist,"_None",str.endswith)

Sorting a list based on another shorter list with incomplete values

I have a list of file paths which I need to order in a specific way prior to reading and processing the files. The specific way is defined by a smaller list which contains only some file names, but not all of them. All other file paths which are not listed in presorted_list need to stay in the order they had previously.
Examples:
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
I've found some relating posts:
Sorting list based on values from another list?
How to sort a list according to another list?
But as far as I can tell the questions and answers always rely on two lists of the same length which I don't have (which results in errors like ValueError: 'bar_foo' is not in list) or a presorted list which needs to contain all possible values which I can't provide.
My Idea:
I've come up with a solution which seems to work but I'm unsure if this is a good way to approach the problem:
import os
import re
EXCPECTED_LIST = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
PRESORTED_LIST = ["foo_baz", "foo"]
def sort_function(item, len_list):
# strip path and unwanted parts
filename = re.sub(r"[\(\[].*?[\)\]]", "", os.path.basename(item)).split('.')[0]
if filename in PRESORTED_LIST:
return PRESORTED_LIST.index(filename)
return len_list
def main():
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv',]
list_length = len(some_list)
sorted_list = sorted(some_list, key=lambda x: sort_function(x, list_length))
assert sorted_list == EXCPECTED_LIST
if __name__ == "__main__":
main()
Are there other (shorter, more pythonic) ways of solving this problem?

Here is how I think I would do it:
import re
from collections import OrderedDict
from itertools import chain
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
def my_sort(lst, presorted_list):
rgx = re.compile(r"^(.*/)?([^/(.]*)(\(.*\))?(\.[^.]*)?$")
d = OrderedDict((n, []) for n in presorted_list)
d[None] = []
for p in some_list:
m = rgx.match(p)
n = m.group(2) if m else None
if n not in d:
n = None
d[n].append(p)
return list(chain.from_iterable(d.values()))
print(my_sort(some_list, presorted_list) == expected_list)
# True

An easy implementation is to add some sentinels to the lines before sorting. So there is no need for specific ordering. Also regex may be avoid if all filenames respect the pattern you gave:
for n,file1 in enumerate(presorted_list):
for m,file2 in enumerate(some_list):
if '/'+file1+'.' in file2 or '/'+file1+'(' in file2:
some_list[m] = "%03d%03d:%s" % (n, m, file2)
some_list.sort()
some_list = [file.split(':',1)[-1] for file in some_list]
print(some_list)
Result:
['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']

Let me think. It is a unique problem, I'll try to suggest a solution
only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition(".")[0] in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition(".")[0]))
expected_list = []
count = 0
for ind, each_element in enumerate(some_list):
if each_element not in presorted_list:
expected_list.append(each_element)
else:
expected_list[ind].append(only_sorted_elements[count])
count += 1
Hope this solves your problem.
I first filter for only those elements which are there in presorted_list,
then I sort those elements according to its order in presorted_list
Then I iterate over the list and append accordingly.
Edited :
Changed index parameters from filename with path to exact filename.
This will retain the original indexes of those files which are not in presorted list.
EDITED :
The new edited code will change the parameters and gives sorted results first and unsorted later.
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] in presorted_list , some_list)
unsorted_all = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] not in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition("(")[0].partition(".")[0]))
expected_list = only_sorted_elements + unsorted_all
print expected_list
Result :
['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']

Since python's sort is already stable, you only need to provide it with a coarse grouping for the sort key.
Given the specifics of your sorting requirements this is better done using a function. For example:
def presort(presorted):
def sortkey(name):
filename = name.split("/")[-1].split(".")[0].split("(")[0]
if filename in presorted:
return presorted.index(filename)
return len(presorted)
return sortkey
sorted_list = sorted(some_list,key=presort(['foo_baz', 'foo']))
In order to keep the process generic and simple to use, the presorted_list should be provided as a parameter and the sort key function should use it to produce the grouping keys. This is achieved by returning a function (sortkey) that captures the presorted list parameter.
This sortkey() function returns the index of the file name in the presorted_list or a number beyond that for unmatched file names. So, if you have 2 names in the presorted_list, they will group the corresponding files under sort key values 0 and 1. All other files will be in group 2.
The conditions that you use to determine which part of the file name should be found in presorted_list are somewhat unclear so I only covered the specific case of the opening parenthesis. Within the sortkey() function, you can add more sophisticated parsing to meet your needs.

Extract rows from list given two criteria

What I have:
I have a list of lists, nodes. Each list has the following structure:
nodes = [[ID, number 1, number 2, number 3],[...]]
I also have two other lists of lists called sampleID and sampleID2 where each list may only have single data equal to an ID number which belongs to a subset of the total IDs contained in nodes:
sampleID = [[IDa],[...]]
sampleID2 = [[IDb],[...]], len(sampleID) + len(sampleID2) <= len(nodes)
In some cases these lists can also be like:
sampleID = [[IDa1,IDa2, IDa3,...],[...]]
What I want:
Given the above three lists I'd like to obtain in a fast way a fourth list which contains the lists where IDi==ID, i=a,b:
extractedlist = [[ID, number 1, number 2, number 3],[...]], len(extractedlist) = len(sampleID) + len(sampleID2)
My code:
Very basic, it works but it takes a lot of time to compute:
import itertools
for line in nodes[:]:
for line2,line3 in itertools.izip(sampleID[:],sampleID2[:]):
for i in range(0,len(line2)):
if line2[i]==line[0]:
extractedlist.append([line[0], line[1], line[2], line[3]])
for j in range(0,len(line3)):
if line3[j]==line[0]:
extractedlist.append([line[0], line[1], line[2], line[3]])

I could not understand your problem well but this is what i understand :P
node = [ .... ]
sampleID = [ .... ]
sampleID2 = [ .... ]
final_ids = []
[final_ids.extend(list_item) for list_item in sampleID]
[final_ids.extend(list_item) for list_item in sampleID2]
extractedlist = []
for line in nodes:
if line[0] in final_ids:
extractedlist.append(line)
hope this is what you need.
Else just add original input-list and result-list in question so i can understand what you want to do :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Compute all possible pairwise distances of a list (DTW) - python

Related

How do I use list comprehension for slicing in a 2d matrix with multiple entries?

Efficient way to reverse a large iterator

Separate List data with Python

Sorting a list based on another shorter list with incomplete values

Extract rows from list given two criteria

Categories

Resources