Computing euclidean distance with multiple list in python - python

I'm writing a simple program to compute the euclidean distances between multiple lists using python. This is the code I have so fat
import math
euclidean = 0
euclidean_list = []
euclidean_list_complete = []
test1 = [[0.0, 0.0, 0.0, 152.0, 12.29], [0.0, 0.0, 0.357, 245.0, 10.4], [0.0, 0.0, 0.10, 200.0, 11.0]]
test2 = [[0.0, 0.0, 0.0, 72.0, 12.9], [0.0, 0.0, 0.0, 80.0, 11.3]]
for i in range(len(test2)):
for j in range(len(test1)):
for k in range(len(test1[0])):
euclidean += pow((test2[i][k]-test1[j][k]),2)
euclidean_list.append(math.sqrt(euclidean))
euclidean = 0
euclidean_list_complete.append(euclidean_list)
print euclidean_list_complete
my problem with this code is it doesn't print the output i want properly. The output should be
[[80.0023, 173.018, 128.014], [72.006, 165.002, 120.000]]
but instead, it prints
[[80.00232559119766, 173.01843095173416, 128.01413984400315, 72.00680592832875, 165.0028407300917, 120.00041666594329], [80.00232559119766, 173.01843095173416, 128.01413984400315, 72.00680592832875, 165.0028407300917, 120.00041666594329]]
I'm guessing it has something to do with the loop. What should I do to fix it? By the way, I don't want to use numpy or scipy for studying purposes
If it's unclear, I want to calculate the distance between lists on test2 to each lists on test1

Not sure what you are trying to achieve for 3 vectors, but for two the code has to be much, much simplier:
test2 = [[0.0, 0.0, 0.0, 72.0, 12.9], [0.0, 0.0, 0.0, 80.0, 11.3]]
def distance(list1, list2):
"""Distance between two vectors."""
squares = [(p-q) ** 2 for p, q in zip(list1, list2)]
return sum(squares) ** .5
d2 = distance(test2[0], test2[1])
With numpy is even a shorter statement.
PS. python 3 recommened

The question has partly been answered by #Evgeny. The answer the OP posted to his own question is an example how to not write Python code. Here is a shorter, faster and more readable solution, given test1 and test2 are lists like in the question:
def euclidean(v1, v2):
return sum((p-q)**2 for p, q in zip(v1, v2)) ** .5
d2 = []
for i in test2:
foo = [euclidean(i, j) for j in test1]
d2.append(foo)
print(d2)
#[[80.00232559119766, 173.01843095173416, 128.01413984400315],
# [72.00680592832875, 165.0028407300917, 120.00041666594329]]

test1 = [[0.0, 0.0, 0.0, 152.0, 12.29], [0.0, 0.0, 0.357, 245.0, 10.4], [0.0, 0.0, 0.10, 200.0, 11.0]]
test2 = [[0.0, 0.0, 0.0, 72.0, 12.9], [0.0, 0.0, 0.0, 80.0, 11.3]]
final_list = []
for a in test2:
temp = [] #temporary list
for b in test1:
dis = sum([pow(a[i] - b[i], 2) for i in range(len(a))])
temp.append(round(pow(dis, 0.5),4))
final_list.append(temp)
print(final_list)

I got it, the trick is to create the first euclidean list inside the first for loop, and then deleting the list after appending it to the complete euclidean list
import math
euclidean = 0
euclidean_list_complete = []
test1 = [[0.0, 0.0, 0.0, 152.0, 12.29], [0.0, 0.0, 0.357, 245.0, 10.4], [0.0, 0.0, 0.10, 200.0, 11.0]]
test2 = [[0.0, 0.0, 0.0, 72.0, 12.9], [0.0, 0.0, 0.0, 80.0, 11.3]]
for i in range(len(test2)):
euclidean_list = []
for j in range(len(test1)):
for k in range(len(test1[0])):
euclidean += pow((test2[i][k]-test1[j][k]),2)
euclidean_list.append(math.sqrt(euclidean))
euclidean = 0
euclidean_list.sort(reverse=True)
euclidean_list_complete.append(euclidean_list)
del euclidean_list
print euclidean_list_complete

Related

python list comprehension and looping question

I have variables as follows:
J = range (1,16)
T = range (1,9)
x= {} # 0,1 decision variable to be determined
These variables turn into combinations of x[j,t].
I am trying to implement a constraint for unacceptable t types in T for x[j,t] combinations that make the x var = 0.
I have a dictionary 'U' with j's as the key and t types and values stored in a list. Zero value means t is unacceptable, 1 is acceptable. The index is range 1-9, not 0-8. So in the example below, j 2, type 3 (bc its at index 3 on range(1,9)) is the only acceptable value.
{1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
2: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
3: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
4: [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
5: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
6: [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
7: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
8: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
9: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
10: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
11: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
12: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
13: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
14: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
15: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
}
I am struggling in trying to get the x[j,t] combinations bc of the misaligned index. I set it up like so:
for j,t in x:
if t in U[j]==0:
# do the thing... #addConstr(x[j,t], GRB.EQUAL,0)
So for j 2 the results I need are {(2,1):0, (2,2):0, (2,4):0, (2,5):0, (2,6):0, (2,7):0, (2,8):0} where the index value on range (1,9) becomes the t value in the tupledict.
Any pointers? Thank you!
Assuming your example data is stored in U, what you want to do is:
j_results = []
for j,types in U.items():
results = {}
for t in types:
if t == 0.0:
result[(j,int(t))] = 0
j_results.append(result)
j_results list will contain all results like you described:
for j 2 the results I need are {(2,1):0, (2,2):0, (2,4):0, (2,5):0, (2,6):0, (2,7):0, (2,8):0}
will be in j_result[1] (counter intuitive because your U data start from 1)
Note the int cast, because data you provided has floats, but results you provided are a tuple of ints.

How to find the longest consecutive non-zero subset of a list?

I have a list of floats that somewhat looks like this:
[
163.33333333333334,
0.0,
0.0,
154.73684210526315,
172.94117647058823,
155.8303886925795,
0.0,
156.93950177935943,
0.0,
0.0,
0.0,
151.5463917525773,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
165.1685393258427,
156.93950177935943,
169.6153846153846,
159.7826086956522,
167.04545454545453,
158.06451612903226,
168.9655172413793,
157.5,
0.0,
159.7826086956522,
0.0,
163.94052044609666,
166.41509433962264,
0.0,
0.0,
0.0,
]
The actual list is much larger than this but has similar values.
From this list, I want to find the largest consecutive subset of this that is nonzero. In this case that would be:
[165.1685393258427,
156.93950177935943,
169.6153846153846,
159.7826086956522,
167.04545454545453,
158.06451612903226,
168.9655172413793]
I am new to python and python and coding in general so any help would be greatly appreciated.
you can use simple algorythm with buffer
make a for loop, then get current subset, then if length of current subset is more than maximum, set is as maximum.
def get_longest_consecutive_non_zero_subset(input_list: list) -> list:
max_subset = []
current_max_subset = []
for number in input_list:
if number > 0:
current_max_subset.append(number)
else:
if len(current_max_subset) > len(max_subset):
max_subset = current_max_subset
current_max_subset = []
return max_subset
test_list = [0, 1, 2, 3, 0, 0, 1, 2, 3, 4, 0]
result = get_longest_consecutive_non_zero_subset(test_list)
print(result)
assert result == [1, 2, 3, 4]
You could use itertools.groupby, grouping on whether the values are 0 or not, then select all sublists which have non-zero values and find the one with the maximum length:
from itertools import groupby
g = groupby(l, key=lambda x:x>0.0)
m = max([list(s) for v, s in g if v > 0.0], key=len)
print(m)
Output (for your sample data):
[
165.1685393258427,
156.93950177935943,
169.6153846153846,
159.7826086956522,
167.04545454545453,
158.06451612903226,
168.9655172413793,
157.5
]
Note that since you only need to compare with 0, you can just use bool as the groupby function (i.e. g = groupby(l, bool)). This should be faster than comparing with 0.
You can exploit the way how groupby() works with unsorted data:
from itertools import groupby
lst = [163.33333333333334, 0.0, 0.0, 154.73684210526315, 172.94117647058823, 155.8303886925795, 0.0, 156.93950177935943, 0.0, 0.0, 0.0, 151.5463917525773, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 165.1685393258427, 156.93950177935943, 169.6153846153846, 159.7826086956522, 167.04545454545453, 158.06451612903226, 168.9655172413793, 157.5, 0.0, 159.7826086956522, 0.0, 163.94052044609666, 166.41509433962264, 0.0, 0.0, 0.0]
result = max((list(g) for k, g in groupby(lst, bool) if k), key=len)
You can try with some if statement. Hence you say you're new to python, I preferred to keep the code as easy as possible, but it would be a nice "training" to optimize it
fulllist = [163.33333333333334, 0.0, 0.0, 154.73684210526315, 172.94117647058823, 155.8303886925795, 0.0, 156.93950177935943, 0.0, 0.0, 0.0, 151.5463917525773, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 165.1685393258427, 156.93950177935943, 169.6153846153846, 159.7826086956522, 167.04545454545453, 158.06451612903226, 168.9655172413793, 157.5, 0.0, 159.7826086956522, 0.0, 163.94052044609666, 166.41509433962264, 0.0, 0.0, 0.0,]
longest = []
new_try = []
for element in fulllist:
if element != 0:
new_try.append(element)
if new_try>longest:
longest = new_try.copy()
if element == 0:
new_try = []
print(longest)
Output:
[165.1685393258427, 156.93950177935943, 169.6153846153846, 159.7826086956522, 167.04545454545453, 158.06451612903226, 168.9655172413793, 157.5]
def max_non_zero_subset(arr):
max_non_zero = []
curr_non_zero = []
for n in arr:
if n == 0:
if len(curr_non_zero) > len(max_non_zero):
max_non_zero = curr_non_zero
curr_non_zero = []
else:
curr_non_zero.append(n)
return max_non_zero if len(max_non_zero) >= len(curr_non_zero) else curr_non_zero

Python: read a file with lists converted to strings and convert back to lists

I have a txt file which looks like this:
0.41,"[1.0, 0.0, 1.0]","[16.4, 0.0, 0.0]"
0.23,"[0.0, 2.0, 2.0]","[32.8, 0.0, 0.0]"
0.19,"[0.0, 0.0, 3.0]","[92.8, 0.0, 0.0]"
and I hope to read it and convert the strings to lists:
a=[0.41, 0.23, 0.19, 0.03, 0.02, 0.02]
b=[[1.0, 0.0, 1.0],[0.0, 2.0, 2.0],[0.0, 0.0, 3.0]]
c=[[16.4, 0.0, 0.0],[32.8, 0.0, 0.0],[92.8, 0.0, 0.0]]
How can I do this in python?
Thanks in advance,
Fei
I would use csv module to properly tokenize the items, then I'd transpose the rows using zip and convert the string data to python lists/values using ast.literal_eval
import csv
import ast
with open("file.txt") as f:
cr = csv.reader(f)
items = [[ast.literal_eval(x) for x in row] for row in zip(*cr)]
print(items)
result: a list of lists
[[0.41, 0.23, 0.19], [[1.0, 0.0, 1.0], [0.0, 2.0, 2.0], [0.0, 0.0, 3.0]], [[16.4, 0.0, 0.0], [32.8, 0.0, 0.0], [92.8, 0.0, 0.0]]]
That's not the general case, but if you know you exactly have 3 items in the list, you can unpack them to any variables you want:
if len(items)==3:
a,b,c = items
print(a)
print(b)
print(c)
and you get:
[0.41, 0.23, 0.19]
[[1.0, 0.0, 1.0], [0.0, 2.0, 2.0], [0.0, 0.0, 3.0]]
[[16.4, 0.0, 0.0], [32.8, 0.0, 0.0], [92.8, 0.0, 0.0]]
note that your expected input is not possible for a given the input data.

Python: What is an efficient way to sort a nested array by repeated values?

data is a list, where each entry is a list of floats
L is a range to check whether the first entry of _ in data is equal to and if so store it at that index in c
c = []
d = []
for i in range(L):
for seq in data:
if int(seq[0]) == i:
d.append(seq)
c.append(d)
d = []
return c
>>> data = [[4.0, 0.0, 15.0, 67.0], [3.0, 0.0, 15.0, 72.0], [4.0, 0.0, 15.0, 70.0], [1.0, -0.0, 15.0, 90.0], [3.0, -0.0, 15.0, 75.0], [2.0, -0.0, 15.0, 83.0], [3.0, 0.0, 15.0, 74.0], [4.0, 0.0, 15.0, 69.0], [4.0, 0.0, 14.0, 61.0], [3.0, 0.0, 15.0, 74.0], [3.0, 0.0, 15.0, 75.0], [4.0, 0.0, 15.0, 67.0], [5.0, 0.0, 14.0, 45.0], [6.0, 0.0, 13.0, 30.0], [3.0, 0.0, 15.0, 74.0], [4.0, 0.0, 15.0, 55.0], [7.0, 0.0, 13.0, 22.0], [6.0, 0.0, 13.0, 25.0], [1.0, -0.0, 15.0, 83.0], [7.0, 0.0, 13.0, 18.0]]
>>> sort(data,7)
[[], [[1.0, -0.0, 15.0, 90.0], [1.0, -0.0, 15.0, 83.0]], [[2.0, -0.0, 15.0, 83.0]], [[3.0, 0.0, 15.0, 72.0], [3.0, -0.0, 15.0, 75.0], [3.0, 0.0, 15.0, 74.0], [3.0, 0.0, 15.0, 74.0], [3.0, 0.0, 15.0, 75.0], [3.0, 0.0, 15.0, 74.0]], [[4.0, 0.0, 15.0, 67.0], [4.0, 0.0, 15.0, 70.0], [4.0, 0.0, 15.0, 69.0], [4.0, 0.0, 14.0, 61.0], [4.0, 0.0, 15.0, 67.0], [4.0, 0.0, 15.0, 55.0]], [[5.0, 0.0, 14.0, 45.0]], [[6.0, 0.0, 13.0, 30.0], [6.0, 0.0, 13.0, 25.0]]]
len(data) is on the order of 2 Million
L is on the order of 8000.
I need a way to speed this up ideally!
Optimization attempt
Assuming you want to sort your sublists into buckets according to the first value of each sublist.
For simplicity, I use the following to generate random numbers for testing:
L = 10
data = [[round(random.random() * 10.0, 2) for _ in range(3)] for _ in range(10)]
First about your code, just to make sure that I got your intention correctly.
c = []
d = []
for i in range(L): # Loop over all buckets
for e in data: # Loop over entire data
if int(e[0]) == i: # If first float of sublist falls into i-th bucket
d.append(e) # Append entire sublist to current bucket
c.append(d) # Append current bucket to list of buckets
d = [] # Reset
This is inefficient, because you loop over the full data set for each of your buckets. If you have, as you say, like 8000 buckets and 2 000 000 lists of floats, you will be essentially performing 16 000 000 000 (16 Billion) comparisons. Additionally, you completely populate your bucket lists on creation instead of reusing the existing lists in your data variable. So this makes as many data reference copies.
Thus, you should think about working with your data's indices, e.g.
bidx = [int(e[0]) for e in data] # Calculate bucket indices for all sublists
buck = []
for i in range(L): # Loop over all buckets
lidx = [k for k, b in enumerate(bidx) if b == i] # Get sublist indices for this bucket
buck.append([data[l] for l in lidx]) # Collect list references
print(buck)
This should result in a single iteration over your data calculating the bucket indices in-place. Then, only one second iteration over all your buckets is performed, where corresponding bucket indices are collected from bidx (you have to have this double loop, but this may be a bit faster though) -- resulting in lidx holding the positions of sublists in data that fall into the current bucket. Finally, collect the list references in the bucket's list and store it.
The last step can be costly though, because it contains a lot of reference copying. You should consider storing only the indices in each bucket, not the entire data, e.g.
lidx = ...
buck.append(lidx)
However, optimizing performance only in-code with large data has limits.
If your data is large, all linear iterations will be costly. You can try to reduce them as far as possible, but there is a lower cap defined by the data size itself!
If you have to perform more operations of millions of records, you should think about changing to another data representation or format. For example, if you need to perform frequent operations within one script, you may want to think about trees (e.g. b-trees). If you want to store it for further processing, you may want to think about a database with proper indexes.
Running in Python 3 I get 2 order of magnitude better performance than jbndlr with this algorithm:
rl = range(L) # Generate the range list
buck = [[] for _ in rl] # Create all the buckets
for seq in data: # Loop over entire data
try:
idx = rl.index(int(seq[0])) # Find the bucket index
buck[idx].append(seq) # Append current data in its bucket
except ValueError:
pass # There is no bucket for that value
Comparing the algorithms with:
L = 1000
data = [[round(random.random() * 1200.0, 2) for _ in range(3)] for _ in range(100000)]
I get:
yours: 26.66 sec
jbndlr: 6.78 sec
mine: 0.07 sec

For-Loop Execution in Python - Does the Executable Code Reset?

Trying to plot multiple lines on one graph using matplotlib and for loops, but the code doesn't work after the first iteration. Here's the code:
import csv
import matplotlib.pyplot as plt
r = csv.reader(open('CrimeStatebyState.csv', 'rb'))
line1 = r.next()
def crime_rate(*state):
for s in state:
orig_dict = {}
for n in range (1960,2006):
orig_dict[n] = []
for line in r:
if line[0] == s:
orig_dict[int(line[3])].append(int(line[4]))
for y in orig_dict:
orig_dict[y] = sum(orig_dict[y])
plt.plot(orig_dict.keys(), orig_dict.values(),'r')
print orig_dict.values()
print s
crime_rate("Alabama", "California", "New York")
Here's what it returns:
[39920, 38105, 41112, 44636, 53550, 55131, 61838, 65527, 71285, 75090, 85399, 86919, 84047, 91389, 107314, 125497, 139573, 136995, 147389, 159950, 190511, 191834, 182701, 162361, 155691, 158513, 173807, 181751, 188261, 190573, 198604, 219400, 217889, 204274, 206859, 206188, 205962, 211188, 200065, 192819, 202159, 192835, 200331, 201572, 201664, 197071]
Alabama
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
California
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
New York
**[[[Graph of Alabama's values]]]**
Why am I getting zeroes after the loop runs once? Is this why the other two graphs aren't showing up? Is there an issue with the sum function, the "for line in r" loop, or using *state?
Sorry if that's not enough information! Thanks to those kind/knowledgeable enough for helping.
It would appear that your csv reader is exhausted after you have processed the first state and therefore when you next call "for line in r:" on the next state there are no more lines to look at. You can confirm this by putting a print statement straight after it to see what it has to process e.g.
for line in r:
print "test" # Test print
if line[0] == s:
orig_dict[int(line[3])].append(int(line[4]))
If you re-define your csv reader within each state loop you should get your data correctly processed:
import csv
import matplotlib.pyplot as plt
def crime_rate(*state):
for s in state:
r = csv.reader(open('CrimeStatebyState.csv', 'rb'))
line1 = r.next()
orig_dict = {}
for n in range (1960,2006):
orig_dict[n] = []
for line in r:
if line[0] == s:
orig_dict[int(line[3])].append(int(line[4]))
for y in orig_dict:
orig_dict[y] = sum(orig_dict[y])
plt.plot(orig_dict.keys(), orig_dict.values(),'r')
print orig_dict.values()
print s
crime_rate("Alabama", "California", "New York")
Others have already explained the source of your error. May I suggest you use pandas for this task:
import pandas as pd
states = ["Alabama", "California", "New York"]
data = pd.read_csv('CrimeStatebyState.csv') # import data
df = data[(1996 <= data.Year) & (data.Year <= 2005)] # filter by year
pd.pivot_table(df, rows='Year', cols='State', values='Count')[states].plot()

Categories

Resources