PySpark: difficulty implementing KMeans with mapreduce functions - python

I am currently tasekd in a Distributed DataBase class to create an implementation of kmeans with map reduce based approach (yes i know that there is a premade function for it but the task is specifically to do your own approach), and while i have figured out the approach itself, i am struggling with implementing it with the appropriate use of the map and reduce functions.
def Find_dist(x,y):
sum = 0
vec1= list(x)
vec2 = list(y)
for i in range(len(vec1)):
sum = sum +(vec1[i]-vec2[i])*(vec1[i]-vec2[i])
return sum
def mapper(cent, datapoint):
min = Find_dist(datapoint,cent[0])
closest = cent[0]
for i in range(1,len(cent)):
curr = Find_dist(datapoint,cent[i])
if curr < min:
min = curr
closest = cent[i]
yield closest,datapoint
def combine(x):
Key = x[0]
Values = x[1]
sum = [0]*len(Key)
counter = 0
for datapoint in Values:
vec = list(datapoint[0])
counter = counter+1
sum = sum+vec
point = Row(vec)
result = (counter,point)
yield Key, result
def Reducer(x):
Key = x[0]
Values = x[1]
sum = [0]*len(Key)
counter = 0
for datapoint in Values:
vec = list(datapoint[0])
counter = counter+1
sum = sum+vec
avg = [0]*len(Key)
for i in range(len(Key)):
avg[i] = sum[i]/counter
centroid = Row(avg)
yield Key, centroid
def kmeans_fit(data,k,max_iter):
centers = data.rdd.takeSample(False,k,seed=42)
for i in range(max_iter):
mapped = data.rdd.map(lambda x: mapper(centers,x))
combined = mapped.reduceByKeyLocally(lambda x: combiner(x))
reduced = combined.reduceByKey(lambda x: Reducer(x)).collect()
flag = True
for i in range(k):
if(reduced[i][1] != reduced[i][0] ):
for j in range(k):
centers[i] = reduced[i][1]
flag = False
break
if (flag):
break
return centers
data = spark.read.parquet("/mnt/ddscoursedatabricksstg/ddscoursedatabricksdata/random_data.parquet")
kmeans_fit(data,5,10)
My main issue for the most part is I encounter difficulty in the usage of dataframes and the map, reducebykeylocally and reducebykey fucntions.
Currently the run fails at calling reduceByKeyLocally(lambda x: combiner(x)) because "ValueError: not enough values to unpack (expected 2, got 1)", and i really need to get this all working properly soon, so please, anyone i would love assistance on this, and thank you in advance, I will be very grateful for any help!

Related

MaxValueSelect() function has some errors . any fixes?

My code is:
def maxValueSelection(items,V):
maxval = 0
val = 0
sorted_dict={}
for i in items.items():
sorted_dict[i[1][1]] = [i[1][0],i[0]]
sorted_dict_list = (sorted(sorted_dict))[::-1]
while sorted_dict_list!=[]:
item = sorted_dict_list[0]
if(sorted_dict[item][0] + val<=V):
maxval+=item
val = val+sorted_dict[item][0]
sorted_dict_list.pop(0)
return maxval
items = {1:(4,400),2:(9,1800),3:(10,3500),4:(20,4000),5:(2,1000),6:(1,200)}
V = 20
print(maxValueSelection(items,V))
I have used a greedy algorithm for the question in which I have two values which records the value of the item and one monitors the weight of the items which should not be exceeded more than a threshold value mentioned in the question. It seems like my greedy strategy is working upto some extent but nearly misses the maxValue in every test case. It will be helpful if someone tells me how to fix this issue with my code
If I understood correctly what you need.
this code selects V elements from the dict items, with maximum total value.
from operator import itemgetter, truediv
def maxValueSelection(items,V):
maxval = 0
val = 0
itemlist = sorted([i[1] for i in items.items()],key=itemgetter(1),reverse=True)
for item in itemlist:
if V > 0:
val = item[1]
maxval += val * min(V,item[0])
V -= min(V,item[0])
return maxval
items = {1:(4,400),2:(9,1800),3:(10,3500),4:(20,4000),5:(2,1000),6:(1,200)}
V = 20
print(maxValueSelection(items,V))

Find minimum number of elements in a list that covers specific values

A recruiter wants to form a team with different skills and he wants to pick the minimum number of persons which can cover all the required skills.
N represents number of persons and K is the number of distinct skills that need to be included. list spec_skill = [[1,3],[0,1,2],[0,2,4]] provides information about skills of each person. e.g. person 0 has skills 1 and 3, person 1 has skills 0, 1 and 2 and so on.
The code should outputs the size of the smallest team that recruiter could find (the minimum number of persons) and values indicating the specific IDs of the people to recruit onto the team.
I implemented the code with brute force as below but since some data are more than thousands, it seems I need to be solved with heuristic approaches. In this case it is possible to have approximate answer.
Any suggestion how to solve it with heuristic methods will be appreciated.
N,K = 3,5
spec_skill = [[1,3],[0,1,2],[0,2,4]]
A = list(range(K))
set_a = set(A)
solved = False
for L in range(0, len(spec_skill)+1):
for subset in itertools.combinations(spec_skill, L):
s = set(item for sublist in subset for item in sublist)
if set_a.issubset(s):
print(str(len(subset)) + '\n' + ' '.join([str(spec_skill.index(item)) for item in subset]))
solved = True
break
if solved: break
Here is my way of doing this. There might be potential optimization possibilities in the code, but the base idea should be understandable.
import random
import time
def man_power(lst, K, iterations=None, period=0):
"""
Specify a fixed number of iterations
or a period in seconds to limit the total computation time.
"""
# mapping each sublist into a (sublist, original_index) tuple
lst2 = [(lst[i], i) for i in range(len(lst))]
mini_sample = [0]*(len(lst)+1)
if period<0 or (period == 0 and iterations is None):
raise AttributeError("You must specify iterations or a positive period")
def shuffle_and_pick(lst, iterations):
mini = [0]*len(lst)
for _ in range(iterations):
random.shuffle(lst2)
skillset = set()
chosen_ones = []
idx = 0
fullset = True
# Breaks from the loop when all skillsets are found
while len(skillset) < K:
# No need to go further, we didn't find a better combination
if len(chosen_ones) >= len(mini):
fullset = False
break
before = len(skillset)
skillset.update(lst2[idx][0])
after = len(skillset)
if after > before:
# We append with the orginal index of the sublist
chosen_ones.append(lst2[idx][1])
idx += 1
if fullset:
mini = chosen_ones.copy()
return mini
# Estimates how many iterations we can do in the specified period
if iterations is None:
t0 = time.perf_counter()
mini_sample = shuffle_and_pick(lst, 1)
iterations = int(period / (time.perf_counter() - t0)) - 1
mini_result = shuffle_and_pick(lst, iterations)
if len(mini_sample)<len(mini_result):
return mini_sample, len(mini_sample)
else:
return mini_result, len(mini_result)

Python: Concatenate similiar objects in List

I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.

Compute Higher Moments of Data Matrix

this probably leads to scipy/numpy, but right now I'm happy with any functionality as I couldn't find anything in those packages. I have a matrix that contains data for a multi-variate distribution (let's say, 2, for the fun of it). Is there any function to compute (higher) moments of that? All I could find was numpy.mean() and numpy.cov() :o
Thanks :)
/edit:
So some more detail: I have multivariate data, that is, a matrix where rows display variables and columns observations. Now I would like to have a simple way of computing the joint moments of that data, as defined in http://en.wikipedia.org/wiki/Central_moment#Multivariate_moments .
I'm pretty new to python/scipy so I'm not sure I'd be the best person to code this one up, especially for the n-variables case (note that the wikipedia definition is for n=2), and I kind of expected there to be some out-of-the-box thing to use as I thought this would be a standard problem.
/edit2:
Just for the future, in case someone wants to do something similar, the following code (which is still under review) should give the sample equivalent of the raw moments E(X^2), E(Y^2), etc. It only works for two variables right now, but it should be extendable if one feels the need. If you see some mistakes or unclean/unpython-nish code, feel free to comment.
from numpy import *
# this function should return something as
# moments[0] = 1
# moments[1] = mean(X), mean(Y)
# moments[2] = 1/n*X'X, 1/n*X'Y, 1/n*Y'Y
# moments[3] = mean(X'X'X), mean(X'X'Y), mean(X'Y'Y),
# mean(Y'Y'Y)
# etc
def getRawMoments(data, moment, axis=0):
a = moment
if (axis==0):
n = float(data.shape[1])
X = matrix(data[0,:]).reshape((n,1))
Y = matrix(data[1,:]).reshape((n,1))
else:
n = float(data.shape[0])
X = matrix(data[:,0]).reshape((n,1))
Y = matrix(data[:,1]).reshape((n,11))
result = 1
Z = hstack((X,Y))
iota = ones((1,n))
moments = {}
moments[0] = 1
#first, generate huge-ass matrix containing all x-y combinations
# for every power-combination k,l such that k+l = i
# for all 0 <= i <= a
for i in arange(1,a):
if i==2:
moments[i] = moments[i-1]*Z
# if even, postmultiply with X.
elif i%2 == 1:
moments[i] = kron(moments[i-1], Z.T)
# Else, postmultiply with X.T
elif i%2==0:
temp = moments[i-1]
temp2 = temp[:,0:n]*Z
temp3 = temp[:,n:2*n]*Z
moments[i] = hstack((temp2, temp3))
# since now we have many multiple moments
# such as x**2*y and x*y*x, filter non-distinct elements
momentsDistinct = {}
momentsDistinct[0] = 1
for i in arange(1,a):
if i%2 == 0:
data = 1/n*moments[i]
elif i == 1:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
data = 1/n*hstack((temp2))
else:
temp = moments[i]
temp2 = temp[:,0:n]*iota.T
temp3 = temp[:,n:2*n]*iota.T
data = 1/n*hstack((temp2, temp3))
momentsDistinct[i] = unique(data.flat)
return momentsDistinct(result, axis=1)

How to write this program into a for loop?

I'm trying to learn how to change this program into a for loop for the sake of knowing both ways
def Diff(a_list):
num = enumerate(max(x) - min(x) for x in a_list)
return max(x[::-1] for x in num)
I want it to be something like
def Diff(x):
for a in x
if it helps the program is intended to return the row that has the smallest sum of the elements inside it so like [[1,2,3,4],[-500],[10,20]] would be 1.
I do not understand why you use this name for your function, it does something else (as far as I understand). It searches for the inner-list inside a list for which the difference between min and max, the span, are maximal and the n returns a tuple (span, idx), idx being the index within the outer loop.
When you want to have the same as a loop, try:
def minRow_loop(a_list):
rv = (0,0)
for idx, row in enumerate(a_list):
span = max(row) - min(row)
span_and_idx = (span, idx)
if span_and_idx > rv:
rv = span_and_idx
return rv
But your code doesn't do what it'S intended to do, so I created two correct versions, once with and once without a loop.
import random
random.seed(12346)
def minRow(a_list):
num = enumerate(max(x) - min(x) for x in a_list)
return max(x[::-1] for x in num)
def minRow_loop(a_list):
rv = (0,0)
for idx, row in enumerate(a_list):
span = max(row) - min(row)
span_and_idx = (span, idx)
if span_and_idx > rv:
rv = span_and_idx
return rv
def minRow_correct(a_list):
return min(enumerate([sum(l) for l in a_list]),
key=lambda (idx, val): val)[0]
def minRow_correct_loop(a_list):
min_idx = 0
min_sum = 10e50
for idx, list_ in enumerate(a_list):
sum_ = sum(list_)
if sum_<min_sum:
min_idx = idx
min_sum = sum
return min_idx
li = [[random.random() for i in range(2)] for j in range(3)]
from pprint import pprint
print "Input:"
pprint(li)
print "\nWrong versions"
print minRow(li)
print minRow_loop(li)
which prints:
Input:
[[0.46318380478657073, 0.7396007585882016],
[0.38778699106140135, 0.7078233515518557],
[0.7453097328344933, 0.23853757442660117]]
Wrong versions
(0.5067721584078921, 2)
(0.5067721584078921, 2)
Corrected versions
2
2
What you want can actually be done in two lines of code:
# Let's take the list from your example
lst = [[1,2,3,4],[-500],[10,20]]
# Create a new list holding the sums of each sublist using a list comprehension
sums = [sum(sublst) for sublst in lst]
# Get the index of the smallest element
sums.index(min(sums)) # Returns: 1
if you're looking for minimum sum, just go through every row and keep track of the smallest:
def minRow(theList):
foundIndex = 0 # assume first element is the answer for now.
minimumSum = sum(theList[0])
for index, row in enumerate(theList):
if sum(row) < minimumSum:
foundIndex = index
minimumSum = sum(row) # you don't have to sum() twice, but it looks cleaner
return foundIndex
If your looking for greatest range (like the first Diff() function), it'd be similar. You'd keep track of the greatest range and return its index.
Thorsten's answer is very complete. But since I finished this anyway, I'm submitting my "dumbed down" version in case it helps you understand.

Categories

Resources