I am following the Machine Learning with Spark Book and trying to convert the python code to scala code and using Beaker notebook to share variables in order to pass values to python to plot with matplotlib as described in the book. Most of the code so far I have been able to convert but I am having some issues with the try-catch conversion with data cleansing with the u.item dataset. Below code ends in a infinite loop without a clear issue what the error is.
val movieData = sc.textFile("/Users/minHenry/workspace/ml-100k/u.item")
val movieDataSplit = movieData.first()
val numMovies = movieData.count()
def convertYear(x:String):Int = x.takeRight(4) match {
case x => x.takeRight(4).toInt
case _ => 1900
}
val movieFields = movieData.map(lines => lines.split('|'))
print(movieData.first())
val years1 = movieFields.map(fields => fields(2))
val years = movieFields.map(fields => fields(2).map(x=>convertYear(x.toString())))
val filteredYears = years.filter(x => x!=1900)
years.take(2).foreach(println)
I suspect my problem is with my pattern match but I am not exactly sure what's wrong with it. I think the takeRight() works because it doesn't complain about the type that this function is being applied to.
UPDATE
I have updated the code as follows, per advice from the answer provided thus far:
import scala.util.Try
val movieData = sc.textFile("/Users/minHenry/workspace/ml-100k/u.item")
def convertYear(x:String):Int = Try(x.toInt).getOrElse(1900)
val movieFields = movieData.map(lines => lines.split('|'))
val preYears = movieFields.map(fields => fields(2))
val years = preYears.map(x => x.takeRight(4))//.map(x=>convertYear(x))
println("=======> years")
years.take(2).foreach(println) //--output = 1995/n1995
println("=======> filteredYears")
val filteredYears = years.filter(x => x!=1900)
filteredYears.take(2).foreach(println)
//val movieAges = filteredYears.map(yr => (1998-yr)).countByValue()
I commented out the map following the takeRight(4) because its easier to comment than putting x=>convertYear(x.takeRight(4)) and should produce the same output. When I apply this convertYear() function i still end up in an infinite loop. the values print as expected in the few print statements shown. The problem is if i cannot remove the data point that cannot be easily converted to Int then I am unable to run the countByValue() function in the last line.
Here is the link to my public beaker notebook for more context:
https://pub.beakernotebook.com/#/publications/56eed31d-85ad-4728-a45d-14b3b08d673f
movieData: RDD[String]
movieFields: RDD[Array[String]]
years1: RDD[String]
val years = movieFields.map(fields => fields(2).map(x=>convertYear(x.toString()))) - fields(2) is String and so x is Char, because String is treated as Seq[Char]. All inputs to convertYear(x: String) have only one letter string.
Your error is types incompatability hiding (convertYear(x.toString())). It's alarm bell. Always use type system in scala, don't hide problem with toString() or isInstanceOf or something else. Then compiler shows error before running.
P.S.
Second call of takeRight is useless.
def convertYear(x:String):Int = x.takeRight(4) match {
case x => x.takeRight(4).toInt
case _ => 1900
}
Pattern matching is about checking type or conditions (with if statement). Your first partial function doesn't check anything. All inputs go to x.takeRight(4).toInt. Also there is no defence against toInt exception.
Use instead def convertYear(x: String): Int = Try(x.toInt).getOrElse(1900).
Update
scala> import scala.util.Try
import scala.util.Try
scala> def convertYear(x:String):Int = Try(x.toInt).getOrElse(1900)
convertYear: (x: String)Int
scala> List("sdsdf", "1989", "2009", "1945", "asdf", "455")
res0: List[String] = List(sdsdf, 1989, 2009, 1945, asdf, 455)
scala> res0.map(convertYear)
res1: List[Int] = List(1900, 1989, 2009, 1945, 1900, 455)
With RDD all the same, because it is a functor as List.
val filteredYears = years.filter(x => x!=1900) Wouldn't work as you expect. x is a String not Int. Scala doesn't implicitly convert types for comparision. So you always get true.
Related
I'm trying to call a Go function from Python using c-shared (.so) file. In my python code I'm calling the function like this:
website = "https://draftss.com"
domain = "draftss.com"
website_ip = "23.xxx.xxx.xxx"
website_tech_finder_lib = cdll.LoadLibrary("website_tech_finder/builds/websiteTechFinder.so")
result_json_string: str = website_tech_finder_lib.FetchAllData(website, domain, website_ip)
On Go side I'm converting the strings to Go strings based on this SO post (out of memory panic while accessing a function from a shared library):
func FetchAllData(w *C.char, d *C.char, dIP *C.char) *C.char {
var website string = C.GoString(w)
var domain string = C.GoString(d)
var domainIP string = C.GoString(dIP)
fmt.Println(website)
fmt.Println(domain)
fmt.Println(domainIP)
.... // Rest of the code
}
The website domain and domainIP just have the first characters of the strings that I passed:
fmt.Println(website) // -> h
fmt.Println(domain) // -> d
fmt.Println(domainIP) // -> 2
I'm a bit new to Go, so I'm not sure if I'm doing something stupid here. How do I get the full string that I passed?
You need to convert the parameters as UTF8 bytes.
website = "https://draftss.com".encode('utf-8')
domain = "draftss.com".encode('utf-8')
website_ip = "23.xxx.xxx.xxx".encode('utf-8')
lib = cdll.LoadLibrary("website_tech_finder/builds/websiteTechFinder.so")
result_json_string: str = website_tech_finder_lib.FetchAllData(website, domain, website_ip)
I'm using a matlab-function in simulink to call a python script, that do some calculations from the input values. The python-script gives me a string back to the matlab-function, that I split to an array. The splitted string has always to be a cell array with 6 variable strings:
dataStringArray = '[[-5.01 0.09785429][-8.01 0.01284927]...' '10.0' '20.0' '80.0' '80.0' '50.0'
To call the functions like strsplit or the python-script itself with a specific m-file, I'm using coder.extrinsic('*') method.
Now I want to index to a specific value for example with dataStringArray(3) to get '20.0' and define it as an output value of the matlab-function, but this doesn't work! I tried to predefine the dataStringArray with dataStringArray = cell(1,6); but get always the same 4 errors:
Subscripting into an mxArray is not supported.
Function 'MATLAB Function' (#23.1671.1689), line 42, column 24:
"dataStringArray(3)"
2x Errors occurred during parsing of MATLAB function 'MATLAB Function'
Error in port widths or dimensions. Output port 1 of 's_function_Matlab/MATLAB Function/constIn5' is a one dimensional vector with 1 elements.
What do I'm wrong?
SAMPLE CODE
The commented code behind the output definitions is what I need.:
function [dataArrayOutput, constOut1, constOut2, constOut3, constOut4, constOut5] = fcn(dataArrayInput, constIn1, constIn2, constIn3, constIn4, constIn5)
coder.extrinsic('strsplit');
% Python-Script String Output
pythonScriptOutputString = '[[-5.01 0.088068861]; [-4.96 0.0]]|10.0|20.0|80.0|80.0|50.0';
dataStringArray = strsplit(pythonScriptOutputString, '|');
% Outputs
dataArrayOutput = dataArrayInput; % str2num(char((dataStringArray(1))));
constOut1 = constIn1; % str2double(dataStringArray(2));
constOut2 = constIn2; % str2double(dataStringArray(3));
constOut3 = constIn3; % str2double(dataStringArray(4));
constOut4 = constIn4; % str2double(dataStringArray(5));
constOut5 = constIn5; % str2double(dataStringArray(6));
SOLUTION 1
Cell arrays are not supported in Matlab function blocks, only the native Simulink datatypes are possible.
A workaround is to define the whole code as normal function and execute it from the MATLAB-Function defined with extrinsic. It`s important to initialize the output variables with a known type and size before executing the extrinsic function.
SOLUTION 2
Another solution is to use the strfind function, that gives you a double matrix with the position of the splitter char. With that, you can give just the range of the char positions back that you need. In this case, your whole code will be in the MATLAB-Function block.
function [dataArrayOutput, constOut1, constOut2, constOut3, constOut4, constOut5] = fcn(dataArrayInput, constIn1, constIn2, constIn3, constIn4, constIn5)
coder.extrinsic('strsplit', 'str2num');
% Python-Script String Output
pythonScriptOutputString = '[[-5.01 0.088068861]; [-4.96 0.0]; [-1.01 7.088068861]]|10.0|20.0|80.0|80.0|50.0';
dataStringArray = strfind(pythonScriptOutputString,'|');
% preallocate
dataArrayOutput = zeros(3, 2);
constOut1 = 0;
constOut2 = 0;
constOut3 = 0;
constOut4 = 0;
constOut5 = 0;
% Outputs
dataArrayOutput = str2num(pythonScriptOutputString(1:dataStringArray(1)-1));
constOut1 = str2num(pythonScriptOutputString(dataStringArray(1)+1:dataStringArray(2)-1));
constOut2 = str2num(pythonScriptOutputString(dataStringArray(2)+1:dataStringArray(3)-1));
constOut3 = str2num(pythonScriptOutputString(dataStringArray(3)+1:dataStringArray(4)-1));
constOut4 = str2num(pythonScriptOutputString(dataStringArray(4)+1:dataStringArray(5)-1));
constOut5 = str2num(pythonScriptOutputString(dataStringArray(5)+1:end));
When using an extrinsic function, the data type returned is of mxArray, which you cannot index into as the error message suggests. To work around this problem, you first need to initialise the variable(s) of interest to cast them to the right data type (e.g. double). See Working with mxArrays in the documentation for examples of how to do that.
The second part of the error message is a dimension. Without seeing the code of the function, the Simulink model and how the inputs/outputs of the function are defined, it's difficult to tell what's going on, but you need to make sure you have the correct size and data type defined in the Ports and Data manager.
I have a python program which performs calculations using nested dictionaries. The problem is, if someone enters a value not in one of the dictionaries it won't work. I can either force the user to choose from the values but I'd rather perform interpolation to get the 'expected' value. I cannot figure out how to unpack these dictionaries, get them ordered, and perform the interpolation though.
Any help would be greatly appreciated. My code is below.
Dictionaries like this:
from decimal import *
pga_values = {
"tee": {
100:2.92, 120:2.99, 140:2.97, 160:2.99, 180:3.05, 200:3.12, 240:3.25, 260:3.45, 280:3.65,
300:3.71, 320:3.79, 340:3.86, 360:3.92, 380:3.96, 400:3.99, 420:4.02, 440:4.08, 460:4.17,
480:4.28, 500:4.41, 520:4.54, 540:4.65, 560:4.74, 580:4.79, 600:4.82
},
"fairway": {
5:2.10,10:2.18,20:2.40,30:2.52,40:2.60,50:2.66,60:2.70,70:2.72,80:2.75,
ETC... (edited to be concise)
lie_types = set(pga_values.keys())
user_preshot_lie = input("What was your pre-shot lie type?")
user_preshot_distance_to_hole = Decimal(input('How far away from the hole were you before your shot?'))
user_postshot_lie = input("What was your post-shot lie type?")
user_postshot_distance_to_hole = Decimal(input('How far away from the hole were you?'))
assert user_preshot_lie in lie_types
assert user_postshot_lie in lie_types
preshot_pga_tour_shots_to_hole_out = pga_values[user_preshot_lie][user_preshot_distance_to_hole]
postshot_pga_tour_shots_to_hole_out = pga_values[user_postshot_lie][user_postshot_distance_to_hole]
user_strokes_gained = Decimal((preshot_pga_tour_shots_to_hole_out - postshot_pga_tour_shots_to_hole_out)-1)
print(user_strokes_gained)
Given e.g to isolate the problem a bit:
tee = {
100:2.92, 120:2.99, 140:2.97, 160:2.99, 180:3.05, 200:3.12, 240:3.25, 260:3.45, 280:3.65,
300:3.71, 320:3.79, 340:3.86, 360:3.92, 380:3.96, 400:3.99, 420:4.02, 440:4.08, 460:4.17,
480:4.28, 500:4.41, 520:4.54, 540:4.65, 560:4.74, 580:4.79, 600:4.82
}
you could have...:
import bisect
teekeys = sorted(tee)
def lookup(aval):
where = bisect.bisect_left(teekeys, aval)
lo = teekeys[where-1]
hi = teekeys[where]
if lo==hi: return tee[lo]
delta = float(aval-lo)/(hi-lo)
return delta*tee[hi] + (1-delta)*tee[lo]
So for example:
print(lookup(110))
2.955
print(lookup(530))
4.595
Not sure what you want to do if the value is <min(tee) or >max(tee) -- is raising an exception OK in such anomalous cases?
In Machine learning in action Chapter 2, one example reads records from file, each line like:
124 110 223 largeDoses
(forget its actual meaning)
One function in kNN.py is:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines,3))
classLabelVector = []
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
The problem is listFromLine[-1] is a string ('largeDoses', etc.), how can it convert to int?
In the book, it says numpy can handle this.
(From the book : You have to explicitly tell the interpreter that you’d like the integer version of the last item in the list, or it will give you the string version. Usually, you’d have to do this, but NumPy takes care of those details for you.)
However,
ValueError: invalid literal for int() with base 10: 'largeDoses'
occurs for
import kNN
kNN.file2matrix('dataset.txt')
BTW, the book's Chinese version is different from English Version.
String (indeed) cannot convert to int, neither in python, nor in other environment,
however,
the solution is
Put Machine Learning (indeed) in action
In case all kNN-input training / cross-validation records ( a.k.a. observations, examples )
do conform to the convention of [ 3x FEATURE, 1x LABEL]
use:
classLabelVector.append( listFromLine[-1] ) # to .append a LABEL, not an int()
You should convert those 'largeDoses' 'smallDoses' 'didntLike' to the number by hand. String cannot convert to int unless the String inside is int.
if (listLine[-1]=='largeDoses'):
listLine[-1] = '3'
elif (listLine[-1]=='smallDoses'):
listLine[-1] = '2'
else:
listLine[-1] = '1'
It can be seen that instead of simply changing the string to integer data, it is changed to a table. So, the modification program is as follows.
labels = {'didntLike':1,'smallDoses':2,'largeDoses':3}
classLabelVector.append(labels[listFromLine[-1]])
I'm in need of a function returning only the significant part of a value with respect to a given error. Meaning something like this:
def (value, error):
""" This function takes a value and determines its significant
accuracy by its error.
It returns only the scientific important part of a value and drops the rest. """
magic magic magic....
return formated value as String.
What i have written so far to show what I mean:
import numpy as np
def signigicant(value, error):
""" Returns a number in a scintific format. Meaning a value has an error
and that error determines how many digits of the
value are signifcant. e.g. value = 12.345MHz,
error = 0.1MHz => 12.3MHz because the error is at the first digit.
(in reality drop the MHz its just to show why.)"""
xx = "%E"%error # I assume this is most ineffective.
xx = xx.split("E")
xx = int(xx[1])
if error <= value: # this should be the normal case
yy = np.around(value, -xx)
if xx >= 0: # Error is 1 or bigger
return "%i"%yy
else: # Error is smaller than 1
string = "%."+str(-xx) +"f"
return string%yy
if error > value: # This should not be usual but it can happen.
return "%g"%value
What I don't want is a function like numpys around or round. Those functions take a value and want to know what part of this value is important. The point is that in general I don't know how many digits are significant. It depends in the size of the error of that value.
Another example:
value = 123, error = 12, => 120
One can drop the 3, because the error is at the size of 10. However this behaviour is not so important, because some people still write 123 for the value. Here it is okay but not perfectly right.
For big numbers the "g" string operator is a usable choice but not always what I need. For e.g.
If the error is bigger than the value.( happens e.g. when someone wants to measure something that does not exist.)
value = 10, error = 100
I still wish to keep the 10 as the value because I done know it any better. The function should return 10 then and not 0.
The stuff I have written does work more or less, but its clearly not effective or elegant in any way. Also I assume this question does concern hundreds of people because every scientist has to format numbers in that way. So I'm sure there is a ready to use solution somewhere but I haven't found it yet.
Probably my google skills aren't good enough but I wasn't able to find a solution to this in two days and now I ask here.
For testing my code I used this the following but more is needed.
errors = [0.2,1.123,1.0, 123123.1233215,0.123123123768]
values = [12.3453,123123321.4321432, 0.000321 ,321321.986123612361236,0.00001233214 ]
for value, error in zip(values, errors):
print "Teste Value: ",value, "Error:", error
print "Result: ", signigicant(value, error)
import math
def round_on_error(value, error):
significant_digits = 10**math.floor(math.log(error, 10))
return value // significant_digits * significant_digits
Example:
>>> errors = [0.2,1.123,1.0, 123123.1233215,0.123123123768]
>>> values = [12.3453,123123321.4321432, 0.000321 ,321321.986123612361236,0.00001233214 ]
>>> map(round_on_error, values, errors)
[12.3, 123123321.0, 0.0, 300000.0, 0.0]
And if you want to keep a value that is inferior to its error
if (value < error)
return value
else
def round_on_error(value, error):
significant_digits = 10**math.floor(math.log(error, 10))
return value // significant_digits * significant_digits