interpolation of data in dictionary python 3 - python

I have a python program which performs calculations using nested dictionaries. The problem is, if someone enters a value not in one of the dictionaries it won't work. I can either force the user to choose from the values but I'd rather perform interpolation to get the 'expected' value. I cannot figure out how to unpack these dictionaries, get them ordered, and perform the interpolation though.
Any help would be greatly appreciated. My code is below.
Dictionaries like this:
from decimal import *
pga_values = {
"tee": {
100:2.92, 120:2.99, 140:2.97, 160:2.99, 180:3.05, 200:3.12, 240:3.25, 260:3.45, 280:3.65,
300:3.71, 320:3.79, 340:3.86, 360:3.92, 380:3.96, 400:3.99, 420:4.02, 440:4.08, 460:4.17,
480:4.28, 500:4.41, 520:4.54, 540:4.65, 560:4.74, 580:4.79, 600:4.82
},
"fairway": {
5:2.10,10:2.18,20:2.40,30:2.52,40:2.60,50:2.66,60:2.70,70:2.72,80:2.75,
ETC... (edited to be concise)
lie_types = set(pga_values.keys())
user_preshot_lie = input("What was your pre-shot lie type?")
user_preshot_distance_to_hole = Decimal(input('How far away from the hole were you before your shot?'))
user_postshot_lie = input("What was your post-shot lie type?")
user_postshot_distance_to_hole = Decimal(input('How far away from the hole were you?'))
assert user_preshot_lie in lie_types
assert user_postshot_lie in lie_types
preshot_pga_tour_shots_to_hole_out = pga_values[user_preshot_lie][user_preshot_distance_to_hole]
postshot_pga_tour_shots_to_hole_out = pga_values[user_postshot_lie][user_postshot_distance_to_hole]
user_strokes_gained = Decimal((preshot_pga_tour_shots_to_hole_out - postshot_pga_tour_shots_to_hole_out)-1)
print(user_strokes_gained)

Given e.g to isolate the problem a bit:
tee = {
100:2.92, 120:2.99, 140:2.97, 160:2.99, 180:3.05, 200:3.12, 240:3.25, 260:3.45, 280:3.65,
300:3.71, 320:3.79, 340:3.86, 360:3.92, 380:3.96, 400:3.99, 420:4.02, 440:4.08, 460:4.17,
480:4.28, 500:4.41, 520:4.54, 540:4.65, 560:4.74, 580:4.79, 600:4.82
}
you could have...:
import bisect
teekeys = sorted(tee)
def lookup(aval):
where = bisect.bisect_left(teekeys, aval)
lo = teekeys[where-1]
hi = teekeys[where]
if lo==hi: return tee[lo]
delta = float(aval-lo)/(hi-lo)
return delta*tee[hi] + (1-delta)*tee[lo]
So for example:
print(lookup(110))
2.955
print(lookup(530))
4.595
Not sure what you want to do if the value is <min(tee) or >max(tee) -- is raising an exception OK in such anomalous cases?

Related

Python: Data-structure and processing of GPS points and properties

I'm trying to read data from a csv and then process it on different way. (For starter just the average)
Data
(OneDrive) https://1drv.ms/u/s!ArLDiUd-U5dtg0teQoKGguBA1qt9?e=6wlpko
The data looks like this:
ID; Property1; Property2; Property3...
1; ....
1; ...
1; ...
2; ...
2; ...
3; ...
...
Every line is a GPS point. All points with same ID together (for example 1) produce one Route. The routes are not of the same length and some IDs are skipped. So it isn't a seamless increase of numbers.
I may need to add, that the points are ALWAYS the same set of meters apart from each other. And I don't need the XY information currently.
Wanted Result
In the end I want something like this:
[ID, AVG_Property1, AVG_Property2...] [1, 1.00595, 2.9595, ...] [2,1.50606, 1.5959, ...]
What I got so far
import os
import numpy
import pandas as pd
data = pd.read_csv(os.path.join('C:\\data' ,'data.csv'), sep=';')
# [id, len, prop1, prop2, ...]
routes = numpy.zeros((data.size, 10)) # 10 properties
sums = numpy.zeros(8)
nr_of_entries = 0;
current_id = 1;
for index, row in data.iterrows():
if(int(row['id']) != current_id): #after the last point of the route
routes[current_id-1][0] = current_id;
routes[current_id-1][1] = nr_of_entries; #how many points are in this route?
routes[current_id-1][2] = sums[0] / nr_of_entries;
routes[current_id-1][3] = sums[1] / nr_of_entries;
routes[current_id-1][4] = sums[2] / nr_of_entries;
routes[current_id-1][5] = sums[3] / nr_of_entries;
routes[current_id-1][6] = sums[4] / nr_of_entries;
routes[current_id-1][7] = sums[5] / nr_of_entries;
routes[current_id-1][8] = sums[6] / nr_of_entries;
routes[current_id-1][9] = sums[7] / nr_of_entries;
current_id = int(row['id']);
sums = numpy.zeros(8)
nr_of_entries = 0;
sums[0] += row[3];
sums[1] += row[4];
sums[2] += row[5];
sums[3] += row[6];
sums[4] += row[7];
sums[5] += row[8];
sums[6] += row[9];
sums[7] += row[10];
nr_of_entries = nr_of_entries + 1;
routes
My problem
1.) The way I did it, I have to copy paste the same code for every other processing approach, since as stated I need to do multiple different way. Average is just an example.
2.) The reading of the data is clumsy and fails when IDs are missing
3.) I'm a C# Developer, so my approach would be to create a Class 'Route' which has all the points and then provide methods for 'calculate average for prop 1'. Or something. This way I could also tweak the data if needed. (extreme values for example). But I have no idea how this would be done in Phyton and if this is a reasonable approach in this language.
4.) Is there a more elegant way to iterate through the original csv and getting like Route ID 1, then Route ID 2 and so on? Maybe something like LINQ Queries in C#?
Thanks for any help.
He is a solution and some ideas you can use. The example features multiple options for the same issue so you have to choose which fits the purpose best. Also it is Python 3.7, you didn't specify a version so i hope this works.
class Route(object):
"""description of class"""
def __init__(self, id, rawdata): # on startup
self.id = id
self.rawdata = rawdata
self.avg_Prop1 = self.calculate_average('Prop1')
self.sum_Prop4 = None
def calculate_average(self, Prop_Name): #selfreference for first argument in class method
return self.rawdata[Prop_Name].mean()
def give_Prop_data(self, Prop_Name): #return the Propdata as list
return self.rawdata[Prop_Name].tolist()
def any_function(self, my_function, Prop_Name): #not sure what dataframes support so turning it into a list first
return my_function(self.rawdata[Prop_Name].tolist())
#end of class definiton
data = pd.read_csv('testdata.csv', sep=';')
# [id, len, prop1, prop2, ...]
route_list = [] #List of all the objects created from the route class
for i in data.id.unique():
print('Current id:', i,' with ',len(data[data['id']==i]),'entries')
route_list.append(Route(i,data[data['id']==i]))
#created the Prop1 average in initialization of route so just accessing attribute
print(route_list[1].avg_Prop1)
for current_route in route_list:
print('Route ',current_route.id , ' Properties :')
for i in current_route.rawdata.columns[1:]: #for all except the first (id)
print(i, ' has average ', current_route.calculate_average(i)) #i is the string of the column not just an id
#or pass any function that you want
route_list[1].sum_Prop4 = (route_list[1].any_function(sum,'Prop4'))
print(route_list[1].sum_Prop4)
#which is equivalent to
print(sum(route_list[1].rawdata['Prop4']))
To adress your individual problems out of order:
For 2. and 4.) Looping only over the existing Ids (data.id.unique()) solves the problem. I have no idea what LINQ Queries are, but i assume they are similar. In general, Python has a great way of looping over objects (like for current_route in route_list), which is worth looking into if you want to use it a little more.
For 1. and 3.) Again looping solves the issue. I created a class in the example, mostly to show the syntax for classes. The benefits and drawbacks for using classes should be the same in Python as in C#.
As it is right now the class probably isn't great, but this depends on how you want to use it. If the class should just be a practical way of storing and accessing data it shouldn't have the methods, because you don't need an individual average method for each route. Then you can just access it's data and use it in a function like in sum(route_list[1].rawdata['Prop4']). If however, depending on the data (amount of rows for example) different calculations are necessary, it might come in handy to use the method calculate_average and differentiate in there.
An other example would be the use of the attributes. If you need the average for Prop1 every time, creating it at the initialization sees a good idea, otherwise i wouldn't bother always calculating it.
I hope this helps!

Rpy2: set a R formulat from python

I am little confused by R syntax formula
I created the following python function with Rpy2:
objects.r('''
project_var <- function(grid,points) {
coordinates(points) = ~X + Y
gridded(grid) = ~X+Y
grid = idw(Z~1, points,grid)
grid <- as.data.frame(grid)
return(grid)
}
''')
Then I import it
project_var = robjects.globalenv['project_var']
Then I call it:
test = project_var(model,points_top)
And it works as expected!
I would like to'Z' to be set by an argument of my function, something like this:
project_var <- function(grid,points,feature_name) {
...
grid = idw(feature_name~1, points,grid)
My Problem :
idw(feature_name~1, points,grid)
I do not really understand this line and what is really feature name (because it is not a string nor known variable at this point, but the name of a column as a formula).
for info idw comes from gstat library... and I do not know R...
here is the doc:
idw.locations(formula, locations, data, newdata, nmax = Inf, nmin = 0,
omax = 0, maxdist = Inf, block, na.action = na.pass, idp = 2.0,
debug.level = 1)
https://cran.r-project.org/web/packages/gstat/gstat.pdf
So what should I put for feature_name in the python side ? or how to build it in R so it would transform the string feature_name into something that would work ?
Any help would be appreciate.
Thank you for reading so far.
I do not really understand this line and what is really feature name (because it is not a string nor known variable at this point, but the name of a column).
R differs from Python as expressions in a function call (here idw(Z~1, points,grid)) will only be evaluated within the function, and the unevaluated expression itself is available to the code in the body of the function.
In addition to that, Z~1 is itself a special thing: it is an R formula. You could write fml <- Z ~ 1 in R and the object fml will be a "formula". The constructor for the formula is somewhat hidden as <something> ~ <something> is considered a language construct in R, but in fact you have something like build_formula(<left_side_expression>, <right_side_expression>). You can try in R fml <- get("~")(Z, 1) and see that this is exactly that happening.
okay, just need to use as.formula to convert a string to a formula :-)
idw(as.formula(feature_name), points,grid)

How to efficient find existent key-values of 2-dimensional dictionary in python which are between 4 values?

I have a little Problem in Python. I got a 2 dimensional dictionary. Lets call it dict[x,y] now. x and y are integers. I try to only select the key-pair-values, which match between 4 points. Function should look like this:
def search(topleft_x, topleft_y, bottomright_x, bottomright_y):
For example: search(20, 40, 200000000, 300000000)
Now are Dictionary-items should be returned that match to:
20 < x < 20000000000
AND 40 < y < 30000000000
Most of the key-pair-values in this huge matrix are not set (see picture - this is why i cant just iterate).
This function should return a shorted dictionary. In the example shown in the picture, it would be a new dictionary with the 3 green circled values. Is there any simple solution to realize this?
I recently used 2-for-loops. In this example they would look like this:
def search():
for x in range(20, 2000000000):
for y in range(40, 3000000000):
try:
#Do something
except:
#Well item just doesnt exist
Of course this is highly inefficient. So my question is: How to Boost up this simple thing in Python? In C# i used Linq for stuff like this... What to use in python?
Thanks for help!
Example Picture
You dont go over random number ranges and ask 4million times for forgiveness - you use 2 number range to specify your "filters" and go only over existing keys in the dictionary that fall into those ranges:
# get fancy on filtering if you like, I used explicit conditions and continues for clearity
def search(d:dict,r1:range, r2:range)->dict:
d2 = {}
for x in d: # only use existing keys in d - not 20k that might be in
if x not in r1: # skip it - not in range r1
continue
d2[x] = {}
for y in d[x]: # only use existing keys in d[x] - not 20k that might be in
if y not in r2: # skip it - not in range r2
continue
d2[x][y] = "found: " + d[x][y][:] # take it, its in both ranges
return d2
d = {}
d[20] = {99: "20",999: "200",9999: "2000",99999: "20000",}
d[9999] = { 70:"70",700:"700",7000:"7000",70000:"70000"}
print(search(d,range(10,30), range(40,9000)))
Output:
{20: {99: 'found: 20', 999: 'found: 200'}}
It might be useful to take a look at modules providing sparse matrices.

spark mapPartitionRDD can't print values

I am following the Machine Learning with Spark Book and trying to convert the python code to scala code and using Beaker notebook to share variables in order to pass values to python to plot with matplotlib as described in the book. Most of the code so far I have been able to convert but I am having some issues with the try-catch conversion with data cleansing with the u.item dataset. Below code ends in a infinite loop without a clear issue what the error is.
val movieData = sc.textFile("/Users/minHenry/workspace/ml-100k/u.item")
val movieDataSplit = movieData.first()
val numMovies = movieData.count()
def convertYear(x:String):Int = x.takeRight(4) match {
case x => x.takeRight(4).toInt
case _ => 1900
}
val movieFields = movieData.map(lines => lines.split('|'))
print(movieData.first())
val years1 = movieFields.map(fields => fields(2))
val years = movieFields.map(fields => fields(2).map(x=>convertYear(x.toString())))
val filteredYears = years.filter(x => x!=1900)
years.take(2).foreach(println)
I suspect my problem is with my pattern match but I am not exactly sure what's wrong with it. I think the takeRight() works because it doesn't complain about the type that this function is being applied to.
UPDATE
I have updated the code as follows, per advice from the answer provided thus far:
import scala.util.Try
val movieData = sc.textFile("/Users/minHenry/workspace/ml-100k/u.item")
def convertYear(x:String):Int = Try(x.toInt).getOrElse(1900)
val movieFields = movieData.map(lines => lines.split('|'))
val preYears = movieFields.map(fields => fields(2))
val years = preYears.map(x => x.takeRight(4))//.map(x=>convertYear(x))
println("=======> years")
years.take(2).foreach(println) //--output = 1995/n1995
println("=======> filteredYears")
val filteredYears = years.filter(x => x!=1900)
filteredYears.take(2).foreach(println)
//val movieAges = filteredYears.map(yr => (1998-yr)).countByValue()
I commented out the map following the takeRight(4) because its easier to comment than putting x=>convertYear(x.takeRight(4)) and should produce the same output. When I apply this convertYear() function i still end up in an infinite loop. the values print as expected in the few print statements shown. The problem is if i cannot remove the data point that cannot be easily converted to Int then I am unable to run the countByValue() function in the last line.
Here is the link to my public beaker notebook for more context:
https://pub.beakernotebook.com/#/publications/56eed31d-85ad-4728-a45d-14b3b08d673f
movieData: RDD[String]
movieFields: RDD[Array[String]]
years1: RDD[String]
val years = movieFields.map(fields => fields(2).map(x=>convertYear(x.toString()))) - fields(2) is String and so x is Char, because String is treated as Seq[Char]. All inputs to convertYear(x: String) have only one letter string.
Your error is types incompatability hiding (convertYear(x.toString())). It's alarm bell. Always use type system in scala, don't hide problem with toString() or isInstanceOf or something else. Then compiler shows error before running.
P.S.
Second call of takeRight is useless.
def convertYear(x:String):Int = x.takeRight(4) match {
case x => x.takeRight(4).toInt
case _ => 1900
}
Pattern matching is about checking type or conditions (with if statement). Your first partial function doesn't check anything. All inputs go to x.takeRight(4).toInt. Also there is no defence against toInt exception.
Use instead def convertYear(x: String): Int = Try(x.toInt).getOrElse(1900).
Update
scala> import scala.util.Try
import scala.util.Try
scala> def convertYear(x:String):Int = Try(x.toInt).getOrElse(1900)
convertYear: (x: String)Int
scala> List("sdsdf", "1989", "2009", "1945", "asdf", "455")
res0: List[String] = List(sdsdf, 1989, 2009, 1945, asdf, 455)
scala> res0.map(convertYear)
res1: List[Int] = List(1900, 1989, 2009, 1945, 1900, 455)
With RDD all the same, because it is a functor as List.
val filteredYears = years.filter(x => x!=1900) Wouldn't work as you expect. x is a String not Int. Scala doesn't implicitly convert types for comparision. So you always get true.

Translate ruby to python

I'm rewriting some code from Ruby to Python. The code is for a Perceptron, listed in section 8.2.6 of Clever Algorithms: Nature-Inspired Programming Recipes. I've never used Ruby before and I don't understand this part:
def test_weights(weights, domain, num_inputs)
correct = 0
domain.each do |pattern|
input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
output = get_output(weights, input_vector)
correct += 1 if output.round == pattern.last
end
return correct
end
Some explanation: num_inputs is an integer (2 in my case), and domain is a list of arrays: [[1,0,1], [0,0,0], etc.]
I don't understand this line:
input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
It creates an array with 2 values, every values |k| stores pattern[k].to_f, but what is pattern[k].to_f?
Try this:
input_vector = [float(pattern[i]) for i in range(num_inputs)]
pattern[k].to_f
converts pattern[k] to a float.
I'm not a Ruby expert, but I think it would be something like this in Python:
def test_weights(weights, domain, num_inputs):
correct = 0
for pattern in domain:
output = get_output(weights, pattern[:num_inputs])
if round(output) == pattern[-1]:
correct += 1
return correct
There is plenty of scope for optimising this: if num_inputs is always one less then the length of the lists in domain then you may not need that parameter at all.
Be careful about doing line by line translations from one language to another: that tends not to give good results no matter what languages are involved.
Edit: since you said you don't think you need to convert to float you can just slice the required number of elements from the domain value. I've updated my code accordingly.

Categories

Resources