F# library or .Net Numerics equivalent to Python Numpy function - python

I have the following python Numpy function; it is able to take X, an array with an arbitrary number of columns and rows, and output a Y value predicted by a least squares function.
What is the Math.Net equivalent for such a function?
Here is the Python code:
newdataX = np.ones([dataX.shape[0],dataX.shape[1]+1])
newdataX[:,0:dataX.shape[1]]=dataX
# build and save the model
self.model_coefs, residuals, rank, s = np.linalg.lstsq(newdataX, dataY)

I think you are looking for the functions on this page: http://numerics.mathdotnet.com/api/MathNet.Numerics.LinearRegression/MultipleRegression.htm
You have a few options to solve :
Normal Equations : MultipleRegression.NormalEquations(x, y)
QR Decomposition : MultipleRegression.QR(x, y)
SVD : MultipleRegression.SVD(x, y)
Normal equations are faster but less numerically stable while SVD is the most numerically stable but the slowest.

You can call numpy from .NET using pythonnet (C# CODE BELOW IS COPIED FROM GITHUB):
The only "funky" part right now with pythonnet is passing numpy arrays. It is possible to convert them to Python lists at the interface, though this reduces performance for some situations.
https://github.com/pythonnet/pythonnet/tree/develop
static void Main(string[] args)
{
using (Py.GIL()) {
dynamic np = Py.Import("numpy");
dynamic sin = np.sin;
Console.WriteLine(np.cos(np.pi*2));
Console.WriteLine(sin(5));
double c = np.cos(5) + sin(5);
Console.WriteLine(c);
dynamic a = np.array(new List<float> { 1, 2, 3 });
dynamic b = np.array(new List<float> { 6, 5, 4 }, Py.kw("dtype", np.int32));
Console.WriteLine(a.dtype);
Console.WriteLine(b.dtype);
Console.WriteLine(a * b);
Console.ReadKey();
}
}
outputs:
1.0
-0.958924274663
-0.6752620892
float64
int32
[ 6. 10. 12.]
Here is example using F# posted on github:
https://github.com/pythonnet/pythonnet/issues/112
open Python.Runtime
open FSharp.Interop.Dynamic
open System.Collections.Generic
[<EntryPoint>]
let main argv =
//set up for garbage collection?
use gil = Py.GIL()
//-----
//NUMPY
//import numpy
let np = Py.Import("numpy")
//call a numpy function dynamically
let sinResult = np?sin(5)
//make a python list the hard way
let list = new Python.Runtime.PyList()
list.Append( new PyFloat(4.0) )
list.Append( new PyFloat(5.0) )
//run the python list through np.array dynamically
let a = np?array( list )
let sumA = np?sum(a)
//again, but use a keyword to change the type
let b = np?array( list, Py.kw("dtype", np?int32 ) )
let sumAB = np?add(a,b)
let SeqToPyFloat ( aSeq : float seq ) =
let list = new Python.Runtime.PyList()
aSeq |> Seq.iter( fun x -> list.Append( new PyFloat(x)))
list
//Worth making some convenience functions (see below for why)
let a2 = np?array( [|1.0;2.0;3.0|] |> SeqToPyFloat )
//--------------------
//Problematic cases: these run but don't give good results
//make a np.array from a generic list
let list2 = [|1;2;3|] |> ResizeArray
let c = np?array( list2 )
printfn "%A" c //gives type not value in debugger
//make a np.array from an array
let d = np?array( [|1;2;3|] )
printfn "%A" d //gives type not value in debugger
//use a np.array in a function
let sumD = np?sum(d) //gives type not value in debugger
//let sumCD = np?add(d,d) // this will crash
//can't use primitive f# operators on the np.arrays without throwing an exception; seems
//to work in c# https://github.com/tonyroberts/pythonnet //develop branch
//let e = d + 1
//-----
//NLTK
//import nltk
let nltk = Py.Import("nltk")
let sentence = "I am happy"
let tokens = nltk?word_tokenize(sentence)
let tags = nltk?pos_tag(tokens)
let taggedWords = nltk?corpus?brown?tagged_words()
let taggedWordsNews = nltk?corpus?brown?tagged_words(Py.kw("categories", "news") )
printfn "%A" taggedWordsNews
let tlp = nltk?sem?logic?LogicParser(Py.kw("type_check",true))
let parsed = tlp?parse("walk(angus)")
printfn "%A" parsed?argument
0 // return an integer exit code

Related

Why nested When().Then() is slower than Left Join in Rust Polars?

In Rust Polars(might apply to python pandas as well) assigning values in a new column with a complex logic involving values of other columns can be achieved in two ways. The default way is using a nested WhenThen expression. Another way to achieve same thing is with LeftJoin. Naturally I would expect When Then to be much faster than Join, but it is not the case. In this example, When Then is 6 times slower than Join. Is that actually expected? Am I using When Then wrong?
In this example the goal is to assign weights/multipliers column based on three other columns: country, city and bucket.
use std::collections::HashMap;
use polars::prelude::*;
use rand::{distributions::Uniform, Rng}; // 0.6.5
pub fn bench() {
// PREPARATION
// This MAP is to be used for Left Join
let mut weights = df![
"country"=>vec!["UK"; 5],
"city"=>vec!["London"; 5],
"bucket" => ["1","2","3","4","5"],
"weights" => [0.1, 0.2, 0.3, 0.4, 0.5]
].unwrap().lazy();
weights = weights.with_column(concat_lst([col("weights")]).alias("weihts"));
// This MAP to be used in When.Then
let weight_map = bucket_weight_map(&[0.1, 0.2, 0.3, 0.4, 0.5], 1);
// Generate the DataSet itself
let mut rng = rand::thread_rng();
let range = Uniform::new(1, 5);
let b: Vec<String> = (0..10_000_000).map(|_| rng.sample(&range).to_string()).collect();
let rc = vec!["UK"; 10_000_000];
let rf = vec!["London"; 10_000_000];
let val = vec![1; 10_000_000];
let frame = df!(
"country" => rc,
"city" => rf,
"bucket" => b,
"val" => val,
).unwrap().lazy();
// Test with Left Join
use std::time::Instant;
let now = Instant::now();
let r = frame.clone()
.join(weights, [col("country"), col("city"), col("bucket")], [col("country"), col("city"), col("bucket")], JoinType::Left)
.collect().unwrap();
let elapsed = now.elapsed();
println!("Left Join took: {:.2?}", elapsed);
// Test with nested When Then
let now = Instant::now();
let r1 = frame.clone().with_column(
when(col("country").eq(lit("UK")))
.then(
when(col("city").eq(lit("London")))
.then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
.otherwise(NULL.lit())
)
.otherwise(NULL.lit())
)
.collect().unwrap();
let elapsed = now.elapsed();
println!("Chained When Then: {:.2?}", elapsed);
// Check results are identical
dbg!(r.tail(Some(10)));
dbg!(r1.tail(Some(10)));
}
/// All this does is building a chained When().Then().Otherwise()
fn rf_rw_map(col: Expr, map: HashMap<String, Expr>, other: Expr) -> Expr {
// buf is a placeholder
let mut it = map.into_iter();
let (k, v) = it.next().unwrap(); //The map will have at least one value
let mut buf = when(lit::<bool>(false)) // buffer WhenThen
.then(lit::<f64>(0.).list()) // buffer WhenThen, needed to "chain on to"
.when(col.clone().eq(lit(k)))
.then(v);
for (k, v) in it {
buf = buf
.when(col.clone().eq(lit(k)))
.then(v);
}
buf.otherwise(other)
}
fn bucket_weight_map(arr: &[f64], ntenors: u8) -> HashMap<String, Expr> {
let mut bucket_weights: HashMap<String, Expr> = HashMap::default();
for (i, n) in arr.iter().enumerate() {
let j = i + 1;
bucket_weights.insert(
format!["{j}"],
Series::from_vec("weight", vec![*n; ntenors as usize])
.lit()
.list(),
);
}
bucket_weights
}
The result is surprising to me: Left Join took: 561.26ms vs Chained When Then: 3.22s
Thoughts?
UPDATE
This does not make much difference. Nested WhenThen is still over 3s
// Test with nested When Then
let now = Instant::now();
let r1 = frame.clone().with_column(
when(col("country").eq(lit("UK")).and(col("city").eq(lit("London"))))
.then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
.otherwise(NULL.lit())
)
.collect().unwrap();
let elapsed = now.elapsed();
println!("Chained When Then: {:.2?}", elapsed);
The joins are one of the most optimized algorithms in polars. A left join will be executed fully in parallel and has many performance related fast paths. If you want to combine data based on equality, you should almost always choose a join.

Assigned a complex value in cupy RawKernel

I am a beginner learning how to exploit GPU for parallel computation using python and cupy. I would like to implement my code to simulate some problems in physics and require to use complex number, but don't know how to manage it. Although there are examples in Cupy's official document, it only mentions about include complex.cuh library and how to declare a complex variable. I can't find any example about how to assign a complex number correctly, as well ass how to call the function in the complex.cuh library to do calculation.
I am stuck in line 11 of this code. I want to make a complex number value equal x[tIdx]+j*y[t_Idx], j is the imaginary number. I tried several ways and no one works, so I left this one here.
import cupy as cp
import time
add_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>
extern "C" __global__
void test(double* x, double* y, complex<float>* z){
int tId_x = blockDim.x*blockIdx.x + threadIdx.x;
int tId_y = blockDim.y*blockIdx.y + threadIdx.y;
complex<float>* value = complex(x[tId_x],y[tId_y]);
z[tId_x*blockDim.y*gridDim.y+tId_y] = value;
}''',"test")
x = cp.random.rand(1,8,4096,dtype = cp.float32)
y = cp.random.rand(1,8,4096,dtype = cp.float32)
z = cp.zeros((4096,4096), dtype = cp.complex64)
t1 = time.time()
add_kernel((128,128),(32,32),(x,y,z))
print(time.time()-t1)
What is the proper way to assign a complex number in the RawKernel?
Thank you for answering this question!
#plaeonix, thank you very much for your hint. I find out the answer.
This line:
complex<float>* value = complex(x[tId_x],y[tId_y])
should be replaced to:
complex<float> value = complex<float>(x[tId_x],y[tId_y])
Then the assignment of a complex number works.

PyTorch C++ extension: How to index tensor and update it?

I'm creating a PyTorch C++ extension and after much research I can't figure out how to index a tensor and update its values. I found out how to iterate over a tensor's entries using the data_ptr() method, but that's not applicable to my use case.
Given is a matrix M, a list of lists (blocks) of index pairs P and a function f: dtype(M)^2 -> dtype(M)^2 that takes two values and spits out two new values.
I'm trying to implement the following pseudo code:
for each block B in P:
for each row R in M:
for each index-pair (i,j) in B:
M[R,i], M[R,j] = f(M[R,i], M[R,j])
After all, this code is going to run on the GPU using CUDA, but since I don't have any experience with that, I wanted to first write a pure C++ program and then convert it.
Can anyone suggest how to do this or how to convert the algorithm to do something equivalent?
What I wanted to do can be done using the
tensor.accessor<scalar_dtype, num_dimensions>()
method. If executing on the GPU instead use scalars.packed_accessor64<scalar_dtype, num_dimensions, torch::RestrictPtrTraits>()
or
scalars.packed_accessor32<scalar_dtype, num_dimensions, torch::RestrictPtrTraits>() (depending on the size of your tensor).
auto num_rows = scalars.size(0);
matrix = torch::rand({10, 8});
auto a = matrix.accessor<float, 2>();
for (auto i = 0; i < num_rows; ++i) {
auto x = a[i][some_index];
auto new_x = some_function(x);
a[i][some_index] = new_x;
}

Rpy2: set a R formulat from python

I am little confused by R syntax formula
I created the following python function with Rpy2:
objects.r('''
project_var <- function(grid,points) {
coordinates(points) = ~X + Y
gridded(grid) = ~X+Y
grid = idw(Z~1, points,grid)
grid <- as.data.frame(grid)
return(grid)
}
''')
Then I import it
project_var = robjects.globalenv['project_var']
Then I call it:
test = project_var(model,points_top)
And it works as expected!
I would like to'Z' to be set by an argument of my function, something like this:
project_var <- function(grid,points,feature_name) {
...
grid = idw(feature_name~1, points,grid)
My Problem :
idw(feature_name~1, points,grid)
I do not really understand this line and what is really feature name (because it is not a string nor known variable at this point, but the name of a column as a formula).
for info idw comes from gstat library... and I do not know R...
here is the doc:
idw.locations(formula, locations, data, newdata, nmax = Inf, nmin = 0,
omax = 0, maxdist = Inf, block, na.action = na.pass, idp = 2.0,
debug.level = 1)
https://cran.r-project.org/web/packages/gstat/gstat.pdf
So what should I put for feature_name in the python side ? or how to build it in R so it would transform the string feature_name into something that would work ?
Any help would be appreciate.
Thank you for reading so far.
I do not really understand this line and what is really feature name (because it is not a string nor known variable at this point, but the name of a column).
R differs from Python as expressions in a function call (here idw(Z~1, points,grid)) will only be evaluated within the function, and the unevaluated expression itself is available to the code in the body of the function.
In addition to that, Z~1 is itself a special thing: it is an R formula. You could write fml <- Z ~ 1 in R and the object fml will be a "formula". The constructor for the formula is somewhat hidden as <something> ~ <something> is considered a language construct in R, but in fact you have something like build_formula(<left_side_expression>, <right_side_expression>). You can try in R fml <- get("~")(Z, 1) and see that this is exactly that happening.
okay, just need to use as.formula to convert a string to a formula :-)
idw(as.formula(feature_name), points,grid)

From C to Python

I'm really sorry if this is a lame question, but I think this may potentially help others making the same transition from C to Python. I have a program that I started writing in C, but I think it's best if I did it in Python because it just makes my life a lot easier.
My program retrieves intraday stock data from Yahoo! Finance and stores it inside of a struct. Since I'm so used to programming in C I generally try to do things the hard way. What I want to know is what's the most "Pythonesque" way of storing the data into an organized fashion. I was thinking an array of tuples?
Here's a bit of my C program.
// Parses intraday stock quote data from a Yahoo! Finance .csv file.
void parse_intraday_data(struct intraday_data *d, char *path)
{
char cur_line[100];
char *csv_value;
int i;
FILE *data_file = fopen(path, "r");
if (data_file == NULL)
{
perror("Error opening file.");
return;
}
// Ignore the first 15 lines.
for (i = 0; i < 15; i++)
fgets(cur_line, 100, data_file);
i = 0;
while (fgets(cur_line, 100, data_file) != NULL) {
csv_value = strtok(cur_line, ",");
csv_value = strtok(NULL, ",");
d->close[i] = atof(csv_value);
csv_value = strtok(NULL, ",");
d->high[i] = atof(csv_value);
csv_value = strtok(NULL, ",");
d->low[i] = atof(csv_value);
csv_value = strtok(NULL, ",");
d->open[i] = atof(csv_value);
csv_value = strtok(NULL, "\n");
d->volume[i] = atoi(csv_value);
i++;
}
d->close[i] = 0;
d->high[i] = 0;
d->low[i] = 0;
d->open[i] = 0;
d->volume[i] = 0;
d->count = i - 1;
i = 0;
fclose(data_file);
}
So far my Python program retrieves the data like this.
response = urllib2.urlopen('https://www.google.com/finance/getprices?i=' + interval + '&p=' + period + 'd&f=d,o,h,l,c,v&df=cpct&q=' + ticker)
Question is, what's the best or most elegant way of storing this data in Python?
Keep it simple. Read the line, split it by commas, and store the values inside a (named)tuple. That’s pretty close to using a struct in C.
If your program gets more elaborate it might (!) make sense to replace the tuple by a class, but not immediately.
Here’s an outline:
from collections import namedtuple
IntradayData = namedtuple('IntradayData',
['close', 'high', 'low', 'open', 'volume', 'count'])
response = urllib2.urlopen('https://www.google.com/finance/getprices?q=AAPL')
result=response.read().split('\n')
result = result[15 :] # Your code does this, too. Not sure why.
all_data = []
for i, data in enumerate(x):
if data == '': continue
c, h, l, o, v, _ = map(float, data.split(','))
all_data.append(IntradayData(c, h, l, o, v, i))
I believe it depends on how much data manipulation you will want to do after retrieving data.
If, for example, you plan to just print them on the screen then an array of tuple would do.
However, if you need to be able to sort, search, and other kind of data manipulation I believe a custom class could help: you would then work with a list (or even a home-brew container) of custom objects, allowing you for easily adding custom methods, based on your need.
Note that this is just my opinion, and I'm not an advanced Python developer.
Pandas (http://pandas.pydata.org/pandas-docs/stable/) is particularly well suited to this. Numpy is a little lower level, but may also suit your purposes. I really recommend going the pandas route, though. Either way you shouldn't lose too much of C's speed, so that's a plus.

Categories

Resources