I am in the process of converting some matlab code to python when I ran into the spline function in matlab. I assumed that numpy would have something similar but all I can find on google is scipy.interpolate, which has so many options I dont even know where to start. http://docs.scipy.org/doc/scipy/reference/interpolate.html Is there an exact equivalent to the matlab spline? Since I need it to run for various cases there is not one single test case, so in the worst case I need to recode the function and that will take unnecessary amounts of time.
Thanks
Edit:
So i have tried the examples of the answers so far, but i dont see how they are similar, for example spline(x,y) in matlab returns:
>> spline(x,y)
ans =
form: 'pp'
breaks: [0 1 2 3 4 5 6 7 8 9]
coefs: [9x4 double]
pieces: 9
order: 4
dim: 1
SciPy:
scipy.interpolate.UnivariateSpline
http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.UnivariateSpline.html#scipy.interpolate.UnivariateSpline
Note that it returns an interpolator (function) not interpolated values. You have to make a call to the resulting function:
spline = UnivariateSpline(x, y)
yy = spline(xx)
Related
For some functions it can be hard to look at a 3D plot to see if my function has values and i tried to make a small script to check for max and min values by using basic multivariable calculus. Is there any smart way to check if i have values of 0 or below? I do still have the method of just drawing a line where i have 0. But there has to be another way to make a script that prints out (the function is below 0 in point X).
My code is this
from sympy import *
x,y = symbols("x,y")
Function = x**3*y**2 + x*y + x**2*y + cos(x)*sin(y)
plotting.plot3d(Function,(x,-1,1),(y,-1,1))
And yes it very trivial here to see that i have functions below 0.
I have code for a 2D grid search and it works perfectly. Here is the sample code:
chinp = np.zeros((N,N))
O_M_De = []
for x,y in list(((x,y) for x in range(len(omega_M)) for y in range(len(omega_D)))):
Omin = omega_M[x]
Odin = omega_D[y]
print(Omin, Odin)
chi = np.sum((dist_data - dist_theo)**2/(chi_err))
chinp[y,x] = chi
chi_values1.append(chi)
O_M_De.append((x,y))
My question is, at some point in the future, I may want to perform a grid search over more dimensions. Now if this were the case of 3 dimensions, it would be as simple as adding another variable 'z' in my 'for' statement (line 3). This code would work fine for me to keep adding more dimensions too (i have tried it and it works).
However as you can tell, if I wanted a large number of dimensions to perform a grid search over, it would get a little tedious and inefficient to keep adding variable to my 'for' statement (e.g. for 5D it would go something like 'for v,w,x,y,z in list(((v,w,x,y,z)...').
Just from various google searches, I am under the impression that itertools is very helpful when it comes to performing grid searches however I am still fairly new to programming and unfamiliar with it.
My question is if anyone knows a way (using itertools or some other method I am not aware of) to be able to extend this code to N-dimenions in a more efficient way (i.e. maybe change the 'for' statement so I can grid search over N-dimensions easily without adding on another 'for z in range etc'
Thank you in advance for your help.
You want to take a look at product function from itertools
import itertools
x_list = [0, 1, 2]
y_list = [10, 11, 12]
z_list = [20, 21, 22]
for x, y, z in itertools.product(x_list, y_list, z_list):
print(x, y, z)
0 10 20
0 10 21
0 10 22
0 11 20
0 11 21
(...)
2 11 21
2 11 22
2 12 20
2 12 21
2 12 22
Note that this will not be the most efficient way. The best results you will get if you add some vectorization (for example using numpy or numba) and parallelism (using multiprocessing or numba).
I'm trying to understand what's the execution complexity of the iloc function in pandas.
I read the following Stack Exchange thread (Pandas DataFrame search is linear time or constant time?) that:
"accessing single row by index (index is sorted and unique) should have runtime O(m) where m << n_rows"
mentioning that iloc runs on O(m) time. What is m (linear, log, constant,...)?
Some experiments I ran:
import pandas as pd
>>> a = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,4,5]], columns=['a','b','c'])
>>> a = a.set_index('a').sort_index()
>>> a
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
>>> a.iloc[[0,1,2,3]]
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
So iloc clearly works with offsets and not on the integer-based index (column a). Even if we delete few rows at the top, the iloc offset-based lookup works correctly:
>>> a.drop([1]).iloc[[0,1]]
b c
a
2 2 3
2 3 4
So why isn't iloc offset-lookup running on a comparable time to numpy arrays when each column is simply a numpy array that can be accessed in constant time (few operations)? And what's its complexity?
UPDATE:
I tried to compare the efficiency of pandas vs numpy on a 10000000x2 matrix. Comparing the efficiency of a value increment per row in a DataFrame df and an array arr, with and without a for loop:
# Initialization
SIZE = 10000000
arr = np.ones((SIZE,2), dtype=np.uint32)
df = pd.DataFrame(arr)
# numpy, no for-loop
arr[range(SIZE),1] += 1
# pandas, no for-loop
df.iloc[range(SIZE),1] += 1
# numpy, for-loop
for i in range(SIZE):
arr[i,1] += 1
# pandas, for-loop
for i in range(SIZE):
df.iloc[i,1] += 1
Method
Execution time
numpy, no for-loop
7 seconds
pandas, no for-loop
24 seconds
numpy, with for-loop
27 seconds
pandas, with for-loop
> 2 hours
There likely isn't one answer for the runtime complexity of iloc. The method accepts a huge range of input types, and that flexibility necessarily comes with costs. These costs are likely to include both large constant factors and non-constant costs that are almost certainly dependent on the way in which it is used.
One way to sort of answer your question is to step through the code in the two cases.
Indexing with range
First, indexing with range(SIZE). Assuming df is defined as you did, you can run:
import pdb
pdb.run('df.iloc[range(SIZE), 1]')
and then step through the code to follow the path. Ultimately, this arrives at this line:
self._values.take(indices)
where indices is an ndarray of integers constructed from the original range, and self._values is the source ndarray of the data frame.
There are two things to note about this. First, the range is materialized into an ndarray, which means you have a memory allocation of at least SIZE elements. So...that's going to cost you some time :). I don't know how the indexing happens in NumPy itself, but given the time measurements you've produced, it's possible that there is no (or a much smaller) allocation happening.
The second thing to note is that numpy.take makes a copy. You can verify this by looking at the .flags attribute of the object returned from calling this method, which indicates that it owns its data and is not a view onto the original. (Also note that np.may_share_memory returns False.) So...there's another allocation there :).
Take home: It's not obvious that there's any non-linear runtime here, but there are clearly large constant factors. Multiple allocations are probably the big killer, but the complex branching logic in the call tree under the .iloc property surely doesn't help.
Indexing in a loop
The code taken in this path is much shorter. It pretty quickly arrives here:
return self.obj._get_value(*key, takeable=self._takeable)
The really crappy runtime here is probably due to that tuple-unpacking. That repeatedly unpacks and repacks key as a tuple on each loop iteration. (Note that key is already the tuple (i, 1), so that sucks. The cost of accepting a generic iterable.)
Runtime analysis
In any case, we can get an estimate of the actual runtime of your particular use case by profiling. The following script will generate a log-spaced list of array sizes from 10 to 10e9, index with a range, and print out the time it takes to run the __getitem__ method. (There are only two such calls in the tree, so it's easy to see which is the one we care about.)
import pandas as pd
import numpy as np
import cProfile
import pstats
sizes = [10 ** i for i in range(1, 9)]
for size in sizes:
df = pd.DataFrame(data=np.zeros((size, 2)))
with cProfile.Profile() as pr:
pr.enable()
df.iloc[range(size), 1]
pr.disable()
stats = pstats.Stats(pr)
print(size)
stats.print_stats("__getitem__")
Once the output gets above the minimum resolution, you can see pretty clear linear behavior here:
Size | Runtime
------------------
10000 | 0.002
100000 | 0.021
1000000 | 0.206
10000000 | 2.145
100000000| 24.843
So I'm not sure what sources you're referring to that talk about nonlinear runtime of indexing. They could be out of date, or considering a different code path than the one using range.
I'm currently working on Project Euler 18 which involves a triangle of numbers and finding the value of the maximum path from top to bottom. It says you can do this project either by brute forcing it or by figuring out a trick to it. I think I've figured out the trick, but I can't even begin to solve this because I don't know how to start manipulating this triangle in Python.
https://projecteuler.net/problem=18
Here's a smaller example triangle:
3
7 4
2 4 6
8 5 9 3
In this case, the maximum route would be 3 -> 7 -> 4 -> 9 for a value of 23.
Some approaches I considered:
I've used NumPy quite a lot for other tasks, so I wondered if an array would work. For that 4 number base triangle, I could maybe do a 4x4 array and fill up the rest with zeros, but aside from not knowing how to import the data in that way, it also doesn't seem very efficient. I also considered a list of lists, where each sublist was a row of the triangle, but I don't know how I'd separate out the terms without going through and adding commas after each term.
Just to emphasise, I'm not looking for a method or a solution to the problem, just a way I can start to manipulate the numbers of the triangle in python.
Here is a little snippet that should help you with reading the data:
rows = []
with open('problem-18-data') as f:
for line in f:
rows.append([int(i) for i in line.rstrip('\n').split(" ")])
I have 3 functions which consist of 6 variables (p1,p2,p3,p4,p5,p6). The value of each function is equal to x (say):
f1=
sgn(2-p1)*sqrt(abs(2-p1))+sgn(2-p2)*sqrt(abs(2-p2))+sgn(2-p3)*sqrt(abs(2-p3));
f2= sgn(p4-2)*sqrt(abs(p4-2))+sgn(p5-2)*sqrt(abs(p5-2))+sgn(p6-2)*sqrt(abs(p6-2));
f3=
sgn(p1-p4)*sqrt(abs(p1-p4))+sgn(p2-p5)*sqrt(abs(p2-p5))+sgn(p3-p6)*sqrt(abs(p3-p6));
I want to find the combination of values of p1,p2,p3,p4,p5 and p6 for which x is maximum. Constraints are:
0 <= p1,p2,p3,p4,p5,p6 <= 4
Simply varying every variable from 0 to 4 taking small steps is not a good solution. Can someone tell me an efficient method to optimise the solution (preferably in python).
This is a non-linear optimization problem without an obvious close form solution. Better ask this question in another forum.