Creating table efficiently in numpy - python

I'm implementing the cky algorithm and I tried it once with list of lists in python and once with numpy.zeros and the list of lists is faster every time. I would think numpy would be faster, but I am new to using it and it is likely that I am not writing it in the most efficient way possible. It is also possible that I am just using a small dataset and list of lists are just faster on smaller datasets.
The only bit that is different is the instantialization of the tables:
straight python:
table = [[[] for i in range(length)] for j in range(length)]
and numpy:
table = np.zeros((n_dimension, n_dimension), dtype=object)
for i in range(n_dimension):
for j in range(n_dimension):
table[i][j] = []
I think that numpy is acting slower because I am not optimizing how I form my table, while the pythonic way is as efficient as that can be written. How can I make a similar, optimized implementation in numpy so that I can accurately assess the time differences between the two? Or are list of lists just faster in this case? At what point does numpy become more efficient than list of lists?

Related

Dask run all combination of elements in different lists in parallel

I'm trying to run a function on different combination of all the elements in different arrays with dask, and I'm struggling to apply it.
The serial code is as below:
for i in range(5):
for j in range(5):
for k in range(5):
function(listA[i],listB[j],listC[k])
print(f'{i}.{j}.{k}')
k=k+1
j=j+1
i=i+1
This code running time on my computer is 18 min, while each array has only 5 elements, i want to run this code parallel with dask on bigger size of arrays.
All the calculations inside the function doesn't independent on one another.
You can assume that what the function does is: listA[i]*listB[j]*listC[k]
After searching a lot online, i couldn't find any solution.
Much appreciate.
The snippet can be improved before using dask. Instead of iterating over index and then looking up the corresponding item in a list, one could iterate over the list directly (i.e. use for item in list_A:). Since in this case we are interested in all combinations of items in three lists, we can make use of the built-in combinations:
from itertools import combinations
triples = combinations(list_A, list_B, list_C)
for i,j,k in triples:
function(i,j,k)
To use dask one option is to use the delayed API. By wrapping function with dask.delayed, we obtain an immediate lazy reference to the results of the function. After collecting all the lazy references we can compute them in parallel with dask.compute:
import dask
from itertools import combinations
triples = combinations(list_A, list_B, list_C)
delayeds = [dask.delayed(function)(i,j,k)for i,j,k in triples]
results = dask.compute(*delayeds)

finding different element in column numpy

i am using numpy to find different element in the first column of numpy array i am using below code i also look at np.unique method but i couldn't find proper function
k = 0
c = 0
nonrep=[]
for i in range(len(xin)):
for j in range(len(nonrep)):
if(xin[i,0]==nonrep[j]):
c = c+1
if(c==0):
nonrep.append(xin[i,0])
c=0
i am sure i can do it better and faster using numpy library, i will be glad if you help me to find better and faster way to do this
This is definitely not the good way to do it. Since here you perform membership checks by performing linear search. Furthermore you do not even break after you have found the element. This makes it an O(n2) algorithm.
Using numpy O(n log n), no order
You can simply use:
np.unique(xin[:,0])
This will work in O(n log n). This is still not the most efficient approach.
Using pandas O(n), order
If you really need fast computations, you can better use pandas:
import pandas as pd
pd.DataFrame(xin[:,0])[0].unique()
This works in O(n) (given the elements can be efficiently hashed) and furthermore preserves order. Here the result is again a numpy array.
Like #B.M. says in their comment, you can prevent constructing a 1-column dataframe, and construct a sequence instead:
import pandas as pd
pd.Series(xin[:,0]).unique()

Fastest way to initialize numpy array with values given by function

I am mainly interested in ((d1,d2)) numpy arrays (matrices) but the question makes sense for arrays with more axes. I have function f(i,j) and I'd like to initialize an array by some operation of this function
A=np.empty((d1,d2))
for i in range(d1):
for j in range(d2):
A[i,j]=f(i,j)
This is readable and works but I am wondering if there is a faster way since my array A will be very large and I have to optimize this bit.
One way is to use np.fromfunction. Your code can be replaced with the line:
np.fromfunction(f, shape=(d1, d2))
This is implemented in terms of NumPy functions and so should be quite a bit faster than Python for loops for larger arrays.
a=np.arange(d1)
b=np.arange(d2)
A=f(a,b)
Note that if your arrays are of different size, then you have to create a meshgrid:
X,Y=meshgrid(a,b)
A=f(X,Y)

Efficient insertion of row into sorted DataFrame

My problem requires the incremental addition of rows into a sorted DataFrame (with a DateTimeIndex), but I'm currently unable to find an efficient way to do this. There doesn't seem to be any concept of an "insort".
I've tried appending the row and resorting in place, and I've also tried getting the insertion point with searchsorted and slicing and concatenating to create a new DataFrame. Both being "too slow".
Is Pandas just not suited to jobs where you don't have all the data at once and instead get your data incrementally?
Solutions I've tried:
Concatenation
def insert_data(df, data, index):
insertion_index = df.index.searchsorted(index)
new_df = pandas.concat([df[:insertion_index], pandas.DataFrame(data, index=[index]), df[insertion_index:]])
return new_df, insertion_index
Resorting
def insert_data(df, data, index):
new_df = df.append(pandas.DataFrame(data, index=[index]))
new_df.sort_index(inplace=True)
return new_df
pandas is built on numpy. numpy arrays are fixed sized objects. While there are numpy append and insert functions, in practice they construct new arrays from the old and new data.
There are 2 practical approaches to incrementally defining these arrays:
initialize a large empty array, and fill in values incrementally
incrementally create a Python list (or dictionary), and create the array from the completed list.
Appending to a Python list is a common and fast task. There is also a list insert, but it is slower. For sorted inserts there are specialized Python structures (e.g. bisect).
Pandas may have added functions to deal with common creation scenarios. But unless it has coded something special in C it is unlikely to be faster than a more basic Python structure.
Even if you have to use Pandas features at various points along the incremental build, it might best to create a new DataFrame on the fly from the underlying Python structure.

How to "simulate" numpy.delete during processing

For the sake of speeding up my algorithm that has numpy arrays with tens of thousands of elements, I'm wondering if I can reduce the time used by numpy.delete().
In fact, if I can just eliminate it?
I have an algorithm where I've got my array alpha.
And this is what I'm currently doing:
alpha = np.delete(alpha, 0)
beta = sum(alpha)
But why do I need to delete the first element? Is it possible to simply sum up the entire array using all elements except the first one? Will that reduce the time used in the deletion operation?
Avoid np.delete whenever possible. It returns a a new array, which means that new memory has to be allocated, and (almost) all the original data has to be copied into the new array. That's slow, so avoid it if possible.
beta = alpha[1:].sum()
should be much faster.
Note also that sum(alpha) is calling the Python builtin function sum. That's not the fastest way to sum items in a NumPy array.
alpha[1:].sum() calls the NumPy array method sum which is much faster.
Note that if you were calling alpha.delete in a loop, then the code may be deleting more than just the first element from the original alpha. In that case, as Sven Marnach points out, it would be more efficient to compute all the partial sums like this:
np.cumsum(alpha[:0:-1])[::-1]

Categories

Resources