finding different element in column numpy - python

i am using numpy to find different element in the first column of numpy array i am using below code i also look at np.unique method but i couldn't find proper function
k = 0
c = 0
nonrep=[]
for i in range(len(xin)):
for j in range(len(nonrep)):
if(xin[i,0]==nonrep[j]):
c = c+1
if(c==0):
nonrep.append(xin[i,0])
c=0
i am sure i can do it better and faster using numpy library, i will be glad if you help me to find better and faster way to do this

This is definitely not the good way to do it. Since here you perform membership checks by performing linear search. Furthermore you do not even break after you have found the element. This makes it an O(n2) algorithm.
Using numpy O(n log n), no order
You can simply use:
np.unique(xin[:,0])
This will work in O(n log n). This is still not the most efficient approach.
Using pandas O(n), order
If you really need fast computations, you can better use pandas:
import pandas as pd
pd.DataFrame(xin[:,0])[0].unique()
This works in O(n) (given the elements can be efficiently hashed) and furthermore preserves order. Here the result is again a numpy array.
Like #B.M. says in their comment, you can prevent constructing a 1-column dataframe, and construct a sequence instead:
import pandas as pd
pd.Series(xin[:,0]).unique()

Related

Creating table efficiently in numpy

I'm implementing the cky algorithm and I tried it once with list of lists in python and once with numpy.zeros and the list of lists is faster every time. I would think numpy would be faster, but I am new to using it and it is likely that I am not writing it in the most efficient way possible. It is also possible that I am just using a small dataset and list of lists are just faster on smaller datasets.
The only bit that is different is the instantialization of the tables:
straight python:
table = [[[] for i in range(length)] for j in range(length)]
and numpy:
table = np.zeros((n_dimension, n_dimension), dtype=object)
for i in range(n_dimension):
for j in range(n_dimension):
table[i][j] = []
I think that numpy is acting slower because I am not optimizing how I form my table, while the pythonic way is as efficient as that can be written. How can I make a similar, optimized implementation in numpy so that I can accurately assess the time differences between the two? Or are list of lists just faster in this case? At what point does numpy become more efficient than list of lists?

Broadcasting a multiplication across a pandas Panel

I have a pandas Panel that is long, wide, and shallow. In reality it's bigger but for ease of example, let's say it's 2x5x6:
panel=pd.Panel(pd.np.random.rand(2,3,6))
I have a Series that is the length of the shortest dimension - in this case 2:
series=pd.Series([0,1])
I want to multiply the panel by the series, by broadcasting the series across the two other axes.
Using panel.mul doesn't work, because that can only take Panels or DataFrames, I think
panel.mul(series) # returns None
Using panel.apply(lambda x: x.mul(series), axis=0) works, but seems to do the calculation across every combination of series, in this case 3x6=18, but in reality >1m series, and so is extremely slow.
Using pd.np.multiply seems to require a very awkward construction:
pd.np.multiply(panel, pd.np.asarray(series)[:, pd.np.newaxis, pd.np.newaxis])
Is there an easier way?
I don't think there's anything wrong conceptually with your last way of doing it (and I can't think of an easier way). A more idiomatic way to write it would be
import numpy as np
panel.values * (series.values[:,np.newaxis,np.newaxis])
using values to return the underlying numpy arrays of the pandas objects.

Mandelbrot set on python using matplotlib + need some advices

this is my first post here, so i'm sorry if i didn't follow the rules
i recently learned python, i know the basics and i like writing famous sets and plot them, i've wrote codes for the hofstadter sequence, a logistic sequence and succeeded in both
now i've tried writing mandelbrot's sequence without any complex parameters, but actually doing it "by hand"
for exemple if Z(n) is my complexe(x+iy) variable and C(n) my complexe number (c+ik)
i write the sequence as {x(n)=x(n-1)^2-y(n-1)^2+c ; y(n)=2.x(n-1).y(n-1)+c}
from math import *
import matplotlib.pyplot as plt
def mandel(p,u):
c=5
k=5
for i in range(p):
c=5
k=k-10/p
for n in range(p):
c=c-10/p
x=0
y=0
for m in range (u):
x=x*x-y*y + c
y=2*x*y + k
if sqrt(x*x+y*y)>2:
break
if sqrt(x*x+y*y)<2:
X=X+[c]
Y=Y+[k]
print (round((i/p)*100),"%")
return (plt.plot(X,Y,'.')),(plt.show())
p is the width and number of complexe parameters i want, u is the number of iterations
this is what i get as a result :
i think it's just a bit close to what i want.
now for my questions, how can i make the function faster? and how can i make it better ?
thanks a lot !
A good place to start would be to profile your code.
https://docs.python.org/2/library/profile.html
Using the cProfile module or the command line profiler, you can find the inefficient parts of your code and try to optimize them. If I had to guess without personally profiling it, your array appending is probably inefficient.
You can either use a numpy array that is premade at an appropriate size, or in pure python you can make an array with a given size (like 50) and work through that entire array. When it fills up, append that array to your main array. This reduces the number of times the array has to be rebuilt. The same could be done with a numpy array.
Quick things you could do though
if sqrt(x*x+y*y)>2:
should become this
if x*x+y*y>4:
Remove calls to sqrt if you can, its faster to just exponentiate the other side by 2. Multiplication is cheaper than finding roots.
Another thing you could do is this.
print (round((i/p)*100),"%")
should become this
# print (round((i/p)*100),"%")
You want faster code?...remove things not related to actually plotting it.
Also, you break a for loop after a comparison then make the same comparison...Do what you want to after the comparison and then break it...No need to compute that twice.

Efficient insertion of row into sorted DataFrame

My problem requires the incremental addition of rows into a sorted DataFrame (with a DateTimeIndex), but I'm currently unable to find an efficient way to do this. There doesn't seem to be any concept of an "insort".
I've tried appending the row and resorting in place, and I've also tried getting the insertion point with searchsorted and slicing and concatenating to create a new DataFrame. Both being "too slow".
Is Pandas just not suited to jobs where you don't have all the data at once and instead get your data incrementally?
Solutions I've tried:
Concatenation
def insert_data(df, data, index):
insertion_index = df.index.searchsorted(index)
new_df = pandas.concat([df[:insertion_index], pandas.DataFrame(data, index=[index]), df[insertion_index:]])
return new_df, insertion_index
Resorting
def insert_data(df, data, index):
new_df = df.append(pandas.DataFrame(data, index=[index]))
new_df.sort_index(inplace=True)
return new_df
pandas is built on numpy. numpy arrays are fixed sized objects. While there are numpy append and insert functions, in practice they construct new arrays from the old and new data.
There are 2 practical approaches to incrementally defining these arrays:
initialize a large empty array, and fill in values incrementally
incrementally create a Python list (or dictionary), and create the array from the completed list.
Appending to a Python list is a common and fast task. There is also a list insert, but it is slower. For sorted inserts there are specialized Python structures (e.g. bisect).
Pandas may have added functions to deal with common creation scenarios. But unless it has coded something special in C it is unlikely to be faster than a more basic Python structure.
Even if you have to use Pandas features at various points along the incremental build, it might best to create a new DataFrame on the fly from the underlying Python structure.

How to "simulate" numpy.delete during processing

For the sake of speeding up my algorithm that has numpy arrays with tens of thousands of elements, I'm wondering if I can reduce the time used by numpy.delete().
In fact, if I can just eliminate it?
I have an algorithm where I've got my array alpha.
And this is what I'm currently doing:
alpha = np.delete(alpha, 0)
beta = sum(alpha)
But why do I need to delete the first element? Is it possible to simply sum up the entire array using all elements except the first one? Will that reduce the time used in the deletion operation?
Avoid np.delete whenever possible. It returns a a new array, which means that new memory has to be allocated, and (almost) all the original data has to be copied into the new array. That's slow, so avoid it if possible.
beta = alpha[1:].sum()
should be much faster.
Note also that sum(alpha) is calling the Python builtin function sum. That's not the fastest way to sum items in a NumPy array.
alpha[1:].sum() calls the NumPy array method sum which is much faster.
Note that if you were calling alpha.delete in a loop, then the code may be deleting more than just the first element from the original alpha. In that case, as Sven Marnach points out, it would be more efficient to compute all the partial sums like this:
np.cumsum(alpha[:0:-1])[::-1]

Categories

Resources