Broadcasting a multiplication across a pandas Panel - python

I have a pandas Panel that is long, wide, and shallow. In reality it's bigger but for ease of example, let's say it's 2x5x6:
panel=pd.Panel(pd.np.random.rand(2,3,6))
I have a Series that is the length of the shortest dimension - in this case 2:
series=pd.Series([0,1])
I want to multiply the panel by the series, by broadcasting the series across the two other axes.
Using panel.mul doesn't work, because that can only take Panels or DataFrames, I think
panel.mul(series) # returns None
Using panel.apply(lambda x: x.mul(series), axis=0) works, but seems to do the calculation across every combination of series, in this case 3x6=18, but in reality >1m series, and so is extremely slow.
Using pd.np.multiply seems to require a very awkward construction:
pd.np.multiply(panel, pd.np.asarray(series)[:, pd.np.newaxis, pd.np.newaxis])
Is there an easier way?

I don't think there's anything wrong conceptually with your last way of doing it (and I can't think of an easier way). A more idiomatic way to write it would be
import numpy as np
panel.values * (series.values[:,np.newaxis,np.newaxis])
using values to return the underlying numpy arrays of the pandas objects.

Related

finding different element in column numpy

i am using numpy to find different element in the first column of numpy array i am using below code i also look at np.unique method but i couldn't find proper function
k = 0
c = 0
nonrep=[]
for i in range(len(xin)):
for j in range(len(nonrep)):
if(xin[i,0]==nonrep[j]):
c = c+1
if(c==0):
nonrep.append(xin[i,0])
c=0
i am sure i can do it better and faster using numpy library, i will be glad if you help me to find better and faster way to do this
This is definitely not the good way to do it. Since here you perform membership checks by performing linear search. Furthermore you do not even break after you have found the element. This makes it an O(n2) algorithm.
Using numpy O(n log n), no order
You can simply use:
np.unique(xin[:,0])
This will work in O(n log n). This is still not the most efficient approach.
Using pandas O(n), order
If you really need fast computations, you can better use pandas:
import pandas as pd
pd.DataFrame(xin[:,0])[0].unique()
This works in O(n) (given the elements can be efficiently hashed) and furthermore preserves order. Here the result is again a numpy array.
Like #B.M. says in their comment, you can prevent constructing a 1-column dataframe, and construct a sequence instead:
import pandas as pd
pd.Series(xin[:,0]).unique()

Fastest way to initialize numpy array with values given by function

I am mainly interested in ((d1,d2)) numpy arrays (matrices) but the question makes sense for arrays with more axes. I have function f(i,j) and I'd like to initialize an array by some operation of this function
A=np.empty((d1,d2))
for i in range(d1):
for j in range(d2):
A[i,j]=f(i,j)
This is readable and works but I am wondering if there is a faster way since my array A will be very large and I have to optimize this bit.
One way is to use np.fromfunction. Your code can be replaced with the line:
np.fromfunction(f, shape=(d1, d2))
This is implemented in terms of NumPy functions and so should be quite a bit faster than Python for loops for larger arrays.
a=np.arange(d1)
b=np.arange(d2)
A=f(a,b)
Note that if your arrays are of different size, then you have to create a meshgrid:
X,Y=meshgrid(a,b)
A=f(X,Y)

creating numpy structured arrays from columns

If I want to create a numpy array with dtype = [('index','<u4'),('valid','b1')], and I have separate numpy arrays for the 32-bit index and boolean valid values, how can I do it?
I don't see a way in the numpy.ndarray constructor; I know I can do this:
arr = np.zeros(n, dtype = [('index','<u4'),('valid','b1')])
arr['index'] = indices
arr['valid'] = validity
but somehow calling np.zeros() first seems wrong.
Any suggestions?
An alternative is
arr = np.fromiter(zip(indices, validity), dtype=[('index','<u4'),('valid','b1')])
but I suspect your initial idea is more efficient. (In your approach, you could use np.empty() instead of np.zeros() for a tiny performance benefit.)
Just use empty instead of zeros, and it should feel less 'wrong', since you are just allocating the data without unnecessarily zeroing it.
Or use fromiter, and pass in also the optional count argument if you're keen on performance.
This is in any case a matter of taste in more than 99% of the use cases, and won't lead to any noticeable performance improvements IMHO.

Apply function to pandas Series with argument (which varies for every element)

I have a pandas Series and a function that I want to apply to each element of the Series. The function have an additional argument too. So far so good: for example
python pandas: apply a function with arguments to a series. Update
What about if the argument varies by itself running over a given list?
I had to face this problem in my code and I have found a straightforward solution but it is quite specific and (even worse) do not use the apply method.
Here is a toy model code:
a=pd.DataFrame({'x'=[1,2]})
t=[10,20]
I want to multiply elements in a['x'] by elements in t. Here the function is quite simple and len(t) matches with len(a['x'].index) so I could just do:
a['t']=t
a['x*t']=a['x']*a['t']
But what about if the function is more elaborate or the two lengths do not match?
What I would like is a command line like:
a['x'].apply(lambda x,y: x*y, arg=t)
The point is that this specific line exits with an error because the arg variable in that case will accept only a tuple of len=1. I do not see any 'place' to put the various element of t.
What you're looking for is similar to what R calls "recycling", where operations on arrays of unequal length loops through the smaller array over and over as many times as needed to match the length of the longer array.
I'm not aware of any simple, built-in way to do this with numpy or pandas. What you can do is use np.tile to repeat your smaller array. Something like:
a.x*np.tile(t, len(a)/len(t))
This will only work if the longer array's length is a simple multiple of the shorter one's.
The behavior you want is somewhat unusual. Depending on what you're doing, there may be a better way to handle it. Relying on the values to match up in the desired way just by repetition is a little fragile. If you have some way to match up the values in each array that you want to multiply, you could use the .map method of Series to select the right "other value" to multiply each element of your Series with.

Efficiently processing DataFrame rows with a Python function?

In many places in our Pandas-using code, we have some Python function process(row). That function is used over DataFrame.iterrows(), taking each row, and doing some processing, and returning a value, which we ultimate collect into a new Series.
I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.
What would be the best way to make this usage pattern as efficient
as possible?
Can we possibly do it without rewriting most of our code?
Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?
You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object
df.apply(you_function, axis=1)
Example:
>>> df = pd.DataFrame({'a': np.arange(3),
'b': np.random.rand(3)})
>>> df
a b
0 0 0.880075
1 1 0.143038
2 2 0.795188
>>> def func(row):
return row['a'] + row['b']
>>> df.apply(func, axis=1)
0 0.880075
1 1.143038
2 2.795188
dtype: float64
As for the second part of the question: row wise operations, even optimised ones, using pandas apply, are not the fastest solution there is. They are certainly a lot faster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.
Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the apply is too slow for you, I would suggest "Cython-izing" your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.
Or you can try numba. :)

Categories

Resources