Convert a pandas Series of lists into a numpy array - python

I want to convert a pandas Series of strings of list of numbers into a numpy array. What I have is something like:
ds = pd.Series(['[1 -2 0 1.2 4.34]', '[3.3 4 0 -1 9.1]'])
My desired output:
arr = np.array([[1, -2, 0, 1.2, 4.34], [3.3, 4, 0, -1, 9.1]])
What I have done so far is to convert the pandas Series to a Series of a list of numbers as:
ds1 = ds.apply(lambda x: [float(number) for number in x.strip('[]').split(' ')])
but I don't know how to go from ds1 to arr.

Use Series.str.strip + Series.str.split and create a new np.array with dtype=float:
arr = np.array(ds.str.strip('[]').str.split().tolist(), dtype='float')
Result:
print(arr)
array([[ 1. , -2. , 0. , 1.2 , 4.34],
[ 3.3 , 4. , 0. , -1. , 9.1 ]])

You can try to remove the "[]" from the Series object first, then things will become easier, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html.
ds1 = ds.str.strip("[]")
# split and exapand the data, conver to numpy array
arr = ds1.str.split(" ", expand=True).to_numpy(dtype=float)
Then arr will be the right format you want,
array([[ 1. , -2. , 0. , 1.2 , 4.34],
[ 3.3 , 4. , 0. , -1. , 9.1 ]])
Then I did a little profiling in comparison with Shubham's colution.
# Shubham's way
%timeit arr = np.array(ds.str.strip('[]').str.split().tolist(), dtype='float')
332 µs ± 5.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# my way
%timeit ds.str.strip("[]").str.split(" ", expand=True).to_numpy(dtype=float)
741 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Obviously, his solution is much faster! Cheers!

Related

Sorting by subsampling every nth element in numpy array?

I am trying to sample every nth element to sort an array. My current solution works, but it feels like there should be a solution that does not involve concatenation.
My current implementation is as follows.
arr = np.arange(10)
print(arr)
[0 1 2 3 4 5 6 7 8 9]
# sample every 5th element
res = np.empty(shape=0)
for i in range(5):
res = np.concatenate([res, arr[i::5]])
print(res)
[0. 5. 1. 6. 2. 7. 3. 8. 4. 9.]
Looking for any tips to make this faster/more pythonic. My use case is with an array of ~10,000 values.
Reshape your vector into in a 2D array with N elements per row, and then flatten it column-wise:
import numpy as np
# Pick "subsample stride"
N = 5
# Create a vector with length divisible by N.
arr = np.arange(2 * N)
print(arr)
# Reshape arr into a 2D array with N elements per row and however many
# columns required.
# Flatten it with "F" ordering for "Fortran style" (column-major).
output = arr.reshape(-1, N).flatten("F")
print(output)
outputs
[0 1 2 3 4 5 6 7 8 9]
[0 5 1 6 2 7 3 8 4 9]
Performance comparison
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
In [2]: def sol0(arr):
...: """OP's original solution."""
...: res = np.empty(shape=0)
...: for i in range(5):
...: res = np.concatenate([res, arr[i::5]])
...: return res
...:
In [3]: def sol1(arr):
...: """This answer's solution."""
...: return arr.reshape(-1, 5).flatten("F")
...:
In [4]: def sol2(arr):
...: """#seralouk's solution, with shape error patch"""
...: res = np.empty((5, arr.size//5), order='F')
...: for i in range(5):
...: res[i::5] = arr[i::5]
...: return res.reshape(-1)
In [5]: arr = np.arange(10_000)
In [6]: %timeit sol0(arr)
26.6 µs ± 724 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit sol1(arr)
7.81 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit sol2(arr)
36.3 µs ± 841 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Apply a fonction to multiple vectors stored in a numpy array without loops

I have a function taking two vectors with same size, making some calculations with them, and returning a third vector. Let's now consider that I have a multiple dimension array containing many vectors that I want to pass to my function as the first argument, and a fixed vector that I want to pass as the second argument. Below is an example. Is there a way to simplify the code by removing the loops?
def foo(x, y):
result = np.zeros(x.shape)
for i in range(y.size-1):
result[i+1] = result[i] + (x[i+1] + x[i]) / (y[i+1] + y[i])
return result
a = np.arange(2*3*4).reshape(2,3,4)
b = np.arange(4)*10
c = np.ones(a.shape)*-9999.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i, j, :] = foo(a[i, j, :], b)
Thanks for your help!
EDIT: Below is the real function I'm trying to implement.
def press2alt(press, Tv):
"""
Convert pressure level to altitude level with hydrostatic atmosphere calculation (hypsometric
equation). The altitude reference (z = 0 km) is taken for the largest pressure level.
:param press: pressure level profile [hPa]
:param Tv: virtual temperature profile [K]
:return: altitude level profile [km]
"""
press_c = np.copy(press)
if press[0] < press[1]:
press_c = press_c[::-1] # high press first
Tv = Tv[::-1]
alt = np.zeros(press_c.size)
for i in range(alt.size-1):
alt[i+1] = alt[i] + DRY_AIR_GAS_CST/STD_GRAV_ACC*(Tv[i+1]-Tv[i])* \
(np.log(press_c[i])-np.log(press_c[i+1]))/(np.log(Tv[i+1])-np.log(Tv[i]))
if press[0] < press[1]:
alt = alt[::-1]
return alt
# Get altitude at each pressure level
z = np.ones(tv.shape)*FILL_VALUE_FLOAT
for i_month in range(tv.shape[0]):
for i_lat in range(tv.shape[2]):
for i_lon in range(tv.shape[3]):
z[i_month, :, i_lat, i_lon] = \
press2alt(pressure_level, tv[i_month, :, i_lat, i_lon])
The c from your sample (which you should have shown) is:
In [164]: c
Out[164]:
array([[[0. , 0.1 , 0.2 , 0.3 ],
[0. , 0.9 , 1.26666667, 1.52666667],
[0. , 1.7 , 2.33333333, 2.75333333]],
[[0. , 2.5 , 3.4 , 3.98 ],
[0. , 3.3 , 4.46666667, 5.20666667],
[0. , 4.1 , 5.53333333, 6.43333333]]])
np.vectorize with signature turns out to be easier to use than I first thought:
In [165]: f = np.vectorize(foo, signature="(n),(n)->(n)")
In [166]: f(a, b)
Out[166]:
array([[[0. , 0.1 , 0.2 , 0.3 ],
[0. , 0.9 , 1.26666667, 1.52666667],
[0. , 1.7 , 2.33333333, 2.75333333]],
[[0. , 2.5 , 3.4 , 3.98 ],
[0. , 3.3 , 4.46666667, 5.20666667],
[0. , 4.1 , 5.53333333, 6.43333333]]])
But vectorize does not improve speed:
In [167]: %%timeit
...: c = np.ones(a.shape)*-9999.
...: for i in range(a.shape[0]):
...: for j in range(a.shape[1]):
...: c[i, j, :] = foo(a[i, j, :], b)
...:
57 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [168]: timeit f(a, b)
206 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In other recent cases I've found that vectorize does improve in relative performance with larger arrays, but that wasn't with signature.
The function can be rewritten to accept arrays of any size, as long as the iteration on the last dimension is correct. Basically I use ... in the indexing:
def myfoo(a, b):
result = np.zeros(a.shape)
for i in range(a.shape[-1] - 1):
result[..., i + 1] = result[..., i] + (a[..., i + 1] +
a[...,i]) / ( b[..., i + 1] + b[..., i])
return result
In [182]: myfoo(a, b)
Out[182]:
array([[[0. , 0.1 , 0.2 , 0.3 ],
[0. , 0.9 , 1.26666667, 1.52666667],
[0. , 1.7 , 2.33333333, 2.75333333]],
[[0. , 2.5 , 3.4 , 3.98 ],
[0. , 3.3 , 4.46666667, 5.20666667],
[0. , 4.1 , 5.53333333, 6.43333333]]])
In [183]: timeit myfoo(a, b)
65.8 µs ± 483 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
This doesn't help with speed, possibly because the last axis, size 4, is comparable to the 2*3 iterations of the first. I expect it will do better, relatively, if the initial dimensions get much larger.
We may be able to improve speed by replacing the i iteration on:
(a[..., i + 1] + a[...,i]) / ( b[..., i + 1] + b[..., i])
with
(a[...,1:]+a[...,:-1])/(b[...,1:]+b[...,:-1])
edit
In [192]: ab = (a[..., 1:] + a[..., :-1]) / (b[..., 1:] + b[..., :-1])
In [193]: ab
Out[193]:
array([[[0.1 , 0.1 , 0.1 ],
[0.9 , 0.36666667, 0.26 ],
[1.7 , 0.63333333, 0.42 ]],
[[2.5 , 0.9 , 0.58 ],
[3.3 , 1.16666667, 0.74 ],
[4.1 , 1.43333333, 0.9 ]]])
In [194]: ab.cumsum(axis=2)
Out[194]:
array([[[0.1 , 0.2 , 0.3 ],
[0.9 , 1.26666667, 1.52666667],
[1.7 , 2.33333333, 2.75333333]],
[[2.5 , 3.4 , 3.98 ],
[3.3 , 4.46666667, 5.20666667],
[4.1 , 5.53333333, 6.43333333]]])
Those are the values - except for the leading 0's column.
In [195]: timeit((a[..., 1:] + a[..., :-1]) / (b[..., 1:] + b[..., :-1])).cumsum
...: (axis=2)
18.8 µs ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

How can I make a distance matrix with own metric using no loop?

I have a np.arrray like this:
[[ 1.3 , 2.7 , 0.5 , NaN , NaN],
[ 2.0 , 8.9 , 2.5 , 5.6 , 3.5],
[ 0.6 , 3.4 , 9.5 , 7.4 , NaN]]
And a function to compute the distance between two rows:
def nan_manhattan(X, Y):
nan_diff = np.absolute(X - Y)
length = nan_diff.size
return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())
I need all pairwise distances, and I don't want to use a loop. How do I do that?
Leveraging broadcasting -
def manhattan_nan(a):
s = np.nansum(np.abs(a[:,None,:] - a), axis=-1)
m = ~np.isnan(a)
k = m.sum(1)
r = a.shape[1]/np.minimum.outer(k,k)
out = s*r
return out
Benchmarking
From OP's comments, the use-case seems to be a tall array. Let's reproduce one for benchmarking re-using given sample data :
In [2]: a
Out[2]:
array([[1.3, 2.7, 0.5, nan, nan],
[2. , 8.9, 2.5, 5.6, 3.5],
[0.6, 3.4, 9.5, 7.4, nan]])
In [3]: a = np.repeat(a, 100, axis=0)
# #Dani Mesejo's soln
In [4]: %timeit pdist(a, nan_manhattan)
1.02 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Naive for-loop version
In [18]: n = a.shape[0]
In [19]: %timeit [[nan_manhattan(a[i], a[j]) for i in range(j+1,n)] for j in range(n)]
991 ms ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With broadcasting
In [9]: %timeit manhattan_nan(a)
8.43 ms ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use pdist:
import numpy as np
from scipy.spatial.distance import pdist, squareform
def nan_manhattan(X, Y):
nan_diff = np.absolute(X - Y)
length = nan_diff.size
return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())
arr = np.array([[1.3, 2.7, 0.5, np.nan, np.nan],
[2.0, 8.9, 2.5, 5.6, 3.5],
[0.6, 3.4, 9.5, 7.4, np.nan]])
result = squareform(pdist(arr, nan_manhattan))
print(result)
Output
[[ 0. 14.83333333 17.33333333]
[14.83333333 0. 19.625 ]
[17.33333333 19.625 0. ]]

Pandas nunique equivalent with NumPy [duplicate]

This question already has answers here:
Number of unique elements per row in a NumPy array
(4 answers)
Closed 3 years ago.
Is there a pandas equivalent nunique row wise in numpy? I checked out np.unique with return_counts but it doesn't seem to return what I want. For example
a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
[119.16805, 73.89428, 125.38216], [118.38071, 73.35443, 125.30198],
[118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise
The results:
>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)
I should be expecting all 3 as opposed to all 1.
The work around of course is convert to pandas and call nunique but there is a speed issue and I want to explore a pure numpy implementation to speed things up. I am working with large dataframes so hoping to find speedups whereever I can. I am open to other solutions too for speed up.
We can use some sorting and consecutive differences -
a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
For some perf. boost, we can use slicing to replace np.diff -
a_s = np.sort(a,axis=1)
out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
If you want to introduce some tolerance value for checking unique-ness, we can use np.isclose -
a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)
Sample run -
In [51]: import pandas as pd
In [48]: a
Out[48]:
array([[120.52971 , 120.52971 , 128.12627 ],
[119.82573 , 73.86636 , 125.792 ],
[119.16805 , 73.89428 , 125.38216 ],
[118.38071 , 118.38071 , 118.38071 ],
[118.02871 , 73.689514, 124.82088 ]])
In [49]: pd.DataFrame(a).nunique(axis=1).values
Out[49]: array([2, 3, 3, 1, 3])
In [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
Out[50]: array([2, 3, 3, 1, 3])
Timings on a simplistic case with random numbers and at least 2 unique numbers per row -
In [41]: np.random.seed(0)
...: a = np.random.rand(10000,5)
...: a[:,-1] = a[:,0]
In [42]: %timeit pd.DataFrame(a).nunique(axis=1).values
...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
1.31 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
758 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [43]: %%timeit
...: a_s = np.sort(a,axis=1)
...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
694 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Weighted average element-wise between two arrays

I have two arrays of number and I want to compute a weighted average element-wise between these array and store it in a new array.
The solution I use for now is :
array_1 = [0,1,2,3,4]
array_2 = [2,3,4,5,6]
weight_1 = 0.5
weight_2 = 0.5
array_3 = np.zeros(array_1.shape)
for i in range(0, len(array_1)) :
array_3[i] = np.average(a=[array_1[i], array_2[i]], weights=[weight_1, weight_2])
print(array_3)
>> [1,2,3,4,5]
The problem is that it is not really efficient. How can I do that more efficiently ?
Just use NumPy's vectorised operations. To do so, first convert your lists to arrays and then just multiply each array with the respective weight and take the sum
import numpy as np
array_1 = np.array([0,1,2,3,4])
array_2 = np.array([2,3,4,5,6])
weight_1 = 0.5
weight_2 = 0.5
array_3 = weight_1*array_1 + weight_2*array_2
# array([1., 2., 3., 4., 5.])
A direct NumPy solution using np.average would be the following, where axis=0 means take the average row wise (using both columns). np.vstack() simply stacks the two arrays vertically.
np.average(np.vstack((array_1, array_2)), axis=0, weights=[weight_1, weight_2])
As pointed out by #yatu, you can also pass a list of your arrays and specify the axis
np.average([array_1, array_2], axis=0, weights=[weight_1, weight_2])
Timing comparison inspired by the comments on #yatu's answer: As you can see, list comprehension and zip is slightly faster here but then this performance is for small arrays. I am sure, for large arrays, the vectorised solution will take over
Devesh's method
%timeit result = [ item1 * weight_1 + item2 * weight_2 for item1, item2 in zip(array_1, array_2)]
# 25.5 µs ± 3.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.average([array_1, array_2], axis=0, weights=[weight_1, weight_2])
# 42.9 µs ± 2.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.average(np.vstack((array_1, array_2)), axis=0, weights=[weight_1, weight_2])
# 44.8 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can zip both iterators, and multiply each element with the corresponding weight
array_1 = [0,1,2,3,4]
array_2 = [2,3,4,5,6]
weight_1 = 0.5
weight_2 = 0.5
#Zip both iterators and multiply weight with corresponding item
result = [ item1 * weight_1 + item2 * weight_2 for item1, item2 in zip(array_1, array_2)]
print(result)
The output will be
[1.0, 2.0, 3.0, 4.0, 5.0]
Given that you're using NumPy, you can easily vectorize this by doing:
array_1 = np.array([0,1,2,3,4])
array_2 = np.array([2,3,4,5,6])
weight_1 = 0.5
weight_2 = 0.5
array_1*weight_1 + array_2*weight_2
# array([1., 2., 3., 4., 5.])
Can this be generalised for multiple arrays and weights?
For a more generalizable answer, the best way is to use np.average, which accepts an array_like both for the arrays and the weights to be applied to each of these:
np.average([array_1, array_2], weights=[weight_1, weight_2], axis=0)
# array([1., 2., 3., 4., 5.])
Shouldn't the equation for the weighted average be:
(array_1*weight_1 + array_2*weight_2)/(weight_1 + weighted_2)

Categories

Resources