Convert pandas dataframe of lists into numpy array - python

I have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([{'a': [1,3,2]},{'a': [7,6,5]},{'a': [9,8,8]}])
df
df['a'].to_numpy()
df['a'].to_numpy()
=> array([list([1, 3, 2]), list([7, 6, 5]), list([9, 8, 8])], dtype=object)
How can I get a numpy array of shape (3,3) without writing a for loop?

First create nested lists and then convert to array, only necessary all lists with same lengths:
arr = np.array(df.a.tolist())
print (arr)
[[1 3 2]
[7 6 5]
[9 8 8]]

If always have the same length
pd.DataFrame(df.a.tolist()).values
array([[1, 3, 2],
[7, 6, 5],
[9, 8, 8]])

All of these answers are focused on a single column rather than an entire Dataframe. If you have multiple columns, where every entry at index ij is a list you can do this:
df = pd.DataFrame({"A": [[1, 2], [3, 4]], "B": [[5, 6], [7, 8]]})
print(df)
A B
0 [1, 2] [5, 6]
1 [3, 4] [7, 8]
arrays = df.applymap(lambda x: np.array(x, dtype=np.float32)).to_numpy()
result = np.array(np.stack([np.stack(a) for a in array]))
print(result, result.shape)
array([[[1., 2.],
[5., 6.]],
[[3., 4.],
[7., 8.]]], dtype=float32)
I cannot speak to the speed of this, as I use it on very small amounts of data.

Related

Accessing columns of matlab matrix and its realization in numpy

I am trying to find a realization of accessing elements of numpy arrays corresponding to a feature of Matlab.
Suppose given a (2,2,2) Matlab matrix m in the form
m(:,:,1) = [1,2;3,4]
m(:,:,2) = [5,6;7,8]
Even though this is a 3-d array, Matlab allows accessing its column in the fashion like
m(:,1) = [1;3]
m(:,2) = [2;4]
m(:,3) = [5;7]
m(:,4) = [6;8]
I am curious to know that if numpy supports such indexing so that given the following array
m = array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
One can also access columns in the fashion as Matlab listed above.
My answer to this question is as following, suppose given the array listed as in the question
m = array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
One can create a list, I call it m_list in the form such that
m_list = [m[i][:,j] for i in range(m.shape[0]) for j in range(m.shape[-1])]
This will output m_list in the form such that
m_list = [array([1, 3]), array([2, 4]), array([7, 9]), array([ 8, 10])]
Now we can access elements of m_list exactly as the fashion as Matlab as listed in the question.
In [41]: m = np.arange(1,9).reshape(2,2,2)
In [42]: m
Out[42]:
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
Indexing the equivalent blocks:
In [47]: m[0,:,0]
Out[47]: array([1, 3])
In [48]: m[0,:,1]
Out[48]: array([2, 4])
In [49]: m[1,:,0]
Out[49]: array([5, 7])
In [50]: m[1,:,1]
Out[50]: array([6, 8])
We can reshape, to "flatten" one pair of dimensions:
In [84]: m = np.arange(1,9).reshape(2,2,2)
In [85]: m.reshape(2,4)
Out[85]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
In [87]: m.reshape(2,4)[:,2]
Out[87]: array([3, 7])
and throw in a transpose:
In [90]: m.transpose(1,0,2).reshape(2,4)
Out[90]:
array([[1, 2, 5, 6],
[3, 4, 7, 8]])
MATLAB originally was strictly 2d. Then sometime around v3.9 (2000) they allowed for more, but in a kludgy way. They added a way to index the trailing dimension as though it was multidimensional. In another recent SO I noticed that when reshaping to (2,2,1,1) the result remained (2,2). Trailing size 1 dimensions are squeeze out.
I suspect the m(:,3) is a consequence of that as well.
Testing a 4d MATLAB
>> m=reshape(1:36,2,3,3,2);
>> m(:,:,1)
ans =
1 3 5
2 4 6
>> reshape(m,2,3,6)(:,:,1)
ans =
1 3 5
2 4 6
>> m(:,17)
ans =
33
34
>> reshape(m,2,18)(:,17)
ans =
33
34

Best way to concatenate subarrays along an axis?

I have an array of arrays:
a = [ [[1,2], [3,4]], [[5,6], [7,8]] ]
How can I, for each subarray in the above array, perform the following:
reshape(-1,1)
concatenate along axis=1
a is not a numpy array but it can of course be converted to a numpy array. I do not care whether it stays a Python list or is converted to a numpy array - whatever yields the most concise/clean solution.
I am aware that this can be accomplished using a for loop but it seems clumsy and I am interesting in cleaner solutions :)
The result I expect is:
[[1, 5], [2, 6], [3, 7], [4, 8]]
The library einops contains a function rearrange that should do the trick in a tidy manner. It uses the same type of indexing as np.einsum to operate flexible reshaping operations.
import numpy as np
a = [ [[1,2], [3,4]], [[5,6], [7,8]] ]
an = np.array(a)
from einops import rearrange
ar = rearrange(an, "a b c -> (b c) a")
ar
output:
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
You are looking to reshape each sub-array to (-1, 1), which means the overall shape of the initial array is (len(a), -1, 1). From there we can np.concatenate on axis=1:
>>> a
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
>>> a.reshape((len(a), -1, 1))
array([[[1],
[2],
[3],
[4]],
[[5],
[6],
[7],
[8]]])
>>> np.concatenate(x.reshape((len(x), -1, 1)), 1)
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
Note: len(a) is essentially a.shape[0]
Alternatively you can discard the last axis and work with np.stack instead:
>>> np.stack(x.reshape((len(x), -1)), 1)
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
A python list solution:
In [182]: alist = []
...: for xy in zip(*a):
...: alist.extend(list(zip(*xy)))
...:
In [183]: alist
Out[183]: [(1, 5), (2, 6), (3, 7), (4, 8)]

Asymmetric slicing python

Consider the following matrix:
X = np.arange(9).reshape(3,3)
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
Let say I want to subset the following array
array([[0, 4, 2],
[3, 7, 5]])
It is possible with some indexing of rows and columns, for instance
col=[0,1,2]
row = [[0,1],[1,2],[0,1]]
Then if I store the result in a variable array I can do it with the following code:
array=np.zeros([2,3],dtype='int64')
for i in range(3):
array[:,i]=X[row[i],col[i]]
Is there a way to broadcast this kind of operation ? I have to do this as a data cleaning stage for a large file ~ 5 Gb, and I would like to use dask to parallelize it. But in a first time if I could avoid using a for loop I would feel great.
For arrays with NumPy's advanced-indexing, it would be -
X[row, np.asarray(col)[:,None]].T
Sample run -
In [9]: X
Out[9]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [10]: col=[0,1,2]
...: row = [[0,1],[1,2],[0,1]]
In [11]: X[row, np.asarray(col)[:,None]].T
Out[11]:
array([[0, 4, 2],
[3, 7, 5]])

How to return some column items in a NumPy array?

I want print some items in 2D NumPy array.
For example:
a = [[1, 2, 3, 4],
[5, 6, 7, 8]]
a = numpy.array(a)
My questions:
How can I return just (1 and 2)? As well as (5 and 6)?
And how can I keep the dimension as [2, 2]
The following:
a[:, [0, 1]]
will select only the first two columns (with index 0 and 1). The result will be:
array([[1, 2],
[5, 6]])
You can use slicing to get necessary parts of the numpy array.
To get 1 and 2 you need to select 0's row and the first two columns, i.e.
>>> a[0, 0:2]
array([1, 2])
Similarly for 5 and 6
>>> a[1, 0:2]
array([5, 6])
You can also select a 2x2 subarray, e.g.
>>> a[:,0:2]
array([[1, 2],
[5, 6]])
You can do like this,
In [44]: a[:, :2]
Out[44]:
array([[1, 2],
[5, 6]])

Search Numpy array with multiple values

I have numpy 2d array having duplicate values.
I am searching the array like this.
In [104]: import numpy as np
In [105]: array = np.array
In [106]: a = array([[1, 2, 3],
...: [1, 2, 3],
...: [2, 5, 6],
...: [3, 8, 9],
...: [4, 8, 9],
...: [4, 2, 3],
...: [5, 2, 3])
In [107]: num_list = [1, 4, 5]
In [108]: for i in num_list :
...: print(a[np.where(a[:,0] == num_list)])
...:
[[1 2 3]
[1 2 3]]
[[4 8 9]
[4 2 3]]
[[5 2 3]]
The input is list having number similar to column 0 values.
The end result I want is the resulting rows in any format like array, list or tuple for example
array([[1, 2, 3],
[1, 2, 3],
[4, 8, 9],
[4, 2, 3],
[5, 2, 3]])
My code works fine but doesn't seem pythonic. Is there any better searching strategy with multiple values?
like a[np.where(a[:,0] == l)] where only one time lookup is done to get all the values.
my real array is large
Approach #1 : Using np.in1d -
a[np.in1d(a[:,0], num_list)]
Approach #2 : Using np.searchsorted -
num_arr = np.sort(num_list) # Sort num_list and get as array
# Get indices of occurrences of first column in num_list
idx = np.searchsorted(num_arr, a[:,0])
# Take care of out of bounds cases
idx[idx==len(num_arr)] = 0
out = a[a[:,0] == num_arr[idx]]
You can do
a[numpy.in1d(a[:, 0], num_list), :]

Categories

Resources