3 dimensional numpy array to pandas dataframe - python

I have a 3 dimensional numpy array
([[[0.30706802]],
[[0.19451728]],
[[0.19380492]],
[[0.23329106]],
[[0.23849282]],
[[0.27154338]],
[[0.2616704 ]], ... ])
with shape (844,1,1) resulting from RNN model.predict()
y_prob = loaded_model.predict(X)
, my problem is how to convert it to a pandas dataframe.
I have used Keras
my objective is to have this:
0 0.30706802
7 0.19451728
21 0.19380492
35 0.23329106
42 ...
...
815 ...
822 ...
829 ...
836 ...
843 ...
Name: feature, Length: 78, dtype: float32

idea is to first flatten the nested list to list than convert it in df using from_records method of pandas dataframe
import numpy as np
import pandas as pd
data = np.array([[[0.30706802]],[[0.19451728]],[[0.19380492]],[[0.23329106]],[[0.23849282]],[[0.27154338]],[[0.2616704 ]]])
import itertools
data = list(itertools.chain(*data))
df = pd.DataFrame.from_records(data)
Without itertools
data = [i for j in data for i in j]
df = pd.DataFrame.from_records(data)
Or you can use flatten() method as mentioned in one of the answer, but you can directly use it like this
pd.DataFrame(data.flatten(),columns = ['col1'])

Here you go!
import pandas as pd
y = ([[[[11]],[[13]],[[14]],[[15]]]])
a = []
for i in y[0]:
a.append(i[0])
df = pd.DataFrame(a)
print(df)
Output:
0
0 11
1 13
2 14
3 15
Feel free to set your custom index values both for axis=0 and axis=1.

You could try:
s = pd.Series(your_array.flatten(), name='feature')
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html
You can then convert the series to a dataframe using s.to_frame()

Related

How to efficiently update pandas row if computation involving lookup another array value

The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0
My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16

How to convert pandas data frame to NumPy array?

Following the suggestions I got from my previous question here I'm converting a Pandas data frame to a numeric NumPy array. To do this Im used numpy.asarray.
My data frame:
DataFrame
----------
label vector
0 0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:...
1 0 1:0.02134219 2:-0.007388343 3:0.06835007 4:0....
2 0 1:0.030515702 2:-0.0037591448 3:0.066626 4:0....
3 0 1:0.0069114454 2:-0.0149497045 3:0.020777626 ...
4 1 1:0.003118149 2:-0.015105667 3:0.040879637 4:...
... ... ...
19779 0 1:0.0042634667 2:-0.0044222944 3:-0.012995412...
19780 1 1:0.013818732 2:-0.010984628 3:0.060777966 4:...
19781 0 1:0.00019213723 2:-0.010443398 3:0.01679976 4...
19782 0 1:0.010373874 2:0.0043582567 3:-0.0078354385 ...
19783 1 1:0.0016790542 2:-0.028346825 3:0.03908631 4:...
[19784 rows x 2 columns]
DataFrame datatypes :
label object
vector object
dtype: object
To convert into a Numpy Array I'm using this script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt
r_filenameTSV = 'TSV/A19784.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print('DataFrame\n----------\n', df)
print('\nDataFrame datatypes :\n', df.dtypes)
arr = np.asarray(df, dtype=np.float64)
print('\nNumpy Array\n----------\n', arr)
print('\nNumpy Array Datatype :', arr.dtype)
I'm having this error from line nr.22 arr = np.asarray(df, dtype=np.float64)
ValueError: could not convert string to float: ' 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.0030826542 11:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...
How can I solve this issue?
Regards and thanks for your time
Use list comprehension with nested dictionary comprehension for DataFrame:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4
And then convert to floats and to numpy array:
print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]
It seems one of your columns is a string, not an integer. Either remove that column or encode it as a string before converting the dataframe to an array

How to convert a list of Numpy arrays to a Pandas DataFrame

I have a list of Numpy arrays that looks like this:
[400.31865662]
[401.18514808]
[404.84015554]
[405.14682194]
[405.67735105]
[273.90969447]
[274.0894528]
When I try to convert it to a Pandas Dataframe with the following code
y = pd.DataFrame(data)
print(y)
I get the following output when printing it. Why do I get all those zeros?
0
0 400.318657
0
0 401.185148
0
0 404.840156
0
0 405.146822
0
0 405.677351
0
0 273.909694
0
0 274.089453
I would like to get a single column dataframe which looks like that:
400.31865662
401.18514808
404.84015554
405.14682194
405.67735105
273.90969447
274.0894528
You could flatten the numpy array:
import numpy as np
import pandas as pd
data = [[400.31865662],
[401.18514808],
[404.84015554],
[405.14682194],
[405.67735105],
[273.90969447],
[274.0894528]]
arr = np.array(data)
df = pd.DataFrame(data=arr.flatten())
print(df)
Output
0
0 400.318657
1 401.185148
2 404.840156
3 405.146822
4 405.677351
5 273.909694
6 274.089453
Since I assume the many visitors of this post aren't here for OP's specific and un-reproducible issue, here's a general answer:
df = pd.DataFrame(array)
The strength of pandas is to be nice for the eye (like Excel), so it's important to use column names.
import numpy as np
import pandas as pd
array = np.random.rand(5, 5)
array([[0.723, 0.177, 0.659, 0.573, 0.476],
[0.77 , 0.311, 0.533, 0.415, 0.552],
[0.349, 0.768, 0.859, 0.273, 0.425],
[0.367, 0.601, 0.875, 0.109, 0.398],
[0.452, 0.836, 0.31 , 0.727, 0.303]])
columns = [f'col_{num}' for num in range(5)]
index = [f'index_{num}' for num in range(5)]
Here's where the magic happens:
df = pd.DataFrame(array, columns=columns, index=index)
col_0 col_1 col_2 col_3 col_4
index_0 0.722791 0.177427 0.659204 0.572826 0.476485
index_1 0.770118 0.311444 0.532899 0.415371 0.551828
index_2 0.348923 0.768362 0.858841 0.273221 0.424684
index_3 0.366940 0.600784 0.875214 0.108818 0.397671
index_4 0.451682 0.836315 0.310480 0.727409 0.302597
I just figured out my mistake. (data) was a list of arrays:
[array([400.0290173]), array([400.02253235]), array([404.00252113]), array([403.99466754]), array([403.98681395]), array([271.97896036]), array([271.97110677])]
So I used np.vstack(data) to concatenate it
conc = np.vstack(data)
[[400.0290173 ]
[400.02253235]
[404.00252113]
[403.99466754]
[403.98681395]
[271.97896036]
[271.97110677]]
Then I convert the concatened array into a Pandas Dataframe by using the
newdf = pd.DataFrame(conc)
0
0 400.029017
1 400.022532
2 404.002521
3 403.994668
4 403.986814
5 271.978960
6 271.971107
Et voilĂ !
There is another way, which isn't mentioned in the other answers. If you have a NumPy array which is essentially a row vector (or column vector) i.e. shape like (n, ) , then you could do the following :
# sample array
x = np.zeros((20))
# empty dataframe
df = pd.DataFrame()
# add the array to df as a column
df['column_name'] = x
This way you can add multiple arrays as separate columns.

Using a subset of Pandas dataframe with Scipy Kmeans?

I have a data frame that I import using df = pd.read_csv('my.csv',sep=','). In that CSV file, the first row is the column name, and the first column is the observation name.
I know how to select a subset of the Panda dataframe, using:
df.iloc[:,1::]
which gives me only the numeric values. But when I try and use this with scipy.cluster.vq.kmeans using this command,
kmeans(df.iloc[:,1::],3)
I get the error 'DataFrame' object has no attribute 'dtype'
Any suggestions?
Here is an example to use KMeans.
from sklearn.datasets import make_blobs
from itertools import product
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
# try to simulate your data
# =====================================================
X, y = make_blobs(n_samples=1000, n_features=10, centers=3)
columns = ['feature' + str(x) for x in np.arange(1, 11, 1)]
d = {key: values for key, values in zip(columns, X.T)}
d['label'] = y
data = pd.DataFrame(d)
Out[72]:
feature1 feature10 feature2 ... feature8 feature9 label
0 1.2324 -2.6588 -7.2679 ... 5.4166 8.9043 2
1 0.3569 -1.6880 -5.7671 ... -2.2465 -1.7048 0
2 1.0177 -1.7145 -5.8591 ... -0.5755 -0.6969 0
3 1.5735 -0.0597 -4.9009 ... 0.3235 -0.2400 0
4 -0.1042 -1.6703 -4.0541 ... 0.4456 -1.0406 0
.. ... ... ... ... ... ... ...
995 -0.0983 -1.4569 -3.5179 ... -0.3164 -0.6685 0
996 1.3151 -3.3253 -7.0984 ... 3.7563 8.4052 2
997 -0.9177 0.7446 -4.8527 ... -2.3793 -0.4038 0
998 2.0385 -3.9001 -7.7472 ... 5.2290 9.2281 2
999 3.9357 -7.2564 5.7881 ... 1.2288 -2.2305 1
[1000 rows x 11 columns]
# fit your data with KMeans
# =====================================================
kmeans = KMeans(n_clusters=3)
kmeans.fit_predict(data.ix[:, :-1].values)
Out[70]: array([1, 0, 0, ..., 0, 1, 2], dtype=int32)

Inflating a 1D array into a 2D array in numpy

Say I have a 1D array:
import numpy as np
my_array = np.arange(0,10)
my_array.shape
(10, )
In Pandas I would like to create a DataFrame with only one row and 10 columns using this array. FOr example:
import pandas as pd
import random, string
# Random list of characters to be used as columns
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
But when I try:
pd.DataFrame(my_array, columns = cols)
I get:
ValueError: Shape of passed values is (1,10), indices imply (10,10)
I presume this is because Pandas expects a 2D array, and I have a (flat) 1D array. Is there a way to inflate my 1D array into a 2D array or have Panda use a 1D array in the creation of the dataframe?
Note: I am using the latest stable version of Pandas (0.11.0)
Your value array has length 9, (values from 1 till 9), and your cols list has length 10.
I dont understand your error message, based on your code, i get:
ValueError: Shape of passed values is (1, 9), indices imply (10, 9)
Which makes sense.
Try:
my_array = np.arange(10).reshape(1,10)
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
pd.DataFrame(my_array, columns=cols)
Which results in:
F H L N M X B R S N
0 0 1 2 3 4 5 6 7 8 9
Either these should do it:
my_array2 = my_array[None] # same as myarray2 = my_array[numpy.newaxis]
or
my_array2 = my_array.reshape((1,10))
A single-row, many-columned DataFrame is unusual. A more natural, idiomatic choice would be a Series indexed by what you call cols:
pd.Series(my_array, index=cols)
But, to answer your question, the DataFrame constructor is assuming that my_array is a column of 10 data points. Try DataFrame(my_array.reshape((1, 10)), columns=cols). That works for me.
By using one of the alternate DataFrame constructors it is possible to create a DataFrame without needing to reshape my_array.
import numpy as np
import pandas as pd
import random, string
my_array = np.arange(0,10)
cols = [random.choice(string.ascii_uppercase) for x in range(10)]
pd.DataFrame.from_records([my_array], columns=cols)
Out[22]:
H H P Q C A G N T W
0 0 1 2 3 4 5 6 7 8 9

Categories

Resources