How to convert pandas data frame to NumPy array? - python

Following the suggestions I got from my previous question here I'm converting a Pandas data frame to a numeric NumPy array. To do this Im used numpy.asarray.
My data frame:
DataFrame
----------
label vector
0 0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:...
1 0 1:0.02134219 2:-0.007388343 3:0.06835007 4:0....
2 0 1:0.030515702 2:-0.0037591448 3:0.066626 4:0....
3 0 1:0.0069114454 2:-0.0149497045 3:0.020777626 ...
4 1 1:0.003118149 2:-0.015105667 3:0.040879637 4:...
... ... ...
19779 0 1:0.0042634667 2:-0.0044222944 3:-0.012995412...
19780 1 1:0.013818732 2:-0.010984628 3:0.060777966 4:...
19781 0 1:0.00019213723 2:-0.010443398 3:0.01679976 4...
19782 0 1:0.010373874 2:0.0043582567 3:-0.0078354385 ...
19783 1 1:0.0016790542 2:-0.028346825 3:0.03908631 4:...
[19784 rows x 2 columns]
DataFrame datatypes :
label object
vector object
dtype: object
To convert into a Numpy Array I'm using this script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt
r_filenameTSV = 'TSV/A19784.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print('DataFrame\n----------\n', df)
print('\nDataFrame datatypes :\n', df.dtypes)
arr = np.asarray(df, dtype=np.float64)
print('\nNumpy Array\n----------\n', arr)
print('\nNumpy Array Datatype :', arr.dtype)
I'm having this error from line nr.22 arr = np.asarray(df, dtype=np.float64)
ValueError: could not convert string to float: ' 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.0030826542 11:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...
How can I solve this issue?
Regards and thanks for your time

Use list comprehension with nested dictionary comprehension for DataFrame:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4
And then convert to floats and to numpy array:
print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]

It seems one of your columns is a string, not an integer. Either remove that column or encode it as a string before converting the dataframe to an array

Related

3 dimensional numpy array to pandas dataframe

I have a 3 dimensional numpy array
([[[0.30706802]],
[[0.19451728]],
[[0.19380492]],
[[0.23329106]],
[[0.23849282]],
[[0.27154338]],
[[0.2616704 ]], ... ])
with shape (844,1,1) resulting from RNN model.predict()
y_prob = loaded_model.predict(X)
, my problem is how to convert it to a pandas dataframe.
I have used Keras
my objective is to have this:
0 0.30706802
7 0.19451728
21 0.19380492
35 0.23329106
42 ...
...
815 ...
822 ...
829 ...
836 ...
843 ...
Name: feature, Length: 78, dtype: float32
idea is to first flatten the nested list to list than convert it in df using from_records method of pandas dataframe
import numpy as np
import pandas as pd
data = np.array([[[0.30706802]],[[0.19451728]],[[0.19380492]],[[0.23329106]],[[0.23849282]],[[0.27154338]],[[0.2616704 ]]])
import itertools
data = list(itertools.chain(*data))
df = pd.DataFrame.from_records(data)
Without itertools
data = [i for j in data for i in j]
df = pd.DataFrame.from_records(data)
Or you can use flatten() method as mentioned in one of the answer, but you can directly use it like this
pd.DataFrame(data.flatten(),columns = ['col1'])
Here you go!
import pandas as pd
y = ([[[[11]],[[13]],[[14]],[[15]]]])
a = []
for i in y[0]:
a.append(i[0])
df = pd.DataFrame(a)
print(df)
Output:
0
0 11
1 13
2 14
3 15
Feel free to set your custom index values both for axis=0 and axis=1.
You could try:
s = pd.Series(your_array.flatten(), name='feature')
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html
You can then convert the series to a dataframe using s.to_frame()

How to convert a list of Numpy arrays to a Pandas DataFrame

I have a list of Numpy arrays that looks like this:
[400.31865662]
[401.18514808]
[404.84015554]
[405.14682194]
[405.67735105]
[273.90969447]
[274.0894528]
When I try to convert it to a Pandas Dataframe with the following code
y = pd.DataFrame(data)
print(y)
I get the following output when printing it. Why do I get all those zeros?
0
0 400.318657
0
0 401.185148
0
0 404.840156
0
0 405.146822
0
0 405.677351
0
0 273.909694
0
0 274.089453
I would like to get a single column dataframe which looks like that:
400.31865662
401.18514808
404.84015554
405.14682194
405.67735105
273.90969447
274.0894528
You could flatten the numpy array:
import numpy as np
import pandas as pd
data = [[400.31865662],
[401.18514808],
[404.84015554],
[405.14682194],
[405.67735105],
[273.90969447],
[274.0894528]]
arr = np.array(data)
df = pd.DataFrame(data=arr.flatten())
print(df)
Output
0
0 400.318657
1 401.185148
2 404.840156
3 405.146822
4 405.677351
5 273.909694
6 274.089453
Since I assume the many visitors of this post aren't here for OP's specific and un-reproducible issue, here's a general answer:
df = pd.DataFrame(array)
The strength of pandas is to be nice for the eye (like Excel), so it's important to use column names.
import numpy as np
import pandas as pd
array = np.random.rand(5, 5)
array([[0.723, 0.177, 0.659, 0.573, 0.476],
[0.77 , 0.311, 0.533, 0.415, 0.552],
[0.349, 0.768, 0.859, 0.273, 0.425],
[0.367, 0.601, 0.875, 0.109, 0.398],
[0.452, 0.836, 0.31 , 0.727, 0.303]])
columns = [f'col_{num}' for num in range(5)]
index = [f'index_{num}' for num in range(5)]
Here's where the magic happens:
df = pd.DataFrame(array, columns=columns, index=index)
col_0 col_1 col_2 col_3 col_4
index_0 0.722791 0.177427 0.659204 0.572826 0.476485
index_1 0.770118 0.311444 0.532899 0.415371 0.551828
index_2 0.348923 0.768362 0.858841 0.273221 0.424684
index_3 0.366940 0.600784 0.875214 0.108818 0.397671
index_4 0.451682 0.836315 0.310480 0.727409 0.302597
I just figured out my mistake. (data) was a list of arrays:
[array([400.0290173]), array([400.02253235]), array([404.00252113]), array([403.99466754]), array([403.98681395]), array([271.97896036]), array([271.97110677])]
So I used np.vstack(data) to concatenate it
conc = np.vstack(data)
[[400.0290173 ]
[400.02253235]
[404.00252113]
[403.99466754]
[403.98681395]
[271.97896036]
[271.97110677]]
Then I convert the concatened array into a Pandas Dataframe by using the
newdf = pd.DataFrame(conc)
0
0 400.029017
1 400.022532
2 404.002521
3 403.994668
4 403.986814
5 271.978960
6 271.971107
Et voilĂ !
There is another way, which isn't mentioned in the other answers. If you have a NumPy array which is essentially a row vector (or column vector) i.e. shape like (n, ) , then you could do the following :
# sample array
x = np.zeros((20))
# empty dataframe
df = pd.DataFrame()
# add the array to df as a column
df['column_name'] = x
This way you can add multiple arrays as separate columns.

Convert pandas.DataFrame to numpy tensor using factor levels for shape [duplicate]

This question already has answers here:
Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
(2 answers)
Closed 4 years ago.
I have data from a full factorial experiment. For example, for each of N samples, I have J types of measurement and K measurement loci. I receive this data in long format, for example,
import numpy as np
import pandas as pd
import itertools
from numpy.random import normal as rnorm
# [[N], [J], [K]]
levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]
# fully crossed
exp_design = list(itertools.product(*levels))
df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])
# some fake data
df['x'] = rnorm(size=len(exp_design))
which results in 24 observations (x) with a column for each of the three factors.
> df.head()
sample mode gene x
0 1 start gene1 -1.229370
1 1 start gene2 1.129773
2 1 start gene3 -1.155202
3 1 stop gene1 -0.757551
4 1 stop gene2 -0.166129
I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array). I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:
> df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
array([[-1.22936989],
[ 1.12977346],
[-1.15520216],
...,
[-0.1031641 ],
[ 1.1296491 ],
[ 1.31113584]])
Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?
Try with
df.agg('nunique')
Out[69]:
sample 4
mode 2
gene 3
x 24
dtype: int64
s=df.agg('nunique')
df.x.values.reshape(s['sample'],s['mode'],s['gene'])
Out[71]:
array([[[-2.78133759e-01, -1.42234420e+00, 5.42439121e-01],
[ 2.15359867e+00, 6.55837886e-01, -1.01293568e+00]],
[[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
[-2.91567999e-01, -4.01257702e-01, 7.96422763e-01]],
[[ 1.05088264e-01, -7.23400925e-02, 2.78515041e-01],
[ 2.63088568e-01, 1.47477886e+00, -2.10735619e+00]],
[[-1.71756374e+00, 6.12224005e-04, -3.11562798e-02],
[ 5.26028807e-01, -1.18502045e+00, 1.88633760e+00]]])

Scikit-learn: How to normalize row values horizontally?

I would like to normalize the values below horizontally instead of vertically. The code read csv file provided after the code and output a new csv file with normalized values. How to make it normalize horizontally? Given the code as below:
Code
#norm_code.py
#normalization = x-min/max-min
import numpy as np
from sklearn import preprocessing
all_data=np.loadtxt(open("c:/Python27/test.csv","r"),
delimiter=",",
skiprows=0,
dtype=np.float64)
x=all_data[:]
print('total number of samples (rows):', x.shape[0])
print('total number of features (columns):', x.shape[1])
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(x)
X_minmax=minmax_scale.transform(x)
with open('test_norm.csv',"w") as f:
f.write("\n".join(",".join(map(str, x)) for x in (X_minmax)))
test.csv
1 2 0 4 3
3 2 1 1 0
2 1 1 0 1
You can simply operate on the transpose, and take a transpose of the result:
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(x.T)
X_minmax=minmax_scale.transform(x.T).T
Oneliner answer without use of sklearn:
X_minmax = np.transpose( (x-np.min(x,axis=1))/(np.max(x, axis=1)-np.min(x,axis=1)))
This is about 8x faster than using the MinMaxScaler from preprocessing.
from sklearn.preprocessing import MinMaxScaler
data = np.array([[1 , 2 , 0 , 4 , 3],
[3 , 2 , 1, 1, 0],
[2, 1 , 1 , 0 , 1]])
scaler = MinMaxScaler()
print(data)
print(scaler.fit_transform(data.T).T)# row-wise transform

Using a subset of Pandas dataframe with Scipy Kmeans?

I have a data frame that I import using df = pd.read_csv('my.csv',sep=','). In that CSV file, the first row is the column name, and the first column is the observation name.
I know how to select a subset of the Panda dataframe, using:
df.iloc[:,1::]
which gives me only the numeric values. But when I try and use this with scipy.cluster.vq.kmeans using this command,
kmeans(df.iloc[:,1::],3)
I get the error 'DataFrame' object has no attribute 'dtype'
Any suggestions?
Here is an example to use KMeans.
from sklearn.datasets import make_blobs
from itertools import product
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
# try to simulate your data
# =====================================================
X, y = make_blobs(n_samples=1000, n_features=10, centers=3)
columns = ['feature' + str(x) for x in np.arange(1, 11, 1)]
d = {key: values for key, values in zip(columns, X.T)}
d['label'] = y
data = pd.DataFrame(d)
Out[72]:
feature1 feature10 feature2 ... feature8 feature9 label
0 1.2324 -2.6588 -7.2679 ... 5.4166 8.9043 2
1 0.3569 -1.6880 -5.7671 ... -2.2465 -1.7048 0
2 1.0177 -1.7145 -5.8591 ... -0.5755 -0.6969 0
3 1.5735 -0.0597 -4.9009 ... 0.3235 -0.2400 0
4 -0.1042 -1.6703 -4.0541 ... 0.4456 -1.0406 0
.. ... ... ... ... ... ... ...
995 -0.0983 -1.4569 -3.5179 ... -0.3164 -0.6685 0
996 1.3151 -3.3253 -7.0984 ... 3.7563 8.4052 2
997 -0.9177 0.7446 -4.8527 ... -2.3793 -0.4038 0
998 2.0385 -3.9001 -7.7472 ... 5.2290 9.2281 2
999 3.9357 -7.2564 5.7881 ... 1.2288 -2.2305 1
[1000 rows x 11 columns]
# fit your data with KMeans
# =====================================================
kmeans = KMeans(n_clusters=3)
kmeans.fit_predict(data.ix[:, :-1].values)
Out[70]: array([1, 0, 0, ..., 0, 1, 2], dtype=int32)

Categories

Resources