python - Transform data to numpy array for sklearn - python

I have a dataset formed by some text columns (with limited possibilities) and some numeric columns in a csv format. Is there any way to automatically transform the text columns to numbers (for example: A will be 0, B will be 1 and so on) to transform the dataset to np.array?
This will be later used on scikit-learn, so it needs to be np.array at the end of all the processing.
EDIT: Adding one line of the dataset:
ENABLED;ENABLED;10;MANUAL;ENABLED;ENABLED;1800000;OFF;0.175;5.0;0.13;OFF;NEITHER;ENABLED;-65;2417;"wifi01";65;-75;DISCONNECTED;NO;NO;2621454;432477;3759;2.2436838539123705E-6;

You can apply sklearn.preprocessing.labelEncoder() to each text column. Here is an example:
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5],
'col2': ['ON','ON','OFF','OFF','ON']})
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
df['encoded'] = lb.fit_transform(df.col2)
df
col1 col2 encoded
0 1 ON 1
1 2 ON 1
2 3 OFF 0
3 4 OFF 0
4 5 ON 1
I just added the numerical values in another column but you can replace them. Also, you can convert them into numpy array:
df.as_matrix()
array([[1, 'ON', 1],
[2, 'ON', 1],
[3, 'OFF', 0],
[4, 'OFF', 0],
[5, 'ON', 1]], dtype=object)
Here is how you may encode with numpy. In this example I am just passing a python list:
alist = ['ON','ON','OFF','OFF','ON']
uniqe_values , y = np.unique(alist, return_inverse=True)
print uniqe_values
print y
The results are:
['OFF' 'ON']
[1 1 0 0 1]

Related

Faster way to set new df value using np.array index values in dataframe

I need to set the value of a new pandas df column based on the NumPy array index, also stored in the df. This works, but it is pretty slow with a large df. Any tips on how to speed things up?
a=np.random.random((5,5))
df=pd.DataFrame(np.array([[1,1],[3,3],[2,2],[3,2]]),columns=['i','j'])
df['ij']=df.apply(lambda x: (int(x['i']-1),int(x['j']-1)),axis=1)
for idx,r in df.iterrows():
df.loc[idx,'new']=a[r['ij']]
With NumPy indexing:
inds = df[["i", "j"]].to_numpy() - 1
df["new"] = a[inds[:, 0], inds[:, 1]]
where we index into a along rows with numbers in inds' first column and columns with its second column.
to get
>>> a
array([[0.27494719, 0.17706064, 0.71306907, 0.94776026, 0.04024955],
[0.56557293, 0.63732559, 0.12254121, 0.53177861, 0.48435987],
[0.33299644, 0.43459935, 0.57227818, 0.96142159, 0.79794503],
[0.80112425, 0.52816002, 0.01885327, 0.39880301, 0.51974912],
[0.60377461, 0.24419486, 0.88203753, 0.87263663, 0.49345361]])
>>> inds
array([[0, 0],
[2, 2],
[1, 1],
[2, 1]])
>>> df
i j new
0 1 1 0.274947
1 3 3 0.572278
2 2 2 0.637326
3 3 2 0.434599
for the ij column, you can do df["ij"] = inds.tolist().
Coming from the numpy side you could reshape the indices such that they match the a shape, using ravel_multi_index:
df["new"] = np.take(a, np.ravel_multi_index([df.i -1, df.j - 1], a.shape))

Creating a pandas dataframe from a 2d numpy array (to be a column of 1d numpy arrays) and a 1d np array of labels

For example I have these numpy arrays:
import pandas as pd
import numpy as np
# points could be in n dimension, i need a solution that would cover that up
# and being able to calculate distance between points so flattening the data
# is not my goal.
points = np.array([[1, 2], [2, 1], [100, 100], [-2, -1], [0, 0], [-1, -2]]) # a 2d numpy array containing points in space
labels = np.array([0, 1, 1, 1, 0, 0]) # the labels of the points (not necessarily only 0 and 1)
I tried to make a dictionary and from that to create the pandas datafram:
my_dict = {'point': points, 'label': labels}
df = pd.DataFrame(my_dict, columns=['point', 'label'])
But it didn't work and I got the following exception:
Exception: Data must be 1-dimensional
Probably it's because of the numpy array of points (a 2d numpy array).
The desired result:
point label
0 [1, 2] 0
1 [2, 1] 1
2 [100, 100] 1
3 [-2, -1] 0
4 [0, 0] 0
5 [-1, -2] 1
Thanks in advance for all the helpers :)
You should always try to normalize your data such that each column only contains singular values, not data with a dimension.
In this case, I would do something like this:
>>> df = pd.DataFrame({'x': points[:,0], 'y': points[:, 1], 'label': labels},
columns=['x', 'y', 'label'])
>>> df
x y label
0 1 2 0
1 2 1 1
2 100 100 1
3 -2 -1 1
4 0 0 0
5 -1 -2 0
If you truly insist with keeping points as such, transform them to a list of lists or list of tuples before passing to pandas to avoid this error.

Assign constant numpy array value to pandas dataframe column

I would like to assign constant numpy array value to pandas dataframe column.
Here is what I tried:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new'] = np.array([]) # did not work
my_df['new'] = np.array([])*len(df) # did not work
Here is what worked:
my_df['new'] = my_df['new'].apply(lambda x: np.array([]))
I am curious why it works with simple scalar, but does not work with numpy array. Is there simpler way to assign numpy array value?
Your "new" column will contains arrays, so it must be a object type column.
The simplest way to initialize it is :
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new']=None
You can then fill it as you want. For example :
for index,(a,b,_) in my_df.iterrows():
my_df.loc[index,'new']=np.arange(a,b)
#
# col_1 col_2 new
# 0 1 4 [1, 2, 3]
# 1 2 5 [2, 3, 4]
# 2 3 6 [3, 4, 5]

how to get the actual index of my dataframe row while getting topk nearest neighbors?

this is the sample dataframe to be fit
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
output:
here is my 3 nearest neighbors index-->
array([[0, 1, 3]], dtype=int64)
I want the actual index in the dataframe like array([[a,b,d]]) how can I get this ??
This is easy to achieve. You just need some pandas indexing magic.
Do this:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#load the data
df = pd.read_csv('data.csv')
print(df)
#build the model and fit it
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
#get the index
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
#get the row index (the row names) of the dataframe
names = list(df.index[neighbor_index])
print(names)
Results:
0 1 2
a 1 2 3
b 3 4 5
c 5 2 3
d 4 3 5
[[0 1 3]]
[array(['a', 'b', 'd'], dtype=object)]
See the pandas documentation here about using numeric indices with a pandas DataFrame.
Below is an example recreating the dataframe in your question. The .iloc function will return rows in a dataframe based on their numeric index. You can retrieve the rows by their numeric index to get the index as it appears in the dataframe.
df = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 3, 2], [4, 3, 5]], index=['a', 'b', 'c', 'd'])
df.iloc[[0, 1, 3]].index
which returns ['a', 'b', 'd']

Insert list of lists into single column of pandas df

I am trying to place multiple lists into a single column of a Pandas df. My list of lists is very long, so I cannot do so manually.
The desired out put would look like this:
list_of_lists = [[1,2,3],[3,4,5],[5,6,7],...]
df = pd.DataFrame(list_of_lists)
>>> df
0
0 [1,2,3]
1 [3,4,5]
2 [5,6,7]
3 ...
Thank you for the assistance.
You can assign it by wrapping it in a Series vector if you're trying to add to an existing df:
In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[7]:
a b c
0 -1.675422 -0.696623 -1.025674
1 0.032192 0.582190 0.214029
2 -0.134230 0.991172 -0.177654
3 -1.688784 1.275275 0.029581
4 -0.528649 0.858710 -0.244512
In [9]:
df['new_col'] = pd.Series([[1,2,3],[3,4,5],[5,6,7]])
df
Out[9]:
a b c new_col
0 -1.675422 -0.696623 -1.025674 [1, 2, 3]
1 0.032192 0.582190 0.214029 [3, 4, 5]
2 -0.134230 0.991172 -0.177654 [5, 6, 7]
3 -1.688784 1.275275 0.029581 NaN
4 -0.528649 0.858710 -0.244512 NaN
What about
df = pd.DataFrame({0: [[1,2,3],[3,4,5],[5,6,7]]})
The above solutions were helpful but wanted to add a little bit in case they didn't quite do the trick for someone...
pd.Series will not accept a np.ndarray that looks like a list-of-lists, e.g. one-hot labels array([[1, 0, 0], [0, 1, 0], ..., [0, 0, 1]]).
So in this case one can wrap the variable with list():
df['new_col'] = pd.Series(list(one-hot-labels))

Categories

Resources