I want to make a train dataset's list - python

Let say I have a dataset
a b label
2 apple 4
1 bin 5
I want to make the list like
[[apple,4], [bin,5]]

import pandas as pd
import numpy as np
df = pd.DataFrame([[2,'apple',4],[1,'bin',5]], columns = ['a','b','label'])
np.stack((df.b, df.label), axis = 1)
#op
array([['apple', 4],
['bin', 5]], dtype=object)

Try:
df.drop('a', axis=1).values.tolist()
[['apple', 4], ['bin', 5]]

Related

Is storing numpy array as value in a cell of pandas dataframe a good practice?

So if we create a dataframe with numpy arrays as values to col2 like this
import numpy as np
import pandas as pd
d = {'col1': [11, 12], 'col2': [np.array([1.2,2.3,3.4]),np.array([4,5,6])]}
df = pd.DataFrame(data=d)
output :
col1 col2
0 11 [1.2, 2.3, 3.4]
1 12 [4, 5, 6]
Question -
Should we save numpy arrays to dataframe like this, at all ? and if yes then what could be some problems that someone might face if he/she uses this design

Assign constant numpy array value to pandas dataframe column

I would like to assign constant numpy array value to pandas dataframe column.
Here is what I tried:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new'] = np.array([]) # did not work
my_df['new'] = np.array([])*len(df) # did not work
Here is what worked:
my_df['new'] = my_df['new'].apply(lambda x: np.array([]))
I am curious why it works with simple scalar, but does not work with numpy array. Is there simpler way to assign numpy array value?
Your "new" column will contains arrays, so it must be a object type column.
The simplest way to initialize it is :
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new']=None
You can then fill it as you want. For example :
for index,(a,b,_) in my_df.iterrows():
my_df.loc[index,'new']=np.arange(a,b)
#
# col_1 col_2 new
# 0 1 4 [1, 2, 3]
# 1 2 5 [2, 3, 4]
# 2 3 6 [3, 4, 5]

how to get the actual index of my dataframe row while getting topk nearest neighbors?

this is the sample dataframe to be fit
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
output:
here is my 3 nearest neighbors index-->
array([[0, 1, 3]], dtype=int64)
I want the actual index in the dataframe like array([[a,b,d]]) how can I get this ??
This is easy to achieve. You just need some pandas indexing magic.
Do this:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#load the data
df = pd.read_csv('data.csv')
print(df)
#build the model and fit it
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
#get the index
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
#get the row index (the row names) of the dataframe
names = list(df.index[neighbor_index])
print(names)
Results:
0 1 2
a 1 2 3
b 3 4 5
c 5 2 3
d 4 3 5
[[0 1 3]]
[array(['a', 'b', 'd'], dtype=object)]
See the pandas documentation here about using numeric indices with a pandas DataFrame.
Below is an example recreating the dataframe in your question. The .iloc function will return rows in a dataframe based on their numeric index. You can retrieve the rows by their numeric index to get the index as it appears in the dataframe.
df = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 3, 2], [4, 3, 5]], index=['a', 'b', 'c', 'd'])
df.iloc[[0, 1, 3]].index
which returns ['a', 'b', 'd']

python - Transform data to numpy array for sklearn

I have a dataset formed by some text columns (with limited possibilities) and some numeric columns in a csv format. Is there any way to automatically transform the text columns to numbers (for example: A will be 0, B will be 1 and so on) to transform the dataset to np.array?
This will be later used on scikit-learn, so it needs to be np.array at the end of all the processing.
EDIT: Adding one line of the dataset:
ENABLED;ENABLED;10;MANUAL;ENABLED;ENABLED;1800000;OFF;0.175;5.0;0.13;OFF;NEITHER;ENABLED;-65;2417;"wifi01";65;-75;DISCONNECTED;NO;NO;2621454;432477;3759;2.2436838539123705E-6;
You can apply sklearn.preprocessing.labelEncoder() to each text column. Here is an example:
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5],
'col2': ['ON','ON','OFF','OFF','ON']})
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
df['encoded'] = lb.fit_transform(df.col2)
df
col1 col2 encoded
0 1 ON 1
1 2 ON 1
2 3 OFF 0
3 4 OFF 0
4 5 ON 1
I just added the numerical values in another column but you can replace them. Also, you can convert them into numpy array:
df.as_matrix()
array([[1, 'ON', 1],
[2, 'ON', 1],
[3, 'OFF', 0],
[4, 'OFF', 0],
[5, 'ON', 1]], dtype=object)
Here is how you may encode with numpy. In this example I am just passing a python list:
alist = ['ON','ON','OFF','OFF','ON']
uniqe_values , y = np.unique(alist, return_inverse=True)
print uniqe_values
print y
The results are:
['OFF' 'ON']
[1 1 0 0 1]

Insert list of lists into single column of pandas df

I am trying to place multiple lists into a single column of a Pandas df. My list of lists is very long, so I cannot do so manually.
The desired out put would look like this:
list_of_lists = [[1,2,3],[3,4,5],[5,6,7],...]
df = pd.DataFrame(list_of_lists)
>>> df
0
0 [1,2,3]
1 [3,4,5]
2 [5,6,7]
3 ...
Thank you for the assistance.
You can assign it by wrapping it in a Series vector if you're trying to add to an existing df:
In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[7]:
a b c
0 -1.675422 -0.696623 -1.025674
1 0.032192 0.582190 0.214029
2 -0.134230 0.991172 -0.177654
3 -1.688784 1.275275 0.029581
4 -0.528649 0.858710 -0.244512
In [9]:
df['new_col'] = pd.Series([[1,2,3],[3,4,5],[5,6,7]])
df
Out[9]:
a b c new_col
0 -1.675422 -0.696623 -1.025674 [1, 2, 3]
1 0.032192 0.582190 0.214029 [3, 4, 5]
2 -0.134230 0.991172 -0.177654 [5, 6, 7]
3 -1.688784 1.275275 0.029581 NaN
4 -0.528649 0.858710 -0.244512 NaN
What about
df = pd.DataFrame({0: [[1,2,3],[3,4,5],[5,6,7]]})
The above solutions were helpful but wanted to add a little bit in case they didn't quite do the trick for someone...
pd.Series will not accept a np.ndarray that looks like a list-of-lists, e.g. one-hot labels array([[1, 0, 0], [0, 1, 0], ..., [0, 0, 1]]).
So in this case one can wrap the variable with list():
df['new_col'] = pd.Series(list(one-hot-labels))

Categories

Resources