Assign constant numpy array value to pandas dataframe column - python

I would like to assign constant numpy array value to pandas dataframe column.
Here is what I tried:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new'] = np.array([]) # did not work
my_df['new'] = np.array([])*len(df) # did not work
Here is what worked:
my_df['new'] = my_df['new'].apply(lambda x: np.array([]))
I am curious why it works with simple scalar, but does not work with numpy array. Is there simpler way to assign numpy array value?

Your "new" column will contains arrays, so it must be a object type column.
The simplest way to initialize it is :
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new']=None
You can then fill it as you want. For example :
for index,(a,b,_) in my_df.iterrows():
my_df.loc[index,'new']=np.arange(a,b)
#
# col_1 col_2 new
# 0 1 4 [1, 2, 3]
# 1 2 5 [2, 3, 4]
# 2 3 6 [3, 4, 5]

Related

Comparing two data frames columns and assigning Zero and One

I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!

I want to make a train dataset's list

Let say I have a dataset
a b label
2 apple 4
1 bin 5
I want to make the list like
[[apple,4], [bin,5]]
import pandas as pd
import numpy as np
df = pd.DataFrame([[2,'apple',4],[1,'bin',5]], columns = ['a','b','label'])
np.stack((df.b, df.label), axis = 1)
#op
array([['apple', 4],
['bin', 5]], dtype=object)
Try:
df.drop('a', axis=1).values.tolist()
[['apple', 4], ['bin', 5]]

Adding Numpy ndarray into dataframe

I would like to add a numpy array to each row in my dataframe:
I do have a dataframe holdings some data in each row and now i like to add a new column which contains an n element array.
for example:
Name, Years
Test, 2
Test2, 4
Now i like to add:
testarray1 = [100, 101, 1 , 0, 0, 5] as a new column='array' to Name='Test'
Name, Years, array
Test, 2, testarray1
Test2, 4, NaN
how can i do this ?
import pandas as pd
import numpy as np
testarray1 = [100, 101, 1 , 0, 0, 5]
d = {'Name':['Test', 'Test2'],
'Years': [2, 4]
}
df = pd.DataFrame(d) # create a DataFrame of the data
df.set_index('Name', inplace=True) # set the 'Name' column as the dataframe index
df['array'] = np.NaN # create a new empty 'array' column (filled with NaNs)
df['array'] = df['array'].astype(object) # convert it to an 'object' data type
df.at['Test', 'array'] = testarray1 # fill in the cell where index equals 'Test' and column equals 'array'
df.reset_index(inplace=True) # if you don't want 'Name' to be the dataframe index
print(df)
Name Years array
0 Test 2 [100, 101, 1, 0, 0, 5]
1 Test2 4 NaN
Try this
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['test', 'test2'], 'year':[1,2]})
print(df)
x = np.arange(5)
df['array']=[x,np.nan]
print(df)

how to get the actual index of my dataframe row while getting topk nearest neighbors?

this is the sample dataframe to be fit
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
output:
here is my 3 nearest neighbors index-->
array([[0, 1, 3]], dtype=int64)
I want the actual index in the dataframe like array([[a,b,d]]) how can I get this ??
This is easy to achieve. You just need some pandas indexing magic.
Do this:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#load the data
df = pd.read_csv('data.csv')
print(df)
#build the model and fit it
neigh = NearestNeighbors(3,.4)
neigh.fit(df)
#get the index
neighbor_index = neigh.kneighbors([[1.3,4.5,2.5]],return_distance=False)
print(neighbor_index)
#get the row index (the row names) of the dataframe
names = list(df.index[neighbor_index])
print(names)
Results:
0 1 2
a 1 2 3
b 3 4 5
c 5 2 3
d 4 3 5
[[0 1 3]]
[array(['a', 'b', 'd'], dtype=object)]
See the pandas documentation here about using numeric indices with a pandas DataFrame.
Below is an example recreating the dataframe in your question. The .iloc function will return rows in a dataframe based on their numeric index. You can retrieve the rows by their numeric index to get the index as it appears in the dataframe.
df = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 3, 2], [4, 3, 5]], index=['a', 'b', 'c', 'd'])
df.iloc[[0, 1, 3]].index
which returns ['a', 'b', 'd']

Insert list of lists into single column of pandas df

I am trying to place multiple lists into a single column of a Pandas df. My list of lists is very long, so I cannot do so manually.
The desired out put would look like this:
list_of_lists = [[1,2,3],[3,4,5],[5,6,7],...]
df = pd.DataFrame(list_of_lists)
>>> df
0
0 [1,2,3]
1 [3,4,5]
2 [5,6,7]
3 ...
Thank you for the assistance.
You can assign it by wrapping it in a Series vector if you're trying to add to an existing df:
In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[7]:
a b c
0 -1.675422 -0.696623 -1.025674
1 0.032192 0.582190 0.214029
2 -0.134230 0.991172 -0.177654
3 -1.688784 1.275275 0.029581
4 -0.528649 0.858710 -0.244512
In [9]:
df['new_col'] = pd.Series([[1,2,3],[3,4,5],[5,6,7]])
df
Out[9]:
a b c new_col
0 -1.675422 -0.696623 -1.025674 [1, 2, 3]
1 0.032192 0.582190 0.214029 [3, 4, 5]
2 -0.134230 0.991172 -0.177654 [5, 6, 7]
3 -1.688784 1.275275 0.029581 NaN
4 -0.528649 0.858710 -0.244512 NaN
What about
df = pd.DataFrame({0: [[1,2,3],[3,4,5],[5,6,7]]})
The above solutions were helpful but wanted to add a little bit in case they didn't quite do the trick for someone...
pd.Series will not accept a np.ndarray that looks like a list-of-lists, e.g. one-hot labels array([[1, 0, 0], [0, 1, 0], ..., [0, 0, 1]]).
So in this case one can wrap the variable with list():
df['new_col'] = pd.Series(list(one-hot-labels))

Categories

Resources