I would like to add a numpy array to each row in my dataframe:
I do have a dataframe holdings some data in each row and now i like to add a new column which contains an n element array.
for example:
Name, Years
Test, 2
Test2, 4
Now i like to add:
testarray1 = [100, 101, 1 , 0, 0, 5] as a new column='array' to Name='Test'
Name, Years, array
Test, 2, testarray1
Test2, 4, NaN
how can i do this ?
import pandas as pd
import numpy as np
testarray1 = [100, 101, 1 , 0, 0, 5]
d = {'Name':['Test', 'Test2'],
'Years': [2, 4]
}
df = pd.DataFrame(d) # create a DataFrame of the data
df.set_index('Name', inplace=True) # set the 'Name' column as the dataframe index
df['array'] = np.NaN # create a new empty 'array' column (filled with NaNs)
df['array'] = df['array'].astype(object) # convert it to an 'object' data type
df.at['Test', 'array'] = testarray1 # fill in the cell where index equals 'Test' and column equals 'array'
df.reset_index(inplace=True) # if you don't want 'Name' to be the dataframe index
print(df)
Name Years array
0 Test 2 [100, 101, 1, 0, 0, 5]
1 Test2 4 NaN
Try this
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['test', 'test2'], 'year':[1,2]})
print(df)
x = np.arange(5)
df['array']=[x,np.nan]
print(df)
Related
I currently have a pandas dataframe that has a column of values that are numpy arrays. I am trying to get the rows of the dataframe where the value of the column is an empty numpy array but I can't index using the pandas method.
Here is an example dataframe.
data = {'Name': ['A', 'B', 'C', 'D'], 'stats': [np.array([1,1,1]), np.array([]), np.array([2,2,2]), np.array([])]}
df = pd.DataFrame(data)
I am trying to just get the rows where 'stats' is None, but when I try df[df['stats'] is None] I just get a KeyError: False.
How can I filter by rows that contain an empty list?
Additionally, how can I filter by row where the numpy array is something specific? i.e. get all rows of df where df['stats'] == np.array([1, 1, 1])
Thanks
You can check length by Series.str.len, because it working with all Iterables:
print (df['stats'].str.len())
0 3
1 0
2 3
3 0
Name: stats, dtype: int64
And then filter, e.g. rows with len=0:
df = df[df['stats'].str.len().eq(0)]
#alternative
#df = df[df['stats'].apply(len).eq(0)]
print (df)
Name stats
1 B []
3 D []
If need test specific array is possible use tuples:
df =df[ df['stats'].apply(tuple) == tuple(np.array([1, 1, 1]))]
print (df)
Name stats
0 A [1, 1, 1]
for this question:
"Additionally, how can I filter by row where the numpy array is something specific? i.e. get all rows of df where df['stats'] == np.array([1, 1, 1])"
data = {'Name': ['A', 'B', 'C', 'D'], 'stats': [np.array([1,1,1]), np.array([]), np.array([2,2,2]), np.array([])]}
df = pd.DataFrame(data)
df = df[df['stats'].apply(lambda x: np.array_equal(x, np.array([1,1,1])))]
I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!
A CSV file is given. I am supposed to print the name of the row label of a row of the data frame as a string output. How do I do that?
import pandas as pd
df= pd.read_csv('olympics.csv', index_col=0, skiprows=1)
s= df.loc[df['Gold'].idxmax()]
return s.index
Here 'Gold' is a random column index name. I have been trying by this code. But it only prints column indices. But I need to print the row index output as a string .
df = pd.DataFrame({'id':['1','2','3','4','5','6','7','8'],
'A':['foo','bar','foo','bar','foo','bar','foo','foo'],
'C':[10, 10, 10, 30, 50, 60, 50, 8],
'D':[9, 8, 7, 6, 5, 4, 3, 2]},
index = list('abcdefgh'))
idxmax() returns the row index,
>>> df['C'].idxmax()
'f'
Selecting that row produces a Series whose name is the index of that row.
>>> df.loc[df['C'].idxmax()]
id 6
A bar
C 60
D 4
Name: f, dtype: object
>>> df.loc[df['C'].idxmax()].name
'f'
I would like to assign constant numpy array value to pandas dataframe column.
Here is what I tried:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new'] = np.array([]) # did not work
my_df['new'] = np.array([])*len(df) # did not work
Here is what worked:
my_df['new'] = my_df['new'].apply(lambda x: np.array([]))
I am curious why it works with simple scalar, but does not work with numpy array. Is there simpler way to assign numpy array value?
Your "new" column will contains arrays, so it must be a object type column.
The simplest way to initialize it is :
my_df = pd.DataFrame({'col_1': [1,2,3], 'col_2': [4,5,6]})
my_df['new']=None
You can then fill it as you want. For example :
for index,(a,b,_) in my_df.iterrows():
my_df.loc[index,'new']=np.arange(a,b)
#
# col_1 col_2 new
# 0 1 4 [1, 2, 3]
# 1 2 5 [2, 3, 4]
# 2 3 6 [3, 4, 5]
Is it possible to create pandas.DataFrame which includes list type field?
For example, I'd like to load the following csv to pandas.DataFrame:
id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"
Strip the double quotes:
id,scores
1, [1,2,3,4]
2, [1,2]
3, [0,2,4]
And you should be able to do this:
query = [[1, [1,2,3,4]], [2, [1,2]], [3, [0,2,4]]]
df = pandas.DataFrame(query, columns=['id', 'scores'])
print df
You can use:
import pandas as pd
import io
temp=u'''id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"'''
df = pd.read_csv(io.StringIO(temp), sep=',', index_col=[0] )
print df
scores
id
1 [1,2,3,4]
2 [1,2]
3 [0,2,4]
But dtype of column scores is object, not list.
One approach use ast and converters:
import pandas as pd
import io
from ast import literal_eval
temp=u'''id,scores
1,"[1,2,3,4]"
2,"[1,2]"
3,"[0,2,4]"'''
def converter(x):
#define format of datetime
return literal_eval(x)
#define each column
converters={'scores': converter}
df = pd.read_csv(io.StringIO(temp), sep=',', converters=converters)
print df
id scores
0 1 [1, 2, 3, 4]
1 2 [1, 2]
2 3 [0, 2, 4]
#check lists:
print 2 in df.scores[2]
#True
print 1 in df.scores[2]
#False