Set pandas dataframe cell to list [duplicate] - python

I have a list 'abc' and a dataframe 'df':
abc = ['foo', 'bar']
df =
A B
0 12 NaN
1 23 NaN
I want to insert the list into cell 1B, so I want this result:
A B
0 12 NaN
1 23 ['foo', 'bar']
Ho can I do that?
1) If I use this:
df.ix[1,'B'] = abc
I get the following error message:
ValueError: Must have equal len keys and value when setting with an iterable
because it tries to insert the list (that has two elements) into a row / column but not into a cell.
2) If I use this:
df.ix[1,'B'] = [abc]
then it inserts a list that has only one element that is the 'abc' list ( [['foo', 'bar']] ).
3) If I use this:
df.ix[1,'B'] = ', '.join(abc)
then it inserts a string: ( foo, bar ) but not a list.
4) If I use this:
df.ix[1,'B'] = [', '.join(abc)]
then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).
Thanks for help!
EDIT
My new dataframe and the old list:
abc = ['foo', 'bar']
df2 =
A B C
0 12 NaN 'bla'
1 23 NaN 'bla bla'
Another dataframe:
df3 =
A B C D
0 12 NaN 'bla' ['item1', 'item2']
1 23 NaN 'bla bla' [11, 12, 13]
I want insert the 'abc' list into df2.loc[1,'B'] and/or df3.loc[1,'B'].
If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.
Another dataframe:
df4 =
A B
0 'bla' NaN
1 'bla bla' NaN
These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.

Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df.at[1, 'B'] = ['m', 'n']
df =
A B
0 1 x
1 2 [m, n]
2 3 z
You also need to make sure the column you are inserting into has dtype=object. For example
>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A int64
B int64
dtype: object
>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence
>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
A B
0 1 1
1 2 [1, 2, 3]
2 3 3

Pandas >= 0.21
set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.
Setting Cell Values with at/iat
# Setup
>>> df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
>>> df
A B
0 12 [a, b]
1 23 [c, d]
>>> df.dtypes
A int64
B object
dtype: object
If you want to set a value in second row of the "B" column to some new list, use DataFrame.at:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
You can also set by integer position using DataFrame.iat
>>> df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
What if I get ValueError: setting an array element with a sequence?
I'll try to reproduce this with:
>>> df
A B
0 12 NaN
1 23 NaN
>>> df.dtypes
A int64
B float64
dtype: object
>>> df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.
This is because of a your object is of float64 dtype, whereas lists are objects, so there's a mismatch there. What you would have to do in this situation is to convert the column to object first.
>>> df['B'] = df['B'].astype(object)
>>> df.dtypes
A int64
B object
dtype: object
Then, it works:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 NaN
1 23 [m, n]
Possible, But Hacky
Even more wacky, I've found that you can hack through DataFrame.loc to achieve something similar if you pass nested lists.
>>> df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
>>> df
A B
0 12 [a, b]
1 23 [m, n, o, p]
You can read more about why this works here.

df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column 'B'. For example, a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.

Quick work around
Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.
mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data
col1 col2
0 1 [1, 4]
1 2 [2, 5]
2 3 [3, 6]

Also getting
ValueError: Must have equal len keys and value when setting with an iterable,
using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:
df['B'] = df['B'].astype(object)
Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.

As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.

I've got a solution that's pretty simple to implement.
Make a temporary class just to wrap the list object and later call the value from the class.
Here's a practical example:
Let's say you want to insert list object into the dataframe.
df = pd.DataFrame([
{'a': 1},
{'a': 2},
{'a': 3},
])
df.loc[:, 'b'] = [
[1,2,4,2,],
[1,2,],
[4,5,6]
] # This works. Because the list has the same length as the rows of the dataframe
df.loc[:, 'c'] = [1,2,4,5,3] # This does not work.
>>> ValueError: Must have equal len keys and value when setting with an iterable
## To force pandas to have list as value in each cell, wrap the list with a temporary class.
class Fake(object):
def __init__(self, li_obj):
self.obj = li_obj
df.loc[:, 'c'] = Fake([1,2,5,3,5,7,]) # This works.
df.c = df.c.apply(lambda x: x.obj) # Now extract the value from the class. This works.
Creating a fake class to do this might look like a hassle but it can have some practical applications. For an example you can use this with apply when the return value is list.
Pandas would normally refuse to insert list into a cell but if you use this method, you can force the insert.

I prefer .at and .loc. It is important to note, that the target column needs a dtype (object), which can handle the list.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [0, 1, 2, 3],
'B': np.array([np.nan]*3 + [[3, 33]], dtype=object),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
df.at[0, 'B'] = [0, 100] # at assigns single elemnt
df.loc[1, 'B'] = [[ [1, 11] ]] # loc expects 2d input
print('df modified:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
A B
0 0 NaN
1 1 NaN
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
df modified:
A B
0 0 [0, 100]
1 1 [[1, 11]]
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object

first set the cell to blank. next use at to assign the abc list to the cell at 1, 'B'
abc = ['foo', 'bar']
df =pd.DataFrame({'A':[12,23],'B':[np.nan,np.nan]})
df.loc[1,'B']=''
df.at[1,'B']=abc
print(df)

Related

How do I keep original pandas dataframe values while using df.astype() ? I need to raise a value error for below example

df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd', 'e'],
'B': [1, 2.5, 3, 4, 5],
'C': ['abc', 'def', 'ghi', 'jkl', 'mno'] })
col_type = {'A':str, 'B':int, 'C':str}
df = df.astype(col_type)
df
Output is:
A B C
0 a 1 abc
1 b 2 def
2 c 3 ghi
3 d 4 jkl
4 e 5 mno
But I want to raise a value error at index 1 for column B. I don't need the integer value. I want to do it automatically( Like loop through all columns)
Panda's built .astype() in doesn't appear to have a 'safe casting' method as you want.
In numpy you can use
np.ndarray.astype(preferred_type, casting='safe')
So unfortunately I don't have a pretty solution for you but I would do something like
coltypes = [str,int,str]
colnames = ['a','b','c']
data_for_df = [df.values[:,i].astype(coltypes[i], casting='safe') for i in range(len(df))]
df = pd.DataFrame(data_for_df,columns=colnames)
Someone might be able to give a better answer than me :)
If you want to control that some columns of floating point values only contain integers, you can simply examine the difference between the original column(s) and the same one(s) after int conversion:
(df['B'] - df['B'].astype('int')) == 0
gives following Series:
0 True
1 False
2 True
3 True
4 True
Name: B, dtype: bool
From there, you can raise an exception
tmp = (df['B'] - df['B'].astype('int')) == 0
if not tmp.all():
raise TypeError("Non int value at "+ ', '.join(df[(df['B'] - df['B'].astype('int')) != 0]
.index.astype(str)))
With the sample data, it gives as expected:
TypeError: Non int value at 1

Pandas Set element of a new column as a list (iterable) raise ValueError: setting an array element with a sequence

I want to, at the same time, create a new column in a pandas dataframe and set its first value to a list.
I want to transform this dataframe
df = pd.DataFrame.from_dict({'a':[1,2],'b':[3,4]})
a b
0 1 3
1 2 4
into this one
a b c
0 1 3 [2,3]
1 2 4 NaN
I tried :
df.loc[0, 'c'] = [2,3]
df.loc[0, 'c'] = np.array([2,3])
df.loc[0, 'c'] = [[2,3]]
df.at[0,'c'] = [2,3]
df.at[0,'d'] = [[2,3]]
It does not work.
How should I proceed?
If the first element of a series is a list, then the series must be of type object (not the most efficient for numerical computations). This should work, however.
df = df.assign(c=None)
df.loc[0, 'c'] = [2, 3]
>>> df
a b c
0 1 3 [2, 3]
1 2 4 None
If you really need the remaining values of column c to be NaNs instead of None, use this:
df.loc[1:, 'c'] = np.nan
The problem seems to have something to do with the type of the c column. If you convert it to type 'object', you can use iat, loc or set_value to set a cell as a list.
df2 = (
df.assign(c=np.nan)
.assign(c=lambda x: x.c.astype(object))
)
df2.set_value(0,'c',[2,3])
Out[86]:
a b c
0 1 3 [2, 3]
1 2 4 NaN

Slicing a DataFrameGroupBy object

Is there a way to slice a DataFrameGroupBy object?
For example, if I have:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']})
A B
0 2 x
1 1 y
2 1 z
3 3 r
4 3 p
dfg = df.groupby('A')
Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like
dfg.loc[1:2].agg(...)
or, for a specific column,
dfg['B'].loc[1:2].agg(...)
EDIT. To make it more clear: by slicing the GroupBy object I mean accessing only a subset of groups. In the above example, the GroupBy object will contain 3 groups, for A = 1, A = 2, and A = 3. For some reasons, I may only be interested in groups for A = 1 and A = 2.
It seesm you need custom function with iloc - but if use agg is necessary return aggregate value:
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[0:3]))
print (df)
A
1 y,z
2 x
3 r,p
Name: B, dtype: object
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
A
1 z
2
3 p
Name: B, dtype: object
For multiple columns:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3],
'B': ['x', 'y', 'z', 'r', 'p'],
'C': ['g', 'y', 'y', 'u', 'k']})
print (df)
A B C
0 2 x g
1 1 y y
2 1 z y
3 3 r u
4 3 p k
df = df.groupby('A').agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
B C
A
1 z y
2
3 p k
If I understand correctly, you only want some groups, but those are supposed to be returned completely:
A B
1 1 y
2 1 z
0 2 x
You can solve your problem by extracting the keys and then selecting groups based on those keys.
Assuming you already know the groups:
pd.concat([dfg.get_group(1),dfg.get_group(2)])
If you don't know the group names and are just looking for random n groups, this might work:
pd.concat([dfg.get_group(n) for n in list(dict(list(dfg)).keys())[:2]])
The output in both cases is a normal DataFrame, not a DataFrameGroupBy object, so it might be smarter to first filter your DataFrame and only aggregate afterwards:
df[df['A'].isin([1,2])].groupby('A')
The same for unknown groups:
df[df['A'].isin(list(set(df['A']))[:2])].groupby('A')
I believe there are some Stackoverflow answers refering to this, like How to access pandas groupby dataframe by key

A straightforword method to select columns by position

I am trying to clarify how can I manage pandas methods to call columns and rows in a Dataframe. An example will clarify my issue
dic = {'a': [1, 5, 2, 7], 'b': [6, 8, 4, 2], 'c': [5, 3, 2, 7]}
df = pd.DataFrame(dic, index = ['e', 'f', 'g', 'h'] )
than
df =
a b c
e 1 6 5
f 5 8 3
g 2 4 2
h 7 2 7
Now if I want to select column 'a' I just have to type
df['a']
while if I want to select row 'e' I have to use the ".loc" method
df.loc['e']
If I don't know the name of the row, but just it's position ( 0 in this case) than I can use the "iloc" method
df.iloc[0]
What looks like it is missing is a method for calling columns by position and not by name, something that is the "equivalent for columns of the 'iloc' method for rows". The only way I can find to do this is
df[df.keys()[0]]
is there something like
df.ilocColumn[0]
?
You can add : because first argument is position of selected indexes and second position of columns in function iloc:
And : means all indexes in DataFrame:
print (df.iloc[:,0])
e 1
f 5
g 2
h 7
Name: a, dtype: int64
If need select first index and first column value:
print (df.iloc[0,0])
1
Solution with ix work nice if need select index by name and column by position:
print (df.ix['e',0])
1

pandas DataFrame set value on boolean mask

I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.

Categories

Resources