Store array of string values in column in pandas? [duplicate]

Store array of string values in column in pandas? [duplicate] - python

This question already has answers here:
Python pandas insert list into a cell
(9 answers)
Closed 6 years ago.
I have a pandas dataframe. I have a column that could potentially have null values or an array of string values in it. But I'm having trouble working out how to store values in this column.
This is my code now:
df_completed = df[df.completed]
df['links'] = None
for i, row in df_completed.iterrows():
results = get_links(row['nct_id'])
if results:
df[df.nct_id == row['nct_id']].links = results
print df[df.nct_id == row['nct_id']].links
But this has two problems:
When results is an array of length 1, the printed output is None, rather than the array, so I think I must be saving the value wrong
When results is a longer array, the line where I save the value produces an error: ValueError: Length of values does not match length of index
What am I doing wrong?

I am not sure it's advisable to try to store arrays in pandas like this, have you considered trying to serialise the array contents and then store?
If storing an array is what you're after anyways, then you can try with the set_value() method, like this (make sure you take care of the dtype of column nct_id):
In [35]: df = pd.DataFrame(data=np.random.rand(5,5), columns=list('ABCDE'))
In [36]: df
Out[36]:
A B C D E
0 0.741268 0.482689 0.742200 0.210650 0.351758
1 0.798070 0.929576 0.522227 0.280713 0.168999
2 0.413417 0.481230 0.304180 0.894934 0.327243
3 0.797061 0.561387 0.247033 0.330608 0.294618
4 0.494038 0.065731 0.538588 0.095435 0.397751
In [38]: df.dtypes
Out[38]:
A float64
B float64
C float64
D float64
E float64
dtype: object
In [39]: df.A = df.A.astype(object)
In [40]: df.dtypes
Out[40]:
A object
B float64
C float64
D float64
E float64
dtype: object
In [41]: df.set_value(0, 'A', ['some','values','here'])
Out[41]:
A B C D E
0 [some, values, here] 0.482689 0.742200 0.210650 0.351758
1 0.79807 0.929576 0.522227 0.280713 0.168999
2 0.413417 0.481230 0.304180 0.894934 0.327243
3 0.797061 0.561387 0.247033 0.330608 0.294618
4 0.494038 0.065731 0.538588 0.095435 0.397751
I hope this helps!

Related

Preserving dtypes when extracting a row from a pandas DataFrame

Extracting a single row from a pandas DataFrame (e.g. using .loc or .iloc) yields a pandas Series. However, when dealing with heterogeneous data in the DataFrame (i.e. the DataFrame’s columns are not all the same dtype), this causes all the values from the different columns in the row to be coerced into a single dtype, because a Series can only have one dtype. Here is a simple example to show what I mean:
import numpy
import pandas
a = numpy.arange(5, dtype='i8')
b = numpy.arange(5, dtype='u8')**2
c = numpy.arange(5, dtype='f8')**3
df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
df.dtypes
# a int64
# b uint64
# c float64
# dtype: object
df
# a b c
# 0 0 0 0.0
# 1 1 1 1.0
# 2 2 4 8.0
# 3 3 9 27.0
# 4 4 16 64.0
df.loc[2]
# a 2.0
# b 4.0
# c 8.0
# Name: 2, dtype: float64
All values in df.loc[2] have been converted to float64.
Is there a good way to extract a row without incurring this type conversion? I could imagine e.g. returning a numpy structured array, but I don’t see a hassle-free way of creating such an array.

As you already realized, series doesn't allow mixing dtypes. However, it allows mixed data type if you specify its dtypes as object. So, you may convert dtypes of dataframe to object. Every column will be in dtype object, but every value still keeps it data type of int and float
df1 = df.astype('O')
Out[10]:
a b c
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
In [12]: df1.loc[2].map(type)
Out[12]:
a <class 'int'>
b <class 'int'>
c <class 'float'>
Name: 2, dtype: object
Otherwise, you need to convert dataframe to np.recarray
n_recs = df.to_records(index=False)
Out[22]:
rec.array([(0, 0, 0.), (1, 1, 1.), (2, 4, 8.), (3, 9, 27.),
(4, 16, 64.)],
dtype=[('a', '<i8'), ('b', '<u8'), ('c', '<f8')])

Another approach (but it feels slightly hacky):
Instead of using an integer with loc or iloc, you can use a slicer with length 1. This returns a DataFrame with length 1, so iloc[0] contains your data. e.g
In[1] : row2 = df[2:2+1]
In[2] : type(row)
Out[2]: pandas.core.frame.DataFrame
In[3] : row2.dtypes
Out[3]:
a int64
b uint64
c float64
In[4] : a2 = row2.a.iloc[0]
In[5] : type(a2)
Out[5]: numpy.int64
In[6] : c2 = row2.c.iloc[0]
In[7] : type(c2)
Out[7]: numpy.float64
To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).
I think it would be better if pandas had a DataFrameRow type for this siutation.

Adding an array to pandas dataframe

I have a dataframe, and I want to create a new column and add arrays to this each row of this new column. I know to do this I have to change the datatype of the column to 'object' I tried the following but it doesn;t work,
import pandas
import numpy as np
df = pandas.DataFrame({'a':[1,2,3,4]})
df['b'] = np.nan
df['b'] = df['b'].astype(object)
df.loc[0,'b'] = [[1,2,4,5]]
The error is
ValueError: Must have equal len keys and value when setting with an ndarray
However, it works if I convert the datatype of the whole dataframe into 'object':
df = pandas.DataFrame({'a':[1,2,3,4]})
df['b'] = np.nan
df = df.astype(object)
df.loc[0,'b'] = [[1,2,4,5]]
So my question is: why do I have to change the datatype of whole DataFrame?

try this:
In [12]: df.at[0,'b'] = [1,2,4,5]
In [13]: df
Out[13]:
a b
0 1 [1, 2, 4, 5]
1 2 NaN
2 3 NaN
3 4 NaN
PS be aware that as soon as you put non scalar value in any cells - the corresponding column's dtype will be changed to object in order to be able to contain non-scalar values:
In [14]: df.dtypes
Out[14]:
a int64
b object
dtype: object
PPS generally it's a bad idea to store non-scalar values in cells, because the vast majority of Pandas/Numpy methods will not work properly with such data.

How can I check the dtype of the contents of a column in python pandas?

This question is related to how to check the dtype of a column in python pandas.
An empty pandas dataframe is created. Following this, it's filled with data. How can I then check if any of its columns contain complex types?
index = [np.array(['foo', 'qux'])]
columns = ["A", "B"]
df = pd.DataFrame(index=index, columns=columns)
df.loc['foo']["A"] = 1 + 1j
df.loc['foo']["B"] = 1
df.loc['qux']["A"] = 2
df.loc['qux']["B"] = 2
print df
for type in df.dtypes:
if type == complex:
print type
At the moment, I get the type as object which isn't useful.
A B
foo (1+1j) 1
qux 2 2

Consider the series s
s = pd.Series([1, 3.4, 2 + 1j], dtype=np.object)
s
0 1
1 3.4
2 (2+1j)
dtype: object
If I use pd.to_numeric, it will upcast the dtype to complex if any are complex
pd.to_numeric(s).dtype
dtype('complex128')

pandas DataFrame assign with format

I'm trying to use assign to create a new column in a pandas DataFrame. I need to use something like str.format to have the new column be pieces of existing columns. For instance...
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3))
gives me...
0 1 2
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
an assign for a totally new column works
# works
print(df.assign(c = "a"))
0 1 2 c
0 -0.738703 -1.027115 1.129253 a
1 0.674314 0.525223 -0.371896 a
2 1.021304 0.169181 -0.884293 a
But, if I want to use an existing column into a new column it seems like pandas is adding the whole existing frame into the new column.
# doesn't work
print(df.assign(c = "a{}b".format(df[0])))
0 1 2 \
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
c
0 a0 -0.738703\n1 0.674314\n2 1.021304\n...
1 a0 -0.738703\n1 0.674314\n2 1.021304\n...
2 a0 -0.738703\n1 0.674314\n2 1.021304\n...
Thanks for the help.

In [131]: df.assign(c="a"+df[0].astype(str)+"b")
Out[131]:
0 1 2 c
0 0.833556 -0.106183 -0.910005 a0.833556419295b
1 -1.487825 1.173338 1.650466 a-1.48782514804b
2 -0.836795 -1.192674 -0.212900 a-0.836795026809b
'a{}b'.format(df[0]) is a str. "a"+df[0].astype(str)+"b" is a Series.
In [142]: type(df[0].astype(str))
Out[142]: pandas.core.series.Series
In [143]: type('{}'.format(df[0]))
Out[143]: str
When you assign a single string to the column c, that string is repeated for every row in df.
Thus, df.assign(c = "a{}b".format(df[0])) assigns the string 'a{}b'.format(df[0])
to each row of df:
In [138]: 'a{}b'.format(df[0])
Out[138]: 'a0 0.833556\n1 -1.487825\n2 -0.836795\nName: 0, dtype: float64b'
It is really no different than what happened with df.assign(c = "a").
In contrast, when you assign a Series to the column c, then the index of the Series is aligned with the index of df and the corresponding values are assigned to df['c'].
Under the hood, the Series.__add__ method is defined in such a way so that addition of the Series containing strings with a string results in a new Series with the string concatenated with the values in the Series:
In [149]: "a"+df[0].astype(str)
Out[149]:
0 a0.833556419295
1 a-1.48782514804
2 a-0.836795026809
Name: 0, dtype: object
(The astype method was called to convert the floats in df[0] into strings.)

df['c'] = "a" + df[0].astype(str) + 'b'
df
0 1 2 c
0 -1.134154 -0.367397 0.906239 a-1.13415403091b
1 0.551997 -0.160217 -0.869291 a0.551996920472b
2 0.490102 -1.151301 0.541888 a0.490101854737b

Idiomatic way to add two pandas Series objects with different indices

I have two Series objects that I would like to add:
s1 = Series([1,1], index=['a', 'b'])
s2 = Series([2.2], index=['x', 'y'])
When I add them, I get a Series with 4 elements with NaN values, but what I want is a Series that is [s1.a + s2.x, s1.b + s2.y]. This seems like it should be possible, because the indices have an ordering.
I can get what I want from pd.Series(s1.values + s2.values), but I'd like to know if there is a function that already operates on the Series objects this way and returns a series, rather than having to go down to numpy.

Depends on what do you want for the final index:
In [20]:
s1+s2.values
Out[20]:
a 3
b 3
dtype: int64
In [21]:
s2+s1.values
Out[21]:
x 3
y 3
dtype: int64
Or even multiindex:
In [22]:
s3=s2+s1.values
s3.index=pd.MultiIndex.from_tuples(zip(s1.index, s2.index))
s3
Out[22]:
a x 3
b y 3
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Store array of string values in column in pandas? [duplicate] - python

Related

Preserving dtypes when extracting a row from a pandas DataFrame

Adding an array to pandas dataframe

How can I check the dtype of the contents of a column in python pandas?

pandas DataFrame assign with format

Idiomatic way to add two pandas Series objects with different indices

Categories

Resources