Idiomatic way to add two pandas Series objects with different indices - python

I have two Series objects that I would like to add:
s1 = Series([1,1], index=['a', 'b'])
s2 = Series([2.2], index=['x', 'y'])
When I add them, I get a Series with 4 elements with NaN values, but what I want is a Series that is [s1.a + s2.x, s1.b + s2.y]. This seems like it should be possible, because the indices have an ordering.
I can get what I want from pd.Series(s1.values + s2.values), but I'd like to know if there is a function that already operates on the Series objects this way and returns a series, rather than having to go down to numpy.

Depends on what do you want for the final index:
In [20]:
s1+s2.values
Out[20]:
a 3
b 3
dtype: int64
In [21]:
s2+s1.values
Out[21]:
x 3
y 3
dtype: int64
Or even multiindex:
In [22]:
s3=s2+s1.values
s3.index=pd.MultiIndex.from_tuples(zip(s1.index, s2.index))
s3
Out[22]:
a x 3
b y 3
dtype: int64

Related

Preserving dtypes when extracting a row from a pandas DataFrame

Extracting a single row from a pandas DataFrame (e.g. using .loc or .iloc) yields a pandas Series. However, when dealing with heterogeneous data in the DataFrame (i.e. the DataFrame’s columns are not all the same dtype), this causes all the values from the different columns in the row to be coerced into a single dtype, because a Series can only have one dtype. Here is a simple example to show what I mean:
import numpy
import pandas
a = numpy.arange(5, dtype='i8')
b = numpy.arange(5, dtype='u8')**2
c = numpy.arange(5, dtype='f8')**3
df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
df.dtypes
# a int64
# b uint64
# c float64
# dtype: object
df
# a b c
# 0 0 0 0.0
# 1 1 1 1.0
# 2 2 4 8.0
# 3 3 9 27.0
# 4 4 16 64.0
df.loc[2]
# a 2.0
# b 4.0
# c 8.0
# Name: 2, dtype: float64
All values in df.loc[2] have been converted to float64.
Is there a good way to extract a row without incurring this type conversion? I could imagine e.g. returning a numpy structured array, but I don’t see a hassle-free way of creating such an array.
As you already realized, series doesn't allow mixing dtypes. However, it allows mixed data type if you specify its dtypes as object. So, you may convert dtypes of dataframe to object. Every column will be in dtype object, but every value still keeps it data type of int and float
df1 = df.astype('O')
Out[10]:
a b c
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
In [12]: df1.loc[2].map(type)
Out[12]:
a <class 'int'>
b <class 'int'>
c <class 'float'>
Name: 2, dtype: object
Otherwise, you need to convert dataframe to np.recarray
n_recs = df.to_records(index=False)
Out[22]:
rec.array([(0, 0, 0.), (1, 1, 1.), (2, 4, 8.), (3, 9, 27.),
(4, 16, 64.)],
dtype=[('a', '<i8'), ('b', '<u8'), ('c', '<f8')])
Another approach (but it feels slightly hacky):
Instead of using an integer with loc or iloc, you can use a slicer with length 1. This returns a DataFrame with length 1, so iloc[0] contains your data. e.g
In[1] : row2 = df[2:2+1]
In[2] : type(row)
Out[2]: pandas.core.frame.DataFrame
In[3] : row2.dtypes
Out[3]:
a int64
b uint64
c float64
In[4] : a2 = row2.a.iloc[0]
In[5] : type(a2)
Out[5]: numpy.int64
In[6] : c2 = row2.c.iloc[0]
In[7] : type(c2)
Out[7]: numpy.float64
To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).
I think it would be better if pandas had a DataFrameRow type for this siutation.

pandas dividing a column by lagged values

I'm trying to divide a Pandas DataFrame column by a lagged value, which is 1 in this example.
Create the dataframe. This example only has 1 column, even though my real data has dozens
dTest = pd.DataFrame(data={'Open': [0.99355, 0.99398, 0.99534, 0.99419]})
When I try this vector division (I'm a Python newbie coming from R):
dTest.ix[range(1,4),'Open'] / dTest.ix[range(0,3),'Open']
I get this output:
NaN 1 1 NaN
But I'm expecting:
1.0004327915052085
1.0013682367854484
0.9988446159101413
There's clearly something that I don't understand about the data structure. I'm expecting 3 values but it's outputting 4. What am I missing?
What you tried failed because the sliced ranges of the indices only overlap on the middle 2 rows. You should use shift to shift the rows to achieve what you want:
In [166]:
dTest['Open'] / dTest['Open'].shift()
Out[166]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
you can also use div:
In [159]:
dTest['Open'].div(dTest['Open'].shift(), axis=0)
Out[159]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
You can see that the indices are different when you slice so when using / only the common indices are affected:
In [164]:
dTest.ix[range(0,3),'Open']
Out[164]:
0 0.99355
1 0.99398
2 0.99534
Name: Open, dtype: float64
In [165]:
dTest.ix[range(1,4),'Open']
Out[165]:
1 0.99398
2 0.99534
3 0.99419
Name: Open, dtype: float64
here:
In [168]:
dTest.ix[range(0,3),'Open'].index.intersection(dTest.ix[range(1,4),'Open'].index
Out[168]:
Int64Index([1, 2], dtype='int64')

Store array of string values in column in pandas? [duplicate]

This question already has answers here:
Python pandas insert list into a cell
(9 answers)
Closed 6 years ago.
I have a pandas dataframe. I have a column that could potentially have null values or an array of string values in it. But I'm having trouble working out how to store values in this column.
This is my code now:
df_completed = df[df.completed]
df['links'] = None
for i, row in df_completed.iterrows():
results = get_links(row['nct_id'])
if results:
df[df.nct_id == row['nct_id']].links = results
print df[df.nct_id == row['nct_id']].links
But this has two problems:
When results is an array of length 1, the printed output is None, rather than the array, so I think I must be saving the value wrong
When results is a longer array, the line where I save the value produces an error: ValueError: Length of values does not match length of index
What am I doing wrong?
I am not sure it's advisable to try to store arrays in pandas like this, have you considered trying to serialise the array contents and then store?
If storing an array is what you're after anyways, then you can try with the set_value() method, like this (make sure you take care of the dtype of column nct_id):
In [35]: df = pd.DataFrame(data=np.random.rand(5,5), columns=list('ABCDE'))
In [36]: df
Out[36]:
A B C D E
0 0.741268 0.482689 0.742200 0.210650 0.351758
1 0.798070 0.929576 0.522227 0.280713 0.168999
2 0.413417 0.481230 0.304180 0.894934 0.327243
3 0.797061 0.561387 0.247033 0.330608 0.294618
4 0.494038 0.065731 0.538588 0.095435 0.397751
In [38]: df.dtypes
Out[38]:
A float64
B float64
C float64
D float64
E float64
dtype: object
In [39]: df.A = df.A.astype(object)
In [40]: df.dtypes
Out[40]:
A object
B float64
C float64
D float64
E float64
dtype: object
In [41]: df.set_value(0, 'A', ['some','values','here'])
Out[41]:
A B C D E
0 [some, values, here] 0.482689 0.742200 0.210650 0.351758
1 0.79807 0.929576 0.522227 0.280713 0.168999
2 0.413417 0.481230 0.304180 0.894934 0.327243
3 0.797061 0.561387 0.247033 0.330608 0.294618
4 0.494038 0.065731 0.538588 0.095435 0.397751
I hope this helps!

How to sort a Series or DataFrame by a given index order?

Suppose I have a Series like this:
In [19]: sr
Out[19]:
a 1
b 2
c 3
d 4
dtype: int64
In [20]: sr.index
Out[20]: Index([u'a', u'b', u'c', u'd'], dtype='object')
Instead of sorting lexicographically, I would like to sort this series based on a custom order, say, cdab. How can I do that?
What if it is a DataFrame; how can I sort it by a given index list?
You can do this in number of different ways. For Series objects, you can simply pass your preferred order for the index like this:
>>> sr[['c','d','a','b']]
c 3
d 4
a 1
b 2
dtype: int64
Alternatively, both Series and DataFrame objects have a reindex method. This allows you more flexibility when sorting the index. For instance, you can insert new values into the index (and even choose what value it should have):
>>> sr.reindex(['c','d','a','b','e'])
c 3
d 4
a 1
b 2
e NaN # <-- new index location 'e' is filled with NaN
dtype: int64
Yet another option for both Series and DataFrame objects is the ever-useful loc method of accessing index labels:
>>> sr.loc[['c','d','a','b']]
c 3
d 4
a 1
b 2
dtype: int64
Just use reindex, for example:
In [3]: sr.reindex(['c', 'd', 'a', 'b'])
Out[3]:
c 3
d 4
a 1
b 2
dtype: int64

how to preserve pandas dataframe identity when extracting a single row

I am extracting a subset of my dataframe by index using either .xs or .loc (they seem to behave the same). When my condition retrieves multiple rows, the result stays a dataframe. When only a single row is retrieved, it is automatically converted to a series. I don't want that behavior, since that means I need to handle multiple cases downstream (different method sets available for series vs dataframe).
In [1]: df = pd.DataFrame({'a':range(7), 'b':['one']*4 + ['two'] + ['three']*2,
'c':range(10,17)})
In [2]: df.set_index('b', inplace=True)
In [3]: df.xs('one')
Out[3]:
a c
b
one 0 10
one 1 11
one 2 12
one 3 13
In [4]: df.xs('two')
Out[4]:
a 4
c 14
Name: two, dtype: int64
In [5]: type(df.xs('two'))
Out [5]: pandas.core.series.Series
I can manually convert that series back to a dataframe, but it seems cumbersome and will also require case testing to see if I should do that. Is there a cleaner way to just get a dataframe back to begin with?
IIUC, you can simply add braces, [], and use .loc:
>>> df.loc["two"]
a 4
c 14
Name: two, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
>>> df.loc[["two"]]
a c
b
two 4 14
[1 rows x 2 columns]
>>> type(_)
<class 'pandas.core.frame.DataFrame'>
This may remind you of how numpy advanced indexing works:
>>> a = np.arange(9).reshape(3,3)
>>> a[1]
array([3, 4, 5])
>>> a[[1]]
array([[3, 4, 5]])
Now this will probably require some refactoring of code so that you're always accessing with a list, even if the list only has one element, but it works well for me in practice.

Categories

Resources