how to preserve pandas dataframe identity when extracting a single row - python

I am extracting a subset of my dataframe by index using either .xs or .loc (they seem to behave the same). When my condition retrieves multiple rows, the result stays a dataframe. When only a single row is retrieved, it is automatically converted to a series. I don't want that behavior, since that means I need to handle multiple cases downstream (different method sets available for series vs dataframe).
In [1]: df = pd.DataFrame({'a':range(7), 'b':['one']*4 + ['two'] + ['three']*2,
'c':range(10,17)})
In [2]: df.set_index('b', inplace=True)
In [3]: df.xs('one')
Out[3]:
a c
b
one 0 10
one 1 11
one 2 12
one 3 13
In [4]: df.xs('two')
Out[4]:
a 4
c 14
Name: two, dtype: int64
In [5]: type(df.xs('two'))
Out [5]: pandas.core.series.Series
I can manually convert that series back to a dataframe, but it seems cumbersome and will also require case testing to see if I should do that. Is there a cleaner way to just get a dataframe back to begin with?

IIUC, you can simply add braces, [], and use .loc:
>>> df.loc["two"]
a 4
c 14
Name: two, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
>>> df.loc[["two"]]
a c
b
two 4 14
[1 rows x 2 columns]
>>> type(_)
<class 'pandas.core.frame.DataFrame'>
This may remind you of how numpy advanced indexing works:
>>> a = np.arange(9).reshape(3,3)
>>> a[1]
array([3, 4, 5])
>>> a[[1]]
array([[3, 4, 5]])
Now this will probably require some refactoring of code so that you're always accessing with a list, even if the list only has one element, but it works well for me in practice.

Related

Getting first element of series from pandas .value_counts() [duplicate]

I am new to python. This seems like a basic question to ask. But I really want to understand what is happening here
import numpy as np
import pandas as pd
tempdata = np.random.random(5)
myseries_one = pd.Series(tempdata)
myseries_two = pd.Series(data = tempdata, index = ['a','b','c','d','e'])
myseries_three = pd.Series(data = tempdata, index = [10,11,12,13,14])
myseries_one
Out[1]:
0 0.291293
1 0.381014
2 0.923360
3 0.271671
4 0.605989
dtype: float64
myseries_two
Out[2]:
a 0.291293
b 0.381014
c 0.923360
d 0.271671
e 0.605989
dtype: float64
myseries_three
Out[3]:
10 0.291293
11 0.381014
12 0.923360
13 0.271671
14 0.605989
dtype: float64
Indexing first element from each dataframe
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_three[0]
KeyError:0
Doubt1 :-Why this is happenening ? Why myseries_three[0] gives me a keyError ?
what we meant by calling myseries_one[0] , myseries_one[0] or myseries_three[0] ? Does calling this way mean we are calling by rownames ?
Doubt2 :-Is rownames and rownumber in Python works as different as rownames and rownumber in R ?
myseries_one[0:2]
Out[78]:
0 0.291293
1 0.381014
dtype: float64
myseries_two[0:2]
Out[79]:
a 0.291293
b 0.381014
dtype: float64
myseries_three[0:2]
Out[80]:
10 0.291293
11 0.381014
dtype: float64
Doubt3:- If calling myseries_three[0] meant calling by rownames then how myseries_three[0:3] producing the output ? does myseries_three[0:4] mean we are calling by rownumber ? Please explain and guide. I am migrating from R to python. so its a bit confusing for me.
When you are attempting to slice with myseries[something], the something is often ambiguous. You are highlighting a case of that ambiguity. In your case, pandas is trying to help you out by guessing what you mean.
myseries_one[0] #As expected
Out[74]: 0.29129291112626043
myseries_one has integer labels. It would make sense that when you attempt to slice with an integer that you intend to get the element that is labeled with that integer. It turns out, that you have an element labeled with 0 an so that is returned to you.
myseries_two[0] #As expected
Out[75]: 0.29129291112626043
myseries_two has string labels. It's highly unlikely that you meant to slice this series with a label of 0 when labels are all strings. So, pandas assumes that you meant a position of 0 and returns the first element (thanks pandas, that was helpful).
myseries_three[0]
KeyError:0
myseries_three has integer labels and you are attempting to slice with an integer... perfect. Let's just get that value for you... KeyError. Whoops, that index label does not exist. In this case, it is safer for pandas to fail than to guess that maybe you meant to slice by position. The documentation even suggests that if you want to remove the ambiguity, use loc for label based slicing and iloc for position based slicing.
Let's try loc
myseries_one.loc[0]
0.29129291112626043
myseries_two.loc[0]
KeyError:0
myseries_three.loc[0]
KeyError:0
Only myseries_one has a label 0. The other two return KeyErrors
Let's try iloc
myseries_one.iloc[0]
0.29129291112626043
myseries_two.iloc[0]
0.29129291112626043
myseries_three.iloc[0]
0.29129291112626043
They all have a position of 0 and return the first element accordingly.
For the range slicing, pandas decides to be less interpretive and sticks to positional slicing for the integer slice 0:2. Keep in mind. Actual real people (the programmers writing pandas code) are the ones making these decisions. When you are attempting to do something that is ambiguous, you may get varying results. To remove ambiguity, use loc and iloc.
iloc
myseries_one.iloc[0:2]
0 0.291293
1 0.381014
dtype: float64
myseries_two.iloc[0:2]
a 0.291293
b 0.381014
dtype: float64
myseries_three.iloc[0:2]
10 0.291293
11 0.381014
dtype: float64
loc
myseries_one.loc[0:2]
0 0.291293
1 0.381014
2 0.923360
dtype: float64
myseries_two.loc[0:2]
TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [0] of <type 'int'>
myseries_three.loc[0:2]
Series([], dtype: float64)

Uncomfortable output of mode() in pandas Dataframe

I have a dataframe with several columns (the features).
>>> print(df)
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
I would like to compute the mode of one of them. This is what happens:
>>> print(df['col1'].mode())
0 3
dtype: int64
I would like to output simply the value 3.
This behavoiur is quite strange, if you consider that the following very similar code is working:
>>> print(df['col1'].mean())
2.25
So two questions: why does this happen? How can I obtain the pure mode value as it happens for the mean?
Because Series.mode() can return multiple values:
consider the following DF:
In [77]: df
Out[77]:
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
e 2 3
In [78]: df['col1'].mode()
Out[78]:
0 2
1 3
dtype: int64
From docstring:
Empty if nothing occurs at least 2 times. Always returns Series
even if only one value.
If you want to chose the first value:
In [83]: df['col1'].mode().iloc[0]
Out[83]: 2
In [84]: df['col1'].mode()[0]
Out[84]: 2
I agree that it's too cumbersome
df['col1'].mode().iloc[0].values[0]
a series can have one mean(), but a series can have more than one mode()
like
<2,2,2,3,3,3,4,4,4,5,6,7,8> its mode 2,3,4.
the output must be indexed
mode() will return all values that tie for the most frequent value.
In order to support that functionality, it must return a collection, which takes the form of a dataFrame or Series.
For example, if you had a series:
[2, 2, 3, 3, 5, 5, 6]
Then the most frequent values occur twice. The result would then be the series [2, 3, 5] since each of these occur twice.
If you want to collapse this into a single value, you can access the first value, compute the max(), min(), or whatever makes most sense for your application.

Issues using compare lists in pandas DataFrame

I have a DataFrame in pandas with one of the column types being a list on int, like so:
df = pandas.DataFrame([[1,2,3,[4,5]],[6,7,8,[9,10]]], columns=['a','b','c','d'])
>>> df
a b c d
0 1 2 3 [4, 5]
1 6 7 8 [9, 10]
I'd like to build a filter using d, but the normal comparison operations don't seem to work:
>>> df['d'] == [4,5]
0 False
1 False
Name: d, dtype: bool
However when I inspect row by row, I get what I would expect
>>> df.loc[0,'d'] == [4,5]
True
What's going on here? How can I do list comparisons?
It is a curious issue, it probably has to do with the fact that list are not hashable
I would go for apply:
df['d'].apply(lambda x: x == [4,5])
Of course as suggested by DSM, the following works:
df = pd.DataFrame([[1,2,3,(4,5)],[6,7,8,(9,10)]], columns=['a','b','c','d'])
df['d'] == (4,5)
Another solution is use list comprehension:
df[[x == [4, 5] for v in df['col2']]]
As an alternative, if you wish to keep your "series of lists" structure, you can convert your series to tuples for comparison purposes only. This is possible via pd.Series.apply:
>>>>df['d'].apply(tuple) == (4, 5)
0 True
1 False
Name: d, dtype: bool
However, note that none of the options available for a series of lists are vectorised. You are advised to split your data into numeric series before performing comparisons.

Idiomatic way to add two pandas Series objects with different indices

I have two Series objects that I would like to add:
s1 = Series([1,1], index=['a', 'b'])
s2 = Series([2.2], index=['x', 'y'])
When I add them, I get a Series with 4 elements with NaN values, but what I want is a Series that is [s1.a + s2.x, s1.b + s2.y]. This seems like it should be possible, because the indices have an ordering.
I can get what I want from pd.Series(s1.values + s2.values), but I'd like to know if there is a function that already operates on the Series objects this way and returns a series, rather than having to go down to numpy.
Depends on what do you want for the final index:
In [20]:
s1+s2.values
Out[20]:
a 3
b 3
dtype: int64
In [21]:
s2+s1.values
Out[21]:
x 3
y 3
dtype: int64
Or even multiindex:
In [22]:
s3=s2+s1.values
s3.index=pd.MultiIndex.from_tuples(zip(s1.index, s2.index))
s3
Out[22]:
a x 3
b y 3
dtype: int64

pandas DataFrame combine_first and update methods have strange behavior

I'm running into a strange issue (or intended?) where combine_first or update are causing values stored as bool to be upcasted into float64s if the argument supplied is not supplying the boolean columns.
Example workflow in ipython:
In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])
In [145]: test
Out[145]:
a b isBool isBool2
0 1 2 False True
1 4 5 True False
In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])
In [148]: b
Out[148]:
a b
0 45 45
In [149]: test.update(b)
In [150]: test
Out[150]:
a b isBool isBool2
0 45 45 0 1
1 4 5 1 0
Was this meant to be the behavior of the update function? I would think that if nothing was specified that update wouldn't mess with the other columns.
EDIT: I started tinkering around a little more. The plot thickens. If I insert one more command: test.update([]) before running test.update(b), boolean behavior works at the cost of numbers upcasted as objects. This also applies to DSM's simplified example.
Based on panda's source code, it looks like the reindex_like method is creating a DataFrame of dtype object, while reindex_like b creates a DataFrame of dtype float64. Since object is more general, subsequent operations work with bools. Unfortunately running np.log on the numerical columns will fail with an AttributeError.
this is a bug, update shouldn't touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021
Before updating, the dateframe b is been filled by reindex_link, so that b becomes
In [5]: b.reindex_like(a)
Out[5]:
a b isBool isBool2
0 45 45 NaN NaN
1 NaN NaN NaN NaN
And then use numpy.where to update the data frame.
The tragedy is that for numpy.where, if two data have different type, the more general one would be used. For example
In [20]: np.where(True, [True], [0])
Out[20]: array([1])
In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])
Since NaN in numpy is floating type, it'll also return an floating type.
In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])
Therefore, after updating, your 'isBool' and 'isBool2' column become floating type.
I've added this issue on the issue tracker for pandas

Categories

Resources