Python/Pandas: Unexpected indices when doing a groupby-apply

Python/Pandas: Unexpected indices when doing a groupby-apply - python

I'm using Pandas and Numpy on Python3 with the following versions:
Python 3.5.1 (via Anaconda 2.5.0) 64 bits
Pandas 0.19.1
Numpy 1.11.2 (probably not relevant here)
Here is the minimal code producing the problem:
import pandas as pd
import numpy as np
a = pd.DataFrame({'i' : [1,1,1,1,1], 'a': [1,2,5,6,100], 'b': [2, 4,10, np.nan, np.nan]})
a.set_index(keys='a', inplace=True)
v = a.groupby(level=0).apply(lambda x: x.sort_values(by='i')['b'].rolling(2, min_periods=0).mean())
v.index.names
This code is a simple groupby-apply, but I don't understand the outcome:
FrozenList(['a', 'a'])
For some reason, the index of the result is ['a', 'a'], which seems to be a very doubtful choice from pandas. I would have expected a simple ['a'].
Does anyone have some idea about why Pandas chooses to duplicate the column in the index?
Thanks in advance.

This is happening because sort_values returns a DataFrame or Series so the index is being concatenated to the existing groupby index, the same thing happens if you did shift on the 'b' column:
In [99]:
v = a.groupby(level=0).apply(lambda x: x['b'].shift())
v
Out[99]:
a a
1 1 NaN
2 2 NaN
5 5 NaN
6 6 NaN
100 100 NaN
Name: b, dtype: float64
even with as_index=False it would still produce a multi-index:
In [102]:
v = a.groupby(level=0, as_index=False).apply(lambda x: x['b'].shift())
v
Out[102]:
a
0 1 NaN
1 2 NaN
2 5 NaN
3 6 NaN
4 100 NaN
Name: b, dtype: float64
if the lambda was returning a plain scalar value then no duplicating index is created:
In [104]:
v = a.groupby(level=0).apply(lambda x: x['b'].max())
v
Out[104]:
a
1 2.0
2 4.0
5 10.0
6 NaN
100 NaN
dtype: float64
I don't think this is a bug rather some semantics to be aware of that some methods will return an object where the index will be aligned with the pre-existing index.

Related

Replace str values in series into np.nan

I have the following series
s = pd.Series({'A':['hey','hey',2,2.14},index=1,2,3,4)
I basically want to mask, the series and check if the values are a str if so i want to replace then with np.nan, how could i achieve that?
Wanted result
s = pd.Series({'A':[np.nan,np.nan,2,2.14},index=1,2,3,4)
I tried this
s.mask(isinstance(s,str))
But i got the following ValueError: Array conditional must be same shape as self, i am kinda a newb when it comes to these methods would appreciate a explanation on the why

You can use
out = s.mask(s.apply(type).eq(str))
print(out)
1 NaN
2 NaN
3 2
4 2.14
dtype: object

If you are set on using mask, you could try:
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s.mask(s.apply(isinstance,args = [str]))
print(s)
1 NaN
2 NaN
3 2
4 2.14
dtype: object
But as you can see, many roads leading to Rome...

Use to_numeric with the errors="coerce" parameter.
s = pd.to_numeric(s, errors = 'coerce')
Out[73]:
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64

IIUC, You need to create pd.Series like below then use isinstance like below.
import numpy as np
import pandas as pd
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s = s.apply(lambda x: np.nan if isinstance(x, str) else x)
print(s)
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64

You could use:
s[s.str.match('\D+').fillna(False)] = np.nan
But if you are looking to convert all string 'types' not just representations like "1.23" then refer to #Ynjxsjmh's answer.

Need to combine multiple rows based on index

I have a dataframe with values like
0 1 2
a 5 NaN 6
a NaN 2 NaN
Need the output by combining the two rows based on index 'a' which is same in both rows
Also need to add multiple columns and output as single column
Need the output as below. Value 13 since adding 5 2 6
0
a 13
Trying this using concat function but getting errors

How about using Pandas dataframe.sum() ?
import pandas as pd
import numpy as np
data = pd.DataFrame({"0":[5, np.NaN], "1":[np.NaN, 2], "2":[6,np.NaN]})
row_total = data.sum(axis = 1, skipna = True)
row_total.sum(axis = 0)
result:
13.0
EDIT: #Chris comment (did not see it while writing my answer) shows how to do it in one line, if all rows have same index.
data:
data = pd.DataFrame({"0":[5, np.NaN],
"1":[np.NaN, 2],
"2":[6,np.NaN]},
index=['a', 'a'])
gives:
0 1 2
a 5.0 NaN 6.0
a NaN 2.0 NaN
Then
data.groupby(data.index).sum().sum(1)
Returns
13.0

Why does value_counts not show all values present?

I am using pandas 0.18.1 on a large dataframe. I am confused by the behaviour of value_counts(). This is my code:
print df.phase.value_counts()
def normalise_phase(x):
print x
return int(str(x).split('/')[0])
df['phase_normalised'] = df['phase'].apply(normalise_phase)
This prints the following:
2 35092
3 26248
1 24646
4 22189
1/2 8295
2/3 4219
0 1829
dtype: int64
1
nan
Two questions:
Why is nan printing as an output of normalise_phase, when nan
is not listed as a value in value_counts?
Why does value_counts show dtype as int64 if it has string values like
1/2 and nan in it too?

You need to pass dropna=False for NaNs to be tallied (see the docs).
int64 is the dtype of the series (counts of the values). The values themselves are the index. dtype of the index will be object, if you check.
ser = pd.Series([1, '1/2', '1/2', 3, np.nan, 5])
ser.value_counts(dropna=False)
Out:
1/2 2
5 1
3 1
1 1
NaN 1
dtype: int64
ser.value_counts(dropna=False).index
Out: Index(['1/2', 5, 3, 1, nan], dtype='object')

Python Pandas drop columns based on max value of column

Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.
My dataframe (simplified):
Date Stock1 Stock2 Stock3
2014.10.10 74.75 NaN NaN
2014.9.9 NaN 100.95 NaN
2010.8.8 NaN NaN 120.45
So each column only has one value.
I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:
Date Stock2 Stock3
2014.10.10 NaN NaN
2014.9.9 100.95 NaN
2010.8.8 NaN 120.45
How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?

Use the df.max() to index with.
In [19]: from pandas import DataFrame
In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])
In [36]: df
Out[36]:
a b c
0 -0.928912 0.220573 1.948065
1 -0.310504 0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567
In [24]: df.max()
Out[24]:
a -0.310504
b 0.847638
c 1.948065
dtype: float64
Next, we make a boolean expression out of this:
In [31]: df.max() > 0
Out[31]:
a False
b True
c True
dtype: bool
Next, you can index df.columns by this (this is called boolean indexing):
In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')
Which you can finally pass to DF:
In [35]: df[df.columns[df.max() > 0]]
Out[35]:
b c
0 0.220573 1.948065
1 0.847638 -0.541496
2 -1.099226 -1.183567
Of course, instead of 0, you use any value that you want as the cutoff for dropping.

Concatenating Columns Pandas

I'm trying to concatenate several columns which mostly contain NaNs to one, but here is an example on 2 only:
2013-06-18 21:46:33.422096-05:00 A NaN
2013-06-18 21:46:35.715770-05:00 A NaN
2013-06-18 21:46:42.669825-05:00 NaN B
2013-06-18 21:46:45.409733-05:00 A NaN
2013-06-18 21:46:47.130747-05:00 NaN B
2013-06-18 21:46:47.131314-05:00 NaN B
This could go on for 3 or 4 or 10 columns, always 1 being pd.notnull() and the rest are NaN.
I want to concatenate these into 1 column the fastest way possible. How can I do this?

You get one string per line and the other cells are NaN, then the math to apply is to ask for the max value:
df.max(axis=1)
As per comment, if it doesn't work in Python 3, project your NaN into strings before:
df.fillna('').max(axis=1)

You could do
In [278]: df = pd.DataFrame([[1, np.nan], [2, np.nan], [np.nan, 3]])
In [279]: df
Out[279]:
0 1
0 1 NaN
1 2 NaN
2 NaN 3
In [280]: df.sum(1)
Out[280]:
0 1
1 2
2 3
dtype: float64
Since NaNs are treated as 0 when summed, they don't show up.
A couple of caveats: You need to be sure that only one of the columns has a non-Nan for this to work. It will also only work on numeric data.
You can also use
df.fillna(method='ffill', axis=1).iloc[:, -1]
The last column will now contain all the valid observations since the valid ones have been filled ahead. See the documentation here. The second way should be more flexible but slower. I slice off every row and the last column with iloc[:, -1].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Pandas: Unexpected indices when doing a groupby-apply - python

Related

Replace str values in series into np.nan

Need to combine multiple rows based on index

Why does value_counts not show all values present?

Python Pandas drop columns based on max value of column

Concatenating Columns Pandas

Categories

Resources