How can I pass grouping index value as an additional argument alongside the group's subdataframe?
This crude example just applies a univariate function:
df = pd.DataFrame(data=np.random.randint(0,10, size=(3,3)), index = ['a','b','a'])
t = df.groupby(df.index).apply(lambda x: ''.join(str(x)))
0 1 2
a 8 6 7
b 6 2 4
a 8 2 4
This function accepts as argument the index upon which the dataframe was grouped.
def f(g, indx):
return ''.join(str(x)) +'___' str(indx)
The output should be:
0
a '8 6 7 8 2 4___a'
b '6 2 4___b'
I understand that this example is trivial, but the point is to pass the grouping index value as argument alongside the grouped subdataframe. The solution I see is to iterate over the grouping object. I am not sure it's good solution performance-wise.
Mathematica has MapIndexed function that does the job but without prior grouping. It seems this question was asked before.
You can get to the index name via .name. So you do something like:
df.groupby(df.index).apply(lambda x: ''.join(str(x.values)) + '___' + str(x.name))
The output is not exactly what you want but figured I'd get this info to you quickly. Assume that you can clean it up to what you want.
Output (older version of your data):
a [[8 4 6]\n [6 8 9]]___a
b [[1 3 2]]___b
Related
I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.
I am slicing a pandas dataframe and I seem to be getting unexpected slices using .loc, at least as compared to numpy and ordinary python slicing. See the example below.
>>> import pandas as pd
>>> a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
>>> a
0 1 2
0 0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
4 34 2 1
>>> a.loc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
>>> a.values[1:3, :]
array([[3, 4, 5],
[4, 5, 6]])
Interestingly, this only happens with .loc, not .iloc.
>>> a.iloc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
Thus, .loc appears to be inclusive of the terminating index, but numpy and .iloc are not.
By the comments, it seems this is not a bug and we are well warned. But why is it the case?
Remember .loc is primarily label based indexing. The decision to include the stop endpoint becomes far more obvious when working with a non-RangeIndex:
df = pd.DataFrame([1,2,3,4], index=list('achz'))
# 0
#a 1
#c 2
#h 3
#z 4
If I want to select all rows between 'a' and 'h' (inclusive) I only know about 'a' and 'h'. In order to be consistent with other python slicing, you'd need to also know what index follows 'h', which in this case is 'z' but could have been anything.
There's also a section of the documents hidden away that explains this design choice Endpoints are Inclusive
Additionally to the point in the docs, pandas slice indexing using .loc is not cell index based. It is in fact value based indexing (in the pandas docs it is called "label based", but for numerical data I prefer the term "value based"), whereas with .iloc it is traditional numpy-style cell indexing.
Furthermore, value based indexing is right-inclusive, whereas cell indexing is not. Just try the following:
a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
a.index = [0, 1, 2, 3.1, 4] # add a float index
# value based slicing: the following will output all value up to the slice value
a.loc[1:3.1]
# Out:
# 0 1 2
# 1.0 3 4 5
# 2.0 4 5 6
# 3.1 9 10 11
# index based slicing: will raise an error, since only integers are allowed
a.iloc[1:3.1]
# Out: TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Float64Index'> with these indexers [3.2] of <class 'float'>
To give an explicit answer to your question why it is right-inclusive:
When using values/labels as indices, it is, at least in my opinion, intuitive, that the last index is included. This is as far as I know a design decision of how the implemented function is meant to work.
So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9
Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)