Pandas - apply transformation on index of DataFrame - python

This is my code
import pandas as pd
x = pd.DataFrame.from_dict({'A':[1,2,3,4,5,6], 'B':[10, 20, 30, 44, 48, 81]})
a = x['A'].apply(lambda t: t%2==0) # works
c = x.index.apply(lambda t: t%2==0) # error
How can I make that code work in the easiest way? I know how to reset_index() and then treat it as a column, but I was curious if it's possible to operate on the index as if it's a regular column.

You have to convert the Index to a Series using to_series:
c = x.index.to_series().apply(lambda t: t%2==0)
if you want to call apply as Index objects have no apply method
There are a limited number of methods and operations for an Index: http://pandas.pydata.org/pandas-docs/stable/api.html#modifying-and-computations

Pandas hasn't implemented pd.Index.apply. You can, for simple calculations, use the underlying NumPy array:
c = x.index.values % 2 == 0
As opposed to lambda-based solutions, this takes advantage of vectorised operations.

Pandas index have a map method:
c = x.index.map(lambda t: t%2==0) # Index([True, False, True, False, True, False], dtype='object')
Note that this returns a Index, not a pandas Series.

Related

applying a function to a pair of pandas series

Suppose I have two series:
s = pd.Series([20, 21, 12]
t = pd.Series([17,19 , 11]
I want to apply a two argument function to the two series to get a series of results (as a series). Now, one way to do it is as follows:
df = pd.concat([s, t], axis=1)
result = df.apply(lambda x: foo(x[s], x[t]), axis=1)
But this seems clunky. Is there any more elegant way?
There are many ways to do what you want.
Depending on the function in question, you may be able to apply it directly to the series. For example, calling s + t returns
0 37
1 40
2 23
dtype: int64
However, if your function is more complicated than simple arithmetic, you may need to get creative. One option is to use the built-in Python map function. For example, calling
list(map(np.add, s, t))
returns
[37, 40, 23]
If the two series have the same index, you can create a series with list comprehension:
result = pd.Series([foo(xs, xt) for xs,xt in zip(s,t)], index=s.index)
If you can't guarantee that the two series have the same index, concat is the way to go as it helps align the index.
If I understand you can use this to apply a function using 2 colums and copy the results in another column:
df['result'] = df.loc[:, ['s', 't']].apply(foo, axis=1)
It might be possible to use numpy.vectorize:
from numpy import vectorize
vect_foo = vectorize(foo)
result = vect_foo(s, t)

Is there a way to use a method/function as an expression for .loc() in pandas?

I've been going crazy trying to figure this out. I'm trying to avoid using df.iterrows() to iterate through the rows of a dataframe, as it's quite time consuming and .loc() is better from what I've seen.
I know this works:
df = df.loc[df.number == 3, :]
And that'll basically set df to be each row where the "number" column is equal to 3.
But, I get an error when I try something like this:
df = df.loc[someFunction(df.number), :]
What I want is to get every row where someFunction() returns True whenever the "number" value of said row is set as the parameter.
For some reason, it's passing the entire column (the dataframe's entire "number" column, in this example), instead of the value of a row as it iterates through the row, like the previous example.
Again, I know I can just use a for loop and .iterrows(), but I'm working with around 280,000 rows and it just takes longer than I'd like. Also have tried using a lambda function among other things.
Apply is slow - if you can, try to just put the complex vectorization logic in the function by taking in series as arguments:
import pandas as pd
df = pd.DataFrame()
df['a'] = [7, 6, 5, 4, 3, 2]
df['b'] = [1, 2, 3, 4, 5, 6]
def my_func(series1, series2):
return (series2 > 3) | (series1 == series2)
df.loc[my_func(df.b, df.a), 'new_column_name'] = True
I think this is what you need:
import pandas as pd
df = pd.DataFrame({"number": [x for x in range(10)]})
def someFunction(row):
if row > 5:
return True
else:
return False
df = df.loc[df.number.apply(someFunction)]
print(df)
Output:
number
6 6
7 7
8 8
9 9
You can use an anonymous function with .loc
x refers to the dataframe you are indexing
df.loc[lambda x: x.number > 5, :]
Two options I can think of:
Create a new column using the pandas apply() method and a lambda function that returns either true or false depending on someFunction(). Then, use loc to filter on the new column you just created.
Use a for loop and df.itertuples() as it is way faster than iterrows. Make sure to look up the documentation as the syntax is slightly different for itertuples
Just use something like this will work
df = pd.DataFrame()
df['number'] = np.arange(10)
display(df[df['number']>5])
display(df[df['number']>2][df['number']<7])

pandas apply function to each group (output is not really an aggregation)

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

Savgol filter over dataframe columns

I'm trying to apply a savgol filter from SciPy to smooth my data. I've successfully applied the filter by selecting each column separately, defining a new y value and plotting it. However I wanted to apply the function in a more efficient way across a dataframe.
y0 = alldata_raw.iloc[:,0]
w0 = savgol_filter(y0, 41, 1)
My first thought was to create an empty array, write a for loop apply the function to each column, append it to the array and finally concatenate the array. However I got an error 'TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid'
smoothed_array = []
for key,values in alldata_raw.iteritems():
y = savgol_filter(values, 41, 1)
smoothed_array.append(y)
alldata_smoothed = pd.concat(smoothed_array, axis=1)
Instead I tried using the pd.apply() function however I'm having issues with that. I have an error message: 'TypeError: expected x and y to have same length'
alldata_smoothed = alldata_raw.apply(savgol_filter(alldata_raw, 41, 1), axis=1)
print(alldata_smoothed)
I'm quite new to python so any advice on how to make each method work and which is preferable would be appreciated!
In order to use the filter first create a function that takes a single argument - the column data. Then you can apply it to dataframe columns like this:
from scipy.signal import savgol_filter
def my_filter(x):
return savgol_filter(x, 41, 1)
alldata_smoothed = alldata_raw.apply(my_filter)
You could also go with a lambda function:
alldata_smoothed = alldata_raw.apply(lambda x: savgol_filter(x,41,1))
axis=1 in apply is specified to apply the function to dataframe rows. What you need is the default option axis=0 which means apply it to the columns.
That was pretty general but the docs for savgol_filter tell me that it accepts an axis argument too. So in this specific case you could apply the filter to the whole dataframe at once. This will probably be more performant but I haven't checked =).
alldata_smoothed = pd.DataFrame(savgol_filter(alldata_raw, 41, 1, axis=0),
columns=alldata_raw.columns,
index=alldata_raw.index)

Vectorized Flag Assignment in Dataframe

I have a dataframe with observations possessing a number of codes. I want to compare the codes present in a row with a list. If any codes are in that list, I wish to flag the row. I can accomplish this using the itertuples method as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id' : [1,2,3,4,5],
'cd1' : ['abc1', 'abc2', 'abc3','abc4','abc5'],
'cd2' : ['abc3','abc4','abc5','abc6',''],
'cd3' : ['abc10', '', '', '','']})
code_flags = ['abc1','abc6']
# initialize flag column
df['flag'] = 0
# itertuples method
for row in df.itertuples():
if any(df.iloc[row.Index, 1:4].isin(code_flags)):
df.at[row.Index, 'flag'] = 1
The output correctly adds a flag column with the appropriate flags, where 1 indicates a flagged entry.
However, on my actual use case, this takes hours to complete. I have attempted to vectorize this approach using numpy.where.
df['flag'] = 0 # reset
df['flag'] = np.where(any(df.iloc[:,1:4].isin(code_flags)),1,0)
Which appears to evaluate everything the same. I think I'm confused on how the vectorization treats the index. I can remove the semicolon and write df.iloc[1:4] and obtain the same result.
Am I misunderstanding the where function? Is my indexing incorrect and causing a True evaluation for all cases? Is there a better way to do this?
Using np.where with .any not any(..)
np.where((df.iloc[:,1:4].isin(code_flags)).any(1),1,0)

Categories

Resources