Savgol filter over dataframe columns - python

I'm trying to apply a savgol filter from SciPy to smooth my data. I've successfully applied the filter by selecting each column separately, defining a new y value and plotting it. However I wanted to apply the function in a more efficient way across a dataframe.
y0 = alldata_raw.iloc[:,0]
w0 = savgol_filter(y0, 41, 1)
My first thought was to create an empty array, write a for loop apply the function to each column, append it to the array and finally concatenate the array. However I got an error 'TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid'
smoothed_array = []
for key,values in alldata_raw.iteritems():
y = savgol_filter(values, 41, 1)
smoothed_array.append(y)
alldata_smoothed = pd.concat(smoothed_array, axis=1)
Instead I tried using the pd.apply() function however I'm having issues with that. I have an error message: 'TypeError: expected x and y to have same length'
alldata_smoothed = alldata_raw.apply(savgol_filter(alldata_raw, 41, 1), axis=1)
print(alldata_smoothed)
I'm quite new to python so any advice on how to make each method work and which is preferable would be appreciated!

In order to use the filter first create a function that takes a single argument - the column data. Then you can apply it to dataframe columns like this:
from scipy.signal import savgol_filter
def my_filter(x):
return savgol_filter(x, 41, 1)
alldata_smoothed = alldata_raw.apply(my_filter)
You could also go with a lambda function:
alldata_smoothed = alldata_raw.apply(lambda x: savgol_filter(x,41,1))
axis=1 in apply is specified to apply the function to dataframe rows. What you need is the default option axis=0 which means apply it to the columns.
That was pretty general but the docs for savgol_filter tell me that it accepts an axis argument too. So in this specific case you could apply the filter to the whole dataframe at once. This will probably be more performant but I haven't checked =).
alldata_smoothed = pd.DataFrame(savgol_filter(alldata_raw, 41, 1, axis=0),
columns=alldata_raw.columns,
index=alldata_raw.index)

Related

applying a function to a pair of pandas series

Suppose I have two series:
s = pd.Series([20, 21, 12]
t = pd.Series([17,19 , 11]
I want to apply a two argument function to the two series to get a series of results (as a series). Now, one way to do it is as follows:
df = pd.concat([s, t], axis=1)
result = df.apply(lambda x: foo(x[s], x[t]), axis=1)
But this seems clunky. Is there any more elegant way?
There are many ways to do what you want.
Depending on the function in question, you may be able to apply it directly to the series. For example, calling s + t returns
0 37
1 40
2 23
dtype: int64
However, if your function is more complicated than simple arithmetic, you may need to get creative. One option is to use the built-in Python map function. For example, calling
list(map(np.add, s, t))
returns
[37, 40, 23]
If the two series have the same index, you can create a series with list comprehension:
result = pd.Series([foo(xs, xt) for xs,xt in zip(s,t)], index=s.index)
If you can't guarantee that the two series have the same index, concat is the way to go as it helps align the index.
If I understand you can use this to apply a function using 2 colums and copy the results in another column:
df['result'] = df.loc[:, ['s', 't']].apply(foo, axis=1)
It might be possible to use numpy.vectorize:
from numpy import vectorize
vect_foo = vectorize(foo)
result = vect_foo(s, t)

In pandas expanding/rolling function, how to use the index of the a dataframe or series?

Let say I have a pandas.Series with a datetime index:
srs = pd.Series(index = pd.date_range('2013-01-01','2013-01-10' )).fillna(1)
I can use the expanding function to calculate say expanding sum of the series.
srs.expanding(5).sum()
However, I can not access other properties of the series (say its index) by using the expanding function. For example by running:
srs.expanding(5).apply(lambda x: x.index[-1])
I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'index'
Why are the groups are passed as numpy array as opposed to pandas.Series? Is there another way to use expanding/rolling functions to have access to the indices as well?
You can get access to the indices by using the expanding, but if the indices have the same type as values.
For example it works:
s1 = pd.Series(index = range(10)).fillna(1)
s1.expanding(5).agg(lambda x: x.index[-1])
But it doesn't work:
srs = pd.Series(index = pd.date_range('2013-01-01','2013-01-10' )).fillna(1)
srs.expanding(5).agg(lambda x: x.index[-1])

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?
Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

Multiplying columns by another column in a dataframe

(Full disclosure that this is related to another question I asked, so bear with me if I should have appended it to what I wrote previously, even though the problem is different.)
I have a dataframe consisting of a column of weights and columns containing binary values of 0 and 1. I'd like to multiply every column within the dataframe by the weights column. However, I seem to be replacing every column within the dataframe with the weight column. I'm sure I'm missing something incredibly stupid/basic here--I'm rather new to pandas and python as a whole. What am I doing wrong?
celebfile = pd.read_csv(celebcsv)
celebframe = pd.DataFrame(celebfile)
behaviorfile = pd.read_csv(behaviorcsv)
behaviorframe = pd.DataFrame(behaviorfile)
celebbehavior = pd.merge(celebframe, behaviorframe, how ='inner', on = 'RespID')
celebbehavior2 = celebbehavior.copy()
def multiplycolumns(column):
for column in celebbehavior:
return celebbehavior[column]*celebbehavior['WEIGHT']
celebbehavior2 = celebbehavior2.apply(lambda column: multiplycolumns(column), axis=0)
print(celebbehavior2.head())
You have return statement in a for loop, which means the for loop is executed only once, to multiply a data frame with a column, you can use mul method with the correct axis parameter:
celebbehavior.mul(celebbehavior['WEIGHT'], axis=0)
read_csv
returns a pd.DataFrame... Not necessary to use pd.DataFrame on top of it.
mul with axis=0
You can use apply but that is awkward. Use mul(axis=0)... This should be all you need.
df = pd.read_csv(celebcsv).merge(pd.read_csv(behaviorcsv), on='RespID')
df = df.mul(df.WEIGHT, 0)
?
You said that it looks like you are just replacing with the weights column? Are you other columns all ones?
you can use the `mul' method to multiply the columns. However, just fyi if you do want to use apply you can bear in mind the following:
The apply function passes each series in the dataframe to the function. This looping is inherent to the apply function. Therefore first thing to say is that your loop within the function is redundant. Also you have a return statement inside it which is causing the behavior you do not want.
If each column is passed as the argument automatically all you need to do is tell the function what to multiply it by. In this case your weights series.
Here is an implementation using apply. Of course the undesirable here is that the weights are also multiplpied by themselves:
df = pd.DataFrame({'1' : [1, 1, 0, 1],
'2' : [0, 0, 1, 0],
'weights' : [0.5, 0.25, 0.1, 0.05]})
def multiply_columns(column, weights):
return column * weights
df.apply(lambda x: multiply_columns(x, df['weights']))

calling apply() on an empty pandas DataFrame

I'm having a problem with the apply() method of the pandas DataFrame. My issue is that apply() can return either a Series or a DataFrame, depending on the return type of the input function; however, when the frame is empty, apply() (almost) always returns a DataFrame. So I can't write code that expects a Series. Here's an example:
import pandas as pd
def area_from_row(row):
return row['width'] * row['height']
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1)
# This works as expected.
non_empty_frame = pd.DataFrame(data=[[2, 3]], columns=['width', 'height'])
add_area_column(non_empty_frame)
# This fails!
empty_frame = pd.DataFrame(data=None, columns=['width', 'height'])
add_area_column(empty_frame)
Is there a standard way of dealing with this? I can do the following, but it's silly:
def area_from_row(row):
# The way we respond to an empty row tells pandas whether we're a
# reduction or not.
if not len(row):
return None
return row['width'] * row['height']
(I'm using pandas 0.11.0, but I checked this on 0.12.0-1100-g0c30665 as well.)
You can set the result_type parameter in apply to 'reduce'.
From the documentation,
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
And then,
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
In your code, update here:
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1, result_type='reduce')

Categories

Resources