I'm repeatedly trying to get similar data from time series dataframes. For example, the monthly (annualized) standard deviation of the series:
any_df.resample('M').std().mean()*12**(1/2)
It would save typing and would probably limit errors if these methods could be assigned to a variable, so they can be re-used - I guess this would look something like
my_stdev = .resample('M').std().mean()*12**(1/2)
result = any_df.my_stdev()
Is this possible and if so is it sensible?
Thanks in advance!
Why not just make your own function?
def my_stdev(df):
return df.resample('M').std().mean()*12**(1/2)
result = my_stdev(any_df)
Related
I'm looking for a way to count how many times my database function runs.
My code looks like this
df = pd.read_csv("nodos_results3.csv")
df['URL_dashboard'] = df.apply(create_url, axis = 1)
df.to_csv('nodos_results4.csv', index = False)
I want to count how many times the function "create_url" runs. If I was in C++, for example, I would simply have the function take in another input
create_url(database i_DB, int i_count)
{
//stuff I want done to database
i_count++;
}
but I'm not sure how to do something equivalent using pandas dataframe apply.
Update: For anyone who might find this while googling in the future - I personally didn't solve this issue. I was moved to another project and didn't continue working on this. Sorry I can't be more help.
apply executes the function exactly once for each row. So, the function is executed df.shape[0] times. Correction: As per #juanpa.arrivillaga, apply executes the function twice on the first row, the correct answer is df.shape[0]+1.
Alternatively, create a global variable (say, create_url_counter=0) and increment it in the function body:
def create_url(...):
global create_url_counter
...
create_url_counter += 1
Be aware, though, that having global variables is in general a bad idea.
Is it possible to modify a class outside of a class?
I frequently have to add a new column to my data frames and am looking for cleaner syntax. All my dataframes come ready for this operation.
It's essentially this operation:
DF['Percent'] = float(DF['Earned'])/DF['Total']
I'd love to add this functionality like so:
DF = DF.add_percent()
Or
DF.add_percent(inplace=True)
Right now I'm only able to do something like:
DF = add_percent(DF)
where I declare add_percent as a function outside of pandas.
You can do
DF.eval('Percent = Earned / Total')
I don't think it gets much cleaner than that.
When I use R, I can use str() to inspect objects which are a list of things most of the times.
I recently switched to Python for statistics and don't know how to inspect the objects I encounter. For example:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
heart.groupby(['censors'])['age']
I want to investigate what kind of object is heart.groupby(['censors']) that allows me to add ['age'] at the end. However, print heart.groupby(['censors']) only tells me the type of the object, not its structure and what I can do with it.
So how do I get to understand the structure of numpy / pandas object, similar to str() in R?
If you're trying to get some insight into what you can do with a Python object, you can inspect it using a beefed-up Python console like IPython. In an IPython session, first put the object you want to look at into a variable:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
h_grouped = heart.groupby(['censors'])
Then type out the variable name and double-tap Tab to bring up a list of the object's methods:
In [5]: h_grouped.<Tab><Tab>
# Shows the object's methods
A further benefit of the IPython console is you can quickly check the
help for any individual method by adding a ?:
h_grouped.apply?
# Apply function and combine results
# together in an intelligent way.
If you don't have IPython or a similar console, you can achieve something similar using dir(), e.g. dir(h_grouped), although this will also list
the object's private methods which are generally not useful and shouldn't be
touched in regular use.
type(heart.groupby(['censors'])['age'])
type will tell you what kind of object it is. At the moment you are grouping by a dimension and not telling pandas what to do with age. If you want the mean for example you could do:
heart.groupby(['censors'])['age'].mean()
This would take the mean of age by the group, and return a series.
The groupby is I think a red herring -- "age" is just a column name:
import statsmodels.api as sm
heart = sm.datasets.heart.load_pandas().data
heart
# survival censors age
# 0 15 1 54.3
# ...
heart.keys()
# Index([u'survival', u'censors', u'age'], dtype='object')
I am new to Python and try to modify a pair trading script that I found here:
https://github.com/quantopian/zipline/blob/master/zipline/examples/pairtrade.py
The original script is designed to use only prices. I would like to use returns to fit my models and price for invested quantity but I don't see how do it.
I have tried:
to define a data frame of returns in the main and call it in run
to define a data frame of returns in the main as a global object and use where needed in the 'handle data'
to define a data frame of retuns directly in the handle data
I assume the last option to be the most appropriate but then I have an error with panda 'shift' attribute.
More specifically I try to define 'DataRegression' as follow:
DataRegression = data.copy()
DataRegression[Stock1]=DataRegression[Stock1]/DataRegression[Stock1].shift(1)-1
DataRegression[Stock2]=DataRegression[Stock2]/DataRegression[Stock2].shift(1)-1
DataRegression[Stock3]=DataRegression[Stock3]/DataRegression[Stock3].shift(1)-1
DataRegression = DataRegression.dropna(axis=0)
where 'data' is a data frame which contains prices, stock1, stock2 and stock3 column names defined globally. Those lines in the handle data return the error:
File "A:\Apps\Python\Python.2.7.3.x86\lib\site-packages\zipline-0.5.6-py2.7.egg\zipline\utils\protocol_utils.py", line 85, in __getattr__
return self.__internal[key]
KeyError: 'shift'
Would anyone know why and how to do that correctly?
Many Thanks,
Vincent
This is an interesting idea. The easiest way to do this in zipline is to use the Returns transform which adds a returns field to the event-frame (which is an ndict, not a pandas DataFrame as someone pointed out).
For this you have to add the transform to the initialize method:
self.add_transform(Returns, 'returns', window_length=1)
(make sure to add from zipline.transforms import Returns at the beginning).
Then, inside the batch_transform you can access returns instead of prices:
#batch_transform
def ols_transform(data, sid1, sid2):
"""Computes regression coefficient (slope and intercept)
via Ordinary Least Squares between two SIDs.
"""
p0 = data.returns[sid1]
p1 = sm.add_constant(data.returns[sid2])
slope, intercept = sm.OLS(p0, p1).fit().params
return slope, intercept
Alternatively, you could also create a batch_transform to convert prices to returns like you wanted to do.
#batch_transform
def returns(data):
return data.price / data.price.shift(1) - 1
And then pass that to the OLS transform. Or do this computation inside of the OLS transform itself.
HTH,
Thomas
I have a matrix which is built at the start of the class and then I go on and use within a couple of functions then at the last function I want to delete a row from the matrix without having to use a new variable name for it otherwise I would have to change every single function call (it loops until a winner is found). It sort of works like this
matrix defined
...a lot of functions that use it then winner is called
def winner():
if hasWon():
.....
else:
fun1()
def fun1():
.....
Iterativefun()
def Iterativefun():
.....
matrix = numpy.delete(matrix, obj,axis)
winner()
Is there a way to delete a row? I thought of changing every single number in that row from 1-10 to a 0 which I'm not using so it would be ignored. Any help will be appreciated
I'm not sure what version of Numpy you have (or exactly what you're trying to do for that matter), but if you have Numpy 1.7, numpy.squeeze looks like it might be what you're looking for... http://docs.scipy.org/doc/numpy/reference/generated/numpy.squeeze.html#numpy.squeeze
If changing a row to all zeros is acceptable, you can just use:
matrix[row] = 0