I'm a Python user and I'm quite lost on the task below.
Let df be a time series of 1000 stock returns.
I would like to calculate an iterating mean as for below
df[0:500].mean()
df[0:501].mean()
df[0:502].mean()
...
df[0:999].mean()
df[0:1000].mean()
How can I write a efficient code?
Many thanks
Pandas has common transformations like this built in. See for example:
df.expanding().mean()
Related
I am a starter with pandas, picked it up as it seemed to be most popular and easiest to work with based on reviews. My intention is fast data processing using async processes (pandas don't really support async, but haven't reached that problem yet). If you believe I could use better library for my needs based on below scenarios, please let me know.
My code is running websockets using asyncio which are fetching activity data constantly and storing it into a pandas DataFrame like so:
data_set.loc[len(data_set)] = [datetime.now(),res['data']['E'] ,res['data']['s'] ,res['data']['p'] ,res['data']['q'] ,res['data']['m']]
That seems to work while printing out the results. The data frame gets big quickly, so have clean up function checking len() of data frame and drop() rows.
My intention is to take the full set in data_set and create a summary view based on a group value and calculate additional values as analytics using the grouped data and data points at different date_time snaps. These calculations would be running multiple times per second.
What I mean is this (all is made up, not a working code example just principle of what's needed):
grouped_data = data_set.groupby('name')
stats_data['name'] = grouped_data['name'].drop_duplicates()
stats_data['latest'] = grouped_data['column_name'].tail(1)
stats_data['change_over_1_day'] = ? (need to get oldest record that's within 1 day frame (out of multiple day data), and get value from specific column and compare it against ['latest']
stats_data['change_over_2_day'] = ?
stats_data['change_over_3_day'] = ?
stats_data['total_over_1_day'] = grouped_data.filter(data > 1 day ago).sum(column_name)
I have googled a million things, every time the examples are quite basic and don't really help my scenarios.
Any help appreciated.
The question was a bit vague I guess, but after some more research (googling) and trial/error (hours) managed to accomplish all that I mentioned here.
Hopefully can help someone to save some time who are new to this:
stats_data = data.loc[trade_data.groupby('name')['date_time'].idxmax()].reset_index(drop=True)
1_day_ago = data.loc[data[data.date_time > day_1].groupby("name")["date_time"].idxmin()].drop(labels = ['date_time','id','volume','flag'], axis=1).set_index('name')['value']
stats_data['change_over_1_day'] = stats_data['value'].astype('float') / stats_data['name'].map(1_day_ago).astype('float') * 100 - 100
Same principal applied to other columns.
If anyone has a much more efficient/faster way to do this, please post your answer.
I'm studying phyton and one of my goals is write most os my codes without packages, and I would to like write a structure which looks like with pandas's DataFrame, but without using any other package. Is there any way to do that?
Using pandas, my code looks like this:
From pandas import Dataframe
...
s = DataFrame(s, index = ind)
where ind is the result of a function.
Maybe dictionary could be the answer?
Thanks
No native python data structure has all the features of a pandas dataframe, which was part of why pandas was written in the first place. Leveraging packages others have written brings the time and work of many other people into your code, advancing your own code's capabilities in a similar way that Isaac Newton said his famous discoveries were only possible by standing on the shoulders of giants.
There's no easy summary for your answer except to point out that pandas is open-source, and their implementation of the dataframe can be found at https://github.com/pandas-dev/pandas.
I need to split my dataset into chunks, which i currently do with the following simple code:
cases = []
for i in set(df['key']):
cases.append(df[df['key']==i].copy())
But my dataset is huge and this ends up taking a couple hours, so I was wondering if there is a way to maybe use multithreading to accelerate this? Or if there is any other method to make this go faster?
I'm fairly certain you want to group-by unique keys. Use the built-in functionality to do this.
cases = list(df.groupby('key'))
I am working on a project that involves some larger-than-memory datasets, and have been evaluating different tools for working on a cluster instead of my local machine. One project that looked particularly interesting was dask, as it has a very similar API to pandas for its DataFrame class.
I would like to be taking aggregates of time-derivatives of timeseries-related data. This obviously necessitates ordering the time series data by timestamp so that you are taking meaningful differences between rows. However, dask DataFrames have no sort_values method.
When working with Spark DataFrame, and using Window functions, there is out-of-the-box support for ordering within partitions. That is, you can do things like:
from pyspark.sql.window import Window
my_window = Window.partitionBy(df['id'], df['agg_time']).orderBy(df['timestamp'])
I can then use this window function to calculate differences etc.
I'm wondering if there is a way to achieve something similar in dask. I can, in principle, use Spark, but I'm in a bit of a time crunch, and my familiarity with its API is much less than with pandas.
You probably want to set your timeseries column as your index.
df = df.set_index('timestamp')
This allows for much smarter time-series algorithms, including rolling operations, random access, and so on. You may want to look at http://dask.pydata.org/en/latest/dataframe-api.html#rolling-operations.
Note that in general setting an index and performing a full sort can be expensive. Ideally your data comes in a form that is already sorted by time.
Example
So in your case, if you just want to compute a derivative you might do something like the following:
df = df.set_index('timestamp')
df.x.diff(...)
Still new to this, sorry if I ask something really stupid. What are the differences between a Python ordered dictionary and a pandas series?
The only difference I could think of is that an orderedDict can have nested dictionaries within the data. Is that all? Is that even true?
Would there be a performance difference between using one vs the other?
My project is a sales forecast, most of the data will be something like: {Week 1 : 400 units, Week 2 : 550 units}... Perhaps an ordered dictionary would be redundant since input order is irrelevant compared to Week#?
Again I apologize if my question is stupid, I am just trying to be thorough as I learn.
Thank you!
-Stephen
Most importantly, pd.Series is part of the pandas library so it comes with a lot of added functionality - see attributes and methods as you scroll down the pd.Series docs. This compares to OrderDict: docs.
For your use case, using pd.Series or pd.DataFrame (which could be a way of using nested dictionaries as it has an index and multiple columns) seem quite appropriate. If you take a look at the pandas docs, you'll also find quite comprehensive time series functionality that should come in handy for a project around weekly sales forecasts.
Since pandas is built on numpy, the specialized scientific computing package, performance is quite good.
Ordered dict is implemented as part of the python collections lib. These collection are very fast containers for specific use cases. If you would be looking for only dictionary related functionality (like order in this case) i would go for that. While you say you are going to do more deep analysis in an area where pandas is really made for (eg plotting, filling missing values). So i would recommend you going for pandas.Series.