Python Pandas Frequency documentation - python

I have been trying to get a proper documentation for the freq arguments associated with pandas. For example to resample a dataframe we can do something like
df.resample(rule='W', how='sum')
which will resample this weekly. I was wondering what are the other options and how can I define custom frequency/rules.
EDIT : To clarify I am looking at what are the other legal options for rule

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
And, almost immediately below that: W-SAT and others.
I'll admit, links to this particular piece of documentation are pretty scarce. More general frequencies can be represented by supplying a DateOffset instance. Even more general resamplings can be done via groupby.

Related

What does ... mean in Python?

I am an elementary Python programmer and have been using this module called "Pybaseball" to analyze sabermetrics data. When using this module, I came across a problem when trying to retrieve information from the program. The program reads a CSV file from any baseball stats site and outputs it onto a program for ease of use but the problem is that some of the information is not shown and is instead all replaced with a "...". An example of this is shown:
from pybaseball import batting_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
print(data.head())
I should be getting:
https://github.com/jldbc/pybaseball#batting-stats-hitting-stats-for-players-within-seasons-or-during-a-specified-time-period
But the information is cutoff from 'TM' all the way to 'CS' and is replaced with a ... on my code. Can someone explain to me why this happens and how I can prevent it?
As the docs states, head() is meant for "quickly testing if your object has the right type of data in it." So, it is expected that some data may not show because it is collapsed.
If you need to analyze the data with more detail you can access specific columns with other methods.
For example, using iloc(). You can read more about it here, but essentially you can "ask" for a slice of those columns and then apply a new slice to get only nrows.
Another example would be loc(), docs here. The main difference being that loc() uses labels (column names) to filter data instead of numerical order of columns. You can filter a subset of specific columns and then get a sample of rows from that.
So, to answer your question "..." is pandas's way of collapsing data in order to get a prettier view of the results.

Python: (findall, wide to long, pandas): Data manipulation

So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.

How to perform time derivatives in Dask without sorting

I am working on a project that involves some larger-than-memory datasets, and have been evaluating different tools for working on a cluster instead of my local machine. One project that looked particularly interesting was dask, as it has a very similar API to pandas for its DataFrame class.
I would like to be taking aggregates of time-derivatives of timeseries-related data. This obviously necessitates ordering the time series data by timestamp so that you are taking meaningful differences between rows. However, dask DataFrames have no sort_values method.
When working with Spark DataFrame, and using Window functions, there is out-of-the-box support for ordering within partitions. That is, you can do things like:
from pyspark.sql.window import Window
my_window = Window.partitionBy(df['id'], df['agg_time']).orderBy(df['timestamp'])
I can then use this window function to calculate differences etc.
I'm wondering if there is a way to achieve something similar in dask. I can, in principle, use Spark, but I'm in a bit of a time crunch, and my familiarity with its API is much less than with pandas.
You probably want to set your timeseries column as your index.
df = df.set_index('timestamp')
This allows for much smarter time-series algorithms, including rolling operations, random access, and so on. You may want to look at http://dask.pydata.org/en/latest/dataframe-api.html#rolling-operations.
Note that in general setting an index and performing a full sort can be expensive. Ideally your data comes in a form that is already sorted by time.
Example
So in your case, if you just want to compute a derivative you might do something like the following:
df = df.set_index('timestamp')
df.x.diff(...)

orderedDict vs pandas series

Still new to this, sorry if I ask something really stupid. What are the differences between a Python ordered dictionary and a pandas series?
The only difference I could think of is that an orderedDict can have nested dictionaries within the data. Is that all? Is that even true?
Would there be a performance difference between using one vs the other?
My project is a sales forecast, most of the data will be something like: {Week 1 : 400 units, Week 2 : 550 units}... Perhaps an ordered dictionary would be redundant since input order is irrelevant compared to Week#?
Again I apologize if my question is stupid, I am just trying to be thorough as I learn.
Thank you!
-Stephen
Most importantly, pd.Series is part of the pandas library so it comes with a lot of added functionality - see attributes and methods as you scroll down the pd.Series docs. This compares to OrderDict: docs.
For your use case, using pd.Series or pd.DataFrame (which could be a way of using nested dictionaries as it has an index and multiple columns) seem quite appropriate. If you take a look at the pandas docs, you'll also find quite comprehensive time series functionality that should come in handy for a project around weekly sales forecasts.
Since pandas is built on numpy, the specialized scientific computing package, performance is quite good.
Ordered dict is implemented as part of the python collections lib. These collection are very fast containers for specific use cases. If you would be looking for only dictionary related functionality (like order in this case) i would go for that. While you say you are going to do more deep analysis in an area where pandas is really made for (eg plotting, filling missing values). So i would recommend you going for pandas.Series.

Error using bootstrap_plot in pandas if values have NaN

I'm trying to do a bootstrap analysis using pandas bootstrap_plot. My dataset has Nans. I get the following error message:
AttributeError: max must be larger than min in range parameter.
If I fill the data with fillna(0), it works, but then I'm changing my data set. Is there a reason why bootstrap (and autocorrelation_plot, for that matter), don't do the Right Thing about the Nans?
It's a little clunky, but maybe this:
bootstrap_plot( df[ df['x'].notnull() ]['x'] )
Re your question about bootstrap_plot doing the Right Thing: well, this is an area where pandas is still improving in general, but there's often going to be a bit of manual labor in this area and it's not generally that hard to do something with fillna or notnull. And honestly, it's often a feature to be forced to do this rather than have missing values handled automatically in a way you might not have liked or even been aware of.

Categories

Resources