Preprocessing data for Time-Series prediction

Preprocessing data for Time-Series prediction - python

Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.
df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])
(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)
I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)
Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict()
I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.
I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!

So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github
There's a couple of things I want to make sure you know before I dive into the answer to your questions.
There are lots of different types of predictive models used in
forecasting. You can find all about them here
You're asking a lot of broad questions but I'll break down the main questions
into two steps and describe what's happening using the example that
I believe you're trying to replicate
Let's break it down
Loading and parsing the data
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
This section of code loads in the data from a .csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.
Scaling the data
log_air_passengers = np.log(air_passengers.Passengers)
This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such
Now that the data has been scaled, we can prep it for statistical modeling
log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)
This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.
The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here

Related

Aggregate raster data based on geopandas DataFrame geometries

Situation
I have two datasets:
Raster data loaded using rioxarray, an xarray.DataArray
An geopandas.DataFrame with geometries indicating areas in the 1. dataset
The geo data in both datasets are in the same CRS (EPSG:4326).
Problem
For each entry in 2. I want to aggregate all values from 1. which overlap with the specific geometry. Kind of like an .group-by() using the geometries + .sum().
Current WIP approach
The package xagg does that already, is unfortunately slow on a subset of my dataset and scales worse when I try to use it on my full dataset.
Question
Is there an simple and efficient way to do this in Python?
(The solution wouldn't need to replicate the results from xagg accurately.)

WRT your comment here is some pseudo code I've used to do what you're after. The function being executed outputs files in this case. If it's not obvious this, strategy isn't going to help if you just have 1 big raster and 1 big poly file. This method assumes tiled data and uses an indexing system to match the right rasters with the overlaying polys. The full example is kind of beyond the scope of a single answer. But if you ask specifics if you have issues I can try to assist. In addition to dask's good documentation, there are lots of other posts on here with dask delayed examples.
results_list = []
for f in raster_file_list:
temp_list = dask.delayed(your_custom_function)(f, your_custom_function_arg_1, your_custom_function_arg_2)
results.append(temp_list)
results = dask.compute(*results_list, scheduler='processes')

What does ... mean in Python?

I am an elementary Python programmer and have been using this module called "Pybaseball" to analyze sabermetrics data. When using this module, I came across a problem when trying to retrieve information from the program. The program reads a CSV file from any baseball stats site and outputs it onto a program for ease of use but the problem is that some of the information is not shown and is instead all replaced with a "...". An example of this is shown:
from pybaseball import batting_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
print(data.head())
I should be getting:
https://github.com/jldbc/pybaseball#batting-stats-hitting-stats-for-players-within-seasons-or-during-a-specified-time-period
But the information is cutoff from 'TM' all the way to 'CS' and is replaced with a ... on my code. Can someone explain to me why this happens and how I can prevent it?

As the docs states, head() is meant for "quickly testing if your object has the right type of data in it." So, it is expected that some data may not show because it is collapsed.
If you need to analyze the data with more detail you can access specific columns with other methods.
For example, using iloc(). You can read more about it here, but essentially you can "ask" for a slice of those columns and then apply a new slice to get only nrows.
Another example would be loc(), docs here. The main difference being that loc() uses labels (column names) to filter data instead of numerical order of columns. You can filter a subset of specific columns and then get a sample of rows from that.
So, to answer your question "..." is pandas's way of collapsing data in order to get a prettier view of the results.

Plot specifying column by name, upper case issue

I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?

You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Create bokeh timeseries graph using database info

Note from maintainers: this question is about the obsolete bokeh.charts API removed several years ago. For an example of timeseries charts in modern Bokeh, see here:
https://docs.bokeh.org/en/latest/docs/gallery/range_tool.html
I'm trying to create a timeseries graph with bokeh. This is my first time using bokeh, and my first time dealing with pandas as well. Our customers receive reviews on their products. I'm trying to create a graph which shows how their average review rating has changed over time.
Our database contains the dates of each review. We also have the average review value for that date. I need to plot a line with the x axis being dates and the y axis being the review value range (1 through 10).
When I accepted this project I thought it would be easy. How wrong I was. I found a timeseries example that looks good. Unfortunately, the example completely glosses over what is the most difficult part about creating a solution. Specifically, it does not show how to create an appropriate data structure from your source data. The example is retrieving pre-built datastructures from the yahoo api. I've tried examining these structures, but they don't exactly look straightforward to me.
I found a page explaining pandas structs. It is a little difficult for me to understand. Particularly confusing to me is how to represent points in the graph without necessarily labeling those points. For example the y axis should display whole numbers, but data points need not intersect with the whole number value. The page I found is linked below:
http://pandas.pydata.org/pandas-docs/stable/dsintro.html
Does anyone know of a working example for the timeseries chart type which exemplifies how to build the necessary data structure?
UPDATE:
Thanks to the answer below I toyed around with just passing lists into lines. It didn't occur to me that I could do this, but it works very well. For example:
date = [1/11/2011, 1/12/2011. 1/13/2011, 4/5/2014]
rating = [4, 4, 5, 2]
line(
date, # x coordinates
rating, # y coordinates
color='#A6CEE3', # set a color for the line
x_axis_type = "datetime", # NOTE: only needed on first
tools="pan,wheel_zoom,box_zoom,reset,previewsave" # NOTE: only needed on first
)

You don't have to use Pandas, you simply need to supply a sequence of x-values and a sequence of y-values. These can be plain Python lists of numbers, or NumPy arrays, or Pandas Series. Here is another time series example that uses just NumPy arrays:
http://docs.bokeh.org/en/latest/docs/gallery/color_scatter.html
EDIT: link updated

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.