Create bokeh timeseries graph using database info

Create bokeh timeseries graph using database info - python

Note from maintainers: this question is about the obsolete bokeh.charts API removed several years ago. For an example of timeseries charts in modern Bokeh, see here:
https://docs.bokeh.org/en/latest/docs/gallery/range_tool.html
I'm trying to create a timeseries graph with bokeh. This is my first time using bokeh, and my first time dealing with pandas as well. Our customers receive reviews on their products. I'm trying to create a graph which shows how their average review rating has changed over time.
Our database contains the dates of each review. We also have the average review value for that date. I need to plot a line with the x axis being dates and the y axis being the review value range (1 through 10).
When I accepted this project I thought it would be easy. How wrong I was. I found a timeseries example that looks good. Unfortunately, the example completely glosses over what is the most difficult part about creating a solution. Specifically, it does not show how to create an appropriate data structure from your source data. The example is retrieving pre-built datastructures from the yahoo api. I've tried examining these structures, but they don't exactly look straightforward to me.
I found a page explaining pandas structs. It is a little difficult for me to understand. Particularly confusing to me is how to represent points in the graph without necessarily labeling those points. For example the y axis should display whole numbers, but data points need not intersect with the whole number value. The page I found is linked below:
http://pandas.pydata.org/pandas-docs/stable/dsintro.html
Does anyone know of a working example for the timeseries chart type which exemplifies how to build the necessary data structure?
UPDATE:
Thanks to the answer below I toyed around with just passing lists into lines. It didn't occur to me that I could do this, but it works very well. For example:
date = [1/11/2011, 1/12/2011. 1/13/2011, 4/5/2014]
rating = [4, 4, 5, 2]
line(
date, # x coordinates
rating, # y coordinates
color='#A6CEE3', # set a color for the line
x_axis_type = "datetime", # NOTE: only needed on first
tools="pan,wheel_zoom,box_zoom,reset,previewsave" # NOTE: only needed on first
)

You don't have to use Pandas, you simply need to supply a sequence of x-values and a sequence of y-values. These can be plain Python lists of numbers, or NumPy arrays, or Pandas Series. Here is another time series example that uses just NumPy arrays:
http://docs.bokeh.org/en/latest/docs/gallery/color_scatter.html
EDIT: link updated

Related

Aggregate raster data based on geopandas DataFrame geometries

Situation
I have two datasets:
Raster data loaded using rioxarray, an xarray.DataArray
An geopandas.DataFrame with geometries indicating areas in the 1. dataset
The geo data in both datasets are in the same CRS (EPSG:4326).
Problem
For each entry in 2. I want to aggregate all values from 1. which overlap with the specific geometry. Kind of like an .group-by() using the geometries + .sum().
Current WIP approach
The package xagg does that already, is unfortunately slow on a subset of my dataset and scales worse when I try to use it on my full dataset.
Question
Is there an simple and efficient way to do this in Python?
(The solution wouldn't need to replicate the results from xagg accurately.)

WRT your comment here is some pseudo code I've used to do what you're after. The function being executed outputs files in this case. If it's not obvious this, strategy isn't going to help if you just have 1 big raster and 1 big poly file. This method assumes tiled data and uses an indexing system to match the right rasters with the overlaying polys. The full example is kind of beyond the scope of a single answer. But if you ask specifics if you have issues I can try to assist. In addition to dask's good documentation, there are lots of other posts on here with dask delayed examples.
results_list = []
for f in raster_file_list:
temp_list = dask.delayed(your_custom_function)(f, your_custom_function_arg_1, your_custom_function_arg_2)
results.append(temp_list)
results = dask.compute(*results_list, scheduler='processes')

Preprocessing data for Time-Series prediction

Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.
df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])
(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)
I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)
Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict()
I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.
I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!

So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github
There's a couple of things I want to make sure you know before I dive into the answer to your questions.
There are lots of different types of predictive models used in
forecasting. You can find all about them here
You're asking a lot of broad questions but I'll break down the main questions
into two steps and describe what's happening using the example that
I believe you're trying to replicate
Let's break it down
Loading and parsing the data
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
This section of code loads in the data from a .csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.
Scaling the data
log_air_passengers = np.log(air_passengers.Passengers)
This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such
Now that the data has been scaled, we can prep it for statistical modeling
log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)
This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.
The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here

How to automatically reduce the range of a chart?

First of all sorry for my bad english as it is not first language.
I have recently started learning python and I am trying to develop a "simple" program, but I have run into a problem.
I am using xlwings to modify and interact with Excel. What I want to achieve (or to know if its possible) is:
I have excel look into data and plot a graph. However this graph sometimes has for example 20 values for the X-Axis and in other cases let's say 10 values for the X-Axis, thus, leaving 10 #NA empty spaces. Based on this, I want to adjust the graph to show only 10 values by changing the range that shapes the graph .
The function get_prod_hours() looks how many values I want on the X-Axis:
def get_prod_hours():
"""From the input gets the production hours to adapt the graphs"""
dt = wb.sheets['Calculatrice']
return dt.range('E24').value
Based on the value gotten from the function I must modify the range of values on the graph (by reducing it).
Solutions as for example create the graphs from scratch are not OK because I would like to only modify the range of the graph because the Excel file is a "standard" on my company.
I hope for something like:
Column A in Excel with values: 1, 2, 3, 4, 5 and get from get_prod_hours() a value of 5, so my graph will have only 5 points and not for example 6 of which one is #NA.
Thank you very much, and sorry for the wall of text.

The xlwings API doesn't offer a lot of options for charts (see https://docs.xlwings.org/en/stable/api.html?highlight=charts#xlwings.main.Charts).
Try to find the chart in wb.sheets[0].charts.
The range can then be modified with
range = xw.Range((1,1), (get_prod_hours(),1))
set_source_data(wb.sheets[0].range(range))
But from looking at the API and knowing how many options Excel charts have, the API feels too thin.
If this doesn't work, an option is to add a VBA macro which modifies the chart and call that. See How do I call an Excel macro from Python using xlwings?

plotting pandas core series.series/values not showing

I am trying to plot the availability of my network per hour. So,I have a massive dataframe containing multiple variables including the availability and hour. I can clearly visualise everything I want on my plot I want to plot when I do the following:
mond_data= mond_data.groupby('Hour')['Availability'].mean()
The only problem is, if I bracket the whole code and plot it (I mean this (the code above).plot); I do not get any value on my x-axis that says 'Hour'.How can plot this showing the values of my x-axis (Hour). I should have 24 values as the code above bring an aaverage for the whole day for midnight to 11pm.

Here is how I solved it.
plt.plot(mon_data.index,mond_data.groupby('Hour')['Availability'].mean())
for some reason python was not plotting the index, only if called. I have not tested many cases. So additional explanation to this problem is still welcome.

Python Pandas Regression

[enter image description here][1]I am struggling to figure out if regression is the route I need to go in order to solve my current challenge with Python. Here is my scenario:
I have a Pandas Dataframe that is 195 rows x 25 columns
All data (except for index and headers) are integers
I have one specific column (Column B) that I would like compared to all other columns
Attempting to determine if there is a range of numbers in any of the columns that influences or impacts column B
An example of the results I would like to calculate in Python is something similar to: Column B is above 3.5 when data in Column D is between 10.20 - 16.4
The examples I've been reading online with Regression in Python appear to produce charts and statistics that I don't need (or maybe I am interpreting incorrectly). I believe the proper wording to describe what I am asking, is to identify specific values or a range of values that are linear between two columns in a Pandas dataframe.
Can anyone help point me in the right direction?
Thank you all in advance!

Your goals sound very much like exploratory data analysis at this point. You should probably first calculate the correlation between your target column B and any other column using pandas.Series.corr (which really is the same as bivariate regression), which you could list:
other_cols = [col for col in df1.columns if col !='B']
corr_B = [{other: df.loc[:, 'B'].corr(df.loc[:, other])} for other in other_col]
To get a handle on specific ranges, I would recommend looking at:
the cut and qcut functionality to bin your data as you like and either plot or correlate subsets accordingly: see docs here and here.
To visualize bivariate and simple multivariate relationships, I would recommend
the seaborn package because it includes various types of plots designed to help you get a quick grasp of covariation among variables. See for instance the examples for univariate and bivariate distributions here, linear relationship plots here, and categorical data plots here.
The above should help you understand bivariate relationships. Once you want to progress to multivariate relationships, you could return to the scikit-learn or statsmodels packages best suited for this in python IMHO. Hope this helps to get you started.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.