I am trying to apply a best fit line to time series showing NDVI over time but I keep running into errors. my x, in this case, are different dates as strings that are not evenly spaced and y is the NDVI value for use each date.
When I use the poly1d function in numpy I get the following error:
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U32') dtype('<U32') dtype('<U32')
I have attached a sample of the data set I am working with
# plot Data and and models
plt.subplots(figsize=(20, 10))
plt.xticks(rotation=90)
plt.plot(x,y,'-', color= 'blue')
plt.title('WSC-10-50')
plt.ylabel('NDVI')
plt.xlabel('Date')
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(y)))
plt.legend(loc='upper right')
Any help fixing my code or a better way I can get the best fit line for my data?
When I apply a best fit line to time series data, I create an evenly spaced line that represents the dates to simplify the regression. So I use np.linspace() to create a set of intervals equal to the number of dates.
Code:
from io import StringIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = StringIO("""
date value
24-Jan-16 0.786
25-Feb-16 0.781
29-Apr-16 0.786
15-May-16 0.761
16-Jun-16 0.762
04-Sep-16 0.783
22-Oct-16 0.797
""")
df = pd.read_table(data, delim_whitespace=True)
# To read from csv use:
# df = pd.read_csv("/path/to/file.csv")
df.loc[:, "date"] = pd.to_datetime(df.loc[:, "date"], format="%d-%b-%y")
y_values = df.loc[:, "value"]
x_values = np.linspace(0,1,len(df.loc[:, "value"]))
poly_degree = 3
coeffs = np.polyfit(x_values, y_values, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
plt.figure(figsize=(12,8))
plt.plot(df.loc[:, "date"], df.loc[:,"value"], "ro")
plt.plot(df.loc[:, "date"],y_hat)
plt.title('WSC-10-50')
plt.ylabel('NDVI')
plt.xlabel('Date')
plt.savefig("NDVI_plot.png")
Output:
Related
I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?
I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.figure()
plt.hist(df, bins = bin_edges, density = True)
plt.xlabel('Value')
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
plt.xlabel('Value')
I am trying to plot the accuracy evolution of NN models overtimes. So, I have an excel file with data like the following:
and I wrote the following code to extract data and plot the scatter:
import pandas as pd
data = pd.read_excel (r'SOTA DNN.xlsx')
acc1 = pd.DataFrame(data, columns= ['Top-1-Acc'])
para = pd.DataFrame(data, columns= ['Parameters'])
dates = pd.to_datetime(data['Date'], format='%Y-%m-%d')
import matplotlib.pyplot as plt
plt.grid(True)
plt.ylim(40, 100)
plt.scatter(dates, acc1)
plt.show()
Is there a way to draw a line in the same figure to show only the ones achieving the maximum and print their names at the same time as in this example:
is it also possible to stretch the x-axis (for the dates)?
It is still not clear what you mean by "stretch the x-axis" and you did not provide your dataset, but here is a possible general approach:
import matplotlib.pyplot as plt
import pandas as pd
#fake data generation, this has to be substituted by your .xls import routine
from pandas._testing import rands_array
import numpy as np
np.random.seed(1234)
n = 30
acc = np.concatenate([np.random.randint(0, 10, 10), np.random.randint(0, 30, 10), np.random.randint(0, 100, n-20)])
date_range = pd.date_range("20190101", periods=n)
model = rands_array(5, n)
df = pd.DataFrame({"Model": model, "Date": date_range, "TopAcc": acc})
df = df.sample(frac=1).reset_index(drop=True)
#now to the actual data modification
#first, we extract the dataframe with monotonically increasing values after sorting the date column
df = df.sort_values("Date").reset_index()
df["Max"] = df.TopAcc.cummax().diff()
df.loc[0, "Max"] = 1
dfmax = df[df.Max > 0]
#then, we plot all data, followed by the best performers
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(df.Date, df.TopAcc, c="grey")
ax.plot(dfmax.Date, dfmax.TopAcc, marker="x", c="blue")
#finally, we annotate the best performers
for _, xylabel in dfmax.iterrows():
ax.text(xylabel.Date, xylabel.TopAcc, xylabel.Model, c="blue", horizontalalignment="right", verticalalignment="bottom")
plt.show()
Sample output:
I have a pandas dataframe with 27 columns for electricity consumption, the first column represents the date and time for a two year duration and the other columns have a recorded hourly values for electricity consumption for 26 houses during two years. What I'm doing is clustering using k-means. Whenever I try to plot the date on the x-axis and the values of electricity consumption on the y-axis I have a problem which is x and y must have the same size. I try to reshape and the problem is not being solved.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import datetime
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
X=data_consumption2.iloc[: , 1:26].values
X=np.nan_to_num(X)
np.concatenate(X)
date=data_consumption2.iloc[: , 0].values
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
C = kmeans.cluster_centers_
plt.scatter(X, R , s=40, c= kmeans.labels_.astype(float), alpha=0.7)
plt.scatter(C[:,0] , C[:,1] , marker='*' , c='r', s=100)
I always get the same error message, X and Y must have save size, try to reshape your data. When I tried to reshape the data it did not work because the date column's size is always smaller than the size of the rest columns.
I think what you are essentially doing is a time series clustering of all households to find similar electricity usage pattern over time.
For that, each timestamp becomes a 'feature', while each household's usage becomes your data row. This will make it easier to apply sklearn clustering methods, which are typically in the form of method.fit(x) where x represents the features (pass the data as 2D array that has the shape of (row, column)). So your data needs to be transposed.
The refactored code is as such:
# what you have done
import pandas as pd
df = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
# this is to fill all the NaN values with 0
df.fillna(0,inplace=True)
# transpose the dataframe accordingly
df = df.set_index('Timestamp').transpose()
df.rename(columns=lambda x : x.strftime('%D %H:%M:%S'), inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'index':'house_no'}, inplace=True)
df.columns.rename(None, inplace=True)
df.head()
and you should see something like this (don't mind the data shown, I created some dummy data that is similar to yours).
Next, for clustering, this is what you can do:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df.iloc[:,1:])
y_kmeans = kmeans.predict(df.iloc[:,1:])
C = kmeans.cluster_centers_
# add a new column to your dataframe that contains the predicted clusters
df['cluster'] = y_kmeans
Finally, for plotting, you can produce the scatter chart you wanted using the code below:
import matplotlib.pyplot as plt
color = ['red','green','blue']
plt.figure(figsize=(16,4))
for index, row in df.iterrows():
plt.scatter(x=row.index[1:-1], y=row.iloc[1:-1], c=color[row.iloc[-1]], marker='x', alpha=0.7, s=40)
for index, cluster_center in enumerate(kmeans.cluster_centers_):
plt.scatter(x=df.columns[1:-1], y=cluster_center, c=color[index], marker='o', s=100)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'All Clusters - Scatter', fontsize=20)
plt.show()
But I would recommend plotting line plots for individual clusters, more visually appealing (to me):
plt.figure(figsize=(16,16))
for cluster_index in [0,1,2]:
plt.subplot(3,1,cluster_index + 1)
for index, row in df.iterrows():
if row.iloc[-1] == cluster_index:
plt.plot(row.iloc[1:-1], c=color[row.iloc[-1]], linestyle='--', marker='x', alpha=0.5)
plt.plot(kmeans.cluster_centers_[cluster_index], c = color[cluster_index], marker='o', alpha=1)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'Cluster {cluster_index}', fontsize=20)
plt.tight_layout()
plt.show()
Cheers!
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.
I want to create a plot just like this:
The code:
P.fill_between(DF.start.index, DF.lwr, DF.upr, facecolor='blue', alpha=.2)
P.plot(DF.start.index, DF.Rt, '.')
but with dates in the x axis, like this (without bands):
the code:
P.plot_date(DF.start, DF.Rt, '.')
the problem is that fill_between fails when x values are date_time objects.
Does anyone know of a workaround? DF is a pandas DataFrame.
It would help if you show how df is defined. What does df.info() report? This will show us the dtypes of the columns.
There are many ways that dates can be represented: as strings, ints, floats, datetime.datetime, NumPy datetime64s, Pandas Timestamps, or Pandas DatetimeIndex. The correct way to plot it depends on what you have.
Here is an example showing your code works if df.index is a DatetimeIndex:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
plt.fill_between(df.index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(df.index, df.Rt, '.')
plt.show()
If the index has string representations of dates, then (with Matplotlib version 1.4.2) you would get a TypeError:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'])
index = [item.strftime('%Y-%m-%d') for item in index]
plt.fill_between(index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(index, df.Rt, '.')
plt.show()
yields
File "/home/unutbu/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/ma/core.py", line 2237, in masked_invalid
condition = ~(np.isfinite(a))
TypeError: Not implemented for this type
In this case, the fix is to convert the strings to Timestamps:
index = pd.to_datetime(index)
Regarding the error reported by chilliq:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs
could not be safely coerced to any supported types according to the casting
rule ''safe''
This can be produced if the DataFrame columns have "object" dtype when using fill_between. Changing the example column types and then trying to plot, as follows, results in the error above:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
dfo = df.astype(object)
plt.fill_between(df0.index, df0.lwr, df0.upr, facecolor='blue', alpha=.2)
plt.show()
From dfo.info() we see that the column types are "object":
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 180 entries, 2000-01-31 to 2014-12-31
Freq: M
Data columns (total 3 columns):
lwr 180 non-null object
Rt 180 non-null object
upr 180 non-null object
dtypes: object(3)
memory usage: 5.6+ KB
Ensuring that the DataFrame has numerical columns will solve the problem. To do this we can use pandas.to_numeric to convert, as follows:
dfn = dfo.apply(pd.to_numeric, errors='ignore')
plt.fill_between(dfn.index, dfn.lwr, dfn.upr, facecolor='blue', alpha=.2)
plt.show()
I got similar error while using fill_between:
ufunc 'bitwise_and' not supported
However, in my case the cause of error was rather stupid. I was passing color parameter but without explicit argument name which caused it to be #4 parameter called where. So simply making sure keyword parameters has key solved the issue:
ax.fill_between(xdata, highs, lows, color=color, alpha=0.2)
I think none of the answers addresses the original question, they all change it a little bit.
If you want to plot timdeltas you can use this workaround
ax = df.Rt.plot()
x = ax.get_lines()[0].get_xdata().astype(float)
ax.fill_between(x, df.lwr, df.upr, color="b", alpha=0.2)
plt.show()
This work sin your case. In general, the only caveat is that you always need to plot the index using pandas and then get the coordinates from the artist. I am sure that by looking at pandas code, one can actually find how they plot the timedeltas. Then one can apply that to the code, and the first plot is not needed anymore.