Putting a gap/break in a pyplot line plot without losing data - python

I have a time series with several large data gaps. I would like to see a connecting line between data points that are less than an hour apart, but not if the gap is larger. The accepted answer to the question, Put a gap/break in a line plot, would work except that you sacrifice the masked points. I would like to avoid that.
I have attempted to make a list comprehension that would insert NaNs into the array, I think that would automatically achieve the same result, but I don't seem to be able to do it correctly. The best I have found is as follows:
import datetime as dtm
import numpy as np
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
xmod = np.array([x[0]]+[dt1 if dt1-dt0 < dtm.timedelta(hours=1.) else [dt1,np.nan] for dt1, dt0 in zip(x[1:],x[:-1])])
This gives the result:
In [7]: xmod
Out[7]:
array([datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 52, 30),
[datetime.datetime(2001, 4, 3, 0, 57, 30), nan],
datetime.datetime(2001, 4, 3, 3, 57, 30),
datetime.datetime(2001, 4, 3, 4, 2, 30)], dtype=object)
I have not been able to find a way to insert both the data point and the np.nan without putting brackets around them. Is this possible? Is there a better way to achieve my goal? Thanks!

In accordance with the comment above, probably the easiest way to do this would be to separate the data into groups where you need the gaps. Here is one way to implement such a thing.
import datetime as dtm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),
dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
y = range(len(x))
# make a dataframe with groups separated that are over an hour apart
data = []
g = 0
for i in range(len(x)):
x0 = x[i]
y0 = y[i]
if i < (len(x)-1):
x1 = x[i+1]
td = x1 - x0
elapsed_seconds = td.total_seconds()
hrs = (elapsed_seconds/60)/60
if hrs < 1:
data.append([x0,y0, g])
else:
data.append([x0,y0, g])
g+=1
else:
data.append([x0,y0, g])
df = pd.DataFrame(data, columns=['x', 'y', 'group'])
# draw a plot
fig, ax = plt.subplots(1,1, figsize = (8,5))
for i, dfg in df.groupby('group'):
ax.plot(dfg['x'], dfg['y'], c='b')

So, I accepted the answer by djakubosky because it seems clean and is probably the right approach. However, by the time that answer was posted, I had decided that what I was doing was inappropriate for a list comprehension and simply wrote it as a for loop - and that worked fine. Possibly this will be useful to someone else. Here is the code:
def insert_breaks(x,y):
import datetime as dtm
import numpy as np
xnew = []
ynew = []
for dt1, dt0, y1, y0 in zip(x[1:],x[:-1],y[1:],y[:-1]):
if dt1-dt0 < dtm.timedelta(hours=1):
xnew+=[dt0]
ynew+=[y0]
else:
xnew+=[dt0,dt0+(dt1-dt0)/2]
ynew+=[y0, np.nan]
xnew+=[dt1]
ynew+=[y1]
return xnew, ynew

Related

How to use np.where multiple times without iteration?

Note that this question is not about multiple conditions within a single np.where(), see this thread for that.
I have a numpy array arr1 with some numbers (without a particular structure):
arr0 = \
np.array([[0,3,0],
[1,3,2],
[1,2,0]])
and a list of all the entries in this array:
entries = [0,1,2,3]
I also have another array, arr1:
arr1 = \
np.array([[4,5,6],
[6,2,4],
[3,7,9]])
I would like to perform some function on multiple subsets of elements of arr1. A subset consts of numbers which are at the same position as arr0 entries with a cetrain value. Let this function be finding the max value. Performing the function on each subset via a list comprehension:
res = [np.where(arr0==index,arr1,0).max() for index in entries]
res is [9, 6, 7, 5]
As expected: 0 in arr0 is on the top left, top right, bottom right corner, and the biggest number from the top left, top right, bottom right entries of arr1 (ie 4, 6, 9) is 9. Rest follow with a similar logic.
How can I achieve this without iteration?
My actual arrays are much bigger than these examples.
With broadcasting
res = np.where(arr0[...,None] == entries, arr1[...,None], 0).max(axis=(0, 1))
The result of np.where(...) is a (3, 3, 4) array, where slicing [...,0] would give you the same 3x3 array you get by manually doing the np.where with just entries[0], etc. Then taking the max of each 3x3 subarray leaves you with the desired result.
Timings
Apparently this method doesn't scale well for bigger arrays. The other answer using np.unique is more efficient because it reduces the maximum operation down to a few unique value regardless of how big the original arrays are.
import timeit
import matplotlib.pyplot as plt
import numpy as np
def loops():
return [np.where(arr0==index,arr1,0).max() for index in entries]
def broadcast():
return np.where(arr0[...,None] == entries, arr1[...,None], 0).max(axis=(0, 1))
def numpy_1d():
arr0_1D = arr0.ravel()
arr1_1D = arr1.ravel()
arg_idx = np.argsort(arr0_1D)
u, idx = np.unique(arr0_1D[arg_idx], return_index=True)
return np.maximum.reduceat(arr1_1D[arg_idx], idx)
sizes = (3, 10, 25, 50, 100, 250, 500, 1000)
lengths = (4, 10, 25, 50, 100)
methods = (loops, broadcast, numpy_1d)
fig, ax = plt.subplots(len(lengths), sharex=True)
for i, M in enumerate(lengths):
entries = np.arange(M)
times = [[] for _ in range(len(methods))]
for N in sizes:
arr0 = np.random.randint(1000, size=(N, N))
arr1 = np.random.randint(1000, size=(N, N))
for j, method in enumerate(methods):
times[j].append(np.mean(timeit.repeat(method, number=1, repeat=10)))
for t in times:
ax[i].plot(sizes, t)
ax[i].legend(['loops', 'broadcasting', 'numpy_1d'])
ax[i].set_title(f'Entries size {M}')
plt.xticks(sizes)
fig.text(0.5, 0.04, 'Array size (NxN)', ha='center')
fig.text(0.04, 0.5, 'Time (s)', va='center', rotation='vertical')
plt.show()
It's more convenient to work in 1D case. You need to sort your arr0 then find starting indices for every group and use np.maximum.reduceat.
arr0_1D = np.array([[0,3,0],[1,3,2],[1,2,0]]).ravel()
arr1_1D = np.array([[4,5,6],[6,2,4],[3,7,9]]).ravel()
arg_idx = np.argsort(arr0_1D)
>>> arr0_1D[arg_idx]
array([0, 0, 0, 1, 1, 2, 2, 3, 3])
u, idx = np.unique(arr0_1D[arg_idx], return_index=True)
>>> idx
array([0, 3, 5, 7], dtype=int64)
>>> np.maximum.reduceat(arr1_1D[arg_idx], idx)
array([9, 6, 7, 5], dtype=int32)

Python using row index as variable input to equation within numpy array

I can't figure out how to in python without creating a for loop. I'm hoping you can teach me the simpler way.
I trimmed the relevant stuff. I'm doing a polyfit and then want to use these a and b coefficients, coeff[0:1], to update an array and solve the relevant y's like: y = ax + b
I can brute force it and included two methods here, but they're both clunky.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
coeff = np.polyfit(np.arange(0, len(raw)), raw[:], 1) #fits slope of values in raw
fit = np.zeros(shape=(len(raw), 2))
fit[:,0] = np.arange(0,fit.shape[0]) # this creates an index so I can use the row index as the "x" variable
fit[:,1] = fit[:,0]*coeff[0] + fit[:,0]*coeff[1] # calculating y = ax * b in column [1]
## Alternate method with the for loop
for_fit = np.zeros(len(raw))
for i in range(0,len(raw)) :
for_fit[i] = i*coeff[0] + i*coeff[1]
I tried to make it a little bit cleaner. The main issue I saw is that you did not use the formula y = ax+b but rather y=ax+bx.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
x = np.arange(0, len(raw))
coeff = np.polyfit(x, raw[:], 1)
y = x*coeff[0] + coeff[1]
To visualise the result we can use:
import matplotlib.pyplot as plt
plt.plot(x, raw, 'bo')
plt.plot(x, y, 'r')
#EDIT
Are you looking for something like this?
y_arr = np.empty((10, len(x)))
for i in range(10):
...
y_arr[i] = y

Numeric integration in numpy

I want to do something quite simple but I'm unable to find it in the depths of numpy. I want to numerically and continuously integrate a function given by its values (not by its formula!). That means I simply want an array which holds the sums of the beginning of the input array. Example:
Input:
[ 4, 3, 5, 8 ]
Output:
[ 4, 7, 12, 20 ] # [ sum(i[0:1]), sum(i[0:2]), sum(i[0:3]), sum(i[0:4]) ]
Sounds pretty straight forward, so I'm hopeful this must be easy with some numpy functionality I'm currently unable to find.
I found stuff like scipy.integrate.quad() but that seems to integrate over a given range (from a to b) and the returns a single value. I need an array as output.
You're looking for numpy.cumsum:
>>> numpy.cumsum([ 4, 3, 5, 8 ])
array([ 4, 7, 12, 20])
You would simply need numpy.cumsum().
import numpy as np
a = np.array([ 4, 3, 5, 8 ])
print np.cumsum(a) # prints [ 4 7 12 20]
You can use quadpy (pip install quadpy), a project of mine, which as opposed to scipy.integrate.quad() does vectorized compution. Provide it with many intervals, and get all the integral values over these intervals back.
import numpy
import quadpy
a = 0.0
b = 3.0
h = 1.0e-2
n = int((b-a) / h)
x0 = numpy.linspace(a, b, num=n, endpoint=False)
x1 = x0 + h
intervals = numpy.stack([x0, x1])
vals = quadpy.line_segment.integrate(
lambda x: numpy.sin(x),
intervals,
quadpy.line_segment.GaussLegendre(5)
)
res = numpy.cumsum(vals)
import matplotlib.pyplot as plt
plt.plot(x1, numpy.sin(x1), label='f')
plt.plot(x1, res, label='F')
plt.legend()
plt.show()
You don't need numpy to get the output. Using standard itertools we get the following:
from itertools import accumulate
a = [4, 3, 5, 8]
*b, = accumulate(a)
print(b)
# [4, 7, 12, 20]

Interpolation of datetimes for smooth matplotlib plot in python

I have lists of datetimes and values like this:
import datetime
x = [datetime.datetime(2016, 9, 26, 0, 0), datetime.datetime(2016, 9, 27, 0, 0),
datetime.datetime(2016, 9, 28, 0, 0), datetime.datetime(2016, 9, 29, 0, 0),
datetime.datetime(2016, 9, 30, 0, 0), datetime.datetime(2016, 10, 1, 0, 0)]
y = [26060, 23243, 22834, 22541, 22441, 23248]
And can plot them like this:
import matplotlib.pyplot as plt
plt.plot(x, y)
I would like to be able to plot a smooth version using more x-points. So first I do this:
delta_t = max(x) - min(x)
N_points = 300
xnew = [min(x) + i*delta_t/N_points for i in range(N_points)]
Then attempting a spline fit with scipy:
from scipy.interpolate import spline
ynew = spline(x, y, xnew)
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
What is the best way to proceed? I am open to solutions involving other libraries such as pandas or plotly.
You're trying to pass a list of datetimes to the spline function, which are Python objects (hence dtype('O')). You need to convert the datetimes to a numeric format first, and then convert them back if you wish:
int_x = [i.total_seconds() for i in x]
ynew = spline(int_x, y, xnew)
Edit: total_seconds() is actually a timedelta method, not for datetimes. However it looks like you sorted it out so I'll leave this answer as is.
Figured something out:
x_ts = [x_.timestamp() for x_ in x]
xnew_ts = [x_.timestamp() for x_ in xnew]
ynew = spline(x_ts, y, xnew_ts)
plt.plot(xnew, ynew)
This works very nicely, but I'm still open to ideas for simpler methods.

Sorted cumulative plots

How can I get sorted cumulative plots in numpy/matplotlib or Pandas?
Let me explain this with an example. Say we have the following data:
number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]
We want to plot a chart that, for a given (x,y) value is read as: the top %X selling stores sold %Y items. That is, it displays the data as follows:
where the best selling stores are to the left (i.e. the slope of the plot decreases monotonically). How can I do this in numpy or Pandas ? (i.e. assuming the above is a Series).
Assuming that you want the best performing stores to come first:
import numpy as np
import matplotlib.pyplot as plt
number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]
ar = sorted(number_of_items_sold_per_store,reverse=True)
y = np.cumsum(ar).astype("float32")
#normalise to a percentage
y/=y.max()
y*=100.
#prepend a 0 to y as zero stores have zero items
y = np.hstack((0,y))
#get cumulative percentage of stores
x = np.linspace(0,100,y.size)
#plot
plt.plot(x,y)
plt.show()
I think the steps involved here are:
Sort the list of sale counts in descending order
Get the cumulative sum of the sorted list
Divide by the overall total and multiply by 100 to convert to percentage
Plot!
n_sold = number_of_items_sold_per_store
sorted_sales = list(reversed(sorted(n_sold)))
total_sales = np.sum(n_sold)
cum_sales = np.cumsum(sorted_sales).astype(np.float64) / total_sales
cum_sales *= 100 # Convert to percentage
# Borrowing the linspace trick from ebarr
x_vals = np.linspace(0, 100, len(cum_sales))
plt.plot(x_vals, cum_sales)
plt.show()
This works for me (you can convert ':
number_of_items_sold_per_store' to numpy array using number_of_items_sold_per_store.values)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]
# Create histogram
values, base = np.histogram(number_of_items_sold_per_store, bins=500)
# Cumulative data
cum = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cum, c='red')
plt.show()

Categories

Resources