Problems with matplotlib and datetime - python

I am trying to plot the observed and calculated values of a time series parameter using matplotlib. The observed data are stored in an XL file which I read using openpyxl and convert to a pandas dataframe. The simulated values are read as elapsed days which I convert to numpy datetime using
delt = get_simulated_time()
t0 = np.datetime64('2004-01-01T00:00:00')
tsim = t0 + np.asarray(delt).astype('timedelta64[D]')
I plot the data using the following code snippet
df = obs_data_df.query("block=='blk-7'")
pobs = df['pres']
tobs = df['date']
tobs = np.array(tobs, dtype='datetime64')
print(type(tobs), np.min(tobs), np.max(tobs))
axs.plot(tobs, pobs, '.', color='g', label='blk-7, obs', markersize=8)
tsim = np.array(curr_sim_obj.tsim, dtype='datetime64')
print(type(tsim), np.min(tsim), np.max(tsim))
axs.plot(tsim, curr_sim_obj.psim[:, 0], '-', color='g', label='blk-7, sim', linewidth=1)
The results of the print statements are:
print(type(tobs), np.min(tobs), np.max(tobs))
... <class 'numpy.ndarray'> 2004-06-01T00:00:00.000000000 2020-06-01T00:00:00.000000000
print(type(tsim), np.min(tsim), np.max(tsim))
... <class 'numpy.ndarray'> 2004-01-01T00:00:00 2020-07-20T00:00:00
These types look OK but I get this error message from matplotlib:
ValueError: view limit minimum -36907.706903627106 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
I don't understand why I am getting this message since the print statements indicate that the data are consistent. I tried investigating further using
print(np.dtype(tsim), np.min(tobs), np.max(tobs))
but get this error message:
TypeError: data type not understood
This has confused me even further since I set the tobs data type in the preceding statement. I have to say that I am really confused about the differences in the way that python, pandas and numpy handle dates and the various code kludges above reflect workarounds that I have picked up along the way. I would basically like to know how to plot the two different time series on the same plot so all suggestions very welcome. Thank you in advance!
Update:
While cutting down the code to get a simpler case that reproduced the error I found the following code buried in the plotting routine:
axs.plot(10*np.random.randn(100), 10*np.random.randn(100), 'o')
This was left over from testing the plot routine. Once I removed this the errors disappeared. I guess I need to check my code more carefully ...

The solution was to use the matplotlib.dates.num2date function on both sets of data.

Related

How can I calculate the time lag between two similar time series?

I'm trying to compute/visualize the time lag between 2 time series (I want to know the time lag between the humidity progression of outside and inside a room).
Each data point of my series was taken hourly. Plotting the 2 series together, I can clearly see a shift between them: Sorry for hiding the axis
Here are a part of my time series data. I will pack them in 2 arrays:
inside_humidity =
[11.77961297, 11.59755268, 12.28761522, 11.88797553, 11.78122077, 11.5694668,
11.70421932, 11.78122077, 11.74272005, 11.78122077, 11.69438733, 11.54126933,
11.28460592, 11.05624965, 10.9611012, 11.07527934, 11.25417308, 11.56040908,
11.6657186, 11.51171572, 11.49246536, 11.78594142, 11.22968373, 11.26840678,
11.26840678, 11.29447992, 11.25553344, 11.19711371, 11.17764047, 11.11922075,
11.04132778, 10.86996123, 10.67410607, 10.63493504, 10.74922916, 10.74922916,
10.6294765, 10.61011497, 10.59075345, 10.80373021, 11.07479154, 11.15223764,
11.19711371, 11.17764047, 11.15816723, 11.22250051, 11.22250051, 11.202915,
11.18332948, 11.16374396, 11.14415845, 11.12457293, 11.10498742, 11.14926578,
11.16896413, 11.16896413, 11.14926578, 10.8307902, 10.51742195, 10.28187137,
10.12608544, 9.98977276, 9.62267727, 9.31289289, 8.96438546, 8.77077022,
8.69332413, 8.51907042, 8.30609366, 8.38353975, 8.4513867, 8.47085994,
8.50980642, 8.52927966, 8.50980642, 8.55887037, 8.51969934, 8.48052831,
8.30425867, 8.2177078, 7.98402891, 7.92560918, 7.89950166, 7.83489682,
7.75789537, 7.5984808, 7.28426807, 7.39778913, 7.71943214, 8.01149931,
8.18276652, 8.23009255, 8.16215295, 7.93822471, 8.00350215, 7.93843482,
7.85072729, 7.49778011, 7.31782649, 7.29862668, 7.60162032, 8.29665484,
8.58797834, 8.50011383, 8.86757784, 8.76600556, 8.60491125, 8.4222628,
8.24923231, 8.14470714, 8.17351638, 8.52530093, 8.72220151, 9.26745883,
9.1580007, 8.61762692, 8.22187405, 8.43693644, 8.32414835, 8.32463974,
8.46833012, 8.55865487, 8.72647164, 9.04112806, 9.35578449, 9.59465974,
10.47339785, 11.07218093, 10.54091351, 10.56138918, 10.46099958, 10.38129168,
10.16434831, 10.10612612, 10.009246, 10.53502351, 10.8307902, 11.13420052,
11.64337309, 11.18958511, 10.49630791, 10.60856932, 10.37029108, 9.86281478,
9.64699826, 9.95341012, 10.24329812, 10.6848196, 11.47604231, 11.30505352,
10.72194974, 10.30058448, 10.05022037, 10.06318411, 9.90118897, 9.68530059,
9.47790657, 9.48585784, 9.61639418, 9.86244265, 10.29009361, 10.28297229,
10.32073088, 10.65389513, 11.09656351, 11.20188562, 11.24124169, 10.40503955,
9.74632512, 9.07606098, 8.85145589, 9.37080152, 9.65082743, 10.0707891,
10.68776091, 11.25879751, 11.0416348, 10.89558456, 10.7908258, 10.66539685,
10.7297755, 10.77571398, 10.9268264, 11.16021492, 11.60961709, 11.43827534,
11.96155427, 12.16116437, 12.80412266, 12.52540805, 11.96752965, 11.58099292]
outside_humidity =
[10.17449206, 10.4823292, 11.06818167, 10.82768699, 11.27582592, 11.4196233,
10.99393027, 11.4122507, 11.18192837, 10.87247831, 10.68664321, 10.37949651,
9.57155882, 10.86611665, 11.62547196, 11.32004266, 11.75537602, 11.51292063,
11.03107569, 10.7297755, 10.4345622, 10.61271497, 9.49271162, 10.15594248,
9.99053828, 9.80915398, 9.6452438, 10.06900573, 11.18075689, 11.8289847,
11.83334752, 11.27480708, 11.14370467, 10.88149985, 10.73930381, 10.7236597,
10.26210496, 11.01260226, 11.05428228, 11.58321342, 12.70523808, 12.5181118,
11.90023799, 11.67756426, 11.28859471, 10.86878222, 9.73984486, 10.18253902,
9.80915398, 10.50980784, 11.38673459, 11.22751685, 10.94171823, 10.56484228,
10.38220753, 10.05388847, 9.96147203, 9.90698862, 9.7732203, 9.85262125,
8.7412938, 8.88281702, 8.07919545, 8.02883587, 8.32341424, 8.07357711,
7.27302616, 6.73660684, 6.66722819, 7.29408637, 7.00046542, 6.46322019,
6.07150988, 6.00207234, 5.8818402, 6.82443881, 7.20212882, 7.52167696,
7.88857771, 8.351627, 8.36547023, 8.24802846, 8.18520693, 7.92420816,
7.64926024, 7.87944972, 7.82118727, 8.02091833, 7.93071882, 7.75789457,
7.5416447, 6.94430133, 6.65907535, 6.67454591, 7.25493614, 7.76939457,
7.55357806, 6.61479472, 7.17641357, 7.24664082, 8.62732387, 8.66913548,
8.70925667, 9.0477017, 8.24558224, 8.4330502, 8.44366397, 8.17995798,
8.1875752, 9.33296518, 9.66567041, 9.88581085, 8.95449382, 8.3587624,
9.20584448, 8.90605388, 8.87494884, 9.12694892, 8.35055177, 7.91879933,
7.78867253, 8.22800878, 9.03685287, 12.49630018, 11.11819755, 10.98869374,
10.65897176, 10.36444573, 10.052609, 10.87627021, 10.07379564, 10.02233847,
9.62022856, 11.21575473, 10.85483543, 11.67324627, 11.89234248, 11.10068132,
10.06942096, 8.50405894, 8.13168561, 8.83616476, 8.35675085, 8.33616802,
8.35675085, 9.02209801, 9.5530404, 9.44738836, 10.89645958, 11.44771721,
11.79943601, 10.7765335, 11.1453622, 10.74874776, 10.55195175, 10.34494483,
9.83813522, 11.26931785, 11.20641798, 10.51555027, 10.90808954, 11.80923545,
11.68300879, 11.60313809, 7.95163365, 7.77213815, 7.54209557, 7.30603673,
7.17842173, 8.25899805, 8.56494995, 10.44245578, 11.08542758, 11.74129079,
11.67979686, 12.94362214, 11.96285343, 11.8289847, 11.01388413, 10.6793698,
11.20662595, 11.97684701, 12.46383177, 11.34178655, 12.12477078, 12.48698059,
12.89325064, 12.07470295, 12.6777319, 10.91689448, 10.7676326, 10.66710434]
I know cross correlation is the right term to use, but after a while I still don't get the idea of using scipy.signal.correlate and numpy.correlate, because all I got is an array full of NaNs. So clearly I need some more knowledge in this area.
What I expect to achieve is probably a plot like those in the answer section of this thread How to make a correlation plot with a certain lag of two time series where I can see at how many hours the time lag is most likely.
Thank you a lot in advance!
With the given data, you can use the numpy and matplotlib modules to achieve the desired result.
so, you can do something like this:
import numpy as np
from matplotlib import pyplot as plt
x = np.array(inside_humidity)
y = np.array(outside_humidity)
fig = plt.figure()
# fit a curve of your choice
a, b = np.polyfit(inside_humidity, outside_humidity, 1)
y_fit = a * x + b
# scatter plot, and fitted plot (best fit used)
plt.scatter(inside_humidity, outside_humidity)
plt.plot(x, y_fit)
plt.show()
which gives this:

Struggling to correctly utilize the str.find() function

I'm trying to use the str.find() and it keeps raising an error, what am I doing wrong?
I have a matrix where the 1st column is numbers and the 2nd is an abbreviation assigned to those letters. the abbrevations are either ED, LI or NA, I'm trying to find the positions that correspond to those letters so that I can plot a scatter graph that is colour coded to match those 3 groups.
mat=sio.loadmat('PBMC_extract.mat') #loading the data file into python
data=mat['spectra']
data_name=mat['name'] #calling in varibale
data_name = pd.DataFrame(data_name) #converting intoa readable matrix
pca=PCA(n_components=20) # preforms pca on data with 20 components
pca.fit(data) #fits to data set
datatrans=pca.transform(data) #transforms data using PCA
# plotting the graph that accounts for majority of data and noise
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
fig = plt.figure()
ax1 = Axes3D(fig)
#str.find to find individual positions of anticoagulants
str.find(data_name,'ED')
#renaming data for easiness
x_data=datatrans[0:35,0]
y_data=datatrans[0:35,1]
z_data=datatrans[0:35,2]
x2_data=datatrans[36:82,0]
y2_data=datatrans[36:82,1]
z2_data=datatrans[36:82,2]
x3_data=datatrans[83:97,0]
y3_data=datatrans[83:97,1]
z3_data=datatrans[83:97,2]
# scatter plot of score of PC1,2,3
ax1.scatter(x_data, y_data, z_data,c='b', marker="^")
ax1.scatter(x2_data, y2_data, z2_data,c='r', marker="o")
ax1.scatter(x3_data, y3_data, z3_data,c='g', marker="s")
ax1.set_xlabel('PC 1')
ax1.set_ylabel('PC 2')
ax1.set_zlabel('PC 3')
plt.show()
the error that keeps showing up is the following;
File "/Users/emma/Desktop/Final year project /working example of colouring data", line 49, in <module>
str.find(data_name,'ED')
TypeError: descriptor 'find' requires a 'str' object but received a 'DataFrame'
The error is because the find method expects a str object instead of a DataFrame object. As PiRK mentioned the problem is you're replacing the data_name variable here:
data_name = pd.DataFrame(data_name)
I believe it should be:
data = pd.DataFrame(data_name)
Also, although str.find(data_name,'ED') works, the suggested way to is to pass only the search term like this:
data_name.find('ED')
the proper syntax would be
data_name.find('ED')
look at the examples here
https://www.programiz.com/python-programming/methods/string/find
EDIT 1
though I just noticed data_name is a pandas DataFrame, so that won't work? What exactly are you trying to do?
your broken function call isn't even returning into a variable? So it's hard to answer your question?

Plotting pandas dataframe after doing pandas melt is slow and creates strange y-axis

This could be caused by me not understanding how pandas.melt works but I get strange behaviour when plotting "melted" dataframes using plotnine. Both frames has been converted from wide to long format. One frame with a column containing string values (df_slow) and another with only numerical values (df_fast).
The following code gives different behaviour. Plotting df_slow is slow and gives a strange looking y-axis. Plotting df_fast looks ok and is fast. My guess is that pandas melt is doing something strange with the data which causes this behaviour. See example plots.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9
SIZE = 200
value = np.random.rand(SIZE, 1)
# Create test data, one with a column containing strings, one with only numeric values
df_slow = pd.DataFrame({'value': value.flatten(), 'string_column': ['A']*SIZE})
df_fast = pd.DataFrame({'value': value.flatten()})
# Set index
df_slow = df_slow.reset_index()
df_fast = df_fast.reset_index()
# Convert 'df_slow', 'df_fast' to long format
df_slow = pd.melt(df_slow, id_vars='index')
df_fast = pd.melt(df_fast, id_vars='index')
print(df_slow.head())
print(df_fast.head())
df_slow = df_slow[df_slow.variable == 'value']
# This is slow and has many breaks on y-axis
p = (p9.ggplot(df_slow, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
# This is much faster and y-axis looks good
p = (p9.ggplot(df_fast, p9.aes(x='index', y='value')) + p9.geom_point())
p.draw()
slow and strange plot
fast and good looking plot
Possible fix
Changing the type of the "value" column in df_slow makes it behave like df_fast when plotting.
# This makes df_slow behave like df_fast when plotting
df_slow['value'] = df_slow.value.astype(np.float64)
Question
Is this a bug in plotnine (or pandas) or am I doing something wrong?
Answer
When pivoting two columns with different data types, in this case string and float, I guess it makes sense that the resulting column containing both strings and floats will have the type object. As
ALollz pointed out this probably makes plotnine interpret the values as strings which causes this behavior.

How to use datetime.time to plot in Python

I have list of timestamps in the format of HH:MM:SS and want to plot against some values using datetime.time. Seems like python doesn't like the way I do it. Can someone please help ?
import datetime
import matplotlib.pyplot as plt
# random data
x = [datetime.time(12,10,10), datetime.time(12, 11, 10)]
y = [1,5]
# plot
plt.plot(x,y)
plt.show()
*TypeError: float() argument must be a string or a number*
Well, a two-step story to get 'em PLOT really nice
Step 1: prepare data into a proper format
from a datetime to a matplotlib convention compatible float for dates/times
As usual, devil is hidden in detail.
matplotlib dates are almost equal, but not equal:
# mPlotDATEs.date2num.__doc__
#
# *d* is either a class `datetime` instance or a sequence of datetimes.
#
# Return value is a floating point number (or sequence of floats)
# which gives the number of days (fraction part represents hours,
# minutes, seconds) since 0001-01-01 00:00:00 UTC, *plus* *one*.
# The addition of one here is a historical artifact. Also, note
# that the Gregorian calendar is assumed; this is not universal
# practice. For details, see the module docstring.
So, highly recommended to re-use their "own" tool:
from matplotlib import dates as mPlotDATEs # helper functions num2date()
# # and date2num()
# # to convert to/from.
Step 2: manage axis-labels & formatting & scale (min/max) as a next issue
matplotlib brings you arms for this part too.
Check code in this answer for all details
It is still valid issue in Python 3.5.3 and Matplotlib 2.1.0.
A workaround is to use datetime.datetime objects instead of datetime.time ones:
import datetime
import matplotlib.pyplot as plt
# random data
x = [datetime.time(12,10,10), datetime.time(12, 11, 10)]
x_dt = [datetime.datetime.combine(datetime.date.today(), t) for t in x]
y = [1,5]
# plot
plt.plot(x_dt, y)
plt.show()
By deafult date part should not be visible. Otherwise you can always use DateFormatter:
import matplotlib.dates as mdates
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H-%M-%S'))
I came to this page because I have a similar issue. I have a Pandas DataFrame df with a datetime column df.dtm and a data column df.x, spanning several days, but I want to plot them using matplotlib.pyplot as a function of time of day, not date and time (datetime, datetimeindex). I.e., I want all data points to be folded into the same 24h range in the plot. I can plot df.x vs. df.dtm without issue, but I've just spent two hours trying to figure out how to convert df.dtm to df.time (containing the time of day without a date) and then plotting it. The (to me) straightforward solution does not work:
df.dtm = pd.to_datetime(df.dtm)
ax.plot(df.dtm, df.x)
# Works (with times on different dates; a range >24h)
df['time'] = df.dtm.dt.time
ax.plot(df.time, df.x)
# DOES NOT WORK: ConversionError('Failed to convert value(s) to axis '
matplotlib.units.ConversionError: Failed to convert value(s) to axis units:
array([datetime.time(0, 0), datetime.time(0, 5), etc.])
This does work:
pd.plotting.register_matplotlib_converters() # Needed to plot Pandas df with Matplotlib
df.dtm = pd.to_datetime(df.dtm, utc=True) # NOTE: MUST add a timezone, even if undesired
ax.plot(df.dtm, df.x)
# Works as before
df['time'] = df.dtm.dt.time
ax.plot(df.time, df.x)
# WORKS!!! (with time of day, all data in the same 24h range)
Note that the differences are in the first two lines. The first line allows better collaboration between Pandas and Matplotlib, the second seems redundant (or even wrong), but that doesn't matter in my case, since I use a single timezone and it is not plotted.

Imshow and pcolor throw errors when trying to create test pattern-style bars

I am trying to create an image to use as a test pattern for a new colormap I'm creating. The map is supposed to have nine unique colors with breaks at the integers from 0-8. The colormap itself is fine, but I can't seem to generate the image itsel.
I'm using pandas to make the test array like this:
mask=pan.DataFrame(index=np.arange(0,100),columns=np.arange(1,91))
mask.ix[:,1:10]=0.0
mask.ix[:,11:20]=1.0
mask.ix[:,21:30]=2.0
mask.ix[:,31:40]=3.0
mask.ix[:,41:50]=4.0
mask.ix[:,51:60]=5.0
mask.ix[:,61:70]=6.0
mask.ix[:,71:80]=7.0
mask.ix[:,81:90]=8.0
Maybe not the most elegant method, but it creates the array I want.
However, when I try to plot it using either imshow or pcolor I get an error. So:
fig=plt.figure()
ax=fig.add_subplot(111)
image=ax.imshow(mask)
fig.canvas.draw()
yields the error: "TypeError: Image data can not convert to float"
and substituting pcolor for imshow yields this error: "AttributeError: 'float' object has no attribute 'view'"
However, when I replace he values in mask with anything else - say random numbers - it plots just fine:
mask=pan.DataFrame(values=rand(100,90),index=np.arange(0,100),columns=np.arange(1,91))
fig=plt.figure()
ax=fig.add_subplot(111)
image=ax.imshow(mask)
fig.canvas.draw()
yields the standard colored speckle one would expect (no errors).
The problem here is that your dataframe is full of objects, not numbers. You can see it if you do mask.dtypes. If you want to use pandas dataframes, create mask by specifying the data type:
mask=pan.DataFrame(index=np.arange(0,100),columns=np.arange(1,91), dtype='float')
otherwise pandas cannot know which data type you want. After that change your code should work.
However, if you want to just test the color maps with integers, then you might be better off using simple numpy arrays:
mask = np.empty((100,90), dtype='int')
mask[:, :10] = 0
mask[:, 10:20] = 1
...
And, of course, there are shorter ways to do that filling, as well. For example:
mask[:] = np.arange(90)[None,:] / 10

Categories

Resources