Generating an array of dates in python - python

I am writing a python script that produces a bar graph of data between two dates specified by the user
For example here the user enters 30 November and 4 December
import datetime as dt
dateBegin = dt.date(2012,11,30)
dateEnd = dt.date(2012,12,4)
Is there a way to return an array of the dates between dateBegin and dateEnd?
What I want is something like [30, 1, 2, 3, 4]. Any suggestions?

Sure! You are looking for matplotlib.dates.drange:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
import datetime as DT
dates = mdates.num2date(mdates.drange(DT.datetime(2012, 11, 30),
DT.datetime(2012, 12, 4),
DT.timedelta(days=1)))
print(dates)
# [datetime.datetime(2012, 11, 30, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>), datetime.datetime(2012, 12, 1, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>), datetime.datetime(2012, 12, 2, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>), datetime.datetime(2012, 12, 3, 0, 0, tzinfo=<matplotlib.dates._UTC object at 0x8c8f8ec>)]
vals = np.random.randint(10, size=len(dates))
fig, ax = plt.subplots()
ax.bar(dates, vals, align='center')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.xticks(rotation=25)
ax.set_xticks(dates)
plt.show()

Related

What is plotted when string data is passed to the matplotlib API?

# first, some imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Let's say I want to make a scatter plot, using this data:
np.random.seed(42)
x=np.arange(0,50)
y=np.random.normal(loc=3000,scale=1,size=50)
Plot via:
plt.scatter(x,y)
I get this answer:
Ok, let's create a dataframe first:
df=pd.DataFrame.from_dict({'x':x,'y':y.astype(str)})
(I am aware that I am storing y as str - this is a reproducible example, and I do this to reflect the real use case.)
Then, if I do:
plt.scatter(df.x,df.y)
I get:
What am I seeing in this second plot? I thought that the second plot must be showing the x column plotted against the y column, which are converted to float. This is clearly not the case.
Matplotlib doesn't automatically convert str values to numerical, so your y values are treated as categorical. As far as Matplotlib is concerned, the differences '1.0' to '0.9' and '1.0' to '100.0' are not different.
So, the y-axis on the plot will be the same as range(len(y)) (since the difference between all categorical values is the same) with labels assigned from the categorical values.
Since your x is a range equal to range(50), and now your y is a range too (also equal to range(50)), it plots x = y, with y-labels set to respective str value.
As per the excellent answer by dm2, when you pass y as a string, y is simply being treated as arbitrary string labels, and being plotted one after the other in the order in which they appear. To demonstrate, here's an even simpler example.
from matplotlib import pyplot as plt
x = [1, 2, 3, 4]
y = [5, 25, 10, 1] # these are ints
plt.scatter(x, y)
So far so good. Now, different string y values.
y = list("abcd")
plt.scatter(x, y)
You can see how it just takes the y labels and just drops them on the axis one after another.
Finally,
y = ["5", "25", "10", "1"]
plt.scatter(x, y)
Compare this with the previous results and now it should become obvious what's going on.
It's more obvious if the labels and locations are extracted, that the API plots the strings as labels, and the axis locations are 0 indexed numbers based on the how many (len) categories exist.
.get_xticks() and .get_yticks() extract a list of the numeric locations.
.get_xticklabels() and .get_yticklabels() extract a list of matplotlib.text.Text, Text(x, y, text).
There are fewer numbers in the list for the y axis because there were duplicate values as a result of rounding.
This applies to any APIs, like seaborn or pandas that use matplotlib as the backend.
sns.scatterplot(data=df, x='x_num', y='y', ax=ax1)
ax1.scatter(data=df, x='x_num', y='y')
ax1.plot('x_num', 'y', 'o', data=df)
Labels, Locs, and Text
print(x_nums_loc)
print(y_nums_loc)
print(x_lets_loc)
print(y_lets_loc)
print(x_lets_labels)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
[Text(0, 0, 'A'), Text(1, 0, 'B'), Text(2, 0, 'C'), Text(3, 0, 'D'), Text(4, 0, 'E'),
Text(5, 0, 'F'), Text(6, 0, 'G'), Text(7, 0, 'H'), Text(8, 0, 'I'), Text(9, 0, 'J'),
Text(10, 0, 'K'), Text(11, 0, 'L'), Text(12, 0, 'M'), Text(13, 0, 'N'), Text(14, 0, 'O'),
Text(15, 0, 'P'), Text(16, 0, 'Q'), Text(17, 0, 'R'), Text(18, 0, 'S'), Text(19, 0, 'T'),
Text(20, 0, 'U'), Text(21, 0, 'V'), Text(22, 0, 'W'), Text(23, 0, 'X'), Text(24, 0, 'Y'),
Text(25, 0, 'Z')]
Imports, Data, and Plotting
import numpy as np
import string
import pandas as pd
import matplotlib.pyplot as plt
import string
# sample data
np.random.seed(45)
x_numbers = np.arange(100, 126)
x_letters = list(string.ascii_uppercase)
y= np.random.normal(loc=3000, scale=1, size=26).round(2)
df = pd.DataFrame.from_dict({'x_num': x_numbers, 'x_let': x_letters, 'y': y}).astype(str)
# plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3.5))
df.plot(kind='scatter', x='x_num', y='y', ax=ax1, title='X Numbers', rot=90)
df.plot(kind='scatter', x='x_let', y='y', ax=ax2, title='X Letters')
x_nums_loc = ax1.get_xticks()
y_nums_loc = ax1.get_yticks()
x_lets_loc = ax2.get_xticks()
y_lets_loc = ax2.get_yticks()
x_lets_labels = ax2.get_xticklabels()
fig.tight_layout()
plt.show()

Parse a list of numbers in datetime in Python Pandas

I have a dataframe of transportation data.
The datetime fields are logged in this format: [2020, 12, 10, 15, 0, 5, 18000000].
How do I parse these as datetime objects?
You can pass it with * to the datetime.datetime constructor and use any pandas function to apply this to every value in your pandas.Series.
>>> from datetime import datetime
>>> datetime(*[2020, 12, 10, 15, 0, 5, 18000])
datetime.datetime(2020, 12, 10, 15, 0, 5, 18000)
One additional moment: you will need to update microsecond field.
E.g.
from datetime import datetime
import pandas as pd
df = pd.DataFrame({"example": [[2020, 12, 10, 15, 0, 5, 18000000]]})
df.example = df.example.apply(lambda x: datetime(
*(v if i != len(x) - 1 else v // 1000 for i, v in enumerate(x))
))

DateLocator in matplotlib to show the first days of both the week and the month

I would like to create a DateLocator in matplotlib that selects all Mondays and the first days of the month. As matplotlib uses the dateutil library I read the docs of how to use RRuleLocator with rrule objects. With the rruleset object from dateutil I can achieve the required functionality:
>>> rrset = rruleset()
>>> rrset.rrule(rrule(DAILY, byweekday=MO, count=5))
>>> rrset.rrule(rrule(DAILY, bymonthday=1, count=5))
>>> list(rrset)
[datetime.datetime(2020, 11, 30, 16, 10, 2),
datetime.datetime(2020, 12, 1, 16, 10, 2),
datetime.datetime(2020, 12, 7, 16, 10, 2),
datetime.datetime(2020, 12, 14, 16, 10, 2),
datetime.datetime(2020, 12, 21, 16, 10, 2),
datetime.datetime(2020, 12, 28, 16, 10, 2),
datetime.datetime(2021, 1, 1, 16, 10, 2),
datetime.datetime(2021, 2, 1, 16, 10, 2),
datetime.datetime(2021, 3, 1, 16, 10, 2),
datetime.datetime(2021, 4, 1, 16, 10, 2)]
But unfortunately I did not manage to find out how to use rruleset with matplotlib. RRuleLocator expects a rrulewrapper object (defined in matplotlib) that hides away the rrule instance and I can not use it with rruleset. Any other way to do this?
If I understood you correctly, calling .set_xticks(list(rrset)) might be enough. For example:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import dateutil
from dateutil.rrule import *
import datetime
import numpy as np
rrset = rruleset()
rrset.rrule(rrule(DAILY, byweekday=MO, count=5))
rrset.rrule(rrule(DAILY, bymonthday=1, count=5))
print(list(rrset))
## generate dates 90 days into the future
base = datetime.datetime.today()
dates = [base + datetime.timedelta(days=3*x) for x in range(30)]
fig = plt.figure(figsize=(10,5))
ax = plt.subplot(111)
ax.set_autoscale_on(True)
## simply plot dates over dates
ax.plot(dates,dates,marker='s')
ax.set_xticks(list(rrset))
formatter = mdates.DateFormatter('%m/%d/%y')
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_tick_params(rotation=30, labelsize=10)
ax.autoscale_view()
ax.grid()
plt.show()
yields (today on 11/26/20 where 11/30/2020 is the next Monday, hence the tick label overlapping with the first of the month):

Pandas : first datetime field gets automatically converted to timestamp type

When creating a pandas dataframe object (python 2.7.9, pandas 0.16.2), the first datetime field gets automatically converted into a pandas timestamp. Why? Is it possible to prevent this so as to keep the field in the original type?
Please see code below:
import numpy as np
import datetime
import pandas
create a dict:
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstart':np.array([datetime.datetime(2001, 11, 16, 0, 0),
datetime.datetime(2012, 2, 28, 0, 0), datetime.datetime(2014, 12, 22, 0, 0)],
dtype=object),
'vstop': np.array([datetime.datetime(2012, 2, 28, 0, 0),
datetime.datetime(2014, 12, 22, 0, 0), datetime.datetime(9999, 12, 31, 0, 0)],
dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'],
dtype='|S18')}
So, the vstart and vstop keys are datetime so far. However, after:
df = pandas.DataFrame(data = x)
the vstart becomes a pandas Timestamp automatically while vstop remains a datetime
type(df.vstart[0])
#class 'pandas.tslib.Timestamp'
type(df.vstop[0])
#type 'datetime.datetime'
I don't understand why the first datetime column that the constructor comes across gets converted to Timestamp by pandas. And how to tell pandas to keep the data types as they are. Can you help? Thank you.
actually I've noticed something in your data , it has nothing to do with your first or second date column in your column vstop there is a datetime with value dt.datetime(9999, 12, 31, 0, 0) , if you changed the year on this date to a normal year like 2020 for example both columns will be treated the same .
just note that I'm importing datetime module as dt
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstop': np.array([dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0), dt.datetime(2020, 12, 31, 0, 0)], dtype=object),
'vstart': np.array([dt.datetime(2001, 11, 16, 0, 0),dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0)], dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'], dtype='|S18')}
In [27]:
df = pd.DataFrame(x)
df
Out[27]:
cusip id vstart vstop
10553M10 EQ0000000000041095 2001-11-16 2012-02-28
67085120 EQ0000000000041095 2012-02-28 2014-12-22
67085140 EQ0000000000041095 2014-12-22 2020-12-31
In [25]:
type(df.vstart[0])
Out[25]:
pandas.tslib.Timestamp
In [26]:
type(df.vstop[0])
Out[26]:
pandas.tslib.Timestamp

How to read the date/time field from the csv file and plot a graph accordingly in python

Im importing records from a CSV file using python csv module .
The date/Time field expects the date to be in a specific format, but
different spreadsheet programs default to different types of formats
and I dont want the user to have to change their down format.I want to
find a way to either detect the format the string is in, or only allow
several specified formats.
How to read the date/time field from the csv file and plot a graph
accordingly.
dateutil can parse date strings in a variety of formats, without you having to specify in advance what format the date string is in:
In [8]: import dateutil.parser as parser
In [9]: parser.parse('Jan 1')
Out[9]: datetime.datetime(2011, 1, 1, 0, 0)
In [10]: parser.parse('1 Jan')
Out[10]: datetime.datetime(2011, 1, 1, 0, 0)
In [11]: parser.parse('1-Jan')
Out[11]: datetime.datetime(2011, 1, 1, 0, 0)
In [12]: parser.parse('Jan-1')
Out[12]: datetime.datetime(2011, 1, 1, 0, 0)
In [13]: parser.parse('Jan 2,1999')
Out[13]: datetime.datetime(1999, 1, 2, 0, 0)
In [14]: parser.parse('2 Jan 1999')
Out[14]: datetime.datetime(1999, 1, 2, 0, 0)
In [15]: parser.parse('1999-1-2')
Out[15]: datetime.datetime(1999, 1, 2, 0, 0)
In [16]: parser.parse('1999/1/2')
Out[16]: datetime.datetime(1999, 1, 2, 0, 0)
In [17]: parser.parse('2/1/1999')
Out[17]: datetime.datetime(1999, 2, 1, 0, 0)
In [18]: parser.parse("10-09-2003", dayfirst=True)
Out[18]: datetime.datetime(2003, 9, 10, 0, 0)
In [19]: parser.parse("10-09-03", yearfirst=True)
Out[19]: datetime.datetime(2010, 9, 3, 0, 0)
Once you've collected the dates and values into lists, you can plot them with plt.plot. For example:
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
n=20
now=dt.datetime.now()
dates=[now+dt.timedelta(days=i) for i in range(n)]
values=[np.sin(np.pi*i/n) for i in range(n)]
plt.plot(dates,values)
plt.show()
Per Joe Kington's comment, a graph similar to the one above could also be made using matplotlib.dates.datestr2num instead of using dateutil.parser explicitly:
import matplotlib.pyplot as plt
import matplotlib.dates as md
import datetime as dt
import numpy as np
n=20
dates=['2011-Feb-{i}'.format(i=i) for i in range(1,n)]
dates=md.datestr2num(dates)
values=[np.sin(np.pi*i/n) for i in range(1,n)]
plt.plot_date(dates,values,linestyle='solid',marker='None')
plt.show()

Categories

Resources