Error ValueError: day is out of range for month - python

I was looking at a sample table of student information and wanted to see what days were the most popular for students to enroll on a course. The script worked fine the first day I ran it and I left it. A few days later I returned to take another look at it but started getting the ValueError message.
Why did it stop working? No new information was added to the dataset
since it worked the first time. The code now fails at
df["EnrolmentDate"] = pd.to_datetime(df.EnrolmentDate)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import seaborn as sns
colors = sns.cubehelix_palette(28, rot=-0.4)
df = pd.read_csv("data.csv")
#print(df.dtypes)
# To change the format of the data type from object to datetime. Has to be run at start of script or the format returns to object.
#df['Enrolment Date'] = pd.to_datetime(df['Enrolment Date'])
print(df.EnrolmentDate.str.slice(0, 10))
df["EnrolmentDate"] = pd.to_datetime(df.EnrolmentDate)
print(df.head())
print(df.dtypes)
#Tells us what day of the week the enrolment date was. Can also use .dayofyear. Google Pandas API Reference, search for .dt., datetime properties
print(df.EnrolmentDate.dt.weekday_name)
#Shows the latest or greatest enrolment date
print(df.EnrolmentDate.max())
print(df.EnrolmentDate.min())
print(df.EnrolmentDate.max()-df.EnrolmentDate.min())
df["EnrolmentDay"] = df.EnrolmentDate.dt.weekday_name
print(df.head())
print(df.EnrolmentDay.value_counts())
print(df.EnrolmentDay.value_counts().plot())
#print(df.Day.value_counts().sort_index())
#df.EnrolmentDay.value_counts().sort_index().plot()
# naming the x axis
plt.xlabel('Day')
# naming the y axis
plt.ylabel('No. of Enrolments')
plt.show()

Related

Why am I not able to feed my data series from pandas into calmap.yearplot? Trying to create a calendar heat map

Beginner question here.
What I'm trying to build:
A program that takes data from a CSV and creates a calendar heat map from it. I am a language learner (language as in spanish, japanese, etc) and the data set I'm using is a CSV that shows how many hours I spent immersing in my target language per day.
I want the individual values in the heat map to be the number of hours. Y axis will be days of the week, and x axis will be months.
What I have tried:
I have tried many methods for the past two days (most of them using seaborn), that have all resulted in error-infested spaghetti code...
The method I'm using today is with calmap. Here is what I have so far:
import seaborn as sns
import matplotlib as plt
import numpy as np
from vega_datasets import data as vds
import calmap
import pandas as pd
import calplot
# importing CSV from google drive
df = pd.read_csv('ImmersionHours.csv', names=['Type', 'Name', 'Date', 'Time', 'Total Time'])
# deleting extraneous row of data
df.drop([0], inplace=True)
# making sure dates are in datetime format
df['Date'] = pd.to_datetime(df['Date'])
# setting the dates as the index
df.set_index('Date', inplace=True)
# the data is now formatted how I want
# creating a series for the heat map values
hm_values = pd.Series(df.Time)
# trying to create the heat map from the series (hm_values)
calmap.yearplot(data=hm_values, year=2021)
and here is a copy of the data set that I imported into Python (for reference) https://docs.google.com/spreadsheets/d/1owZv0NDLz7S4R5Spf-hzRDGMTCS1FVSMvi0WsZJenWE/edit?usp=sharing
Can someone tell me where I'm going wrong and why the heat map won't show?
Thank you in advance for any advice/tips/corrections.
The question is a bit old, but in case anyone is interested, I had the same problem and found that this notebook was very helpful to solve the issue: https://github.com/amandasolis/Fitbit/blob/master/FitbitSummaryPlots.ipynb
import numpy as np
import pandas as pd
import calmap
fulldf = pd.read_csv("./data.csv", index_col=0, header=None,names=['date','duration','frac'], parse_dates=['date'], usecols=['date','frac'], infer_datetime_format=True, dayfirst=True)
fulldf.index=pd.to_datetime(fulldf.index)
events = pd.Series(fulldf['frac'])
calmap.yearplot(events, year=2022) #the notebook linked above has a better but complex viz
first lines of data.csv (I plot frac, the 3rd column, not duration, but it should be similar):
03/11/2022,1,"0.0103"
08/11/2022,1,"0.0103"
15/11/2022,1,"0.0103"

Python: How to access a subordinate Column?

i'm Sven and right before to say i am an absolute beginner with Python. I rode the books "Beginning with Python" and "Python for Data Analysis" to get at least a basic understanding for what i'm doing. My goal with the code below is, that i would like to show the Volume of S&P500 with a rolling Mean of the last 250 days. Means combine a barchart(seaborn) with a line chart(matplotlib.pyplot).
The problem arise in plotting the "S&P500 data by Volume with seaborn as a barchart because i can not access on the subordinate column " Date" . I have an idea but im not quite sure how to start. Has anybody an idea? Thanks a lot.
My approach is anywher between Index, Hierachical and Grouping.
Open High Low Close Adj Close Volume
Date
1993-02-01 438.78 442.52 438.78 442.52 442.52 238570000
1993-02-02 442.52 442.87 440.76 442.55 442.55 271560000
1993-02-03 442.56 447.35 442.56 447.20 447.20 345410000
1993-02-04 447.20 449.86 447.20 449.56 449.56 351140000
1993-02-05 449.56 449.56 446.95 448.93 448.93 324710000
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
yesterday = datetime.now()-timedelta(1)
datetime.strftime(yesterday, "%Y-%m-%d")
SP500 = yf.download('^GSPC', start='1993-02-01', end=yesterday)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
SP500f = SP500.head()
SP500f.groupby
#Stats_Vol = SP500["Volume"]
#Date = SP500["Date"]
#print(Stats_Vol)
#print(Stats_Vol.describe())
#sns.barplot(data=SP500, y="Volume")
#print(Stats_Vol.rolling(250).mean().plot())
plt.show()
Primarily you need to access the Date which is the index
could reset_index() to make it a column
there are two many dates to plot so resampled and then created a new column for display format on x-axis
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
yesterday = datetime.now()-timedelta(1)
fig, ax = plt.subplots()
SP500 = yf.download('^GSPC', start='1993-02-01', end=yesterday)
# too many days, resample
# do a display format for date (which is the index)
sns.barplot(data=SP500.loc[:,"Volume"]\
.resample("Y").mean().to_frame()\
.assign(GDate=lambda dfa: dfa.index.strftime("%Y")),
x="GDate", y="Volume", ax=ax)
# rotate the labels
l = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)

How to omit only weekends from my data frame?

I am working on a project to create an algorithmic trader. However, I want to remove the weekends from my data frame as it ruins the data as shown in I have tried to do somethings I found on StackOverflow but I get an error that the type is Timestamp and so I can't use that technique. It also isn't a column in the data frame. I'm new to python so I'm not very sure but I think it's an index since when I go through the .index function it shows me the date and time. I'm sorry if these are stupid questions but I am new to python and pandas.
Here is my code:
#import all the libraries
import nsetools as ns
import pandas as pd
import numpy
import matplotlib.pyplot as plt
from datetime import datetime
import yfinance as yf
plt.style.use('fivethirtyeight')
a = input("Enter the ticker name you wish to apply strategy to")
ticker = yf.Ticker(a)
hist = ticker.history(period="1mo", interval="15m")
print(hist)
plt.figure(figsize=(12.5, 4.5))
plt.plot(hist['Close'], label=a)
plt.title('close price history')
plt.xlabel("13 Nov 2020 too 13 Dec 2020")
plt.ylabel("Close price")
plt.legend(loc='upper left')
plt.show()
EDIT: On the suggestion of a user, I tried to modify my code to this
refinedlist = hist[hist.index.dayofweek<5]plt.style.use('fivethirtyeight')
a = input("Enter the ticker name you wish to apply strategy to")
ticker = yf.Ticker(a)
hist = ticker.history(period="1mo", interval="15m")
refinedlist = hist[hist.index.dayofweek<5]
print (refinedlist)
And graphed that, but the graph still includes the weekends on the x axis.
In the first place, stock market data does not exist because the market is closed on holidays and national holidays. The reason for this is that your unit of acquisition is time, so there is also no data from the time the market closes to the time it opens the next day.
For example, I graphed the first 50 results. (The x-axis doesn't seem to be correct.)
plt.plot(hist['Close'][:50], label=a)
As one example, if you include holidays and national holidays and draw a graph with missing values for the times when the market is not open, you get the following.
new_idx = pd.date_range(hist.index[0], hist.index[-1], freq='15min')
hist = hist.reindex(new_idx, fill_value=np.nan)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import yfinance as yf
# plt.style.use('fivethirtyeight')
# a = input("Enter the ticker name you wish to apply strategy to")
a = 'AAPL'
ticker = yf.Ticker(a)
hist = ticker.history(period="1mo", interval="15m")
new_idx = pd.date_range(hist.index[0], hist.index[-1], freq='15min')
hist = hist.reindex(new_idx, fill_value=np.nan)
plt.figure(figsize=(12.5, 4.5))
plt.plot(hist['Close'], label=a)
plt.title('close price history')
plt.xlabel("13 Nov 2020 too 13 Dec 2020")
plt.ylabel("Close price")
plt.legend(loc='upper left')
plt.show()

Make line chart with multiple series and error bars

I'm hoping to create a line graph which shows the changes to flowering and fruiting times (phenophases) from year to year. For each phenophase I'd like to plot the average Day of Year and, if possible, show the min and max for each year as an error bar. I've filtered down all the data I need in a few data frames, grouped it all in a sensible way, but I can't figure out how to get it all to plot. Here's a screen grab of where I'm at: Imgur
All the examples I've found adding error bars have been based on formulas or other equal amounts over/under, but in my case the max/min will be different so I'm not sure how to integrate that. Possible just create a list of each column's data and feed that to plot? I'm playing with that now but not getting far.
Also, if anyone has general suggestions as to better ways to present this data I'm all ears. I've looked into Gantt plots but didn't get far with them, as this seems a bit more straight-forward just using matplotlib. I'm happy to put some demo data or the rest of my notebook up if anyone thinks that would help.
Edit: Here's some sample data and the code from my notebook: Gist
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
pd.set_option('display.max_columns', 40)
tick_spacing = 1
dfClean = df[['Site_Cluster', 'Species', 'Phenophase_Name',
'Phenophase_Status', 'Observation_Year', 'Day_of_Year']]
dfClean = dfClean[dfClean.Phenophase_Status == 1]
PhenoNames = ['Open flowers', 'Ripe fruits']
dfLakes = dfClean[(dfClean.Phenophase_Name.isin(PhenoNames))
& (dfClean.Site_Cluster == 'Lakes')
& (dfClean.Species == 'lapponica')]
dfLakesGrouped = dfLakes.groupby(['Observation_Year', 'Phenophase_Name'])
dfLakesReady = dfLakesGrouped.Day_of_Year.agg([np.min, np.mean, np.max]).round(0)
dfLakesReady = dfLakesReady.unstack()
print(dfLakesReady['mean'].plot())
Here's another answer:
from pandas import DataFrame, date_range, Timedelta
import numpy as np
from matplotlib import pyplot as plt
rng = date_range(start='2015-01-01', periods=5, freq='24H')
df = DataFrame({'y':np.random.normal(size=len(rng))}, index=rng)
y1 = df['y']
y2 = (y1*3)
sd1 = (y1*2)
sd2 = (y1*2)
fig,(ax1,ax2) = plt.subplots(2,1,sharex=True)
_ = y1.plot(yerr=sd1, ax=ax1)
_ = y2.plot(yerr=sd2, ax=ax2)
Output:

Reading data from csv and create a graph

I have a csv file with data in the following format -
Issue_Type DateTime
Issue1 03/07/2011 11:20:44
Issue2 01/05/2011 12:30:34
Issue3 01/01/2011 09:44:21
... ...
I'm able to read this csv file, but what I'm unable to achieve is to plot a graph or rather trend based on the data.
For instance - I'm trying to plot a graph with X-axis as Datetime(only Month) and Y-axis as #of Issues. So I would show the trend in line-graphy with 3 lines indicating the pattern of issue under each category for the month.
I really don't have a code for plotting the graph and hence can't share any, but so far I'm only reading the csv file. I'm not sure how to proceed further to plot a graph
PS: I'm not bent on using python - Since I've parsed csv using python earlier I though of using the language, but if there is an easier approach using some other language - I would be open explore that as well.
A way to do this is to use dataframes with pandas.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";") #Reads the csv
df.index = pd.to_datetime(df["DateTime"]) #Set the index of the dataframe to the DateTime column
del df["DateTime"] #The DateTime column is now useless
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_Type"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m')) #This will only show the month number on the graph
This assumes that Issue1/2/3 are integers, I assumed they were as I didn't really understand what they were supposed to be.
Edit: This should do the trick then, it's not pretty and can probably be optimised, but it works well:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";")
df.index = pd.to_datetime(df["DateTime"])
del df["DateTime"]
list=[]
for Issue in df["Issue_Type"]:
list.append(int(Issue[5:]))
df["Issue_number"]=list
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_number"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m'))
plt.show()
The first thing you need to do is to parse the datetime fields as dates/times. Try using dateutil.parser for that.
Next, you will need to count the number of issues of each type in each month. The naive way to do that would be to maintain lists of lists for each issue type, and just iterate through each column, see which month and which issue type it is, and then increment the appropriate counter.
When you have such a frequency count of issues, sorted by issue types, you can simply plot them against dates like this:
import matplotlib.pyplot as plt
import datetime as dt
dates = []
for year in range(starting_year, ending_year):
for month in range(1, 12):
dates.append(dt.datetime(year=year, month=month, day=1))
formatted_dates = dates.DateFormatter('%b') # Format dates to only show month names
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(issues[0], dates) # To plot just issues of type 1
ax.plot(issues[1], dates) # To plot just issues of type 2
ax.plot(issues[2], dates) # To plot just issues of type 3
ax.xaxis.set_major_formatter(formatted_dates) # Format X tick labels
plt.show()
plt.close()
honestly, I would just use R. check this link out on downloading / setting up R & RStudio.
data <- read.csv(file="c:/yourdatafile.csv", header=TRUE, sep=",")
attach(data)
data$Month <- format(as.Date(data$DateTime), "%m")
plot(DateTime, Issue_Type)

Categories

Resources