How to get a date range boxplot with a pandas dataframe

How to get a date range boxplot with a pandas dataframe - python

I have a scenario of the index being datetime objects and the data I want to plot are sales counts. Most of the time, there are multiple sales done throughout the day and each day can have different amount of sales. I would like to create a plot that shows a date range that nicely formats the xticklabels, depending on how many days I'd like to show in the plot. Kind of like this. I've tried different variants of code but have thus far been unsuccessful. Could someone look at my script below and please help me?
import pandas as pd
import matplotlib.pyplot as plt
index1 = ['2017-07-01','2017-07-01','2017-07-02','2017-07-02','2017-07-03','2017-07-03','2017-07-03']
index2 = pd.to_datetime(index1,format='%Y-%m-%d')
df = pd.DataFrame([[123456],[123789],[123654],[654321],[654987],[789456],789123]],columns=['Count'],index=index1)
df.plot(kind='box')
plt.show()

Use T, transpose and reshape your dataframe.
df.T.plot(kind='box', figsize=(10,7))
Output:
Okay to keep those dates as separate records and boxplot. Let's do this:
df.reset_index().set_index('index',append=True).unstack()['Count'].plot(kind='box',figsize=(10,7))
This is better.
df.set_index(np.arange(len(df)),append=True).unstack(0)['Count']\
.plot(kind='box',figsize=(10,7))
Output:

Related

pandas/matplotlib graph on frequency of appearance

I am a pandas newbie and I want to make a graph from a CSV I have. On this csv, there's some date written to it, and I want to make a graph of how frequent those date appears.
This is how it looks :
2022-01-12
2022-01-12
2022-01-12
2022-01-13
2022-01-13
2022-01-14
Here, we can see that I have three records on the 12th of january, 2 records the 13th and only one records the 14th. So we should see a decrease on the graph.
So, I tried converting my csv like this :
date,records
2022-01-12,3
2022-01-13,2
2022-01-14,1
And then make a graph with the date as the x axis and the records amount as the y axis.
But is there a way panda (or matplotlib I never understand which one to use) can make a graph based on the frequency of appearance, so that I don't have to convert the csv before ?

There is a function of PANDAS which allows you to count the number of values.
First off, you'd need to read your csv file into a dataframe. Do this by using:
import pandas as pd
df = pd.read_csv("~csv file name~")
Using the unique() function in the pandas library, you can display all of the unique values. The syntax should look like:
uniqueVals = df("~column name~").unique()
That should return a list of all the unique values. Then what you'll do is use the function value_counts() with whatever value you are trying to count in square brackets after the normal brackets. The syntax should look something like this:
totalOfVals = []
for date in uniqueVals:
numDate = df[date].valuecounts("~Whatever date you're looking for~")
totalOfVals.append(numDate)
Then you can use the two arrays you have for the unique dates and the amount of dates there are to then use matplotlib to create a graph.
You'll want to use the syntax:
import matplotlib.pyplot as mpl
mpl.plot(uniqueVals, totalOfVals, color = "~whatever colour you want the line to be~", marker = "~whatever you want the marker to look like~")
mpl.xlabel('Date')
mpl.ylabel('Number of occurrences')
mpl.title('Number of occurrences of dates')
mpl.grid(True)
mpl.show()
And that should display a graph with all the dates and number of occurrences with a grid behind it. Of course if you don't want the grid just either set mpl.grid to False or just get rid of it.

Datetime to Time/HH:MM format – investigating events on multiple dates by the time of day

I have a pandas dataframe with a column "Datetime" which has values in pd.Timestamp / np.datetime64 format. How should I extract the hours and minutes while keeping the status of this "HH:MM" as "continuous plottable values?"
I want to plot a histogram of the dataframe column (pd.Series) based on the frequency in "HH:MM sense" in which case the x-axis would range from 00:00 to 23:59 etc.
import pandas as pd
# ...
new_df["Datetime"][0]
> Timestamp('2022-08-08 16:58:00')
I saw examples of extracting the time as a string. Not good enough. I could also use groupby hour and then e.g. plot a bar chart by count but that's not exactly what I was looking for, either...

...or I could convert each row to a string and then immediately back to pd.Timestamp with the same date. It's not ideal, but works. Any better ideas?
I battled with this a bit longer and got it working decently. Is this really the most straightforward way of doing it? The lambda stuff feels always a bit far-fetched, and this one still keeps the full date which isn't a problem per se but not necessary, either (and requires extra formatting on the xaxis).
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
fig, ax = plt.subplots()
plt.xticks(rotation=45)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
# pd.Timestamp convers the date automatically to "today" if YYYYMMDD is not specified
new_df["Datetime"].apply(lambda t:pd.Timestamp(f'{t.hour:02d}:{t.minute:02d}')).hist(ax=ax)

Irregular Overlapped Time Series with Total Values

I have a time series that look like this one:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO('''
id,start,end,value
1,2021-01-01,2021-03-31,1000
1,2021-01-01,2021-06-30,2000
1,2021-01-01,2021-12-31,5000
2,2021-01-01,2021-02-01,100
2,2021-02-02,2021-05-04,200
2,2021-01-01,2021-08-24,1000
'''))
It's possible to have irregular steps and they may overlap, but the value is a summation. And I need to transform it in the smallest possible time intervals by id without overlapping, discounting the overlapped values. The output to that time series would look like this:
output = pd.read_csv(StringIO('''
id,start,end,value
1,2021-01-01,2021-03-31,1000
1,2021-04-01,2021-06-30,1000
1,2021-07-01,2021-12-31,3000
2,2021-01-01,2021-02-01,100
2,2021-02-02,2021-05-04,200
2,2021-05-05,2021-08-24,700
'''))
I tried to adapt this Overlap in date range grouped dataframe to my problem, but without success.
Thanks in advance!

The matplotlib chart changes when I change the index in python pandas dataframe

I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()

The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.

I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()

It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:

Convert date in pandas

I know this has been asked like 100 times but I still don't get it and the given solutions don't get me anywhere.
Im trying to convert time into a comparable format with Pandas/Python. I used a db entries as data and currently I have trouble using time like this:
52 2017-08-04 12:26:56.348698
53 2017-08-04 12:28:22.961560
54 2017-08-04 12:34:20.299041
the goal is to use it as year1 and year2 to make a graph like:
def sns_compare(year1,year2):
f, (ax1) = plt.subplots(1, figsize=LARGE_FIGSIZE)
for yr in range(int(year1),int(year2)):
sns.distplot(tag2[str(yr)].dropna(), hist=False, kde=True, rug=False, bins=25)
sns_compare(year1,year2)
When I try to to it like this I get ValueError: invalid literal for int() with base 10: '2017-08-04 12:34:20.299041'.
So currently I think about using Regex to manipulate the time fields but this cant be the way to go or at least I cant imagine. I tried all kind of suggestions from SO/GitHub but nothing really worked. I also don't know what the "optimal" time structure should look like. Is it 20170804123420299041 or something like 2017-08-04-12-34-20-299041. I hope somebody can make this clear to me.

This is your data:
from matplotlib import pyplot as plt
from datetime import datetime
import pandas as pd
df = pd.DataFrame([("2017-08-04 12:26",56.348698),("2017-08-04 12:28",22.961560),("2017-08-04 12:34",20.299041)])
df.columns = ["date", "val"]
First, we convert to datetime, then we reduce year1, next we convert to days.
df['date'] = pd.to_datetime(df["date"])
df["days"]=(df['date'] -datetime(year1,1,1)).dt.total_seconds()/86400.0
plot the data, and display only the days between year1 and year2
plt.scatter(df["days"],df["val"])
plt.xlim((0,(year2-year1)*365))
plt.show()

Have you looked at pd.to_datetime? Pandas and Seaborn should be able to handle dates fine, and you don't have to convert them to integers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get a date range boxplot with a pandas dataframe - python

Related

pandas/matplotlib graph on frequency of appearance

Datetime to Time/HH:MM format – investigating events on multiple dates by the time of day

Irregular Overlapped Time Series with Total Values

The matplotlib chart changes when I change the index in python pandas dataframe

Convert date in pandas

Categories

Resources