How to plot stacked time histogram starting from a Pandas DataFrame? - python

Consider the following DataFrame df:
Date Kind
2018-09-01 13:15:32 Red
2018-09-02 16:13:26 Blue
2018-09-04 22:10:09 Blue
2018-09-04 09:55:30 Red
... ...
In which you have a column with a datetime64[ns] dtype and another which contains a np.object which can assume only a finite number of values (in this case, 2).
You have to plot a date histogram in which you have:
On the x-axis, the dates (per-day histogram showing month and day);
On the y-axis, the number of items belonging to that date, showing in a stacked bar the difference between Blue and Red.
How is it possible to achieve this using Matplotlib?
I was thinking to do a set_index and resample as follows:
df.set_index('Date', inplace=True)
df.resample('1d').count()
But I'm losing the information on the number of items per Kind. I also want to keep any missing day as zero.
Any help very appreciated.

Use groupby, count and unstack to adjust the dataframe:
df2 = df.groupby(['Date', 'Kind'])['Kind'].count().unstack('Kind').fillna(0)
Next, re-sample the dataframe and sum the count for each day. This will also add any missing days that are not in the dataframe (as specified). Then adjust the index to only keep the date part.
df2 = df2.resample('D').sum()
df2.index = df2.index.date
Now plot the dataframe with stacked=True:
df2.plot(kind='bar', stacked=True)
Alternatively, the plt.bar() function can be used for the final plotting:
cols = df['Kind'].unique() # Find all original values in the column
ind = range(len(df2))
p1 = plt.bar(ind, df2[cols[0]])
p2 = plt.bar(ind, df2[cols[1]], bottom=df2[cols[0]])
Here it is necessary to set the bottom argument of each part to be the sum of all the parts that came before.

Related

Plotting a cumulative sum with groupby in pandas

I'm missing something really obvious or simply doing this wrong. I have two dataframes of similar structure and I'm trying to plot a time-series of the cumulative sum of one column from both. The dataframes are indexed by date:
df1
value
2020-01-01 2435
2020-01-02 12847
...
2020-10-01 34751
The plot should be grouped by month and be a cumulative sum of the whole time range. I've tried:
line1 = df1.groupby(pd.Grouper(freq='1M')).value.cumsum()
line2 = df2.groupby(pd.Grouper(freq='1M')).value.cumsum()
and then plot, but it resets after each month. How can I change this?
I am guessing you want to group and take the mean or something to represent the cumulative value for each month, and plot:
df1 = pd.DataFrame({'value':np.random.randint(100,200,366)},
index=pd.date_range(start='1/1/2018', end='1/1/2019'))
df1.cumsum().groupby(pd.Grouper(freq='1M')).mean().plot()

The matplotlib chart changes when I change the index in python pandas dataframe

I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.
I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()
It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:

Accessing last value in a time series dataframe with pandas and plotly

How would I grab the very last value of a time series?
I have a df with timeseries info for many countries, that tracks several variables and does some simple averaging etc.
I just want to grab the most recent value / values for each country and graph it with plotly. I have tried using .last() but not really sure where to fit it into the loop.
I need to grab both the last value for one chart, and the last n values for another chart.
# Daily Change
country = "X"
#Plot rolling average new cases
data = [go.Scatter(x=df_join.loc[f'{country}']['Date'],
y=df_join.loc[f'{country}']['Pct Change'],
mode='lines',
name='Pct Change')]
layout = go.Layout(title=f'{country}: Pct Change')
fig = go.Figure(data=data, layout=layout)
pyo.plot(fig)
IIUC you need to filter your dataframe before hand :
dates = pd.date_range(pd.Timestamp('today'),pd.Timestamp('today') + pd.DateOffset(days=5))
df = pd.DataFrame({'Date' : dates, 'ID' : ['A','A','A','B','B','B']})
df2 = df.loc[df.groupby(['ID'])['Date'].idxmax()]
print(df2)
Date ID
2 2020-05-16 12:26:06.772939 A
5 2020-05-19 12:26:06.772939 B

producing a scatter plot from multi-level dataframe [pandas]

I have a big data frame, on which I've done a df.groupby(["event_type", "day"].count() and gotten the following multi-indexed df:
My aim is to produce a scatter plot that shows the number of occurrences of an event per day, sorted by event_type. So a scatter plot where the x axis is "day" and the y axis would be "id" from the above table (which is a count). But I don't know how to go about making it.
background: event_type is only 3 types. day is like 2 years of dates. "id" is id of things I'm tracking, but in the above .groupby() data frame, its actually the count of ids. I'd ideally like to get 3 separate lines plotted (one per event_type) of the id counts versus day of the year. Thanks!
I hope this will help:
a['date'] = pd.to_datetime(a['date'])
for name, group in a.groupby(['type','date']).count().groupby('type'):
plt.plot(group.reset_index().set_index('date')['v1'], marker=o, linestyle='', label=name)
plt.legend()
If you want normal plot instead of scatter, remove marker and linestyle arguments.
My DF looks like this:

pandas: filter intraday df by non-consecutive list of dates

I have dataframes of 1 minute bars going back years (the datetime is the index). I need to get a set of bars covering an irregular (non-consecutive) long list of dates.
For daily bars, I could do something like this:
datelist = ['20140101','20140205']
dfFiltered = df[df.index.isin(datelist)]
However if I try that on 1 minute bar data, it only gives me the bars with time 00:00:00, e.g. in this case it gives me two bars for 20140101 00:00:00 and 20140205 00:00:00.
My actual source df will look something like:
df1m = pd.DataFrame(index=pd.date_range('20100101', '20140730', freq='1min'),
data={'open':3, 'high':4, 'low':1, 'close':2}
).between_time('00:00:00', '07:00:00')
Is there any better way to get all the bars for each day in the list than looping over the list? Thanks in advance.
One way is to add a date column based on the index
df1m['date'] = pd.to_datetime(df1m.index.date)
Then use that column when filtering
datelist = ['20140101','20140205']
df1m[df1m['date'].isin(datelist)]

Categories

Resources