Problems while plotting label vs datetime in a pandas column? [duplicate] - python

This question already has an answer here:
Pandas Dataframe line plot display date on xaxis
(1 answer)
Closed 4 years ago.
I have the following pandas dataframe, which consist of datetime timestamps and user ids:
id datetime
130 2018-05-17 19:46:18
133 2018-05-17 20:59:57
131 2018-05-17 21:54:01
142 2018-05-17 22:49:07
114 2018-05-17 23:02:34
136 2018-05-18 06:06:48
324 2018-05-18 12:21:38
180 2018-05-18 12:49:33
120 2018-05-18 14:03:58
120 2018-05-18 15:28:36
How can I plot on the y axis the id and on the x axis day or minutes? I tried to:
plt.plot(df3['datatime'], df3['id'], '|')
plt.xticks(rotation='vertical')
However, I have two problems, my dataframe is quite large and I have multiple ids, the second problem is that I wasn't able to arrange each label on the y axis and plot it against its datime value in the x axis. Any idea of how to do something like this:
The whole objective of this plot is to visualize the logins per time of that specific user.

Something like this?
X axis: date, Y axis: id
from datetime import date
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
# set your data as df
# strip only YYYY-mm-dd part from original `datetime` column
df.datetime = df.datetime.apply(lambda x: str(x)[:10])
df.datetime = df.datetime.apply(lambda x: date(int(x[:4]), int(x[5:7]), int(x[8:10])))
# plot
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(df.datetime, df.id, '|')
plt.gcf().autofmt_xdate()
Output:

Related

How to aggregate a metric and plot groups separately

I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.
Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

How to plot data from csv for specific date and time using matplotlib?

I have written a python program to get data from csv using pandas and plot the data using matplotlib. My code is below with result:
import pandas as pd
import datetime
import csv
import matplotlib.pyplot as plt
headers = ['Sensor Value','Date','Time']
df = pd.read_csv('C:/Users\Lala Rushan\Downloads\DataLog.CSV',parse_dates= {"Datetime" : [1,2]},names=headers)
#pd.to_datetime(df['Date'] + ' ' + df['Time'])
#df.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),)
print (df)
#f = plt.figure(figsize=(10, 10))
df.plot(x='Datetime',y='Sensor Value',) # figure.gca means "get current axis"
plt.title('Title here!', color='black')
plt.tight_layout()
plt._show()
Now as you can see the x-axis looks horrible. How can I plot the x-axis for a single date and time interval so that it does not looks like overlapping each other? I have stored both date and time as one column in my dataframe.
My Dataframe looks like this:
Datetime Sensor Value
0 2017/02/17 19:06:17.188 2
1 2017/02/17 19:06:22.360 72
2 2017/02/17 19:06:27.348 72
3 2017/02/17 19:06:32.482 72
4 2017/02/17 19:06:37.515 74
5 2017/02/17 19:06:42.580 70
Hacky way
Try this:
import pylab as pl
pl.xticks(rotation = 90)
It will rotate the labels by 90 degrees, thus eliminating overlap.
Cleaner way
Check out this link which describes how to use fig.autofmt_xdate() and let matplotlib pick the best way to format your dates.
Pandas way
Use to_datetime() and set_index with DataFrame.plot():
df.Datetime=pd.to_datetime(df.Datetime)
df.set_index('Datetime')
df['Sensor Value'].plot()
pandas will then take care to plot it nicely for you:
my Dataframe looks like this:
Datetime Sensor Value
0 2017/02/17 19:06:17.188 2
1 2017/02/17 19:06:22.360 72
2 2017/02/17 19:06:27.348 72
3 2017/02/17 19:06:32.482 72
4 2017/02/17 19:06:37.515 74
5 2017/02/17 19:06:42.580 70

Pandas Frequency Conversion

I'm trying to find if is possible to use data.asfreq(MonthEnd()) with no date_range created data.
What I'm trying to achive. I run csv query with the following code:
import numpy as np
import pandas as pd
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True)
data.columns = ["period", "integ"]
data['period'] = pd.to_datetime(data['period'], infer_datetime_format=True)
Then I want to assign frequency to my 'period' column by doing this:
tdelta = data.period[1] - data.period[0]
data.period.freq = tdelta
And some print comands:
print(data)
print(data.period.freq)
print(data.dtypes)
Returns:
..........
270 1948-07-01 2033.2
271 1948-04-01 2021.9
272 1948-01-01 1989.5
273 1947-10-01 1960.7
274 1947-07-01 1930.3
275 1947-04-01 1932.3
276 1947-01-01 1934.5
[277 rows x 2 columns]
-92 days +00:00:00
period datetime64[ns]
integ float64
dtype: object
I can also parse the original 'DATE' column by making it 'index':
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True, index_col='DATE')
What I want to do is just to covert the quarterly data in to monthly rows. For example:
270 1948-07-01 2033.2
271 1948-06-01 NaN
272 1948-05-01 NaN
273 1948-04-01 2021.9
274 1948-03-01 NaN
275 1948-02-01 NaN
276 1948-01-01 1989.5
......and so on.......
I'm eventually trying to do this by using ts.asfreq(MonthBegin()) and , ts.asfreq(MonthBegin(), method='pad'). So far unsuccessfully. I have the following error:
NameError: name 'MonthBegin' is not defined
My question is can I use asfreq if I don't use date_range to create the frame? Somehow to 'pass' my date column to the function. If this is not the solution is it there any other easy way to convert quarterly to monthly frequency?
Use a TimeGrouper:
import pandas as pd
periods = ['1948-07-01', '1948-04-01', '1948-01-01', '1947-10-01',
'1947-07-01', '1947-04-01', '1947-01-01']
integs = [2033.2, 2021.9, 1989.5, 1960.7, 1930.3, 1932.3, 1934.5]
df = pd.DataFrame({'period': pd.to_datetime(periods), 'integ': integs})
df = df.set_index('period')
df = df.groupby(pd.TimeGrouper('MS')).sum().sort_index(ascending=False)
EDIT: You can also use resample instead of a TimeGrouper:
df.resample('MS').sum().sort_index(ascending=False)

Pandas: bar plot with multiIndex dataframe

I have a pandas DataFrame with a TIMESTAMP column (not the index), and the timestamp format is as follows:
2015-03-31 22:56:45.510
I also have columns called CLASS and AXLES. I would like to compute the count of records for each month separately for each unique value of AXLES (AXLES can take an integer value between 3-12).
I came up with a combination of resample and groupby:
resamp = dfWIM.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
This seems to give me a multiIndex dataframe object, as shown below.
In [72]: resamp
Out [72]:
AXLES TIMESTAMP
3 2014-07-31 5517
2014-08-31 31553
2014-09-30 42816
2014-10-31 49308
2014-11-30 44168
2014-12-31 45518
2015-01-31 54782
2015-02-28 52166
2015-03-31 47929
4 2014-07-31 3147
2014-08-31 24810
2014-09-30 39075
2014-10-31 46857
2014-11-30 42651
2014-12-31 48282
2015-01-31 42708
2015-02-28 43904
2015-03-31 50033
From here, how can I access different components of this multiIndex object to create a bar plot for the following conditions?
show data when AXLES = 3
show x ticks in the Month - Year format (no days, hours, minutes etc.)
Thanks!
EDIT: Following code gives me the plot, but I could not change the xtick formatting to MM-YY.
resamp[3].plot(kind='bar')
EDIT 2 below is a code snippet that generates a small sample of the data similar to what I have:
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
resamp[3].plot(kind='bar')
EDIT 3:
Here below is the solution:
A.Plot the whole resampled dataframe (based on #Ako 's suggestion):
df = resamp.unstack(0)
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
B.Plot an individual index from the resampled dataframe (based on #Alexander 's suggestion):
df = resamp[3]
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
You could generate and set the labels explicitly using ax.xaxis.set_major_formatter with a ticker.FixedFormatter. This will allow you to keep your DataFrame's MultiIndex with timestamp values, while displaying the timestamps in the desired %m-%Y format:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
ax = resamp[3].plot(kind='bar')
ticklabels = [timestamp.strftime('%m-%Y') for axle, timestamp in resamp.index]
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: ticklabels[int(x)]))
plt.gcf().autofmt_xdate()
plt.show()
yields
The following should work, but it is difficult to test without some data.
Start by resetting your index to get access to the TIMESTAMP column. Then use strftime to format it to your desired text representation (e.g. mm-yy). Finally, reset the index back to AXLES and TIMESTAMP.
df = resamp.reset_index()
df['TIMESTAMP'] = [ts.strftime('%m-%y') for ts in df.TIMESTAMP]
df.set_index(['AXLES', 'TIMESTAMP'], inplace=True)
>>> df.xs(3, level=0).plot(kind='bar')

When plotting datetime index data, put markers in the plot on specific days (e.g. weekend)

I create a pandas dataframe with a DatetimeIndex like so:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create datetime index and random data column
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=14, freq='D')
data = np.random.randint(1, 10, size=14)
columns = ['A']
df = pd.DataFrame(data, index=index, columns=columns)
# initialize new weekend column, then set all values to 'yes' where the index corresponds to a weekend day
df['weekend'] = 'no'
df.loc[(df.index.weekday == 5) | (df.index.weekday == 6), 'weekend'] = 'yes'
print(df)
Which gives
A weekend
2014-10-13 7 no
2014-10-14 6 no
2014-10-15 7 no
2014-10-16 9 no
2014-10-17 4 no
2014-10-18 6 yes
2014-10-19 4 yes
2014-10-20 7 no
2014-10-21 8 no
2014-10-22 8 no
2014-10-23 1 no
2014-10-24 4 no
2014-10-25 3 yes
2014-10-26 8 yes
I can easily plot the A colum with pandas by doing:
df.plot()
plt.show()
which plots a line of the A column but leaves out the weekend column as it does not hold numerical data.
How can I put a "marker" on each spot of the A column where the weekend column has the value yes?
Meanwhile I found out, it is as simple as using boolean indexing in pandas. Doing the plot directly with pyplot instead of pandas' own plot wrapper (which is more convenient to me):
plt.plot(df.index, df.A)
plt.plot(df[df.weekend=='yes'].index, df[df.weekend=='yes'].A, 'ro')
Now, the red dots mark all weekend days which are given by df.weekend='yes' values.

Categories

Resources