I've got a df that has three columns, one of them has a repetitive pattern, the df looks like this
>>> df
date hour value
0 01/01/2022 1 0.267648
1 01/01/2022 2 1.564420
2 01/01/2022 ... 0.702019
3 01/01/2022 24 1.504663
4 01/02/2022 1 0.309097
5 01/02/2022 2 0.309097
6 01/02/2022 ... 0.309097
7 01/02/2022 24 0.309097
>>>
I want to make a heatmap with this, the x-axis would be the month, the y axis the hour of the day and the value would be the median of all the values in that specific hour from everyday in the month.
import seaborn as sns
import matplotlib.pyplot as plt
df.date = pd.to_datetime(df.date)
df['month'] = df.date.dt.month
pivot = df.pivot_table(columns='month', index='hour', values='value', aggfunc='median')
sns.heatmap(pivot.sort_index(ascending=False))
plt.show()
Output:
Seaborn Heatmap
Related
I'm new to Python. I hope you can help me.
I have a dataframe with two columns. The first column is called dates and the second column is filled with numbers. The dataframe has 351 row.
dates numbers
01.03.2019 5
02.03.2019 8
...
20.02.2020 3
21.02.2020 2
I want the whole first column to be on the x axis from. I tried to plot it like this:
graph = FinalDataframe.plot(figsize=(12, 8))
graph.legend(loc='upper center', bbox_to_anchor=(0.5, -0.075), ncol=4)
graph.set_xticklabels(FinalDataframe['dates'])
plt.show()
But on the x axis are only the first few values from the column instead of the whole column. Furthermore, they are not correlated to the data from the second column.
Any suggestions?
Thank you in advance!
Your issue is that x ticks are generated automatically, and spaced out to be readable. However you the tell matplotlib to use all the labels. The simple fix is to tell him to use one tick label per entry, but that’s going to make your x-axis unreadable:
graph.set_xticks(range(len(FinalDataframe['dates'])))
Now you could space them out manually:
graph.set_xticks(range(0, len(FinalDataframe['dates']), 61))
graph.set_xticklabels(FinalDataframe['dates'][::61])
However the best result to plot dates on the x-axis is still to use pandas’ built-in date objects. We can do this with pd.to_datetime
This will also allow pandas to know where to place points on the x-axis, by specifying that you want the x-axis to be the dates. In that way, if dates are not sorted or missing, the gaps will be skipped properly, and points will be above the ordinate of the right date.
I’m first recreating a dataframe that looks like what you posted:
>>> df = pd.DataFrame({'dates': pd.date_range('20190301', '20200221', freq='D').strftime('%d.%m.%Y'), 'numbers': np.random.randint(0, 10, 358)})
>>> df
dates numbers
0 01.03.2019 2
1 02.03.2019 2
2 03.03.2019 5
3 04.03.2019 4
4 05.03.2019 3
.. ... ...
353 17.02.2020 2
354 18.02.2020 1
355 19.02.2020 2
356 20.02.2020 3
357 21.02.2020 1
(This should be the same as FinalDataFrame, or if your dates are the index, then it’s the same as FinalDataFrame.reset_index())
Now I’m converting the dates:
>>> df['dates'] = pd.to_datetime(df['dates'], format='%d.%m.%Y')
>>> df
dates numbers
0 2019-03-01 2
1 2019-03-02 2
2 2019-03-03 5
3 2019-03-04 4
4 2019-03-05 3
.. ... ...
353 2020-02-17 2
354 2020-02-18 1
355 2020-02-19 2
356 2020-02-20 3
357 2020-02-21 1
You can check your columns contain dates and not string representations of dates by checking their dtypes:
>>> df.dtypes
dates datetime64[ns]
numbers int64
Finally plotting:
>>> ax = df.plot(x='dates', y='numbers', figsize=(12, 8))
>>> ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.075), ncol=4)
<matplotlib.legend.Legend object at 0x7fc8c24fd4f0>
>>> plt.show()
Legends are taken care of automatically. This is what you get:
I have created a visualization utilizing the plotly library within Python. Everything looks fine, except the axis is starting with 2020 and then shows 2019. The axis should be the opposite of what is displayed.
Here is the data (df):
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
This is what I am doing
import plotly.express as px
px.scatter(df, x = "date", y = "percent", color = "type", facet_col = "type")
How would I make it so that the dates are sorted correctly, earliest to latest? The dates are sorted within the raw data so why is it not reflecting this on the graph?
Any suggestion will be appreciated.
Here is the result:
It is plotting in the order of your df. If you want date order then sort so in date order.
df.sort_values('date', inplace=True)
A lot of other graphing utilities (Seaborn, etc) by default sort when plotting. Plotly Express does not do this.
Your date column seems to be a string. If you convert it to a datetime you don't have to sort your dataframe: plotly express will set the x-axis to datetime:
Working code example:
import pandas as pd
import plotly.express as px
from io import StringIO
text = """
date percent type
3/1/2020 10 a
3/1/2020 0 b
4/1/2020 15 a
4/1/2020 60 b
1/1/2019 25 a
1/1/2019 1 b
2/1/2019 50 c
2/1/2019 20 d
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
px.scatter(df, x="date", y="percent", color="type", facet_col="type")
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08
I have a data frame with columns for every month of every year from 2000 to 2016
df.columns
output
Index(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=200)
and I would like to group over these column by quarters.
I have made a dictionary believing it would be the best method to use groupby then use aggregate and mean:
m2q = {'2000q1': ['2000-01', '2000-02', '2000-03'],
'2000q2': ['2000-04', '2000-05', '2000-06'],
'2000q3': ['2000-07', '2000-08', '2000-09'],
...
'2016q2': ['2016-04', '2016-05', '2016-06'],
'2016q3': ['2016-07', '2016-08']}
but
df.groupby(m2q)
is not giving me the desired output.
In fact its giving me an empty grouping.
Any suggestions to make this grouping work?
Or perhaps a more pythonian solution to categorize by quarters taking the mean of the specified columns?
You can convert your index to DatetimeIndex(example 1) or PeriodIndex(example 2).
And also please check Time Series / Date functionality subject for more detail.
import numpy as np
import pandas as pd
idx = ['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10', '2000-11', '2000-12']
df = pd.DataFrame(np.arange(12), index=idx, columns=['SAMPLE_DATA'])
print(df)
SAMPLE_DATA
2000-01 0
2000-02 1
2000-03 2
2000-04 3
2000-05 4
2000-06 5
2000-07 6
2000-08 7
2000-09 8
2000-10 9
2000-11 10
2000-12 11
# Handle your timeseries data with pandas timeseries / date functionality
df.index=pd.to_datetime(df.index)
example 1
print(df.resample('Q').sum())
SAMPLE_DATA
2000-03-31 3
2000-06-30 12
2000-09-30 21
2000-12-31 30
example 2
print(df.to_period('Q').groupby(level=0).sum())
SAMPLE_DATA
2000Q1 3
2000Q2 12
2000Q3 21
2000Q4 30
I create a pandas dataframe with a DatetimeIndex like so:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create datetime index and random data column
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=14, freq='D')
data = np.random.randint(1, 10, size=14)
columns = ['A']
df = pd.DataFrame(data, index=index, columns=columns)
# initialize new weekend column, then set all values to 'yes' where the index corresponds to a weekend day
df['weekend'] = 'no'
df.loc[(df.index.weekday == 5) | (df.index.weekday == 6), 'weekend'] = 'yes'
print(df)
Which gives
A weekend
2014-10-13 7 no
2014-10-14 6 no
2014-10-15 7 no
2014-10-16 9 no
2014-10-17 4 no
2014-10-18 6 yes
2014-10-19 4 yes
2014-10-20 7 no
2014-10-21 8 no
2014-10-22 8 no
2014-10-23 1 no
2014-10-24 4 no
2014-10-25 3 yes
2014-10-26 8 yes
I can easily plot the A colum with pandas by doing:
df.plot()
plt.show()
which plots a line of the A column but leaves out the weekend column as it does not hold numerical data.
How can I put a "marker" on each spot of the A column where the weekend column has the value yes?
Meanwhile I found out, it is as simple as using boolean indexing in pandas. Doing the plot directly with pyplot instead of pandas' own plot wrapper (which is more convenient to me):
plt.plot(df.index, df.A)
plt.plot(df[df.weekend=='yes'].index, df[df.weekend=='yes'].A, 'ro')
Now, the red dots mark all weekend days which are given by df.weekend='yes' values.