Using dataframe index containing the year as x axis - python

I'm plotting a visualization with two y axes each representing a dataframe column. I used one of the dataframe's (both dataframes have the same index) index as the x-axis, however the xticks labels are not showing correctly. I should have years from 2000 to 2018
I used the following code to create the plot:
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(df1.index, df1, 'g-')
ax2.plot(df1.index, df2, 'b-')
ax1.set_xlabel('X data')
ax1.set_ylabel('Y1 data', color='g')
ax2.set_ylabel('Y2 data', color='b')
plt.show()
the index of df1 is as follows:
Index(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
'2009', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
'2017', '2018'],
dtype='object')
Here's a small snippet of the two dfs:
df1.head()
gdp
2000 1.912873
2001 7.319967
2002 3.121450
2003 5.961162
2004 4.797018
df2.head()
lifeex
2000 68.684
2001 69.193
2002 69.769
2003 70.399
2004 71.067
The plot looks like:
I tried different solutions including the one in Set Xticks frequency to dataframe index but none has succeeded to get all years showing.
I really appreciate if someone can help. thanks in advance
When I try ax1.set_xticks(df1.index) I get the following error: '<' not supported between instances of 'numpy.ndarray' and 'str'

I couldn't duplicate your issue (mpl.version = 3.2.2):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col1':np.random.randint(1,7 , 19)},
index=[str(i) for i in range(2000,2019)])
print(df1.index)
df2 = pd.Series(np.linspace(69,78, 19))
fig, ax1 = plt.subplots(figsize=(15,8))
ax2 = ax1.twinx()
ax1.plot(df1.index, df1, 'g-')
ax2.plot(df1.index, df2, 'b-')
ax1.set_xlabel('X data')
ax1.set_ylabel('Y1 data', color='g')
ax2.set_ylabel('Y2 data', color='b')
plt.show()
Output:
Index(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
'2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017',
'2018'],
dtype='object')

The following code solved the problem for me:
years = list(df1.index)
for i in range(0, len(years)):
years[i] = int(years[i])
ax1.xaxis.set_ticks(years)

Related

Plotting a graph using cells containing two variables

I have a dataframe like this:
import pandas as pd
import numpy as np
date_vals = ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05',
'2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09', '2022-01-10',
'2022-01-11', '2022-01-12', '2022-01-13', '2022-01-14', '2022-01-15',
'2022-01-16', '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20']
machine_vals = ['M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2',
'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2', 'M-1', 'M-2']
shift_vals = ['Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night',
'Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night', 'Day', 'Night']
type_vals = [['Type A', 'Type B'], 'Type B', 'Type A', 'Type B', 'Type A', 'Type A', 'Type B', 'Type B',
'Type A', 'Type B', ['Type A', 'Type B'], 'Type B', ['Type A', 'Type B'], 'Type A', 'Type B', ['Type A', 'Type B'],
'Type A', 'Type B', 'Type A', 'Type B']
meter_vals = [[1000, 800], 1500, 900, 1700, 1200, 1300, 1600, 1400, 1300, 1100, [1400, 200], 1200, [1000, 700], 1500, 1600, [1300, 900], 1200, 1100, 1300, 1700]
data = {
'Date': date_vals,
'Machine': machine_vals,
'Shift': shift_vals,
'Type': type_vals,
'Meter': meter_vals
}
df = pd.DataFrame(data)
df
Some cells in the Type and Meter columns have two values each. This means 1000 meters with Type A and 800 meters with Type B.
In this case, I have two questions:
First of all, is it a healthy way for a database to store two different variables in one cell in the column? Will it cause any problems in the later stages of the work? Do you have any advice for such situations?
The second of these is with the dataframe above,
the dates on the bottom axis, on the other axis how many meters each machine makes to each date as Type A and Type B.
the dates on the bottom axis, on the other axis how many meters each machine makes to each date in Day and Night.
The focus of my questions is how I should behave when there are two different values in a cell (for example: ['Type A', 'Type B'] and [1000, 800]). Thanks.
First, thank you for providing a reproducible example. Makes answering such questions so much easier.
So, first step would be to clean up your df a bit:
Make Date actual Timestamps.
Explode the columns that have lists. It makes further processing easier.
df2 = df.assign(Date=pd.to_datetime(df['Date'])).explode(['Type', 'Meter'])
>>> df2
Date Machine Shift Type Meter
0 2022-01-01 M-1 Day Type A 1000
0 2022-01-01 M-1 Day Type B 800
1 2022-01-02 M-2 Night Type B 1500
.. ... ... ... ... ...
17 2022-01-18 M-2 Night Type B 1100
18 2022-01-19 M-1 Day Type A 1300
19 2022-01-20 M-2 Night Type B 1700
[24 rows x 5 columns]
Then, I presume you'd like to plot these measurements in various ways.
Here are some suggestions:
By type, each date has its own bar(s)
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (5, 3)
plt.rcParams['figure.facecolor'] = 'white'
# each date with its own bar(s):
z = df2.pivot_table(values='Meter', index='Date', columns='Type', aggfunc=sum)
ax = z.plot.bar()
# matplotlib messes up categorical date formats
ax.set_xticklabels(z.index.strftime('%Y-%m-%d'))
plt.show()
Aggregate by week
zw = z.resample('W').sum()
ax = zw.plot.bar()
ax.set_xticklabels(zw.index.strftime('%Y-%m-%d'))
plt.show()
by shift (day or night), total meters, grouped by weeks
z = df2.groupby([pd.Grouper(freq='W', key='Date'), 'Shift'])['Meter'].sum().unstack('Shift')
ax = z.plot.bar()
ax.set_xticklabels(z.index.strftime('%Y-%m-%d'))
plt.show()
Two plots: one for each shift
In each plot, show meter by types.
fig, axes = plt.subplots(nrows=2, sharex=True, figsize=(5,5))
for ax, (label, sdf) in zip(axes, df2.groupby('Shift')):
z = sdf.groupby([pd.Grouper(freq='W', key='Date'), 'Type'])['Meter'].sum().unstack('Type')
z.plot.bar(ax=ax)
ax.set_xticklabels(z.index.strftime('%Y-%m-%d'))
ax.set_title(label)
plt.tight_layout()
plt.show()

Plotly ticks are weirdly aligned

Let's take the following pd.DataFrame as an example
df = pd.DataFrame({
'month': ['2022-01', '2022-02', '2022-03'],
'col1': [1_000, 1_500, 2_000],
'col2': [100, 150, 200],
}).melt(id_vars=['month'], var_name='col_name')
which creates
month col_name value
-----------------------------
0 2022-01 col1 1000
1 2022-02 col1 1500
2 2022-03 col1 2000
3 2022-01 col2 100
4 2022-02 col2 150
5 2022-03 col2 200
Now when I would use simple seaborn
sns.barplot(data=df, x='month', y='value', hue='col_name');
I would get:
Now I would like to use plotly and the following code
import plotly.express as px
fig = px.histogram(df,
x="month",
y="value",
color='col_name', barmode='group', height=500, width=1_200)
fig.show()
And I get:
So why are the x-ticks so weird and not simply 2022-01, 2022-02 and 2022-03?
What is happening here?
I found that I always have this problem with the ticks when using color. It somehow messes the ticks up.
You can solve it by customizing the step as 1 month per tick with dtick="M1", as follows:
import pandas as pd
import plotly.express as px
df = pd.DataFrame({
'month': ['2022-01', '2022-02', '2022-03'],
'col1': [1000, 1500, 2000],
'col2': [100, 150, 200],
}).melt(id_vars=['month'], var_name='col_name')
fig = px.bar(df,
x="month",
y="value",
color='col_name', barmode='group', height=500, width=1200)
fig.update_xaxes(
tickformat = '%Y-%m',
dtick="M1",
)
fig.show()

Order categories in a grouped bar in matplotlib

I am trying to plot a groupby-pandas-dataframe in which I have a categorical variable by which I would like to order the bars.
A sample code of what I am doing:
import pandas as pd
df = {"month":["Jan", "Jan", "Jan","Feb", "Feb", "Mar"],
"cat":["High", "High", "Low", "Medium", "Low", "High"]}
df = pd.DataFrame(df)
df.groupby("month")["cat"].value_counts().unstack(0).plot.bar()
Which plots:
However, I would like to plot within each category the order to be Jan, Feb, March.
Any help on how to achieve this would be a appreciated.
Kind regards.
You could make the month column categorical to fix an order:
import pandas as pd
df = {"month": ["Jan", "Jan", "Jan", "Feb", "Feb", "Mar"],
"cat": ["High", "High", "Low", "Medium", "Low", "High"]}
df = pd.DataFrame(df)
df["month"] = pd.Categorical(df["month"], ["Jan", "Feb", "Mar"])
df.groupby("month")["cat"].value_counts().unstack(0).plot.bar(rot=0)
An alternative would be to select the column order after the call to unstack(0):
df.groupby("month")["cat"].value_counts().unstack(0)[["Jan", "Feb", "Mar"]].plot.bar(rot=0)
I recommend you to use the seaborn package for plotting data from dataframes. It's very simple to organize and order each element when plotting.
First let's add a column with the counts of each existing month/cat combination:
import pandas as pd
data = {"month":["Jan", "Jan", "Jan","Feb", "Feb", "Mar"],
"cat":["High", "High", "Low", "Medium", "Low", "High"]}
df = pd.DataFrame(data)
df = df.value_counts().reset_index().rename(columns={0: 'count'})
print(df)
# output:
#
# month cat count
# 0 Jan High 2
# 1 Mar High 1
# 2 Jan Low 1
# 3 Feb Medium 1
# 4 Feb Low 1
Plotting with seaborn then becomes as simple as:
import matplotlib.pyplot as plt
import seaborn as sns
sns.barplot(
data=df,
x='cat',
y='count',
hue='month',
order=['Low', 'Medium', 'High'], # Order of elements in the X-axis
hue_order=['Jan', 'Feb', 'Mar'], # Order of colored bars at each X position
)
plt.show()
Output image:

Pandas - apply rolling to columns speed

I have a dataframe where I take the subset of only numeric columns, calculate the 5 day rolling average for each numeric column and add it as a new column to the df.
This approach works but currently takes quite a long time (8 seconds per column). I'm wondering if there is a better way to do this.
A working toy example of what I'm doing currently:
data = {'Group': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Year' : ['2017', '2017', '2017', '2018', '2018', '2018', '2017', '2017', '2018', '2018', '2017', '2017', '2017', '2017', '2018', '2018'],
'Score 1' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Score 2': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
for col in ['Score 1', 'Score 2']:
df[col + '_avg'] = df.groupby(['Year', 'Group'])[col].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())
For anyone who lands on this, I was able to speed this up significantly by sorting first and avoiding the lambda function:
return_df[col + '_avg'] = df.sort_values(['Group', 'Year']).groupby(['Group'])[col].rolling(2,1).mean().shift().values

How do I plot a simple bar chart with python and seaborn?

I am trying to do a bar chart using python and seaborn, but I am getting a error:
ValueError: Could not interpret input 'total'.
This is what I am trying to transform in a bar chart:
level_1 1900 2014 2015 2016 2017 2018
total 0.0 154.4 490.9 628.4 715.2 601.5
This is a image of the same dataframe:
Also I want to delete the column 1990, but when I try to do it by deleting the index, the column 2014 is deleted.
I got this far until now:
valor_ano = sns.barplot(
data= valor_ano,
x= ['2014', '2015', '2016', '2017', '2018'],
y= 'total')
Any suggestions?
Do something like the following:
import seaborn as sns
import pandas as pd
valor_ano = pd.DataFrame({'level_1':[1900, 2014, 2015, 2016, 2017, 2018],
'total':[0.0, 154.4, 490.9, 628.4,715.2,601.5]})
valor_ano.drop(0, axis=0, inplace=True)
valor_plot = sns.barplot(
data= valor_ano,
x= 'level_1',
y= 'total')
This produces the following plot:
EDIT: If you want to do it without the dataframe and just pass in the raw data you can do it with the following code. You can also just use a variable containing a list instead of hard-coding the list:
valor_graph = sns.barplot(
x= [2014, 2015, 2016, 2017, 2018],
y= [154.4, 490.9, 628.4,715.2,601.5])

Categories

Resources