in a Pandas Df with 3 variables i want to plot 2 columns in 2 different boxes and the 3rd column as hue with seaborn
I can reach the first step with pd.melt but I cant insert the hue and make it work
This is what I have:
df=pd.DataFrame({'A':['a','a','b','a','b'],'B':[1,3,5,4,7],'C':[2,3,4,1,3]})
df2=df[['B','C']].copy()
sb.boxplot(data=pd.melt(df2), x="variable", y="value",palette= 'Blues')
I want to do this in the first DF, setting variable 'A' as hue
Can you help me?
Thank you
IIUC, you can achieve this as follows:
Apply df.melt, using column A for id_vars, and ['B','C'] for value_vars.
Next, inside sns.boxplot, feed the melted df to the data parameter, and add hue='A'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'A':['a','a','b','a','b'], 'B':[1,3,5,4,7], 'C':[2,3,4,1,3]})
sns.boxplot(data=df.melt(id_vars='A', value_vars=['B','C']),
x='variable', y='value', hue='A', palette='Blues')
plt.show()
Result
I have weekly time-series data that I want to make a weekly line chart using matplotlib/seaborn. To do so, I did aggregate given time series data correctly and tried to make plots, but the output was not correct to me. Essentially, in my data, columns are the list of countries, and the index is the weekly time index. What I wanted to do is, first iterate pandas columns by each country then group it by year and week, so I could have a weekly linechart for each countries. The way of aggregating my data is bit inefficient, which I assume gave me the problem. Can anyone suggest me possible way of doing this? Any way to get line chart by iterating pandas columns where grouping its time index? Any idea?
my attempt and data
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df1_bf.index = pd.to_datetime(df1_bf.index, errors="coerce")
df1_bf.index.name = 'date'
df1_bf.reset_index('date')
df1_bf['year'] = pd.DatetimeIndex(df1_bf.index).year
df1_bf['week'] = pd.DatetimeIndex(df1_bf.index).week
for i in df1_bf.columns:
df_grp = df1.groupby(['year', 'week'])[i].sum().unstack()
fig,ax1 = plt.subplots(nrows=1,ncols=1,squeeze=True,figsize=(16,10))
for j in df_grp['year']:
ax1.plot(df_grp.week, j, next(linecycler),linewidth=3)
plt.gcf().autofmt_xdate()
plt.style.use('ggplot')
plt.xticks(rotation=0)
plt.show()
plt.close()
but I couldn't get the correct plot by attempting the above. Seems I might wrong with data aggregation part for making plot data. Can anyone suggest me possible way of making this right? any thoughts?
desired output
This is the example plot that I want to make. I want to iterate pandas columns then group its timeindex, so I want to get line chart of weekly time series for each country in loop.
how should I get this desired plot? Is there any way of doing this right with matplotlib or seaborn? Any idea?
You need to melt your dataframe and then groupby. Then, use Seaborn to create a plot, passing the data, x, y and hue. Passing hue allows you to avoid looping and makes it a lot cleaner:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['Unnamed: 0'])
df = df.rename({'Unnamed: 0' : 'date'}, axis=1)
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.week
df = df.melt(id_vars=['date','week','year'])
df = df.groupby(['year', 'week'], as_index=False)['value'].sum()
fig, ax = plt.subplots(squeeze=True,figsize=(16,10))
sns.lineplot(data=df, x='week', y='value', hue='year',linewidth=3)
plt.show()
This is the first and last 5 rows of df before plotting:
year week value
0 2018 1 2268.0
1 2019 1 11196.0
2 2019 2 0.0
3 2019 3 0.0
4 2019 4 0.0
.. ... ... ...
100 2020 49 17111.0
101 2020 50 18203.0
102 2020 51 12787.0
103 2020 52 26245.0
104 2020 53 11772.0
Per your comment, you are looking for relplot and pass kind='line'. There are all sorts of formatting parameters you can pass with relplot or you can search how to loop through the axes to make more changes:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/7c96d7f7c05f16abcc39befcd74f5ca8/raw/8997332cd3cdec7610aeaa0300a1b85f9daafccb/prod_sales.csv'
df = pd.read_csv(url, parse_dates=['Unnamed: 0'])
df = df.rename({'Unnamed: 0' : 'date'}, axis=1)
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.isocalendar().week
df = df.melt(id_vars=['date','week','year'], var_name='country')
df = df.loc[df['value'] < 3000].groupby(['country', 'year', 'week'], as_index=False)['value'].sum()
sns.relplot(data=df, x='week', y='value', hue='year', row='country', kind='line', facet_kws={'sharey': False, 'sharex': True})
df
I have 2 separate dataframes that look exactly the same but with different numbers in it
df = pd.DataFrame({'clip emotes':[79,223,435,291,188,99,153,50,55,78,83,48,43,73]}, index=['roohappy','rooblank','lul','omegalul','pog','pogchamp','roovv','roowut','roopog','pepehands','biblethumb','roocry','rooree','rooblind'])
df
and
df = pd.DataFrame({'vod emotes':[3963,7286,5560,4390,3386,3111,2639,2612,2422,1999,1948,1691,1654,1573,1308,1090,1024,1019,1019,974,945,912,893,856,790,771,731,677,658,652]}, index=['rood','roovv','pepega','lul','clap','rookek','roocult','rooblank','pog','rooree','rooaww','roohappy','omegaroll','rooduck','rooh','rareroo','roocry','pepehand','lulw','rooderp','roopog','hyperclap','roospy','rooayaya','omegalul','roolove','roowut','roonya','monkas','roo4'])
df
and then I do df.plot(kind = 'bar') for both of the separately. I cant figure out how can I put these two datas into a one graph one over the other so that one bar with the same name would be over the other with a different colour.
You can do it by joining them:
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'clip emotes':[79,223,435,291,188,99,153,50,55,78,83,48,43,73]}, index=['roohappy','rooblank','lul','omegalul','pog','pogchamp','roovv','roowut','roopog','pepehands','biblethumb','roocry','rooree','rooblind'])
df2 = pd.DataFrame({'vod emotes':[3963,7286,5560,4390,3386,3111,2639,2612,2422,1999,1948,1691,1654,1573,1308,1090,1024,1019,1019,974,945,912,893,856,790,771,731,677,658,652]}, index=['rood','roovv','pepega','lul','clap','rookek','roocult','rooblank','pog','rooree','rooaww','roohappy','omegaroll','rooduck','rooh','rareroo','roocry','pepehand','lulw','rooderp','roopog','hyperclap','roospy','rooayaya','omegalul','roolove','roowut','roonya','monkas','roo4'])
df3 = df2.join(df1)
df3.plot(kind='bar', stacked=True)
plt.tight_layout()
I am plotting a multi-index columns DataFrame.
What is the syntax to specify the column(s) to be plotted on secondary_y using the .plot method of pandas DataFrame?
Setup
import numpy as np
import pandas as pd
mt_idx = pd.MultiIndex.from_product([['A', 'B'], ['first', 'second']])
df = pd.DataFrame(np.random.randint(0, 10, size=(20, len(mt_idx))), columns=mt_idx)
My Attempts
df.plot(secondary_y=('B', 'second'))
df.plot(secondary_y='(B, second)')
None of the above worked, as all the lines were plotted on the principal y-axis.
One possible solution would be to plot each column, then specify secondary=True. Doing it the following way requires you to specifiy the axes to which they will be plotted:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
mt_idx = pd.MultiIndex.from_product([['A', 'B'], ['first', 'second']])
df = pd.DataFrame(np.random.randint(0, 10, size=(20, len(mt_idx))), columns=mt_idx)
df.A.plot(ax=ax)
df.B.plot(ax=ax, secondary_y=True)
plt.show()
You might drop the upper column index level. If you don't want to modify the original dataframe, this could be done on a copy of it.
df2 = df.copy()
df2.columns = df2.columns.map('_'.join)
df2.plot(secondary_y=('B_second'))
I am trying to use python matplotlib to plot a pandas DataFrame. The DataFrame has a 'time' column and a 'val' column. The 'time' column is set as index and has resolution up to microseconds. When I go about plotting it, the values on the x-axis are are totally off (way outside the time range of the data). What could be wrong? Any help is appreciated.
Below is the code:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates
df = pd.read_csv("/tmp/a.csv")
df = df.set_index('time')
def plot1(df):
ax = df.plot(y='val')
ax.get_yaxis().get_major_formatter().set_useOffset(False)
ax.get_xaxis().set_major_formatter(matplotlib.dates.DateFormatter("%H%M%S.%f"))
plt.show()
return ax
plot1(df)
Data in '/tmp/a.csv':
time,val
143642.229348,12
143642.250195,53
143642.252341,17
143642.254349,56
143642.311674,31
143642.313758,36
143642.320217,24
143642.339777,86
You would need to convert your time column to datetime after reading it from the CSV file:
df['time'] = pd.to_datetime(df['time'], format="%H%M%S.%f")
alternatively you can do it on the fly when parsing your CSV file:
tm_parser = lambda x: pd.to_datetime(x, format="%H%M%S.%f")
df = pd.read_csv('/tmp/a.csv',
sep=',',
parse_dates=['time'],
date_parser=tm_parser,
index_col='time')
after that you don't need matplotlib.dates.DateFormatter:
In [147]: df.plot()
Out[147]: <matplotlib.axes._subplots.AxesSubplot at 0x8201f60>