I have a Pandas dataframe that looks like this:
store_id days times rating
100 monday '1:00pm - 3:00pm' 0
100 monday '3:00pm - 6:00pm' 1
100 monday '6:00pm - 9:00pm' 2
...
store n
Where there are ~60 stores and the ratings range from 0 - 2. I would like to create a 6x5 grid Seaborn of heatmaps, with one heatmap per store. I would like for the x-axis to be the days and for the y-axis to be the times.
I tried this:
f, axes = plt.subplots(5,6)
i=0
for store in df['store_id']:
sns.heatmap(data=df[df['store_id']==store]['rating'], ax=axes[i])
i+=1
This creates the 5x6 grid, but generates an error ('Inconsistent shape between the condition and the input...'). What's the best way to do this?
For heat map, you need to transpose/pivot your data so as the days becomes columns (x-axis) and times becomes index:
f, axes = plt.subplots(5,6)
# flatten axes for looping
axes = axes.ravel()
# use groupby to extract data faster
for ax, (store, data) in zip(axes, df.groupby('store_id')):
pivot = data.pivot_table(index='times', columns='days', values='rating')
sns.heatmap(data=pivot, ax=ax)
Related
Say I have a dataframe structured like so:
Name x y
Joe 0,1,5 0,3,8
Sue 0,2,8 1,9,5
...
Harold 0,5,6 0,7,2
I'd like to plot the values in the x and y axis on a line plot based on row. In reality, there are many x and y values, but there is always one x value for every y value in these columns. The name of the plot would be the value in "name".
I've tried to do this by first converting x and y to lists in their own separate columns like so:
df['xval'] = df.['x'].str.split(',')
df['yval'] = df.['y'].str.split(',')
And then passing them to seaborn:
ax = sns.lineplot(x=df['xval'], y=df['yval'], data=df)
However, this does not work because 1) I recieve an error, which I presume is due to attempting to pass a list from a dataframe, claiming:
TypeError: unhashable type: 'list'
And 2) I cannot specify the value for df['name'] for the specific line plot. What's the best way to go about solving this problem?
Data and imports:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame({
'name': ['joe', 'sue', 'mike'],
'x': ['0,1,5', '0,2,8', '0,4'],
'y': ['0,3,8', '1,9,5', '1,6']
})
We should convert df into a useable format for plotting. This makes all plotting eaiser. We can take advantage of the fact that x and y have a 1-to-1 relationship. Notice I've added a third name with a 2 xy value as opposed to 3 to show this method will work for varied amounts of x and y per name as long as each row has equal numbers of x and y values.
Creating the plot_df:
# Grab Name Column to Start Plot DF with
plot_df = df.loc[:, ['name']]
# Split X column
plot_df['x'] = df['x'].str.split(',')
# Explode X into Rows
plot_df = plot_df.explode('x').reset_index(drop=True)
# Split and Series Explode y in one step
# This works IF AND ONLY IF a 1-to-1 relationship for x and y
plot_df['y'] = df['y'].str.split(',').explode().reset_index(drop=True)
# These need to be numeric to plot correctly
plot_df.loc[:, ['x', 'y']] = plot_df.loc[:, ['x', 'y']].astype(int)
plot_df:
name x y
0 joe 0 0
1 joe 1 3
2 joe 5 8
3 sue 0 1
4 sue 2 9
5 sue 8 5
6 mike 0 1
7 mike 4 6
References to the methods used in creating plot_df:
DataFrame.loc to subset the dataframe
Series.str.split to split the comma separated values into a list
DataFrame.explode to upscale the DataFrame based on the iterable in x
DataFrame.reset_index to make index unique again after exploding
Series.explode to upscale the lists in the Series y.
Series.reset_index to make index unique again after exploding
DataFrame.astype since the values are initially strings just splitting and exploding is not enough. Will need to convert to a numeric type for them to plot correctly
Plotting (Option 1)
# Plot with hue set to name.
sns.lineplot(data=plot_df, x='x', y='y', hue='name')
plt.show()
References for plotting separate lines:
sns.lineplot to plot. Note the hue argument to create separate lines based on name.
pyplot.show to display.
Plotting (Option 2.a) Subplots:
sns.relplot(data=plot_df, x='x', y='y', col='name', kind='line')
plt.tight_layout()
plt.show()
Plotting (Option 2.b) Subplots:
# Use Grouper From plot_df
grouper = plot_df.groupby('name')
# Create Subplots based on the number of groups (ngroups)
fig, axes = plt.subplots(nrows=grouper.ngroups)
# Iterate over axes and groups
for ax, (grp_name, grp) in zip(axes, grouper):
# Plot from each grp DataFrame on ax from axes
sns.lineplot(data=grp, x='x', y='y', ax=ax, label=grp_name)
plt.show()
References for plotting subplots:
2.a
relplot the row or col parameter can be used to create subplots in a similar way to how hue creates multiple lines. This will return a seaborn.FacetGrid so post processing will be different than lineplot which returns matplotlib.axes.Axes
2.b
groupby to create iterable that can be used to plot subplots.
pyplot.subplots to create subplots to plot on.
groupby.ngroup to count number of groups.
zip to iterate over axes and groups simultaneously.
sns.lineplot to plot. Note label is needed to have legends. grp_name contains the current key that is common in the current grp DataFrame.
pyplot.show to display.
Plotting option 3 (separate plots):
# Plot from each grp DataFrame in it's own plot
for grp_name, grp in plot_df.groupby('name'):
fig, ax = plt.subplots()
sns.lineplot(data=grp, x='x', y='y', ax=ax)
ax.set_title(grp_name)
fig.show()
joe plot
mike plot
sue plot
References for plotting separate plots:
groupby to create iterable that can be used to plot each name separately.
pyplot.subplots to create separate plot to plot on.
sns.lineplot to plot. Note label is needed to have legends. grp_name contains the current key that is common in the current grp DataFrame.
pyplot.show to display.
From what I understood this is what you want.
df = pd.DataFrame()
df['name'] = ['joe', 'sue']
df['x'] = ['0,1,5', '0,2,8']
df['y'] = ['0,3,8', '1,9,5']
df['newx'] = df['x'].str.split(',')
df['newy'] = df['y'].str.split(',')
for i in range(len(df)):
sns.lineplot(x=df.loc[i, 'newx'], y=df.loc[i, 'newy'])
plt.legend(df['name'])
I have a pandas DataFrame which has 200 columns and each column is a list of 200 values.
I want to plot those values in series in such a way that
First column (100 values) lie between 0 to 1 in x-axis
Second column (200 values) lie between 1 to 2 in x-axis
Third column (200 values) lie between 2 to 3 in x-axis
...
is there any way in python to solve this problem?
Thanks in advance
So, I gather that by "between 0 and 1", you actually want the points of Column 1 situated at x=0.5. To have all values of Column 1 at the same x-coordinate, just pass that fixed x-coordinate to the call to scatter. I show here the example for 20 columns with 20 values per column:
df = pd.DataFrame()
for i in range(20):
df[f'Col {i}'] = np.random.randn(20)
fig, axes = plt.subplots()
for i in range(20):
axes.scatter([i+0.5]*len(df), df[f'Col {i}'])
axes.set_xticks(range(20))
plt.show()
Personally, I prefer to iterate over the columns (or column keys) because one is flexible with the column names. This code snippet is a quick example:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# create random data with non-serial column names
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['Col 1','Col 2','Col 4','Col 5'])
fig, ax= plt.subplots()
# creating a list of DataFramecolumns
columns = list(df)
# iterate over columns of the DataFrame
for i,col in enumerate(columns):
y = df[col]
x = [i+0.5] * len(y)
ax.scatter(x,y)
plt.show()
I have a csv file which I have read into a Pandas Dataframe. The dataframe (say 'cdata') has the below columns
I want to be able to group this data by State and subplot the cumulative confirmed column data for each state in the same plot. The data will be plotted against the Date column.
The distribution of data against the Date column is not uniform i.e. not all State will have data row for each Date.
When I try to subplot this using the below the plotted data does not look okay.
fig,ax = plt.subplots(figsize=(8,6))
count=1;
for state,df in cdata.groupby('State'):
if(count < 5):
df.plot(x='Date', y='Confirmed', ax=ax, label=state)
count = count +1
plt.legend()
This obviously does not look okay since if I look at the data the cumulative figure for State='Andhra Pradesh' on the 1st May is 1463 and not ~400 that the plotted graph seems to point.
What am I doing wrong here?
You are plotting the daily confirmed number and not the cumulative sum of confirmed. You can add a new column with the cumulative sum and plot it instead.
Also, be sure to set the 'Date' column as a date type and sort it before calculating the cumulative sum, you can do something like this:
## Transform 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
## Sort the df by the 'Date' column
df.sort_values('Date', inplace=True)
## Calculate cumulative sum of 'Confirmed' by state
df['Total Confirmed'] = df.groupby('State')['Confirmed'].transform('cumsum');
## Plot
fig,ax = plt.subplots(figsize=(8,6))
count=1
for state, df in cdata.groupby('State'):
if(count < 5):
df.plot(x='Date', y='Total Confirmed', ax=ax, label=state)
count = count + 1
plt.legend()
I was able to achieve the outcome I was looking for with the below code. However I am sure this is not the most elegant way of achieving the same and am still looking for alternatives that are much more intuitive.
grouped = cdata.groupby(['Date','State'],sort=False)['Confirmed'].sum().unstack('State')
grouped.reset_index(inplace=True)
columns = grouped.columns.to_list()[1:-1]
fig,ax = plt.subplots(figsize=(20,14))
grouped.plot(x='Date',y=columns, ax=ax)
I have a csv file with 2 columns:
col1- Timestamp data(yyyy-mm-dd hh:mm:ss.ms (8 months data))
col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot.
Please guide.
Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()
I'm trying to set the ticks (time-steps) of the x-axis on my matplotlib graph of a Pandas DataFrame. My goal is to use the first column of the DataFrame to use as the ticks, but I haven't been successful so far.
My attempts so far have included:
Attempt 1:
#See 'xticks'
data_df[header_names[1]].plot(ax=ax, title="Roehrig Shock Data", style="-o", legend=True, xticks=data_df[header_names[0]])
Attempt 2:
ax.xaxis.set_ticks(data_df[header_names[0]])
header_names is just a list of the column header names and the dataframe is as follows:
Compression Velocity Compression Force
1 0.000213 6.810879
2 0.025055 140.693200
3 0.050146 158.401500
4 0.075816 171.050200
5 0.101011 178.639500
6 0.126681 186.228800
7 0.150925 191.288300
8 0.176597 198.877500
9 0.202269 203.937000
10 0.227466 208.996500
11 0.252663 214.056000
And here is the data in CSV format:
Compression Velocity,Compression Force
0.0002126891606,6.810879
0.025055073079999997,140.6932
0.050145696,158.4015
0.07581600279999999,171.0502
0.1010109232,178.6395
0.12668120459999999,186.2288
0.1509253776,191.2883
0.1765969798,198.8775
0.2022691662,203.937
0.2274659662,208.9965
0.2526627408,214.056
And here is an implementation of reading and plotting the graph:
data_df = pd.read_csv(file).astype(float)
fig = Figure()
ax = fig.add_subplot(111)
ax.set_xlabel("Velocity (m/sec)")
ax.set_ylabel("Force (N)")
data_df[header_names[1]].plot(ax=ax, title="Roehrig Shock Data", style="-o", legend=True)
The current graph looks like:
The x-axis is currently the number of rows in the dataframe (e.g. 12) rather than the actual values within the first column.
Is there a way to use the data from the first column in the dataframe to set as the ticks/intervals/time-steps of the x-axis?
This works for me:
data_df.plot(x='Compression Velocity', y='Compression Force', xticks=d['Compression Velocity'])