Using matplotlib library on Pyhton, I would like to plot some graphs with dynamic y variables, i.e. variables which would change according to another variable stated before my plot functions.
From my imported data frame, I have extracted different gases concentration (M**_conc) and fluxes (M**_fluxes).
M33_conc = ec_top["M 33(ppbv)"]
M39_conc = ec_top["M 39(ncps)"]
M45_conc = ec_top["M 45(ppbv)"]
M59_conc = ec_top["M 59(ppbv)"]
M69_conc = ec_top["M 69(ppbv)"]
M71_conc = ec_top["M 71(ppbv)"]
M81_conc = ec_top["M 81(ppbv)"]
M137_conc = ec_top["M 137(ppbv)"]
M87_conc = ec_top["M 87(ppbv)"]
M47_conc = ec_top["M 47(ppbv)"]
M61_conc = ec_top["M 61(ppbv)"]
M33_flux = ec_top["Flux_M 33"]
M45_flux = ec_top["Flux_M 45"]
M59_flux = ec_top["Flux_M 59"]
M69_flux = ec_top["Flux_M 69"]
M71_flux = ec_top["Flux_M 71"]
M81_flux = ec_top["Flux_M 81"]
M137_flux = ec_top["Flux_M 137"]
M87_flux = ec_top["Flux_M 87"]
M47_flux = ec_top["Flux_M 47"]
M61_flux = ec_top["Flux_M 61"]
I want to be able to plot the evolution of these gases concentration/fluxes with time, with only one function which would allow me to choose between plotting the concentration or the fluxes of these gases.
Here is what I have written so far :
color_1 = 'black'
graph_type='conc'
fig, ((ax1, ax2, ax3), (ax5, ax7, ax8),(ax9,ax10,ax11)) = plt.subplots(3, 3, sharex=True, sharey=False)
fig.suptitle('Influence of wind direction of BVOCs concentration')
ax1.plot(wind_dir,'M33_'+graph_type,linestyle='',marker='.',color=color_1)
ax1.set_title('Methanol')
ax1.set(ylabel='Concentration [ppbv]')
ax2.plot(wind_dir,M39_conc,linestyle='',marker='.',color=color_1)
ax2.set_title('Water cluster')
ax2.set(ylabel='Concentration [ncps]')
ax3.plot(wind_dir,M45_conc,linestyle='',marker='.',color=color_1)
ax3.set_title('Acetaldehyde')
ax3.set(ylabel='Concentration [ppbv]')
# ax4.plot(wind_dir,M47_conc,linestyle='',marker='.',color='color_1')
# ax4.set_title('Unknown')
ax5.plot(wind_dir,M59_conc,linestyle='',marker='.',color=color_1)
ax5.set_title('Acetone')
ax5.set(ylabel='Concentration [ppbv]')
# ax6.plot(wind_dir,M61_conc,linestyle='',marker='.',color='color_1')
# ax6.set_title('Unknown')
ax7.plot(wind_dir,M69_conc,linestyle='',marker='.',color=color_1)
ax7.set_title('Isoprene')
ax7.set(ylabel='Concentration [ppbv]')
ax8.plot(wind_dir,M71_conc,linestyle='',marker='.',color=color_1)
ax8.set_title('Methyl vinyl, ketone and methacrolein')
ax8.set(ylabel='Concentration [ppbv]')
ax9.plot(wind_dir,M81_conc,linestyle='',marker='.',color=color_1)
ax9.set_title('Fragment of monoterpenes')
ax9.set(ylabel='Concentration [ppbv]',xlabel='Wind direction [°]')
ax10.plot(wind_dir,M87_conc,linestyle='',marker='.',color=color_1)
ax10.set_title('Methylbutenols')
ax10.set(ylabel='Concentration [ppbv]',xlabel='Wind direction [°]')
ax11.plot(wind_dir,M137_conc,linestyle='',marker='.',color=color_1)
ax11.set_title('Monoterpenes')
ax11.set(ylabel='Concentration [ppbv]',xlabel='Wind direction [°]')
plt.show()
When I try to parametrize the data I want to plot, I write, for example :
'M33_'+graph_type
which I am expecting to take the value 'M33_conc'.
Could someone help me to do this?
Thanks in advance
You have mentioned wanting to plot the evolution of the gases with time, but in the code sample you have given, you use wind_dir as the x variable. In this answer, I disregard this and use time as the x variable instead.
Looking at your code, I understand that you are wanting to create two different figures made of small multiples, one for gas concentrations and one for gas fluxes. For this kind of plot, I recommend using pandas or seaborn so that you can plot all the variables contained in a pandas dataframe at once. Here I share an example using pandas.
Because you are wanting to plot different measurements of the same substances, I recommend creating a table that lists the names of the variables and units associated with each unique substance (see df_subs below). I create one using code to extract the units and share it here, but this is easier to do with spreadsheet software.
Having a table like that makes it easier to create a plotting function that selects the group of variables you want to plot from the ec_top dataframe. You can then use the pandas plotting function like this: df.plot(subplots=True).
Most of the code shown below is to create some sample data based on your code to make it possible for you to recreate exactly what I show here and for anyone else who would like to give this a try. So if you want to use this solution, you can skip most of it, all you would need to do is create the substances table your way and then adjust the plotting function to fit your preferences.
Create sample dataset
import io # from Python v 3.8.5
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import matplotlib.dates as mdates
pd.set_option("display.max_columns", 6)
rng = np.random.default_rng(seed=1) # random number generator
# Copy paste variable names from sample given in question
var_strings = '''
"M 33(ppbv)"
"M 39(ncps)"
"M 45(ppbv)"
"M 59(ppbv)"
"M 69(ppbv)"
"M 71(ppbv)"
"M 81(ppbv)"
"M 137(ppbv)"
"M 87(ppbv)"
"M 47(ppbv)"
"M 61(ppbv)"
"Flux_M 33"
"Flux_M 45"
"Flux_M 59"
"Flux_M 69"
"Flux_M 71"
"Flux_M 81"
"Flux_M 137"
"Flux_M 87"
"Flux_M 47"
"Flux_M 61"
'''
variables = pd.read_csv(io.StringIO(var_strings), header=None, names=['var'])['var']
# Create datetime variable
nperiods = 60
time = pd.date_range('2021-01-15 12:00', periods=nperiods, freq='min')
# Create range of numbers to compute sine waves for fake data
x = np.linspace(0, 2*np.pi, nperiods)
# Create dataframe containing gas concentrations
var_conc = np.array([var for var in variables if '(' in var])
conc_sine_wave = np.reshape(np.sin(x), (len(x), 1))
loc = rng.exponential(scale=10, size=var_conc.size)
scale = loc/10
var_conc_noise = rng.normal(loc, scale, size=(x.size, var_conc.size))
data_conc = conc_sine_wave + var_conc_noise + 2
df_conc = pd.DataFrame(data_conc, index=time, columns=var_conc)
# Create dataframe containing gas fluxes
var_flux = np.array([var for var in variables if 'Flux' in var])
flux_sine_wave = np.reshape(np.sin(x)**2, (len(x), 1))
loc = rng.exponential(scale=10, size=var_flux.size)
scale = loc/10
var_flux_noise = rng.normal(loc, scale, size=(x.size, var_flux.size))
data_flux = flux_sine_wave + var_flux_noise + 1
df_flux = pd.DataFrame(data_flux, index=time, columns=var_flux)
# Merge concentrations and fluxes into single dataframe
ec_top = pd.merge(left=df_conc, right=df_flux, how='outer',
left_index=True, right_index=True)
ec_top.head()
# M 33(ppbv) M 39(ncps) M 45(ppbv) ... Flux_M 87 Flux_M 47 Flux_M 61
# 2021-01-15 12:00:00 11.940054 5.034281 53.162767 ... 8.079255 2.402073 31.383911
# 2021-01-15 12:01:00 13.916828 4.354558 45.706391 ... 10.229084 2.494649 26.816754
# 2021-01-15 12:02:00 13.635604 5.500438 53.202743 ... 12.772899 2.441369 33.219213
# 2021-01-15 12:03:00 13.146823 5.409585 53.346907 ... 11.373669 2.817323 33.409331
# 2021-01-15 12:04:00 14.124752 5.491555 49.455010 ... 11.827497 2.939942 28.639749
Create substances table containing variable names and units
The substances are shown in the figure subplots in the order that they are listed here. Information from this table is used to create the labels and titles of the subplots.
# Copy paste substance codes and names from sample given in question
subs_strings = """
M33 "Methanol"
M39 "Water cluster"
M45 "Acetaldehyde"
M47 "Unknown"
M59 "Acetone"
M61 "Unknown"
M69 "Isoprene"
M71 "Methyl vinyl, ketone and methacrolein"
M81 "Fragment of monoterpenes"
M87 "Methylbutenols"
M137 "Monoterpenes"
"""
# Create dataframe containing substance codes and names
df_subs = pd.read_csv(io.StringIO(subs_strings), header=None,
names=['subs', 'subs_name'], index_col=False,
delim_whitespace=True)
# Add units and variable names matching the substance codes
# Do this for gas concentrations
for var in var_conc:
var_subs, var_unit_raw = var.split('(')
var_subs_num = var_subs.lstrip('M ')
var_unit = var_unit_raw.rstrip(')')
for i, subs in enumerate(df_subs['subs']):
if var_subs_num == subs.lstrip('M'):
df_subs.loc[i, 'conc_unit'] = var_unit
df_subs.loc[i, 'conc_var'] = var
# Do this for gas fluxes
for var in var_flux:
var_subs_num = var.split('M')[1].lstrip()
var_unit = rng.choice(['unit_a', 'unit_b', 'unit_c'])
for i, subs in enumerate(df_subs['subs']):
if var_subs_num == subs.lstrip('M'):
df_subs.loc[i, 'flux_unit'] = var_unit
df_subs.loc[i, 'flux_var'] = var
df_subs
# subs subs_name conc_unit conc_var flux_unit flux_var
# 0 M33 Methanol ppbv M 33(ppbv) unit_c Flux_M 33
# 1 M39 Water cluster ncps M 39(ncps) NaN NaN
# 2 M45 Acetaldehyde ppbv M 45(ppbv) unit_a Flux_M 45
# 3 M47 Unknown ppbv M 47(ppbv) unit_b Flux_M 47
# 4 M59 Acetone ppbv M 59(ppbv) unit_a Flux_M 59
# 5 M61 Unknown ppbv M 61(ppbv) unit_c Flux_M 61
# 6 M69 Isoprene ppbv M 69(ppbv) unit_a Flux_M 69
# 7 M71 Methyl vinyl, ketone and methacrolein ppbv M 71(ppbv) unit_a Flux_M 71
# 8 M81 Fragment of monoterpenes ppbv M 81(ppbv) unit_c Flux_M 81
# 9 M87 Methylbutenols ppbv M 87(ppbv) unit_c Flux_M 87
# 10 M137 Monoterpenes ppbv M 137(ppbv) unit_b Flux_M 137
Create plotting function based on pandas
Here is one way of creating a plotting function that lets you select the variables for the plot with the graph_type argument. It works by selecting the relevant variables from the substances table using the if/elif statement. This and the ec_top[variables].plot(...) function are all that is really necessary to create the plot, the rest is all for formatting the figure. The variables are plotted in the order of the variables list. I draw only two columns of subplots because of width constraints here (max 10 inches width to get a sharp image on Stack Overflow).
# Create plotting function that creates a single figure showing all
# variables of the chosen type
def plot_grid(graph_type):
# Set the type of variables and units to fetch in df_subs: using if
# statements for the strings lets you use a variety of strings
if 'conc' in graph_type:
var_type = 'conc_var'
unit_type = 'conc_unit'
elif 'flux' in graph_type:
var_type = 'flux_var'
unit_type = 'flux_unit'
else:
return f'Error: "{graph_type}" is not a valid string, \
it must contain "conc" or "flux".'
# Create list of variables to plot depending on type
variables = df_subs[var_type].dropna()
# Set parameters for figure dimensions
nvar = variables.size
cols = 2
rows = int(np.ceil(nvar/cols))
width = 10/cols
height = 3
# Draw grid of line plots: note that x_compat is used to override the
# default x-axis time labels, remove it if you do not want to use custom
# tick locators and formatters like the ones created in the loop below
grid = ec_top[variables].plot(subplots=True, figsize=(cols*width, rows*height),
layout=(rows, cols), marker='.', linestyle='',
xlabel='Time', x_compat=True)
# The code in the following loop is optional formatting based on my
# preferences, if you remove it the plot should still look ok but with
# fewer informative labels and the legends may not all be in the same place
# Loop through the subplots to edit format, including creating labels and
# titles based on the information in the substances table (df_subs):
for ax in grid.flatten()[:nvar]:
# Edit tick locations and format
plt.setp(ax.get_xticklabels(which='both'), fontsize=8, rotation=0, ha='center')
loc = mdates.AutoDateLocator()
ax.xaxis.set_major_locator(loc)
ax.set_xticks([], minor=True)
fmt = mdates.ConciseDateFormatter(loc, show_offset=False)
ax.xaxis.set_major_formatter(fmt)
# Edit legend
handle, (var_name,) = ax.get_legend_handles_labels()
subs = df_subs[df_subs[var_type] == var_name]['subs']
ax.legend(handle, subs, loc='upper right')
# Add y label
var_unit, = df_subs[df_subs[var_type] == var_name][unit_type]
ylabel_type = f'{"Concentration" if "conc" in graph_type else "Flux"}'
ax.set_ylabel(f'{ylabel_type} [{var_unit}]')
# Add title
subs_name, = df_subs[df_subs[var_type] == var_name]['subs_name']
ax.set_title(subs_name)
# Edit figure format
fig = plt.gcf()
date = df_conc.index[0].strftime('%b %d %Y')
title_type = f'{"concentrations" if "conc" in graph_type else "fluxes"}'
fig.suptitle(f'BVOCs {title_type} on {date} from 12:00 to 13:00',
y=0.93, fontsize=15);
fig.subplots_adjust(wspace=0.3, hspace=0.4)
plt.show()
plot_grid('conc') # any kind of string works if it contains 'conc' or 'flux'
plot_grid('graph fluxes')
Documentation: matplotlib date ticks
Related
I would be so thankful if someone would be able to help me with this. I am creating a graph in matplotib however I would to love to split up the 14 lines created from the while loop into the x and y values of P, so instead of plt.plot(t,P) it would be plt.plot(t,((P[1])[0]))) and
plt.plot(t,((P[1])[1]))). I would love if someone could help me very quick, it should be easy but i am just getting errors with the arrays
`
#Altering Alpha in Tumor Cells vs PACCs
#What is alpha? α = Rate of conversion of cancer cells to PACCs
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
from google.colab import files
value = -6
counter = -1
array = []
pac = []
while value <= 0:
def modelP(x,t):
P, C = x
λc = 0.0601
K = 2000
α = 1 * (10**value)
ν = 1 * (10**-6)
λp = 0.1
γ = 2
#returning odes
dPdt = ((λp))*P*(1-(C+(γ*P))/K)+ (α*C)
dCdt = ((λc)*C)*(1-(C+(γ*P))/K)-(α*C) + (ν***P)
return dPdt, dCdt
#initial
C0= 256
P0 = 0
Pinit = [P0,C0]
#time points
t = np.linspace(0,730)
#solve odes
P = odeint(modelP,Pinit,t)
plt.plot(t,P)
value += 1
#plot results
plt.xlabel('Time [days]')
plt.ylabel('Number of PACCs')
plt.show()
`
You can use subplots() to create two subplots and then plot the individual line into the plot you need. To do this, firstly add the subplots at the start (before the while loop) by adding this line...
fig, ax = plt.subplots(2,1) ## Plot will 2 rows, 1 column... change if required
Then... within the while loop, replace the plotting line...
plt.plot(t,P)
with (do take care of the space so that the lines are within while loop)
if value < -3: ## I am using value = -3 as the point of split, change as needed
ax[0].plot(t,P)#, ax=ax[0]) ## Add to first plot
else:
ax[1].plot(t,P)#,ax=ax[1]) ## Add to second plot
This will give a plot like this.
I want to plot an infinite non ending line between two points that are in the form of a pandas series. I am able to successfully plot a standard line between the points, however I don't want the line to "end" and instead it should continue. Expanding on this I would also like to extract the values of this new infinite line to a new dataframe so that I can see what corresponding line value a given x value in has.
data = yf.download("AAPL", start="2021-01-01", interval = "1d").drop(columns=['Adj Close'])
data = data[30:].rename(columns={"Open": "open", "High": "high", "Low": "low", "Close": "close", "Volume": "volume"})
local_max = argrelextrema(data['high'].values, np.greater)[0]
local_min = argrelextrema(data['low'].values, np.less)[0]
highs = data.iloc[local_max,:]
lows = data.iloc[local_min,:]
highesttwo = highs["high"].nlargest(2)
lowesttwo = lows["low"].nsmallest(2)
fig = plt.figure(figsize=[10,7])
data['high'].plot(marker='o', markevery=local_max)
data['low'].plot(marker='o', markevery=local_min)
highesttwo.plot()
lowesttwo.plot()
plt.show()
Currently my plot looks like this:
How ever I want it to look like this as well as be able to get the values of the line for the corresponding x value.
This can be done in a few steps as shown in the following example where the lines are computed with element-wise operations (i.e. vectorized) using the slope-intercept form of the line equation.
The stock data has a frequency based on the opening dates of the stock exchange. This frequency is not automatically recognized by pandas, therefore the .plot method produces a plot with a continuous date for the x-axis and includes the days with no data. This can be avoided by setting the argument use_index=False so that the x-axis uses integers starting from zero instead.
The challenge is to then create nicely formatted tick labels. The following example attempts to imitate the pandas tick format by using list comprehensions to select the tick locations and format the labels. These will need to be adjusted if the date range is significantly lengthened or shortened.
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
from scipy.signal import argrelextrema # v 1.6.1
import yfinance as yf # v 0.1.54
# Import data
data = (yf.download('AAPL', start='2021-01-04', end='2021-03-15', interval='1d')
.drop(columns=['Adj Close']))
data = data.rename(columns={'Open': 'open', 'High': 'high', 'Low': 'low',
'Close': 'close', 'Volume': 'volume'})
# Extract points and get appropriate x values for the points by using
# reset_index for highs/lows
local_max = argrelextrema(data['high'].values, np.greater)[0]
local_min = argrelextrema(data['low'].values, np.less)[0]
highs = data.reset_index().iloc[local_max, :]
lows = data.reset_index().iloc[local_min, :]
htwo = highs['high'].nlargest(2).sort_index()
ltwo = lows['low'].nsmallest(2).sort_index()
# Compute slope and y-intercept for each line
slope_high, intercept_high = np.polyfit(htwo.index, htwo, 1)
slope_low, intercept_low = np.polyfit(ltwo.index, ltwo, 1)
# Create dataframe for each line by using reindexed htwo and ltwo so that the
# index extends to the end of the dataset and serves as the x variable then
# compute y values
# High
line_high = htwo.reindex(range(htwo.index[0], len(data))).reset_index()
line_high.columns = ['x', 'y']
line_high['y'] = slope_high*line_high['x'] + intercept_high
# Low
line_low = ltwo.reindex(range(ltwo.index[0], len(data))).reset_index()
line_low.columns = ['x', 'y']
line_low['y'] = slope_low*line_low['x'] + intercept_low
# Plot data using pandas plotting function and add lines with matplotlib function
fig = plt.figure(figsize=[10,6])
ax = data['high'].plot(marker='o', markevery=local_max, use_index=False)
data['low'].plot(marker='o', markevery=local_min, use_index=False)
ax.plot(line_high['x'], line_high['y'])
ax.plot(line_low['x'], line_low['y'])
ax.set_xlim(0, len(data)-1)
# Set major and minor tick locations
tks_maj = [idx for idx, timestamp in enumerate(data.index)
if (timestamp.month != data.index[idx-1].month) | (idx == 0)]
tks_min = range(len(data))
ax.set_xticks(tks_maj)
ax.set_xticks(tks_min, minor=True)
# Format major and minor tick labels
labels_maj = [ts.strftime('\n%b\n%Y') if (data.index[tks_maj[idx]].year
!= data.index[tks_maj[idx-1]].year) | (idx == 0)
else ts.strftime('\n%b') for idx, ts in enumerate(data.index[tks_maj])]
labels_min = [ts.strftime('%d') if (idx+3)%5 == 0 else ''
for idx, ts in enumerate(data.index[tks_min])]
ax.set_xticklabels(labels_maj)
ax.set_xticklabels(labels_min, minor=True)
plt.show()
You can find more examples of tick formatting here and here in Solution 1.
Date string format codes
I have a dataset with 7 columns - level,Time_30,Time_60,Time_90,Time_120,Time_150 and Time_180
My main goal is to do a time-series anomaly detection using cell count in a 30-minute interval.
I want to do the following data preparation steps:
(I) melt/reshape the df into the appropriate time-series format (from wide to long)- consolidate the columns time_30, time_60 ,....., time_180 into one column time with 6 levels (30,60,.....,180)
(II) since the result from (I) comes out as 30,60,.....180, I want to set the time column as the appropriate time or date format for time-series (something like this '%H:%M:%S')
(III) use a for-loop to plot the time-series plot for each level - A, B,...., F) for comparison purposes.
(IV) Anomaly detection
# generate/import dataset
import pandas as pd
df = pd.DataFrame({'level':[A,B,C,D,E,F],
'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )
Desired outcome
# first series
level, time, count
A, 30, 1993.05
B, 60, 2123.15
C, 90, 2323.56
D, 120, 2355.52
E, 150, 2425.31
F, 180, 2443.35
# 2nd series
level,time,count
A,30,1999.45
B,60,2299.59
C,90,2495.99
D,120,2491.19
E,150,2599.51
F,180,2609.92
.
.
.
.
# up until the last series
See below for my attempt
# (I)
df1 = pd.melt(df,id_vars = ['level'],var_name = 'time',value_name = 'count') #
# (II)
df1['time'] = pd.to_datetime(df1['time'],format= '%H:%M:%S' ).dt.time
OR
df1['time'] = pd.to_timedelta(df1['time'], unit='m')
# (III)
plt.figure(figsize=(10,5))
plt.plot(df1)
for timex in range(30,180):
plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.3)
# Perform STL Decomp
stl = STL(df1)
result = stl.fit()
seasonal, trend, resid = result.seasonal, result.trend, result.resid
plt.figure(figsize=(8,6))
plt.subplot(4,1,1)
plt.plot(df1)
plt.title('Original Series', fontsize=16)
plt.subplot(4,1,2)
plt.plot(trend)
plt.title('Trend', fontsize=16)
plt.subplot(4,1,3)
plt.plot(seasonal)
plt.title('Seasonal', fontsize=16)
plt.subplot(4,1,4)
plt.plot(resid)
plt.title('Residual', fontsize=16)
plt.tight_layout()
estimated = trend + seasonal
plt.figure(figsize=(12,4))
plt.plot(df1)
plt.plot(estimated)
plt.figure(figsize=(10,4))
plt.plot(resid)
# Anomaly detection
resid_mu = resid.mean()
resid_dev = resid.std()
lower = resid_mu - 3*resid_dev
upper = resid_mu + 3*resid_dev
anomalies = df1[(resid < lower) | (resid > upper)] # returns the datapoints with the anomalies
anomalies
plt.plot(df1)
for timex in range(30,180):
plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.6)
plt.scatter(anomalies.index, anomalies.count, color='r', marker='D')
Please note: if you can only attempt I and/or II that would be much appreciated.
I made a few small edits to your sample dataframe based on my comment above:
import pandas as pd
df = pd.DataFrame({'level':['A','B','C','D','E'],
'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )
First, manipulate the Time_* column names to be integer values:
timecols = [int(c.replace("Time_","")) for c in df.columns if c != 'level']
df.columns = ['level'] + timecols
After that you can pd.melt() like you were thinking, yielding a datarame with all those "series" you mentioned above concatenated together:
df1 = df.melt(id_vars=['level'], value_vars=timecols, var_name='time', value_name='count').sort_values(['level','time']).reset_index(drop=True)
print(df1.head(10))
level time count
0 A 30 1993.05
1 A 60 2123.15
2 A 90 2323.56
3 A 120 2355.52
4 A 150 2425.31
5 A 180 2443.35
6 B 30 1999.45
7 B 60 2299.59
8 B 90 2495.99
9 B 120 2491.19
If you want to loop over the levels, select them with:
for level in df1['level'].unique():
tmp = df1[df1['level']==level]
or
for level in df1['level'].unique():
tmp = df1[df1['level']==level].copy()
...if you intend to modify/add data to the tmp dataframe.
As for making timestamps, you could do:
df1['time'] = pd.to_timedelta(df1['time'], unit='min')
...like you were attempting, but it depends on how you're using it. If you just want strings that look like "00:30:00", etc, you can try something like:
df1['time'] = pd.to_timedelta(df1['time'], unit='min').apply(lambda x:str(x)[-8:])
Anyway, hope that gets you on track for what you need.
I am having a problem with waterfall. I took this chart from matplotlib site and added my own data frame with 2 simple columns with some integer numbers. My waterfall was produced but without numbers, just empty bars. I am a bit lost and I would appreciate any suggestions.
What I am trying to build is the custom waterfall that takes one dataframe with column names, values, and some values for filters like countries. I haven't found anything like that anywhere so I am trying to build my own.
import numpy as np;
import pandas as pd;
import matplotlib.pyplot as plt;
from matplotlib.ticker import FuncFormatter;
dataset = pd.read_csv('waterfall_test_data.csv')
#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "${:,.0f}".format(x)
formatter = FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index = dataset['columns']
data = dataset['amount']
#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)
#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total
#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0
#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(15, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)
#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)
#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')
I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Have you tried stacked bar graph?
https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py