I'm trying to output a Pandas dataframe as a pyplot table to be inserted into a presentation, but I'm running into problems with the formatting of the values in the table itself.
This is one part of a much larger project, so I'll try to give as much code and context as I can:
values = []
for item in scenarios:
value = dataframe.loc[dataframe['Scenario'] == item, 'Value'].sum()
# Formatting for scientific notation with 2 decimal points (so, 3 sig figs)
values.append("{:.2e}".format(value))
# Create a dataframe to hold this new, visual-ready data and add the generated values to it.
visual_data = pd.DataFrame(scens, columns=["Scenario"])
visual_data["Data"] = values
# Formatting the data as a float so Pandas can sort it properly.
visual_data["Data"] = visual_data["Data"].astype(float)
# Sorting. By default, Pandas sorts in descending order.
visual_data.sort_values(by=['Data'], inplace=True)
# Matplotlib code.
fig, ax = plt.subplots()
# fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')
table = ax.table(cellText=visual_data.to_numpy(), colLabels=visual_data.columns, loc='center', cellLoc='center')
fig.tight_layout()
plt.show()
The problem I run in to is how everything in the table itself displays. The data I'm working with runs from 1e-8 to 1e-4. When I just have my sorted dataframe everything is formatted in proper scientific notation from lowest to highest values.
As soon as I insert the data into the table (the '.to_numpy' statement when creating the table) the output looks something like the following:
[['DUJ' 3.964e-08]
['DUE' 4.467e-08]
['DUD' 1.172e-07]
['DUC' 2.098e-07]
['DUG' 2.136e-07]
...
['DUN' 7.356e-05]
['MCC' 0.0001046]
['ALU' 0.0001652]]
With the final two entries rendering as standard floats instead of being consistent scientific notation like all the other entries in the table.
I know that Pandas' "set_printoption" has a "suppress=True" variable that should prevent this kind of behavior, but I can't figure out how to enable it (or disable it, as it were).
Any ideas?
Related
I have a plot that iterates through different databases in a list. However, I of course want to know which database is what graph. The databases don't have a title. So I was thinking of using a specific point from a column / row. Which makes it clear for me what database I am seeing.
Inorder to extract a specific float from a column / row which is contained in my list of databases,
I use: specific_float = list_of_databases[list_index][database_column_name][database_index]
The same principle I would like to use for my for loop with plots. So I was thinking of something like this:
Some context about the names in the code:
dc_oppervlakte_filter is the list with dataframes
TP_TARGET is an column name from the dataframes
TP_SMOTE same story as TP_TARGET
STRIP_WIDTH is the column I want to use for my legend with index 0
for k in dc_oppervlakte_filter:
plt.plot(k['TP_TARGET'], k['TP_SMOTE'], 'o', label = k['STRIP_WIDTH'][0])
plt.xlabel("TP_TARGET in ($\u00b0C$)")
plt.ylabel("TP_SMOTE in ($\u00b0C$)")
plt.legend(loc = 'lower right')
However, this gives me the error: Keyerror: 0
So I tried:
for k in dc_oppervlakte_filter:
plt.plot(k['TP_TARGET'], k['TP_SMOTE'], 'o', label = dc_oppervlakte_filter[k]['STRIP_WIDTH'][0])
plt.xlabel("TP_TARGET in ($\u00b0C$)")
plt.ylabel("TP_SMOTE in ($\u00b0C$)")
plt.legend(loc = 'lower right')
But it gives me the error: TypeError: list indices must be integers or slices, not DataFrame
An optimal scenario for me would look like that there's an legend with the specific float chosen from every database that's used to plot a graph.
I would like to find a way to modify the labels on holoviews sankey diagrams that they show, in addition to the numerical values, also the percentage values.
For example:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1]}
df = pd.DataFrame(data, columns=['A','B','values'])
sankey = hv.Sankey(df)
For 'From' label 'YY' which is 'YY - 8' change this to 'YY - 8 (13.7%)' - add the additional percentage in there.
I have found ways to change from the absolute value to percentage by using something along the lines of:
value_dim = hv.Dimension('Percentage', unit='%')
But can't find a way to have both values in the label.
Additionally, I tried to modify the hover tag. In my search to find ways to modify this I found ways to reference and display various attributes in the hover information (through the bokeh tooltips) but it does not seem like you can manipulate this information.
In this post two possible ways are explained how to achive the wanted result. Let's start with the example DataFrame and the necessary imports.
import holoviews as hv
from holoviews import opts, dim # only needed for 2. solution
import pandas as pd
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1],
}
df = pd.DataFrame(data)
1. Option
Use hv.Dimension(spec, **params), which gives you the opportunity to apply a formatter with the keyword value_format to a column name. This formatter is simple the combination of the value and the value in percent.
total = df.groupby('A', sort=False)['values'].sum().sum()
def fmt(x):
return f'{x} ({round(x/total,2)}%)'
hv.Sankey(df, vdims = hv.Dimension('values', value_format=fmt))
2. Option
Extend the DataFrame df by one column wich stores the labels, you want to use. This can be later reused inside the Sankey, with opts(labels=dim('labels')). To ckeck if the calculations are correct, you can turn show_values on, but this will cause a duplicate inside the labels. Therefor in the final solution show_values is set to False. This can be sometime tricky to find the correct order.
labels = []
for item in ['A', 'B']:
grouper = df.groupby(item, sort=False)['values']
total_sum = grouper.sum().sum()
for name, group in grouper:
_sum = group.sum()
_percent = round(_sum/total_sum,2)
labels.append(f'{name} - {_sum} ({_percent}%)')
df['labels'] = labels
hv.Sankey(df).opts(show_values=False, labels=dim('labels'))
The downside of this solution is, that we apply a groupby for both columns 'A' and 'B'. This is something holoviews will do, too. So this is not very efficient.
Output
Comment
Both solutions create nearly the same figure, except that the HoverTool is not equal.
I love how easy it is to set up basic hover feedback with HoverTool, but I'm wrestling with a couple aspects of the display. I have time-series data, with measurements that represent amounts in US$. This data starts out life as a pandas.Series. Legible plotting is easy (following assumes jupyter notebook):
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line(my_data.index, my_data)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
show(p)
This shows me the time-series, with date-formatting on the x-axis and y-axis values that look like "$150,000", "$200,000", "$250,000", etc. I have two questions about HoverTool behavior:
Controlling formatting for $x and $y.
Accessing the name of the dataset under the cursor.
Simply adding a HoverTool allows me to see values, but in unhelpful units:
p.add_tools(HoverTool())
The corresponding tooltip values with these defaults show "1.468e+5" rather than "$146,800" (or even "146800", the underlying Series value); similarly, the date value appears as "1459728000000" rather than (say) "2016-04-04". I can manually work around this display issue by making my pandas.Series into a ColumnDataSource and adding string columns with the desired formatting:
# Make sure Series and its index have `name`, before converting to DataFrame
my_data.name = 'revenue'
my_data.index.name = 'day'
df = my_data.reset_index()
# Add str columns for tooltip display
df['daystr'] = df['day'].dt.strftime('%m %b %Y')
df['revstr'] = df['revenue'].apply(lambda x: '${:,d}'.format(int(x)))
cds = ColumnDataSource(df)
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line('day', 'revenue', source=cds)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
p.add_tools(HoverTool(tooltips=[('Amount', '#revstr'), ('Day', '#daystr')]))
show(p)
but is there a way to handle the formatting in the HoverTool configuration instead? That seems much more desirable than all the data-set transformation that's required above. I looked through the documentation and (quickly) scanned through the source, and didn't see anything obvious that would save me from building the "output" columns as above.
Related to that, when I have several lines in a single plot, is there any way for me to access the name (or perhaps legend value) of each line within HoverTool.tooltips? It would be extremely helpful to include something in the tooltip to differentiate which dataset values are coming from, rather than needing to rely on (say) line-color in conjunction with the tool-tip display. For now, I've added an additional column to the ColumnDataSource that's just the string value I want to show; that obviously only works for datasets that include a single measurement column. When multiple lines are sharing an underlying ColumnDataSource, it would be sufficient to access the column-name that's provided to y.
Hey i know its 2 years late but this is for other people that come across this
p.add_tools(HoverTool(
tooltips=[
('Date', '#Date{%F}'),
('Value', '#Value{int}')],
formatters={
'Date':'datetime',
'Value':'numeral'},mode='vline'
))
I have an dataframe called afplot:
apple_fplot = apple_f1.groupby(['Year','Domain Category'])['Value'].sum()
afplot = apple_fplot.unstack('Domain Category')
I now need to produce a plot for each column of afplot, and need to save each plot to a unique filename.
I've been trying to do this through a for loop, (I know thats inefficient) but can't seem to get it right.
for index, column in afplot.iteritems():
plt.figure(index); afplot[column].plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Fungicide used / lb')
plt.title('Amount of fungicides used on apples in the US')
plt.legend()
plt.savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/apple_fplot{}'.format(index))
I'm not sure if I'm going about this the right way, but the whole idea is to have the plot be reset each time it goes through the iteration, plotting only the next column's line plot, and then saves it to a new filename.
The df.iteritems() iterator returns (column name, series) pairs ([see docs])1. So you can simplify:
for col, data in afplot.iteritems():
ax = data.plot(title='Amount of fungicides used on apples in the US'))
ax.set_ylabel('Fungicide used / lb')
plt.gcf().savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/apple_fplot{}'.format(col))
plt.close()
The xlabel should already be 'Year' as this seems to be the name of the index. Legend is True by default. See additional plot parameters.
I'm plotting:
df['close'].plot(legend=True,figsize=(10,4))
The original data series comes in an descending order,I then did:
df.sort_values(['quote_date'])
The table now looks good and sorted in the desired manner, but the graph is still the same, showing today first and then going back in time.
Does the .plot() order by index? If so, how can I fix this ?
Alternatively, I'm importing the data with:
df = pd.read_csv(url1)
Can I somehow sort the data there already?
There are two problems with this code:
1) df.sort_values(['quote_date']) does not sort in place. This returns a sorted data frame but df is unchanged =>
df = df.sort_values(['quote_date'])
2) Yes, the plot() method plots by index by default but you can change this behavior with the keyword use_index
df['close'].plot(use_index=False, legend=True,figsize=(10,4))