Write multiple Dataframes to same PDF file using matplotlib

Write multiple Dataframes to same PDF file using matplotlib - python

I'm stuck at a point where I have to write multiple pandas dataframe's to a PDF file.The function accepts dataframe as input.
However, I'm able to write to PDF for the first time but all the subsequent calls are overriding the existing data, leaving with only one dataframe in the PDF by the end.
Please find the python function below :
def fn_print_pdf(df):
pp = PdfPages('Sample.pdf')
total_rows, total_cols = df.shape;
rows_per_page = 30; # Number of rows per page
rows_printed = 0
page_number = 1;
while (total_rows >0):
fig=plt.figure(figsize=(8.5, 11))
plt.gca().axis('off')
matplotlib_tab = pd.tools.plotting.table(plt.gca(),df.iloc[rows_printed:rows_printed+rows_per_page],
loc='upper center', colWidths=[0.15]*total_cols)
#Tabular styling
table_props=matplotlib_tab.properties()
table_cells=table_props['child_artists']
for cell in table_cells:
cell.set_height(0.024)
cell.set_fontsize(12)
# Header,Footer and Page Number
fig.text(4.25/8.5, 10.5/11., "Sample", ha='center', fontsize=12)
fig.text(4.25/8.5, 0.5/11., 'P'+str(page_number), ha='center', fontsize=12)
pp.savefig()
plt.close()
#Update variables
rows_printed += rows_per_page;
total_rows -= rows_per_page;
page_number+=1;
pp.close()
And I'm calling this function as ::
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name'])
fn_print_pdf(df_a)
raw_data = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name'])
fn_print_pdf(df_b)
PDF file is available at
SamplePDF
.As you can see only the data from second dataframe is saved ultimately.Is there a way to append to the same Sample.pdf in the second pass and so on while still preserving the former data?

Your PDF's are being overwritten, because you're creating a new PDF document every time you call fn_print_pdf(). You can try keep your PdfPages instance open between function calls, and make a call to pp.close() only after all your plots are written. For reference see this answer.
Another option is to write the PDF's to a different file, and use pyPDF to merge them, see this answer.
Edit : Here is some working code for the first approach.
Your function is modified to :
def fn_print_pdf(df,pp):
total_rows, total_cols = df.shape;
rows_per_page = 30; # Number of rows per page
rows_printed = 0
page_number = 1;
while (total_rows >0):
fig=plt.figure(figsize=(8.5, 11))
plt.gca().axis('off')
matplotlib_tab = pd.tools.plotting.table(plt.gca(),df.iloc[rows_printed:rows_printed+rows_per_page],
loc='upper center', colWidths=[0.15]*total_cols)
#Tabular styling
table_props=matplotlib_tab.properties()
table_cells=table_props['child_artists']
for cell in table_cells:
cell.set_height(0.024)
cell.set_fontsize(12)
# Header,Footer and Page Number
fig.text(4.25/8.5, 10.5/11., "Sample", ha='center', fontsize=12)
fig.text(4.25/8.5, 0.5/11., 'P'+str(page_number), ha='center', fontsize=12)
pp.savefig()
plt.close()
#Update variables
rows_printed += rows_per_page;
total_rows -= rows_per_page;
page_number+=1;
Call your function with:
pp = PdfPages('Sample.pdf')
fn_print_pdf(df_a,pp)
fn_print_pdf(df_b,pp)
pp.close()

Related

How to colorize all rows of a dataframe based on values of a column dynamically?

Let's say I have a dataframe like the following;
df = pd.DataFrame({'Sample': ['A', 'B', 'C', 'D','E','F'],
'NFW': [8.16, 8.63, 9.25, 8.97, 7.5, 8.21],
'Qubit': [55, 100, 229, 30, 42, 33],
'Lane': ['1', '1', '2', '2', '3', '3']})
I want to colorize the background of all rows based on the values of the Lane column dynamically. Also, I'll write this dataframe to an excel file and I need to keep all style changes in there too.

IIUC, you can use a colormap (for example from matplotlib) and map it to each row using a Categorical and style.apply:
import matplotlib
cmap = matplotlib.cm.get_cmap('Pastel1')
colors = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
# ['#fbb4ae', '#b3cde3', '#ccebc5', '#decbe4', '#fed9a6', '#ffffcc', '#e5d8bd', '#fddaec', '#f2f2f2']
def color(df, colors=colors):
# get unique values as category
s = df['Lane'].astype('category')
# map the colors and create CSS string
s = ('background-color: '
+s.cat.rename_categories(colors[:len(s.cat.categories)])
.astype(str)
)
# expand to DataFrame size
return pd.DataFrame(np.tile(s, (df.shape[1], 1)).T,
index=df.index, columns=df.columns)
df.style.apply(color, axis=None)

Best approach completing this table using docxtpl template/python (Border attribute)

I have a table with a fixed number of columns but an undefined number of rows (until it's populated). I am able to modify some example code to dynamically enter the rows, but I can't figure out how to apply the table styling, specifically the border properties. I would like the thicker grey outline of the table to continue around the left hand side and bottom.
Can anyone point me in the right direction please?
tpl = DocxTemplate('G:\ELECSUPP\Documentation\Results\Kaya Marks/dynamic_table_tpl.docx')
context = {
'col_labels': ['Description', 'Company Number', 'Calibration \nDue Date'],
'tbl_contents': [
{'label': '1', 'cols': ['Desc', 'BXXXXXX', '12/12/21']},
{'label': '2', 'cols': ['Desc', 'BXXXXXX', '10/01/22']},
{'label': '3', 'cols': ['Desc', 'BXXXXXX', '10/10/10']},
],
}

How to create multi-line node labels when creating Networkx Directed Graphs from Pandas Dataframe

My question is a continuation from my previous question.
I am trying to create a networkx flow diagram from a pandas dataframe. The dataframe records how an order flows through multiple firms. Most of the rows in the dataframe are connected and the connections are manifested in multiple columns. Sample data is as below:
df = pd.DataFrame({'Company': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'event_type':['new', 'route', 'receive', 'execute', 'route', 'receive', 'execute'],
'event_id': ['110', '120', '200', '210', '220', '300', '310'],
'prior_event_id': [np.nan, '110', np.nan, '120', '210', np.nan, '300'],
'route_id': [np.nan, 'foo', 'foo', np.nan, 'bar', 'bar', np.nan]}
)
The dataframe looks like below:
Company event_type event_id prior_event_id route_id
0 A new 110 NaN NaN
1 A route 120 110 foo
2 B receive 200 NaN foo
3 B execute 210 120 NaN
4 B route 220 210 bar
5 C receive 300 NaN bar
6 C execute 310 300 NaN
I was able to create the source and target columns from the sample data using the code below:
df['event_sub'] = df.groupby([df.Company, df.event_type]).cumcount()+1
df['event'] = df.Company + ' ' + df.event_type + ' ' + df.event_sub.astype(str)
replace_dict_event = dict(df[['event_id', 'event']].values)
df['source'] = df['prior_event_id'].apply(lambda x: replace_dict_event.get(x) if replace_dict_event.get(x) else np.nan )
df['target'] = df['event_id'].apply(lambda x: replace_dict_event.get(x) if replace_dict_event.get(x) else np.nan )
replace_dict_rtd = dict(df[df.event_type == 'route'][['route_id', 'event']].values)
df.loc[df.event_type == 'receive', 'source'] = df[df.event_type == 'receive']['route_id'].apply(lambda x: replace_dict_rtd.get(x))
Now the dataframe looks like this:
The slight difference between the result above and the result in my previous question is that I incorporated the company name in the current result. And the networkx graph I created from the source and target columns looks like below:
However, the problem I am facing is that in my actual data, the company names are longer and there are more nodes. Therefore quite often the labels are all squeezed together and basically become unintelligible. The first solution that came to my mind is to break the labels into multiple lines. My desired node looks like this:
What I tried is to add `\n' in the pertinent columns, so I changed the 2nd line of my last code block to
df['event'] = df.Company + '\n' + df.event_type + ' ' + df.event_sub.astype(str)
But this didn't give me what I want. Instead I got "KeyError: 'Node A\nnew 1 not in graph.'" I tried some other methods I found on SO but no luck either.
Is there any way to achieve this?

# dummy data
a = np.random.randint(0,2,size=(10,10))
G = nx.from_numpy_matrix(a)
pos = nx.spring_layout(G)
# draw without labels, then draw labels separately
nx.draw_networkx(G, pos=pos, with_labels=False)
# draw_networkx_labels takes as keyword argument a dictionary called labels
# which links the id of a node to a name.
# you can create one using dictionary comprehension like so:
nodenames = {n:'firstline \n secondline \n thirdline' for n in G.nodes()}
# and then draw:
nx.draw_networkx_labels(G, pos=pos, labels=nodenames)

Plotly: How to change legend for a go.pie chart without changing data source?

I am practising building a Pie Chart in Plotly Express using Python.
So, this is the Pie Chart that I made;
This chart was build from a file with two columns called
gender with values of [0, 1, 2]
count_genders with values of [total_count_0, total_count_1, total_count_2]
I am planning to add some description to those values; for instance
0 - female
1 - male
2 - undefined
This is where I am currently stuck.
If I remember correctly if you want to change a label in the legend (at least in Choropleth map), you could manipulate the ticks located in colorscale bar. By manipulating them, you could rename the label about the data. Thus I am wondering if you could do the same in Pie chart?
My current code for this graph:
import pandas as pd
import plotly.express as px
'''
Pandas DataFrame:
'''
users_genders = pd.DataFrame({'gender': {0: 0, 1: 1, 2: 2},
'count_genders': {0: 802420, 1: 246049, 2: 106}})
''' Pie Chart Viz '''
gender_distribution = px.pie(users_genders,
values='count_genders',
names='gender',
color_discrete_map={'0': 'blue',
'1': 'red',
'2': 'green'},
title='Gender Distribution <br>'
'between 2006-02-16 to 2014-02-20',
hole=0.35)
gender_distribution.update_traces(textposition='outside',
textinfo='percent+label',
marker=dict(line=dict(color='#000000',
width=4)),
pull=[0.05, 0, 0.03],
opacity=0.9,
# rotation=180
)
gender_distribution.update_layout(legend=dict({'traceorder': 'normal'}
# ticks='inside',
# tickvals=[0, 1, 2],
# ticktext=["0 - Female",
# "1 - Male",
# "2 - Undefined"],
# dtick=3
),
legend_title_text='User Genders')
gender_distribution.show()
I tried to add the ticks in the update_layout to no avail. It returns an error message about incorrect parameters. Would someone kindly help me fix this issue?
edit 1: In case I wasn't clear, I wanted to know if it's possible to modify the values displayed in the legend without changing the original values within the file. Many thanks for your time for those who are already kind enough to help me fix this issue!
edit 2: Add the imports and other prior details of the code, removing the Dropbox link.

If I'm understanding your question correctly, you'd like to change what's displayed in the legend without changing the names in your data source. There may be more elegant ways of doing this but I've put together a custom function newLegend(fig, newNames) that will do exactly that for you.
So with a figure like this:
...running:
fig = newLegend(fig = fig, newNames = {'Australia':'Australia = Dangerous',
'New Zealand' : 'New Zealand = Peaceful'})
...will give you:
I hope this is what you were looking for. Don't hesitate to let me know if not!
Complete code:
import plotly.express as px
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.pie(df, values='pop', names='country')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
def newLegend(fig, newNames):
for item in newNames:
for i, elem in enumerate(fig.data[0].labels):
if elem == item:
fig.data[0].labels[i] = newNames[item]
return(fig)
fig = newLegend(fig = fig, newNames = {'Australia':'Australia = Dangerous',
'New Zealand' : 'New Zealand = Peaceful'})
fig.show()
Edit 1: Example with data sample from OP
The challenge with your data was that genders were of type integer and not string. So the custom function tried to replace an element of one type with an element of another type. I've solved this by replacing the entire array containing your labels in one go instead of manipulating it element by element.
Plot:
Complete code:
import pandas as pd
import plotly.express as px
import numpy as np
# custom function to change labels
def newLegend(fig, newNames):
newLabels = []
for item in newNames:
for i, elem in enumerate(fig.data[0].labels):
if elem == item:
#fig.data[0].labels[i] = newNames[item]
newLabels.append(newNames[item])
fig.data[0].labels = np.array(newLabels)
return(fig)
'''
Pandas DataFrame:
'''
users_genders = pd.DataFrame({'0': {0: 1, 1: 2},
'802420': {0: 246049, 1: 106}})
users_genders = pd.DataFrame({'gender':[0,1,2],
'count_genders': [802420, 246049, 106]})
''' Pie Chart Viz '''
gender_distribution = px.pie(users_genders,
values='count_genders',
names='gender',
color_discrete_map={'0': 'blue',
'1': 'red',
'2': 'green'},
title='Gender Distribution <br>'
'between 2006-02-16 to 2014-02-20',
hole=0.35)
gender_distribution.update_traces(textposition='outside',
textinfo='percent+label',
marker=dict(line=dict(color='#000000',
width=4)),
pull=[0.05, 0, 0.03],
opacity=0.9,
# rotation=180
)
gender_distribution.update_layout(legend=dict({'traceorder': 'normal'}
# ticks='inside',
# tickvals=[0, 1, 2],
# ticktext=["0 - Female",
# "1 - Male",
# "2 - Undefined"],
# dtick=3
),
legend_title_text='User Genders')
# custom function set to work
gender_distribution=newLegend(gender_distribution, {0:"0 - Female",
1:"1 - Male",
2: "2 - Undefined"})
gender_distribution.show()

newnames = {'0': 'zero', '1': 'one', '2': 'two'}
fig.for_each_trace(lambda t: t.update(
labels=[newnames[label] for label in t.labels]
)

Add Page Break before adding a Split with Flowable

I have an application that is using reportlab to build a document of tables. What I want to happen is when a flowable (in this case, always a Table) needs to split across pages, it should first add a page break. Thus, a Table should be allowed to split, but any table that is split should always start on a new page. There are multiple Tables in the same document, and if two can fit on the same page without splitting, there should not be a page break.
The closest I have gotten to this is to set allowSplitting to False when initializing the Document. However the issue is when a table exceeds the amount of space it has to fit, it will just fail. If instead of failing it will then wrap, this is what I am looking for.
For instance, this will fail with an error about not having enough space:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter, inch
from reportlab.platypus import SimpleDocTemplate, Table
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("simple_table_grid.pdf", pagesize=letter, allowSplitting=False)
# container for the 'Flowable' objects
elements = []
data2 = []
data = [['00', '01', '02', '03', '04'],
['10', '11', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', '33', '34']]
for i in range(100):
data2.append(['AA', 'BB', 'CC', 'DD', 'EE'])
t1 = Table(data)
t2 = Table(data2)
elements.append(t1)
elements.append(t2)
doc.build(elements)
The first table (t1) will fit fine, however t2 does not. If the allowSplitting is left off, it will fit everything in the doc, however t1 and t2 are on the same page. Because t2 is longer than one page, I would like it to add a page break before it starts, and then to split on the following pages where needed.

One option is to make use of the document height and table height to calculate the correct placement of PageBreak() elements. Document height can be obtained from the SimpleDocTemplate object and the table height can be calculated with the wrap() method.
The example below inserts a PageBreak() if the available height is less than table height. It then recalculates the available height for the next table.
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, PageBreak
doc = SimpleDocTemplate("simple_table_grid.pdf", pagesize=letter)
# Create multiple tables of various lengths.
tables = []
for rows in [10, 10, 30, 50, 30, 10]:
data = [[0, 1, 2, 3, 4] for _ in range(rows)]
tables.append(Table(data, style=[('BOX', (0, 0), (-1, -1), 2, (0, 0, 0))]))
# Insert PageBreak() elements at appropriate positions.
elements = []
available_height = doc.height
for table in tables:
table_height = table.wrap(0, available_height)[1]
if available_height < table_height:
elements.extend([PageBreak(), table])
if table_height < doc.height:
available_height = doc.height - table_height
else:
available_height = table_height % doc.height
else:
elements.append(table)
available_height = available_height - table_height
doc.build(elements)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write multiple Dataframes to same PDF file using matplotlib - python

Related

How to colorize all rows of a dataframe based on values of a column dynamically?

Best approach completing this table using docxtpl template/python (Border attribute)

How to create multi-line node labels when creating Networkx Directed Graphs from Pandas Dataframe

Plotly: How to change legend for a go.pie chart without changing data source?

Add Page Break before adding a Split with Flowable

Categories

Resources