I am working in python using ReportLab. I need to generate report in PDF format. The data is retrieving from the database and insert into table.
Here is simple code:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.units import inch
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', 'Here is large field retrieve from database', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data)
columnWidth = 1.9*inch;
for x in range(5):
t._argW[x]= cellWidth
elements.append(t)
doc.build(elements)
There are three issues:
The lengthy data in a cell overlap on the other cell in a row.
When I increase the column-width manually such as cellWidth = 2.9*inch; , the page is not visible and not scroll from Left-Right
I do not know how to append the data in a cell , mean if the size of the data is large ,it should append to the next line in the same cell.
How I reach this problem?
For starters i would not set the column size as you did. just pass Table the colWidths argument like this:
Table(data, colWidths=[1.9*inch] * 5)
Now to your problem. If you don't set the colWidth parameter reportlab will do this for you and space the columns according to your data. If this is not what you want, you can encapsulate your data into Paragraph's, like Bertrand said. Here is an modified example of your code:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
styles = getSampleStyleSheet()
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', Paragraph('Here is large field retrieve from database', styles['Normal']), '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data)
elements.append(t)
doc.build(elements)
I think you will get the idea.
I was facing the same problem today. I was looking for a solution where I could only resize a single column - with much different content length than other columns - and have reportlab do the rest for me.
This is what worked for me:
Table(data, colWidths=[1.9*inch] + [None] * (len(data[0]) - 1))
This specifies only the first column. But of course you can easily place the number somewhere in between the Nones as well
Hi I was getting same issue while I was resizing content of table and then I found solution. This solution might will help you and resolve all your 3 issues which is mention here.
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.units import inch
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', 'Here is large field retrieve from database', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data,colWidths=[1.9*inch]*5, rowHeights=[0.9*inch] *4)
#colWidth = size * number of columns
#rowHeights= size * number of rows
elements.append(t)
doc.build(elements)
Related
I'm very new to python and I'm trying to get the index of an element in a list of lists. There goes my list:
Data = [['0', '999.8', '1.78e-3'], ['5', '1000', '1.52e-3'], ['10', '999.7', '1.31e-3'], ['15', '999.1', '1.14e-3'], ['20', '998.2', '1.00e-3'], ['25', '997.0', '0.89e-3'], ['30', '995.7', '0.80e-3'], ['40', '992.2', '0.65e-3']]
I want to find the index of'10'. There is my code:
for element in data:
for e in element:
index_valeur = e.index('10')
print(index_valeur)
It doesn't seem to work and this is the error message:
ValueError: substring not found
How can I get the index of the value?
Pythonic best way is to use Pandas, my RAW attempt ;-)
import numpy as np
import pandas as pd
import io
mycsv = '''
T,rho,mu
0,999.8,1.78e-3
5,1000,1.52e-3
10,999.7,1.31e-3
15,999.1,1.14e-3
20,998.2,1.00e-3
25,997.0,0.89e-3
30,995.7,0.80e-3
40,992.2,0.65e-3
'''
myNum = float(input("Enter number: "))
df = pd.read_csv(io.StringIO(mycsv))
print(sorted(df['T'].values.tolist(), key= lambda x:abs(x-myNum))[:2])
Let me try to explain to the best of my ability as I am not a Python wizard. I have read with PyPDF2 a PDF table of data regarding covid-19 in Mexico and tokenize it—long story, I tried doing it with tabula but did not get the format I was expecting and I was going to spend more time reformatting the CSV document I have gotten back than analyzing it—and have gotten a list of strings back with len of 16792 which is fine.
Now, the problem I am facing is that I need to format it in the appropriate way by concatenating some (not all) of those strings together so I can create a list of lists with the same length which is 9 columns.
This is an example of how it looks right now, the columns are Case number, State, Locality, Gender, Age, Date when symptoms started, Status, Type of contagion, Date of arrival to Mexico:
['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020'
What I would want is to get certain strings like 'ZONA', 'NORTE' as 'ZONA NORTE' or 'CIUDAD', 'DE', 'MEXICO' as 'CIUDAD DE MEXICO' or 'ESTADOS', 'UNIDOS' as 'ESTADOS UNIDOS'...
I seriously do not know how to tackle this. I have tried, split(), replace(), trying to find the index of each frequency, read all questions about manipulating lists, tried almost all the responses provided... and haven't been able to do it.
Any guidance, will be greatly appreciated. Sorry if this is a very basic question, but I know there has to be a way, I just don't know it.
Since the phrase split is not the same for each row, you have a way to "recognize" it. One way would be if 2 list item that are next to each other have more than 3 letter, then join them.
import re
row_list = ['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020']
words_longer_than_3 = r'([^\d\W]){3,}'
def is_a_word(text):
return bool(re.findall(words_longer_than_3, text))
def get_next_item(row_list, i):
try:
next_item = row_list[i+1]
except IndexError:
return
return is_a_word(next_item)
for i, item in enumerate(row_list):
item_is_a_word = is_a_word(row_list[i])
if not item_is_a_word:
continue
next_item_is_a_word = get_next_item(row_list, i)
while next_item_is_a_word:
row_list[i] += f' {row_list[i+1]}'
del row_list[i+1]
next_item_is_a_word = get_next_item(row_list, i)
print(row_list)
result:
['1', 'PUEBLA PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso Contacto', 'NA', '2', 'GUERRERO ZONA NORTE', 'M', '29', '15/03/2020', 'Sospechoso Contacto', 'NA', '3', 'BAJA CALIFORNIA TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso Estados Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso Italia', '03/03/2020', '5', 'JALISCO CENTRO GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso España', '17/03/2020']
I suppose that the data you want to process comes from a similar file like this one here, which contains 2623 rows x 8 columns.
You can load the data from the PDF file using tabula-py. You can install it through pip install tabula-py==1.3.0. One issue with extracting the table like this is that tabula-py can mess up things sometimes. For example, the header of the table was extracted from the PDF file like this:
"",,,,Identificación Fecha de,
"",,,,de COVID-19,Fecha del llegada a
N° Caso,Estado,Sexo,Edad,Inicio de Procedencia por RT-PCR en,México
"",,,,síntomas,
"",,,,tiempo real,
Pretty nasty huh?
In addition, tabula-py could not separate some of the columns, that is writing a comma in the right place so that the output CSV file would be well-parsed. For instance, the line with número de caso (case number) 8:
8,BAJA CALIFORNIA,F,65,13/03/2020 Sospechoso Estados Unidos,08/03/2020
could be fixed by just replacing " Sospechoso " by ",Sospechoso,". And you are lucky becuase this is the only parse issue you will have to deal with for now. Thus, iterating through the lines of the output CSV file and replacing " Sospechoso " by ",Sospechoso," takes care of everything.
Finally, I added an option (removeaccents) for removing the accents from the data. This can help you avoid encoding issues in the future. For this you will need unicode: pip install unidecode.
Putting everything together, the code that reads the PDF file and converts it to a CSV file and loads it as a pandas dataframe is as follows (You can download the preprocessed CSV file here):
import tabula
from tabula import wrapper
import pandas as pd
import unidecode
"""
1. Adds the header.
2. Skips "corrupted" lines from tabula.
3. Replacing " Sospechoso " for ",Sospechoso," automatically separates
the previous column ("fecha_sintomas") from the next one ("procedencia").
4. Finally, elminiates accents to avoid encoding issues.
"""
def simplePrep(
input_path,
header = [
"numero_caso",
"estado",
"sexo",
"edad",
"fecha_sintomas",
"identification_tiempo_real",
"procedencia",
"fecha_llegada_mexico"
],
lookfor = " Sospechoso ",
replacewith = ",Sospechoso,",
output_path = "preoprocessed.csv",
skiprowsupto = 5,
removeaccents = True
):
fin = open(input_path, "rt")
fout = open(output_path, "wt")
fout.write(",".join(header) + "\n")
count = 0
for line in fin:
if count > skiprowsupto - 1:
if removeaccents:
fout.write(unidecode.unidecode(line.replace(lookfor, replacewith)))
else:
fout.write(line.replace(lookfor, replacewith))
count += 1
fin.close()
fout.close()
"""
Reads all the pdf pages specifying that the table spans multiple pages.
multiple_tables = True, otherwise the first row of each page will be missed.
"""
tabula.convert_into(
input_path = "data.pdf",
output_path = "output.csv",
output_format = "csv",
pages = 'all',
multiple_tables = True
)
simplePrep("output.csv", removeaccents = True)
# reads preprocess data set
df = pd.read_csv("preoprocessed.csv", header = 0)
# prints the first 5 samples in the dataframe
print(df.head(5))
I am getting html table based on day so if I search for 20 days it brings me 20 table and I want to add all 20 tables in 1 table so I can verify data within time series.
I have tried merge and add functions of pandas but it just add as string.
Table one
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
table two
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
code but it add as string.
tab_data = [[item.text for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
df = pd.DataFrame(tab_data)
df2 = pd.DataFrame(tab_data)
df3 = df.add(df2,fill_value=0)
df
If you want to convert the numeric cells into integers, you would need to do that explicitly, as follows:
tab_data = [[int(item.text) if item.text.isdigit() else item.text
for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
Hope it helps.
The way you are converting the data frame treats all values as text.
There are two options here.
Explicitly convert the strings to the data type you want using astype
Use read_html to create data frames from html tables, which also tries to do the data type conversion.
I have an application that is using reportlab to build a document of tables. What I want to happen is when a flowable (in this case, always a Table) needs to split across pages, it should first add a page break. Thus, a Table should be allowed to split, but any table that is split should always start on a new page. There are multiple Tables in the same document, and if two can fit on the same page without splitting, there should not be a page break.
The closest I have gotten to this is to set allowSplitting to False when initializing the Document. However the issue is when a table exceeds the amount of space it has to fit, it will just fail. If instead of failing it will then wrap, this is what I am looking for.
For instance, this will fail with an error about not having enough space:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter, inch
from reportlab.platypus import SimpleDocTemplate, Table
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("simple_table_grid.pdf", pagesize=letter, allowSplitting=False)
# container for the 'Flowable' objects
elements = []
data2 = []
data = [['00', '01', '02', '03', '04'],
['10', '11', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', '33', '34']]
for i in range(100):
data2.append(['AA', 'BB', 'CC', 'DD', 'EE'])
t1 = Table(data)
t2 = Table(data2)
elements.append(t1)
elements.append(t2)
doc.build(elements)
The first table (t1) will fit fine, however t2 does not. If the allowSplitting is left off, it will fit everything in the doc, however t1 and t2 are on the same page. Because t2 is longer than one page, I would like it to add a page break before it starts, and then to split on the following pages where needed.
One option is to make use of the document height and table height to calculate the correct placement of PageBreak() elements. Document height can be obtained from the SimpleDocTemplate object and the table height can be calculated with the wrap() method.
The example below inserts a PageBreak() if the available height is less than table height. It then recalculates the available height for the next table.
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, PageBreak
doc = SimpleDocTemplate("simple_table_grid.pdf", pagesize=letter)
# Create multiple tables of various lengths.
tables = []
for rows in [10, 10, 30, 50, 30, 10]:
data = [[0, 1, 2, 3, 4] for _ in range(rows)]
tables.append(Table(data, style=[('BOX', (0, 0), (-1, -1), 2, (0, 0, 0))]))
# Insert PageBreak() elements at appropriate positions.
elements = []
available_height = doc.height
for table in tables:
table_height = table.wrap(0, available_height)[1]
if available_height < table_height:
elements.extend([PageBreak(), table])
if table_height < doc.height:
available_height = doc.height - table_height
else:
available_height = table_height % doc.height
else:
elements.append(table)
available_height = available_height - table_height
doc.build(elements)
These are my stacks of arrays, both with variables arranged columnwise.
final_a = np.stack((four, five, st, dist, ru), axis=-1)
final_b = np.stack((org, own, origin, init), axis=-1)
Example:
In: final_a
Out: array([['9999', '10793', ' 1', '99', '2'],
['9999', '10799', ' 1', '99', '2'],
['9999', '10712', ' 1', '99', '2'],
...,
['9999', '23960', '33', '99', '1'],
['9999', '82920', '33', '99', '2'],
['9999', '82920', '33', '99', '2']],
dtype='<U5')
But when I try to save either of them to a .csv file using this code:
np.savetxt("/Users/jaisaranc/Documents/ASI selected data - A.csv", final_a, delimiter=",")
It throws this error:
TypeError: Mismatch between array dtype ('<U5') and format specifier ('%.18e,%.18e,%.18e,%.18e,%.18e')
I have no idea what to do.
savetxt in Numpy allows you specify a format for how the array will be displayed when it's written to a file. The default format (fmt='%.18e') can only format arrays containing only numeric elements. Your array contains strings (dtype='<U5' means the type is unicode with length 5) so it raises an error. In your case you should also include fmt='%s' as an argument to ensure the array elements in the output file are formatted as strings. For example:
np.savetxt("example.csv", final_a, delimeter=",", fmt="%s")