Reading a table with blank cells with tabula-py - python

I am trying to load a large table (an example is attached) from form 10-K into Python using tabula-py. The table does not have clear border, and have a lot of blank cells, which cause several issues.
My code is
df = tabula.read_pdf("firm_xxx_10K.pdf", pages='100-101',guess=True,stream=True,columns=(144,210,300,340,380,420,450))
With stream=True, I get all the data, but the information in multiple rows are recognized as separate entries. With lattice=True, then the cells with multiple rows are correctly recognized as one cell, but now the results miss a lot of observations.
Is there a better way to set the options? I tried many options, but now I am stuck. Any help is much appreciated.
Best,
Example of the Table I am Trying to Read

Related

Fill out Excel Template with Python

I'm having trouble finding a solution to fill out an excel template using python. I currently have a pandas dataframe where I use openpyxl to write the necessary data to specific Rows and Cells in a for loop. The issue I have is that in my next project several of the cells I have to write are not continuous so for example instead of going A1,A2,A3 it can go A1,A5,A9. However this time if I were to list the cells like I did in the past it would be impractical.
So I was looking for something that would work similar to a Vlookup in excel. Where in the template we have Python would match the necessary Row and Column to drop the information. I know I might need to use different commands.
I added a picture below as an example. So I would need to drop values in the empty cells and ideally Python would read "USA and Revenue" and know to drop that information on cell B2. I know I might need something to map it also I am just not sure on how to start or if it is even possible.
enter image description here

How to extract a table without all borders into text with Python?

I am trying to extract a table like this into a Dataframe. How to do that (and extract even the names splitted on several lines) with Python?
Also, I want this to be general and to be applied on each table (even if it doesn't this structure), so giving the coordinates for each separate and different table won't work that well.
I don't know about your exact problem but if you want to extract data or tables from PDF then try the camelot-py library, it is easy and gives almost more than 90% accuracy.
I am also working on the same project.
import camelot
tables = camelot.read_pdf(PDF_file_Path, flavor='stream', pages='1', table_areas=['5,530,620,180'])
tables[0].parsing_report
df = tables[0].df
The parameters of camelot.read_pdf are:
PDF_File the give file path;
table_areas is optional if you get an exact table then provide a location otherwise it can get whole data & all tables;
pages number of pages.
.parsing_report show the result description, e.g., accuracy and whitespace.
.df can show the table as a data frame. Index 0 refer to the 1st table. It depends on your data.
You can read more about them in the camelot documentation.

Pandas Pivot table and Excel's style cells

I get data measurements from instruments. These measurements depend on several parameters, and a pivot table is a good solution to represent the data. Every measurement can be associated to a scope screenshoot to be more explicit. I get all the data in the following csv format :
The number of measurements and parameters can change.
I am trying to write a Python script (for now with Pandas lib) which allows me to create a pivot table in Excel. With Pandas, I can color the data in and out of a defined range. However, I would like also to to create a link on every cell who can send me to the corresponding screenshot. But I am stuck here.
I would like a result like the following (but with the link to the corresponding screenshot) :
Actually, I found out a way to add the link thanks to the =HYPERLINK() Excel function to all the cells with the apply() Pandas function.
However, I cannot apply a conditional formatting thanks to xlsxWriter anymore because the cells don't have a numerical content anymore
I can apply the conditional formatting first and then iterate through the whole sheet to add a link, but it will be a total mess to retrieve the relation between the data and the different parameters measurement
I would like your help to find ideas and efficient ways to do what I would like
xlsxwriter has a function called write_url ,but first while creating new worksheet you must apply write_url and then use openyxl to insert your pandas data frame
1)create worksheet and insert write_url
2)use openyxl to write data into already formatted cells.

How to reformat dataframe in Pandas using Python?

I'm pretty new to Pandas and Python, but have solid coding background. I've decided to pick this up because it will help me automate certain financial reports at work..
To give you a basic background of my issue, I'm taking a PDF and using Tabula to reformat it into a CSV file, which is working fine but giving me certain formatting issues. The reports come in about 60 page PDF files, which I am exporting to a CSV and then trying to manipulate the data in Python using Pandas.
The issue: when I reformat the data, I get a CSV file that looks something like this -
The issue here is that certain tables are shifting and I think it is due to the amount of pages and multiple headings within those.
Would it be possible for me to reformat this data using Pandas, and basically create a set of rules for how it gets reformatted?
Basically, I would like to shift the rows that are misplaced back into their respective places based on something like blank spaces.
Is it possible for me to delete rows with certain strings - deleting extra/unnecessary headers.
Can I somehow save the 'Total' data at the bottom by searching for the row with 'Total' and placing it somewhere else?
In essence, is there a way to partition this data by a set of commands (without specifying row numbers - because this changes daily) and then reposition it accordingly so that I can manipulate the data however necessary?

Delete cells in Excel using Python 2.7 and openpyxl

I'm trying to delete cells from an Excel spreadsheet using openpyxl. It seems like a pretty basic command, but I've looked around and can't find out how to do it. I can set their values to None, but they still exist as empty cells. worksheet.garbage_collect() throws an error saying that it's deprecated. I'm using the most recent version of openpyxl. Is there any way of just deleting an empty cell (as one would do in Excel), or do I have to manually shift all the cells up? Thanks.
In openpyxl cells are stored individually in a dictionary. This makes aggregate actions like deleting or adding columns or rows difficult as code has to process lots of individual cells. However, even moving to a tabular or matrix implementation is tricky as the coordinates of each cell are stored on each cell meaning that you have process all cells to the right and below an inserted or deleted cell. This is why we have not yet added any convenience methods for this as they could be really, really slow and we don't want the responsibility for that.
Hoping to move towards a matrix implementation in a future version but there's still the problem of cell coordinates to deal with.

Categories

Resources