Using the tabula package for python, I am trying to extract tables from multiple pdf files. This works beautifully for multi-rowed tables, however, some of the pdf files have tables with only a single row. When trying to convert these pdfs, it returns an empty list. It makes sense that these files are problematic since a single-rowed table is essentially just another line of text.
However, it is important that these pdfs are also converted into DataFrames since they appear fairly frequently in my dataset. Unfortunately, the pdf files are proprietary so I can't show them here. I'm hoping that this limitation does not prohibit a solution from being found. Below is the line of code that does the conversion.
df = tabula.read_pdf(DIRECTORY + file_name, pages = 'all', pandas_options={'header': None}, encoding="utf-8")
I've attempted to solve this problem in a few ways. First I tried inserting an extra row in the original pdf files from the source, unfortunately, this is impossible. I tried using the tips on the tabula-py website (https://tabula-py.readthedocs.io/en/latest/faq.html#i-got-a-empty-dataframe-how-can-i-resolve-it):
Set a specific area for accurate table detection.
Try lattice = True option for the table having explicit line.
Try stream = True option
Following the first tip, I tried specifying an area using measurements taken in Adobe. This still returned an empty DataFrame. I tried the second and third tips and this again returned an empty list.
So the question I have is: "Is there a way to let the tabula-py package identify tables with only a single row from a pdf?"
I'm hoping that someone knows how to solve this problem. Thanks in advance for the effort.
Related
Thanks for taking the time to read my question.
I am working on a personal project to learn python scripting for excel, and I want to learn how to move data from one workbook to another.
In this example, I am emulating a company employee ledger that has name, position, address, and more (The organizations is by row so every employee takes up one row). But the project is to have a selected number of people be transferred to a new ledger (another excel file). So I have a list of emails in a .txt file (it could even be another excel file but I thought .txt would be easier), and I would want the script to run through the .txt file, get the emails, and look for any rows that have a matching email address(all emails are in cell 'B'). And if any are found, then copy that entire row to the new excel file.
I tried a lot of ways to make this work, but I could not figure it out. I am really new to python so I am not even sure if this is possible. Would really appreciate some help!
You have essentially two packages that will allow manipulation of Excel files. For reading in data and performing analysis the standard package for use is pandas. You can save the files as .xlsx however you are only really working with base table data and not the file itself (IE, you are extracing data FROM the file, not working WITH the file)
However what you need is really to perform manipulation on Excel files directly which is better done with openpyxl
You can also read files (such as your text file) using with open function that is native to Python and is not a third party import like pandas or openpyxl.
Part of learning to program includes learning how to use documentation.
As such, here is the documentation you require with sufficient examples to learn openpyxl: https://openpyxl.readthedocs.io/en/stable/
And you can learn about pandas here: https://pandas.pydata.org/docs/user_guide/index.html
And you can learn about python with open here: https://docs.python.org/3/tutorial/inputoutput.html
Hope this helps.
EDIT: It's possible I or another person can give you a specific example using your data / code etc, but you would have to provide it fully. Since you're learning, I suggest using the documentation or youtube.
I need to extract tables from pdf, these tables can be of any type, multiple headers, vertical headers, horizontal header etc.
I have implemented the basic use cases for both and found tabula doing a bit better than camelot still not able to detect all tables perfectly, and I am not sure whether it will work for all kinds or not.
So seeking suggestions from experts who have implemented similar use case.
Example PDFs: PDF1 PDF2 PDF3
Tabula Implementation:
import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
print(t, "\n=========================\n")
Camelot Implementation:
import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
print(tabs.df, "\n=================================\n")
Please read this: https://camelot-py.readthedocs.io/en/master/#why-camelot
The main advantage of Camelot is that this library is rich in parameters, through which you can improve the extraction.
Obviously, the application of these parameters requires some study and various attempts.
Here you can find comparision of Camelot with other PDF Table Extraction libraries.
I think Camelot better extracts data in a clean format and not jumbled up ( i.e. data retains the information and row contents are not affected).
So, The quality of data extracted is better in case of difference in the number of lines per cells .
->Tabula requires a Java Runtime Environment
There are open (Tabula, pdf-table-extract) source (smallpdf, PDFTables) tools that are widely used to extract tables from PDF files. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.
Camelot was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!
I'm really new to programming and I've been trying to emulate the 'pandas.read_table' code from Python for Data Analysis book(the chapter on MovieLens 1M Data Set, pg.23ish). Below is the link to the file used for database and the images of jupyter notebook on which I've typed the codes. As you'll see there, I'm having a trouble with the data values not reading properly as it should, and I can't seem to figure out why. Your help will be much appreciated!
Trouble screen
Database file
If you are reading data from a .csv file, use pd.read_csv.
If you want to use pd.read_table, you have to specify the delimiter as the comma with the argument sep=','. What is happening is that pd.read_table is trying to separate your input information at every ::, but it looks like your data is separated by commas instead.
More information here:
http://pandas.pydata.org/pandas-docs/stable/io.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!
There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.
Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx
I'll start off by saying that I'm new to python. I'm trying to create an application that is a simple Q+A and will export the answers to specific cells of an excel. I have an existing spreadsheet that i would like to modify and save as a separate outfile leaving the original untouched. I've seen various ways that i can append the file but will overwrite the original.
As an example, i would like this code;
hq = input('Headquarters: ')
to put the response in cell S1
Am I way off base trying to use Python for this task? Any Help would be greatly appreciated!
-Paul
There may not be very straightforward solutions but there are a couple of tools which might help you.
The first one is openpyxl: https://openpyxl.readthedocs.org/en/2.0.2/# If you have xlsx files, you should be able to modify them with this.
You might also be able to do what you want to do by using xlutils module: http://pythonhosted.org/xlutils/index.html However, then you'll need to first read the file, then edit it, and then save it to another file. Formatting may be lost, etc.
This is heavily YMMV due to the not-so-well defined file format, but I'd start with openpyxl.