Unable to contain text in tabulate cell when changing table format

Unable to contain text in tabulate cell when changing table format - python

I'm using tabulate to create some nice tables for a lab, though I'm running into an issue. I have a "comments" section where the tables include, you guessed it, comments. Only thing is when I apply tablefmt="fancy_grid", the text wraps all the way through and doesn't stay contained in the cell. Even without using "fancy_grid", I had to limit the word count, because it would go over the cell width and break the table. I'll include a picture below.
My question would be - is there a way to apply a class that could contain the text and make it wrap within that cell, rather than wrap through the entire table width? Nothing on the pypi tabulate page shows any solutions to this either, you can format columns, though I didn't see anything related to this specific issue. I know about using p{width} through LaTeX, though I decided to go with tabulate from the start instead of using LaTeX, not a big fan of the notation.
Regular table with contained cell content:
Properly structured table
Broken table due to text overflow:
Improperly structured table
Cheers.

Related

Tabula Python Package: Reading a pdf with a single row

Using the tabula package for python, I am trying to extract tables from multiple pdf files. This works beautifully for multi-rowed tables, however, some of the pdf files have tables with only a single row. When trying to convert these pdfs, it returns an empty list. It makes sense that these files are problematic since a single-rowed table is essentially just another line of text.
However, it is important that these pdfs are also converted into DataFrames since they appear fairly frequently in my dataset. Unfortunately, the pdf files are proprietary so I can't show them here. I'm hoping that this limitation does not prohibit a solution from being found. Below is the line of code that does the conversion.
df = tabula.read_pdf(DIRECTORY + file_name, pages = 'all', pandas_options={'header': None}, encoding="utf-8")
I've attempted to solve this problem in a few ways. First I tried inserting an extra row in the original pdf files from the source, unfortunately, this is impossible. I tried using the tips on the tabula-py website (https://tabula-py.readthedocs.io/en/latest/faq.html#i-got-a-empty-dataframe-how-can-i-resolve-it):
Set a specific area for accurate table detection.
Try lattice = True option for the table having explicit line.
Try stream = True option
Following the first tip, I tried specifying an area using measurements taken in Adobe. This still returned an empty DataFrame. I tried the second and third tips and this again returned an empty list.
So the question I have is: "Is there a way to let the tabula-py package identify tables with only a single row from a pdf?"
I'm hoping that someone knows how to solve this problem. Thanks in advance for the effort.

Deleting pages in a word document using python-docx

I have a .docx template I use containing tables with placeholder names that I replace with data from a .xlsx file using openpyxl and python-docx. For example I have a "tool" placeholder that is replaced with the contents of cell A1 in the .xlsx.
Its a dumb workaround, but thats life when I cant get the business to use .xlsx to begin with.
The template is 126 pages, and I use anything from 1-126 depending on the part being documented. Currently I remove the unused pages manually. Is there a way to remove for example pages 10 through 126 using python?
All the pages are identical to start with, so another solution would be to generate the correct amount of pages at the beginning, I've tried various ways of doing that, but can never get the logo picture to copy correctly.

Modifying and creating xlsx files with Python, specifically formatting single words of a e.g. sentence in a cell

I'm working a lot with Excel xlsx files which I convert using Python 3 into Pandas dataframes, wrangle the data using Pandas and finally write the modified data into xlsx files again.
The files contain also text data which may be formatted. While most modifications (which I have done) have been pretty straight forward, I experience problems when it comes to partly formatted text within a single cell:
Example of cell content: "Medical device whith remote control and a Bluetooth module for communication"
The formatting in the example is bold and italic but may also be a color.
So, I have two questions:
Is there a way of preserving such formatting in xlsx files when importing the file into a Python environment?
Is there a way of creating/modifying such formatting using a specific python library?
So far I have been using Pandas, OpenPyxl, and XlsxWriter but have not succeeded yet. So I shall appreciate your help!
As pointed out below in a comment and the linked question OpenPyxl does not allow for this kind of formatting:
Any other ideas on how to tackle my task?

i have been recently working with openpyxl. Generally if one cell has the same style(font/color), you can get the style from cell.font: cell.font.bmeans bold andcell.font.i means italic, cell.font.color contains color object.
but if the style is different within one cell, this cannot help. only some minor indication on cell.value

IPython notebook read string from raw text cell

I have a raw text cell in my IPython notebook project.
Is there a way to get the text as a string with a build in function or something similar?

My (possibly unsatisfactory) answer is in two parts. This is based on a personal investigation of iPython structures, and it's entirely possible I've missed something that more directly answers the question.
Current session
The raw text for code cells entered during the current session is available within a notebook using the list In.
So the raw text of the current cell can be returned by the following expression within the cell:
In[len(In)-1]
For example, evaluating a cell containing this code:
print "hello world"
three = 1+2
In[len(In)-1]
yields this corresponding Out[] value:
u'print "hello world"\nthree = 1+2\nIn[len(In)-1]'
So, within an active notebook session, you can access the raw text of cell as In[n], where n is the displayed index of the required cell.
But if the cell was entered during a previous Notebook session, which has subsequently been closed and reopened, that no longer works. Also, only code cells seem to be included in the In array.
Also, this doesn't work for non-code cells, so wouldn't work for a raw text cell.
Cells from saved notebook sessions
In my research, the only way I could uncover to get the raw text from previous sessions was to read the original notebook file. There is a documentation page Importing IPython Notebooks as Modules describing how to do this. The key code is in In[4]:
# load the notebook object
with io.open(path, 'r', encoding='utf-8') as f:
nb = current.read(f, 'json')
where current is an instance of the API described at Module: nbformat.current.
The notebook object returned is accessed as a nested dictionary and list structure, e.g.:
for cell in nb.worksheets[0].cells:
...
The cell objects thus enumerated have two key fields for the purpose of this question:
cell.cell_type is the type of the cell ("code", "markdown", "raw", etc.).
cell.input is the raw text content of the cell as a list of strings, with an entry for each line of text.
Much of this can be seen by looking at the JSON data that constitutes a saved iPython notebook.
Apart from the "prompt number" fields in a notebook, which seem to change whenever the field is re-evaluated, I could find no way to create a stable reference to a notebook cell.
Conclusion
I couldn't find an easy answer to the original question. What I found is covered above. Without knowing the motivation behind the original question, I can't know if it's enough.
What I looked for, but was unable to identify, was a way to reference the current notebook that can be used from within the notebook itself (e.g. via a function like get_ipython()). That doesn't mean it doesn't exist.
The other missing piece in my response is any kind of stable way to refer to a specific cell. (e.g. Looking at the notebook file format, raw text cells consist solely of a cell type ("raw") and the raw text itself, though it appears that cell metadata might also be included.) This suggests the only way to directly reference a cell is through its position in the notebook, but that is subject too change when the notebook is edited.
(Researched and answered as part of the Oxford participation in http://aaronswartzhackathon.org)

I am not allowed to comment due to my lack of reputation so I will just post as answer an update to Graham Klyne's answer, in case someone else stumble into this. (Ipython has no updated documentation yet to date)
Use nbformat instead of Ipython.nbformat.current
The worksheets attribute is gone so use cells directly.
I have an example of how the updated code will look like:
https://github.com/ldiary/marigoso/blob/master/marigoso/NotebookImport.py

How to iterate over everything in a python-docx document?

I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!

There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.

Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.