I am trying to extract the text using pymupdf or flitz by applying this tutorial https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467
instead of
blocks = page.getText("dict")["blocks"]
I wrote
blocks = page.get_text("dict", sort=True)["blocks"]
according to https://pymupdf.readthedocs.io/en/latest/recipes-text.html
But still, the text is not in the order I expect. The first paragraph will appear in the middle.
This happens when a page has more than one column of text.
You made a good first step using the sort argument. But please note that PDF can address each single character separately, such that every basic sorting approach may fail with the "right" PDF counter example.
If a page contains n text characters, then there exist n! different ways to encode the page - all of them looking identical, but only one of them extracting the "natural" reading sequence right away.
If your page contains tables, or if the text is organized in multiple columns (as is customary in newspapers), then you must invest additional logic to cope with that.
If you use the PyMuPDF module, you can extract text in a layout preserving manner: python -m fitz gettext -mode layout ....
If you need to achieve a similar effect within your script, you may be forced to use text extraction detailed down to each single character: page.get_text("rawdict") and use the returned character positions to bring them in the right sequence.
BTW the sort parameter causes the text blocks to be sorted ascending by (1) vertical, (2) horizontal coordinate of their bounding boxes. So if in a multi-column page the second column has a slightly higher y-coordinate, it will come before the first column. To handle such a case you must use this knowledge for making specialized code.
Assuming you know have a 2-column page, then the following code snippet might be used:
width2 = page.rect.width / 2 # half of the page width
left = page.rect + (0, 0, -width2, 0) # the left half page
right = page.rect + (width2, 0, 0, 0) # the right half page
# now extract the 2 halves spearately:
lblocks = page.get_text("dict", clip=left, sort=True)["blocks"]
rblocks = page.get_text("dict", clip=right, sort=True)["blocks"]
blocks = lblocks + rblocks
# now process 'blocks'
...
Related
I'm currently trying to extract information from lots of PDF forms such as this:
The text 'female' should be extracted here. So contrary to my title, I'm actually trying to extract text with no strikethroughs rather than text that with strikethroughs. But if I can identify which words with strikethroughs, I can easily identify the inverse.
Gaining inspiration from this post, I came up with this set of codes:
import os
import glob
from pdf2docx import parse
from docx import Document
lst = []
files = glob.glob(os.getcwd() + r'\PDFs\*.pdf')
for i in range(len(files)):
filename = files[i].split('\\')[-1].split('.')[-2]
parse(files[i])
document = Document(os.getcwd() + rf'\PDFs\{filename}.docx')
for p in document.paragraphs:
for run in p.runs:
if run.font.strike:
lst.append(run.text)
os.remove(os.getcwd() + rf'\PDFs\{filename}.docx')
What the above code does is to convert all my PDF files into word documents (docx), and then search through the word documents for text with strikethroughs, extract those text, then delete the word document.
As you may have rightfully suspected, this set of code is very slow and inefficient, taking about 30s to run on my sample set of 4 PDFs with less than 10 pages combined.
I don't believe this is the best way to do this. However, when I did some research online, pdf2docx extracts data from PDFs using PyMuPDF, but yet PyMuPDF do not come with the capability to recognise strikethroughs in PDF text. How could this be so? When pdf2docx could perfectly convert strikethroughs in PDFs into docx document, indicating that the strikethroughs are being recognised at some level.
All in all, I would like to seek advice on whether or not it is possible to extract text with strikethroughs in PDF using Python. Thank you!
If these strikethroughs in fact are annotations, PyMuPDF offers a simple and extremely fast solution:
On a page make a list of all strikethrough annotation rectangles and extract the text "underneath" them.
Or, similarly, look at keywords you are interested in (like "male", "female") and look if any is covered by a strikethrough annot.
# strike out annotation rectangles
st_rects = [a.rect for a in page.annots(types=[fitz.PDF_ANNOT_STRIKE_OUT])]
words = page.get_text("words") # the words on the page
for rect in st_rects:
for w in words:
wrect = fitz.Rect(w[:4]) # rect of the word
wtext = w[4] # word text
if wrect.intersects(rect):
print(f"{wtext} is strike out")
# the above checks if a word area intersects a strike out rect
# B/O mostly sloppy strike out rectangle definitions the safest way.
# alternatively, simpler:
for rect in st_rects:
print(page.get_textbox(rect + (-5, -5, 5, 5)), "is striked out")
# here I have increased the strike out rect by 5 points in every direction
# in the hope to cover the respective text.
Another case are PDF drawings, so-called "line art". These are no annotations (which can be removed) but things like lines, curves, rectangles - permanently stored in the page's rendering code objects (/Contents).
PyMuPDF also lets you extract this line art. If your text is striked-out with this method, then there exist overlaps between text rectangles and line art rectangles.
Office software (MS Word, LibreOffice) usually uses thin rectangles instead of true lines to better cope with zoomed displays -- so to catch all those cases, you must select both, horizontal lines and rectangles with small absolute heights where the width is also much larger.
Here is code that extracts those horizontal lines and "pseudo-lines" and a page:
lines = [] # to be filled with horizontal "lines": thin rectangles
paths = page.get_drawings() # list of drawing dictionary objects
for path in paths: # dictionary with single draw commands
for item in path["items"]: # check item types
if item[0] in ("c", "qu"): # skip curves and quads
continue
if item[0] == "l": # a true line
p1, p2 = item[1:] # start / stop points
if p1.y != p2.y: # skip non-horizontal lines
continue
# make a thin rectangle of height 2
rect = fitz.Rect(p1.x, p1.y - 1, p2.x, p2.y + 1)
lines.append(rect)
elif item[0] == "re": # a rectangle, check if roughly a horizontal line
rect = item[1] # the item's rectangle
if rect.width <= 2 * rect.height or rect.height > 4:
continue # not a pseudo-line
lines.append(rect)
Now you can use these line rectangles to check any intersections with text rectangles.
Disclaimer: I am the author of borb, the library suggested in this answer
Ultimately, the exact code will end up varying depending on how strikethrough is implemented in your PDF. Allow me to clarify:
A PDF document (typically) has no notion of structure. So while we may see a paragraph of text, made up of several lines of text, a PDF (for the most part) just contains rendering instructions.
Things like:
Go to X, Y
Set the current font to Helvetica-Bold
Set the current color to black
Draw the letter "H"
Go to X, Y (moving slightly to the right this time)
Draw the letter "e"
etc
So in all likelihood, the text that is striked through is not marked as such in any meaningful way.
I think there are 2 options:
PDF has the concept of annotations. These are typically pieces of content that are added on top of a page. These can be extra text, geometric figures, etc. There is a specific annotation for strikethrough.
It might be an annotation, but a geometric figure (in this case a line) that simply appears over the text.
It might be a drawing instruction (inside the page content stream that is) that simply renders a black line over the text.
Your PDF might contain one (or more) of these, depending on which software initially created the strikethrough.
You can identify all of these using borb.
What I would do (in pseudo-code):
Extend SimpleTextExtraction (this is the main class in borb that deals with extracting text from a PDF)
Whenever this class sees an event (this is typically the parser having finished a particular instruction) you can check whether you saw a text-rendering instruction, or a line-drawing instruction. Keep track of text, and keep track of lines (in particular their bounding boxes).
When you have finished processing all events on a page, get all the annotations from the page, and filter out strikethrough annotations. Keep track of their bounding boxes.
From the list of TextRenderEvent objects, filter out those whose bounding box overlaps with: either a line, or a strikethrough bounding box
Copy the base algorithm for rebuilding text from these events
I'm building a table in reportlab and want some cells in a table to be displayed in the format "Example: some text", with part of the cell being bolded and the rest not. I'm wrapping each cell in Paragraph to allow for wrapping when lines are too long, but this doesn't provide a neat way to apply formatting to only part of the cell's contents. These are the unideal things I have tried:
Use XML to apply the formatting to the first part of the cell's content, concatenate the string with the second part, and wrap the whole thing in Paragraph; this is currently what I'm doing, and while it technically works, it isn't the prettiest code to look at, especially when working as part of the rest of my script. Relevant code (let me know if it needs more context):
cellData = (example, someText)
cellBold = "".join("<b>", cellData [0], "</b>", cellData [1])
tableRow.append(Paragraph(cellBold, normalStyle))
Display both Paragraphs in the same cell; I tried this in a similar manner described in the answer to this similar question, but doing so displayed the two Paragraphs on separate lines, instead of on the same line. This would work perfectly if there was a way to remove the line break at the end of a Paragraph, but I don't think there is. Relevant code:
cellData = (example, someText)
tableRow.append([Paragraph(cellData [0], boldStyle), Paragraph(cellData [1], normalStyle)])
Two other solutions would be to be able to apply formatting to part of a Paragraph or to concanenate two Paragraphs, but I don't think these are possible in reportlab. Is there a way to neatly accomplish what I want, or do I have to stick with the code mess of using XML formatting on the strings themselves? I'm using the doc.build method of building my PDF, if that's relevant.
I am looking to perform text replacements in a shape's text. I am using code similar to snippet below:
# define key/value
SRKeys, SRVals = ['x','y','z'], [1,2,3]
# define text
text = shape.text
# iterate through values and perform subs
for i in range(len(SRKeys)):
# replace text
text = text.replace(SRKeys[i], str(SRVals[i]))
# write text subs to comment box
shape.text = text
However, if the initial shape.text has formatted characters (bolded for example), the formatting is removed on the read. Is there a solution for this?
The only thing I could think of is to iterate over the characters and check for formatting, then add these formats before writing to shape.text.
#usr2564301 is on the right track. Character formatting (aka. "font") is specified at the run level. This is what a run is; a "run" (sequence) of characters all sharing the same character formatting.
When you assign to shape.text you replace all the runs that used to be there with a single new run having default formatting. If you want to preserve formatting you need to preserve whatever runs are not directly involved in the text replacement.
This is not a trivial problem because there is no guarantee runs break on word boundaries. Try printing out the runs for a few paragraphs and I think you'll see what I mean.
In rough pseudocode, I think this is the approach you would need to take:
do your search for the target text in the paragraph to determine the offset of its first character.
traverse all the runs in the paragraph keeping a running total of how many characters there are before each run, maybe something like (run_idx, prefix_len, length): (0, 0, 8), (1, 8, 4), (2, 12, 9), etc.
Identify which run is the starting, ending, and in-between runs involving your search string.
Split the first run at the start of the search term, split the last run at the end of the search term, and delete all but the first of the "middle" runs.
Change the text of the middle run to the replacement text and clone the formatting from the prior (original start) run. Maybe this last bit you do at split-start time.
This preserves any runs that do not involve the search string and preserves the formatting of the "matched" word in the "replaced" word.
This requires a few operations that are not directly supported by the current API. For those you'd need to use lower-level lxml calls to directly manipulate the XML, although you could get hold of all the existing elements you need from python-pptx objects without ever having to parse in the XML yourself.
Here is an adapted version of the code I'm using (inspired by #scanny's answer). It replaces text for all shapes (with text frame) on a slide.
from pptx import Presentation
prs = Presentation('../../test.pptx')
slide = prs.slides[1]
# iterate through all shapes on slide
for shape in slide.shapes:
if not shape.has_text_frame:
continue
# iterate through paragarphs in shape
for p in shape.text_frame.paragraphs:
# store formats and their runs by index (not dict because of duplicate runs)
formats, newRuns = [], []
# iterate through runs
for r in p.runs:
# get text
text = r.text
# replace text
text = text.replace('s','xyz')
# store run
newRuns.append(text)
# store format
formats.append({'size':r.font.size,
'bold':r.font.bold,
'underline':r.font.underline,
'italic':r.font.italic})
# clear paragraph
p.clear()
# iterate through new runs and formats and write to paragraph
for i in range(len(newRuns)):
# add run with text
run = p.add_run()
run.text = newRuns[i]
# format run
run.font.bold = formats[i]['bold']
run.font.italic = formats[i]['italic']
run.font.size = formats[i]['size']
run.font.underline = formats[i]['underline']
prs.save('../../test.pptx')
I have Microsoft document which we want to transfer to excel. Every sentence needs to be separated and then pasted into the next appropriate cell in excel. These sentences also need to be analyzed as a heading, requirement, or informational.
I will recreate what the typical word format looks like
2.3.4 Lightening Transient Response
The device shall meet spec 24532. Voltage must resemble figure.
Figure 1.
which translates to
<numbering> <Heading>
<Requirements/information>
In excel that is almost exactly how I would the document to look except the second requirement sentence should be in row just below the previous requirement sentence.
2.3.4 | Lightening Transient Response | Heading
| The device shall meet spec 24532. | Requirement
|Voltage must resemble figure | Requirement
|figure 1 | Informational
I have attempted this project with python using openxl and docx modules. I have code that can go into word and get sentences and then code that can analyze the sentence.I'm retrieving runs from paragraphs. I am having problems because not all sentences are coming back due to how the word document is formatted. I am typically only getting the headings back. The heading numbers are not stored in runs. The requirements underneath the headings are stored in tables. I have written some code to get into the tables an extract the text from cells so that is one way to get the requirements however that snippet of code is giving problems(giving me the same sentence three times in a row).
I'm looking for other possible ways to do this. I'm thinking a format switch. XML has been mentioned and then also the pdf and pythons pdf module may be possible.
Any thoughts or advice would be greatly appreciated.
-Chris
XML is going to be harder, not easier. You're closer than you seem to think. I recommend attacking each problem separately until you crack it.
The sentence three times problem in the table is because of merged cells. The way python-docx works on tables, there is an underlying table layout of x rows and y columns. If two side-by-side cells are merged, you get the same results for both those cells. You can detect this be comparing the two cells for equality. Roughly like "if this_cell == last_cell skip this cell".
There's no way around the heading problem. Heading numbers only exist inside a running instance of Word; they are generated at display (or print) time. To get those you need to use the same rules to generate your own numbers. So you'd need to keep track of the number of headings you've passed through etc. and form your own dot-separated numbering.
Why are you using Python for this? Just use VBA, since you are working with Excel and Word.
Something like this should get you pretty close to where you want to be. It may need some tweaking...
Sub Demo()
Dim wdApp As Word.Application
Set wdApp = Word.Application
Dim wdDoc As Word.Document
Set wdDoc = wdApp.ActiveDocument
wdDoc.Range.Copy
ActiveSheet.Paste Destination:=ActiveSheet.Range("A1")
With ActiveSheet
.Paste Destination:=Range("A" & .Cells.SpecialCells(xlCellTypeLastCell).Row + 1)
End With
Set myRange = Range("A1:A100")
For i = 1 To myRange.Rows.Count
If InStr(myRange.Cells(i, "A").Value, "Voltage") > 0 Then
myRange.Cells(i, "A").Offset(1, 0).Select
ActiveCell.EntireRow.Insert
ActiveCell.Offset(-1, 0).Select
If InStr(myRange.Cells(i, "A").Value, "Voltage") > 0 Then
position1 = InStr(1, ActiveCell.Value, "Voltage")
myRange.Cells(i + 1, "A").Value = Mid(ActiveCell.Value, position1, 99)
ActiveCell.Value = Left(ActiveCell.Value, position1 - 2)
i = i + 2
End If
End If
Next i
End Sub
So, copy the text from your Word doc, which should be open and active, and you're good to go. There are other ways to do this too.
How can I invert (rotate 180 degrees) a text object so that the text is kerned appropriately?
My example uses Python and the svgwrite package, but my question seems about any SVG.
Suppose I use the following code:
dwg = svgwrite.Drawing()
dwg.add(dwg.text(fullName, (int(width/2.),gnameHeight),
font_size=gnameFontSize, text_anchor="middle"))
The above code generates text looking like this:
dwg.text() objects accept a rotate parameter that is applied to all characters in a text string, so I've used the following code to reverse the string first:
pcRotate = [180]
ngap = 1
revFullName = fullName
rcl = []
for c in revFullName:
rcl.append(c)
for i in range(ngap):
rcl.append(' ')
rcl.reverse()
revFullName = ''.join(rcl)
dwg.add(dwg.text(revFullName, (int(width/2.),pcnameHeight),
font_size=gnameFontSize, text_anchor="middle", rotate=pcRotate))
But, this produces the very ugly version below:
and this is using an artificial space gap between characters to make it slightly less unreadable.
What's the best way to tap into whatever kerning is being used by standard text in this inverted situation?
The rotate attribute of a <text> element is intended for situations where you want to rotate individual characters. If you want to rotate the whole text object then you should be using a transform instead.
http://pythonhosted.org/svgwrite/classes/mixins.html#transform-mixin
I'm posting this as a self-answer, only to make formatting more clear. Two useful hints from #paul-lebeau happily acknowledged.
While the svgwrite package seems solid, its documentation is a bit thin. The two things I wish it had said:
The rotate attribute of a <text> element is intended for situations where you want to rotate individual characters. If you want to rotate the whole text object, then you should be using a transform mixin instead.
If you need to center the transformed text with respect to some center (other that the default current user coordinate system), add two additional parameters xctr,yctr. This differs from the doc which calls for a single center argument that is a (2-tuple).
The correct code is:
pcRotate = 'rotate(180,%s,%s)' % (int(width/2.),pcnameHeight)
textGroup = svgwrite.container.Group(transform=pcRotate)
textGroup.add(dwg.text(fullName, (int(width/2.),pcnameHeight),
font_size=gnameFontSize, text_anchor="middle"))
dwg.add(textGroup)