How to use Python Fitz detect Hyphen when using search_for? - python

I'm new to the Fitz library and am working on a project where I need to find a string in a PDF page. I'm running into a case where the text on the page that I'm searching on is hyphenated. I am aware of the TEXT_DEHYPHENATE flag that I can use in the search for function, but that doesn't work for me (as shown in the image here https://postimg.cc/zHZPdd6v ). I'm getting no cases when I search for the hyphenated string.
Python Script
LOC = "./test.pdf"
doc = fitz.open(LOC)
page = doc[1]
print(page.get_text())
found = page.search_for("lowcost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low-cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
for rect in found:
print(rect)
Output
Abstract
The objective of “XXXXXXXXXXXXXXXXXX” was design and assemble a low-
cost and efficient tool.
DONE
0
DONE
0
DONE
0
Can someone please point me to how I might be able to detect the hyphen in my file? Thank you!

Your first approach should work, look here:
# insert some hyphenated text
page.insert_textbox((100,100,300,300),"The objective of 'xxx' was design and assemble a low-\ncost and efficient tool.")
157.94699853658676
# now search for it again
page.search_for("lowcost") # 2 rectangles!
[Rect(159.3009796142578, 116.24800109863281, 175.8009796142578, 131.36199951171875),
Rect(100.0, 132.49501037597656, 120.17399597167969, 147.6090087890625)]
# each containing a text portion with hyphen removed
for rect in page.search_for("lowcost"):
print(page.get_textbox(rect))
low
cost
Without the original file there is no way to tell the reason for your failure.
Are you sure there really is text - and not e.g. an image or other hickups?
Edited: As per the comment of user #KJ below: PyMuPDF's C base library MuPDF regards all of the unicodes '-', 0xAD, 0x2010, 0x2011 as hyphens in this context. They all should work the same. Just reconfirmed it in an example.

Related

Python's ElementTree, how to create links in a paragraph

I have a website I'm building running off Python 2.7 and using ElementTree to build the HTML on the fly. I have no problem creating the elements and appending them to the tree. It's where I have to insert links in the middle of a large paragraph that I am stumped. This is easy when it's done in text, but this is doing it via XML. Here's what I mean:
Sample text:
lawLine = "..., a vessel as defined in Section 21 of the Harbors and Navigation Code which is inhabited and designed for habitation, an inhabited floating home as defined in subdivision (d) of Section 18075.55 of the Health and Safety Code, ..."
To add that text to the HTML as H4-style text, I typically use:
h4 = ET.Element('h4')
htmlTree.append(h4)
h4.text = lawLine
I need to add links at the word "Section" and the numbers associated with it, but I can't simply create a new element "a" in the middle of a paragraph and add it to the HTML tree, so I'm trying to build that piece as text, then do ET.fromstring and append it to the tree:
thisLawType = 'PC'
matches = re.findall(r'Section [0-9.]*', lawLine)
if matches:
lawLine = """<h4>{0}</h4>""".format(lawLine)
for thisMatch in matches:
thisMatchLinked = """{2}""".format(thisLawType, thisMatch.replace('Section ',''), thisMatch)
lawLine = lawLine.replace(thisMatch, thisMatchLinked)
htmlBody.append(ET.fromstring(lawLine))
I am getting "xml.etree.ElementTree.ParseError: not well-formed" errors when I do ET.fromstring. Is there a better way to do this in ElementTree? I'm sure there are better extensions out there, but my work environment is limited to Python 2.7 and the standard library. Any help would be appreciated. Thanks!
Evan
The xml you are generating is indeed not well formed, because of the presence of & in thisMatchLinked. It's one of the special charcters which need to be escaped (see an interesting explanation here).
So try replacing & with & and see if it works.

How do you insert an image into a specific location in an existing Word document with Python?

What I want to do is insert an image into a specific location in an existing Word document using Python. I've looked at various libraries to do this; I'm using the docx-mailmerge package to insert text and tables using Word merge fields, but unfortunately image merging is just a TODO/wishlist feature. python-docx meanwhile allows image insertion, but only at the end of a document, not in specific places.
Is there another library that does this, or a good trick to accomplish it?
Fiddling around with the underlying API (and thanks to this SO answer) I hacked my way to success:
add a placeholder in your Word document where the image should go, something like a single line that says [ChartImage1]
find the paragraph object in the document that contains that text
replace the text of that paragraph with an empty string
add a run, and inside that add your image
So something like:
document = Document("template.docx")
image_paras = [i for i, p in enumerate(document.paragraphs) if "[ChartImage1]" in p.text]
p = document.paragraphs[image_paras[0]]
p.text = ""
r = p.add_run()
r.add_picture("path/to/image.png")
document.save("my_doc.docx")

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

Get num of page with beautifulsoup

i want to get the number of pages in the next code html:
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ" class="outputText marginLeft0punto5">1</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ" class="outputText marginLeft0punto5">37</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ" class="outputText marginLeft0punto5">736</span>
The goal is get the number 1, 37 and 736
My problem is that i don't know how define the line to extract the numbers, for example for the number 1:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
first_page = int(soup.find('span', {'id': 'viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ'}).getText())
Thanks so much
EDIT: Finally i found a solution with Selenium:
numpag = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ"]').text)
pagtotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ"]').text)
totaltotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ"]').text)
Thanks #abarnert, sorry for the caos in my question, it was my first post =)
The code you provided already works for the example you provided.
My guess is that your problem is that it doesn't work for any other page, probably because those id values are different each time.
If that's the case, you need to look at (or show us) multiple different outputs to figure out if there's a recognizable pattern that you can match with a regular expression or a function full of string operations or whatever. See Searching the tree in the docs for the different kinds of filters you can use.
As a wild guess, that Z7 and AVEQAI930OBRD02JPMTPG21004 are replaced by different strings of capitals and digits each time, but the rest of the format is always the same? If so, there are some pretty obvious regular expressions you can use:
rnumpag = re.compile(r'.*:form1:textfooterInfoNumPagMAQ')
rtotalpagina = re.compile(r'.*:form1:textfooterInfoTotalPaginaMAQ')
rtotaltotal = re.compile(r'.*:form1:textfooterTotalTotalMAQ')
numpag = int(soup.find('span', id=rnumpag).string)
totalpagina = int(soup.find('span', id=rtotalpagina).string)
totaltotal = int(soup.find('span', id=rtotaltotal).string)
This works on your provided example, and would also work on a different page that had different strings of characters within the part we're matching with .*.
And, even if my wild guess was wrong, this should show you how to write a search for whatever you actually do have to search for.
As a side note, you were using the undocumented legacy function getText(). This implies that you're copying and pasting ancient BS3 code. Don't do that. Some of it will work with BS4, even when it isn't documented to (as in this case), but it's still a bad idea. It's like trying to run Python 2 source code with Python 3 without understanding the differences.
What you want here is either get_text(), string, or text, and you should look at what all three of these mean in the docs to understand the difference—but here, the only thing within the tag is a text string, so they all happen to do the same thing.

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
You could use OpenOffice. It can open word files, and also can run python macros.
I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Categories

Resources