I am writing a Python program with the goal of intaking a word document and returning the same document with certain words/phrases highlighted. The program leverages the python-docx and spaCy APIs.
Currently, the program intakes the text from a .docx word document into a Python-docx Doc, converts it into a list of Spacy Docs (one Doc for each paragraph in the docx Doc), and uses the spaCy Matcher to output all of the word/phrase matches. The output is in the form of a list with the string word, it's starting character index in its respective paragraph, and it's ending character index in its respective paragraph. See below:
import spacy
import docx
from spacy.matcher import Matcher
#Create the nlp object
nlp = spacy.load("en_core_web_sm")
#Intake a word doc into a list of nlp docs
docx = docx.Document('file.docx')
docs = []
for paragraph in docx.paragraphs:
docs.append(nlp(paragraph.text))
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
#matcher rule logic omitted
# Call the matcher on each nlp Doc that each represent a paragraph in the python docx Doc.
matches = []
docx_matches = []
#Calls the matcher on each nlp Doc and adds each doc's match outputs into the matches list
for doc in docs:
matches.append(matcher(doc))
#Iterate through the matches found in each nlp Doc, and return them in a format that python docx can understand (by character indices rather than token indices)
for index, match in enumerate(matches):
for match_id, start, end in match:
span = docs[index][start:end]
match_text = span.text
match_start_index = span.start_char
match_end_index = span.end_char
docx_matches.append([match_text, match_start_index, match_end_index])
I'm struggling to now take the output of the matches (the docx_matches list) and properly highlight only those spans of text in the original file.docx document using the start and end character indices. python-docx only has an add_run() function for adding runs to the end of an existing document. Ideally, we would be able to create a run at the start and end index of each match and change the highlight of that run.
I tried leveraging the code here: https://github.com/python-openxml/python-docx/issues/980. However, I receive an error when attempting to:
return Run(r, paragraph)
in the form of "NameError: name 'Run' is not defined". I also tried setting the run "r" in the code from the above link to
r.font.highlight_color = WD_COLOR_INDEX.YELLOW
but received the error AttributeError: 'CT_R' object has no attribute 'font', meaning that "r" isn't of type Run in the first place.
Can anyone help with a different solution or how to fix the ones above? Thank you for your time - I appreciate it a lot! Happy to answer questions/provide more code as necessary.
I'm trying to use a regex statement to extract a specific block of text between two known phrases that will be repeated in other documents, and remove everything else. These few sentences will then be passed into other functions.
My problem seems to be that when I use a regex statement that has the words im searching for on the same line, it works. If they're on different lines I get:
print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'
I'm expecting future reports to have line breaks at different points depending on what was written before - is there a way to prepare the text first by removing all line breaks, or to make my regex statement ignore those when searching?
Any help would be great, thanks!
import fitz
import re
doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
text_list.append(page.getText())
#print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"
match = re.search(pat, text_string)
print(match.group(1).strip())
When I make my pat being searched for phrases that are on the same line in the long text file, it works. But as soon as they are on different lines, it no longer works.
Here is a sample of the input text giving me an issue:
Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands
Note that . Matches any character other than newline. So you could use (.|\n) to capture everything. Also, it seems that the line could break inside your fixed pattern. first define prefix and suffix of the pattern:
prefix=r"Observations\s+of\s+Client\s+Behavior:"
sufix=r"Observations\s+of\s+Client's\s+response\s+to\s+skill\s+acquisition:"
and then create pattern and find all occurrences:
pattern=prefix+r"((?:.|\n)*?)"+suffix
f=re.findall(pattern,text_string)
By using *? at the end of r"((?:.|\n)*?)" we matches as few characters as possible.
Example of multi-line multi-pattern:
text_string = '''any thing Observations of Client Behavior: patern1 Observations of Client's
response to skill acquisition: any thing
any thing Observations of Client Behavior: patern2 Observations of
Client's response to skill acquisition: any thing Observations of Client
Behavior: patern3 Observations of Client's response to skill acquisition: any thing any thing'''
result=re.findall(pattern,text_string)
result=[' patern1 ', ' patern2 ', ' patern3 ']
check the result here
I'm working on filtering a website's data and looking for keywords. The website uses a long JSON body and I only need to parse everything before a base64-encoded image. I cannot parse the JSON object regularly as the structure changes often and sometimes it's cut off.
Here is a snippet of code I'm parsing:
<script id="__APP_DATA" type="application/json">{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]
As you can see, I'm trying to weed out everything except for these:
{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}
As the JSON is really long and it wouldn't be smart to parse the entire thing, is there a way to find strings like these without actually parsing the JSON object? Ideally, I'd like for everything to be in an array. Will regular expressions work?
The ID is 5 numbers long, the code is 32 characters long, and there is a title.
Thanks a lot in advance
The following will use string.find() to step through the string and, if it finds both the start AND end of your target string, it will extract it as a dictionary. If it only finds the start but not the end, it assumes it's a broken or interrupted string and breaks out of the loop as there's nothing further to do.
I'm using the ast module to convert the string to dictionary. This isn't strictly needed to answer the question but I think it makes the end result more usable.
import ast
testdata = '{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]'
# Create a list to hold the dictionary objects
itemlist = []
# Create variable to keep track of our position in the string
strMarker = 0
#Neverending Loooooooooooooooooooooooooooooooop
while True:
# Find occurrence of the beginning of a target string
strStart = testdata.find('{"id":',strMarker)
if not strStart == -1:
# If we've found the start, now look for the end marker of the string,
# starting from the location we identified as the beginning of that string
strEnd = testdata.find('}', strStart)
# If it does not exist, this suggests it might be an interrupted string
# so we don't do anything further with it, just allow the loop to break
if not strEnd == -1:
# Save this marker as it will be used as the starting point
# for the next search cycle.
strMarker = strEnd
# Extract the substring based on the start and end positions, +1 to capture
# the final '}'; as this string is nicely formatted as a dictionary object
# already, we are using ast.literal_eval() to turn it into an actual usable
# dictionary object
itemlist.append(ast.literal_eval(testdata[strStart:strEnd+1]))
# We're happy to keep searching so jump to the next loop
continue
# If nothing happened to trigger a jump to the next loop, break out of the
# while loop
break
# Print out the first entry in the list as a demo
print(str(itemlist[0]))
print(str(itemlist[0]["title"]))
Output from this code should be a nice formatted dict:
{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"}
Binance Will List GYEN
Regular expression should work here. Try matching with the following regular expression. It matches the desired sections, when I try it in https://regexr.com/. Also, regexr helps you understand the regular expression, in case you are new to it.
(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})
Here is a small sample python script to find all of the sections.
import re
pattern='(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})'
string_to_parse='...'
sections = re.findall(pattern, string_to_parse, re.DOTALL)
I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$
I want to extract disease words from medical data to make a disease word dictionary (consider the notes written by doctors, test results). I'm using python. I tried the following ways:
Used google API to check whether the word is a disease or not depending on the results. It didn't go well because it was extracting medical words too and i even tried modify the search and also i had to buy google CSE which i feel is costly because i have huge data. Its a huge code to include in the post.
Used weka to predict the words but the data which i have is normal text data and wont follow any rules and not in ARFF or CSV type.
Tried checking NER for extracting disease words. But, all the models which i have seen needed a predefined dictionary to search and perform tf-idf on the input data. I don't have such kind of dictionary.
In all the models which i have seen they suggest me to tokenize then POS for the data which I did and couldn't find another way to extract only the disease words.
I even tried extracting only the nouns which didn't do well because other medical terms were also considered as nouns.
My data is in the following way and doesn't follow the same way in the whole document:
After conducting clinical reviews the patient was suffering with
diabetes,htn which was revealed when a complete blood picture of the
patient's blood was done. He was advised to take PRINIVIL TABS 20 MG
(LISINOPRIL) 1.
Believe me, I googled a lot and couldn't come with a perfect solution. Please suggest a way for me to move forward.
The following is one of the approaches I tried which extracted the medical terms too. Sorry, the code looks a bit clumsy and i am positing the main function in it as posting the whole code will be veryy lenghty. Look the search_word variable main logic lies there :
def search(self,wordd): #implemented google custom search engine api
#responseData = 'None'
global flag
global page
search_word="\"is+%s+an+organ?\"" %(wordd)
search_word=str(search_word)
if flag == 1:
search_word="\"%s+is+a+disease\"" %(wordd)
try: #searching google for the word
url = 'https://www.googleapis.com/customsearch/v1?key=AIzaSyAUGKCa2oHSYeZynSMD6zElBKUrg596G_k&cx=00262342415310682663:xy7prswaherw&num=3&q='+search_word
print url
data = urllib2.urlopen(url)
response_data = json.load(data)
results=response_data['queries']['request'][0]['totalResults']
results_count=int(results)
print "the results is: ",results_count
if(results_count == 0):
print "no results found"
flag = 0
return 0
else:
return 1
#except IOError:
#print "network issues!"
except ValueError:
print "Problem while decoding JSON data!"