I'm trying to use a regex statement to extract a specific block of text between two known phrases that will be repeated in other documents, and remove everything else. These few sentences will then be passed into other functions.
My problem seems to be that when I use a regex statement that has the words im searching for on the same line, it works. If they're on different lines I get:
print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'
I'm expecting future reports to have line breaks at different points depending on what was written before - is there a way to prepare the text first by removing all line breaks, or to make my regex statement ignore those when searching?
Any help would be great, thanks!
import fitz
import re
doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
text_list.append(page.getText())
#print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"
match = re.search(pat, text_string)
print(match.group(1).strip())
When I make my pat being searched for phrases that are on the same line in the long text file, it works. But as soon as they are on different lines, it no longer works.
Here is a sample of the input text giving me an issue:
Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands
Note that . Matches any character other than newline. So you could use (.|\n) to capture everything. Also, it seems that the line could break inside your fixed pattern. first define prefix and suffix of the pattern:
prefix=r"Observations\s+of\s+Client\s+Behavior:"
sufix=r"Observations\s+of\s+Client's\s+response\s+to\s+skill\s+acquisition:"
and then create pattern and find all occurrences:
pattern=prefix+r"((?:.|\n)*?)"+suffix
f=re.findall(pattern,text_string)
By using *? at the end of r"((?:.|\n)*?)" we matches as few characters as possible.
Example of multi-line multi-pattern:
text_string = '''any thing Observations of Client Behavior: patern1 Observations of Client's
response to skill acquisition: any thing
any thing Observations of Client Behavior: patern2 Observations of
Client's response to skill acquisition: any thing Observations of Client
Behavior: patern3 Observations of Client's response to skill acquisition: any thing any thing'''
result=re.findall(pattern,text_string)
result=[' patern1 ', ' patern2 ', ' patern3 ']
check the result here
Related
Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!
I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$
I and writing code to extract certain information from texts and I am using spaCy.
The goal is that IF a particular token of a text contains the string "refstart" then I want to get the noun chunk preceding that token.
just for info: this token containing "refstart" and "refend" is generated using regex previously to creating the nlp object in spacy .
So far i using this code:
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text='Figure 1 shows a cross-sectional view refstart10,20,30refend of a
refrigerator refstart41,43refend that uses a new cooling technology refstart10,23a,45refend including a retrofitting pump including high density fluid refstart10refend.'
doc3=nlp(raw_text)
list_of_references=[]
for token in doc3:
# look if the token is a ref. sign
# in order to see the functioning of the loops uncomment the prints
# print('looking for:', token.text)
if 'refstart' in token.text:
#print('yes it is in')
ref_token_text = token.text
ref_token_position = token.i
# print('token text:',ref_token_text)
for chunk in doc3.noun_chunks:
if chunk.end == ref_token_position:
# we have a chunck and a ref. sign
list_of_references.append((chunk.text, chunk.start, chunk.end, ref_token_text))
break
This works, I get a list with tuples including the nounchuncks the start end and the token text following the noun chunck which includes the string refstart in it.
the result of this code should be:
a cross-sectional view, refstart10,20,30refend
a refrigerator, refstart41,43refend
a new cooling technology, refstart10,23a,45refend
high density fluid, refstart10refend
See how "retrofiting pump" is not part of the list because is not followed by a token including "refstart"
This is nevertheless very inefficient for loops over text that are very large can slow down the data pipeline a lot.
Solution 2:
I thought about creating a list of tokens with their positions and a list of noun chunks
# built the list with all the noun chunks, start and end in the text
list_chunks=[]
print("chuncks")
for chunk in doc3.noun_chunks:
list_chunks.append((chunk.text,chunk.start,chunk.end))
try:
print(f'start:{chunk.start},end:{chunk.end} \t \t {chunk.text} \t following text:{doc3[chunk.end+1]}')
except:
# this is done just to avoid error breaking in the last chunk
print(f'start:{chunk.start},end:{chunk.end} \t \t {chunk.text} \t following text:last on')
print("refs------------------")
# build the list with all the tokens and their position
list_ref_tokens=[]
for token in doc3:
if 'refstart' in token.text:
list_ref_tokens.append((token.text,token.i))
print(token.text,token.i)
but now I would have to compare the Tupels inside of list_chunks and list_ref_tokens which is also tricky.
any other suggestion?
thanks.
I have a column called Description in my Dataframe. I have text in that column as below.
Description
Summary: SD1: Low free LOG space in database saptempdb: 2.99% Date: 01/01/2017 Severity: Major Reso
Summary: SD1: Low free DATA space in database 10:101:101:1 2.99% Date: 01/01/2017 Severity: Major Res
Summary: SAP SolMan Sys=SM1_SNG01AMMSOL04,MO=AGEEPM40,Alert=Columnstore Unloads,Desc= ,Cat=Exception
How to extract the Server name or IPs fro the above description. I have around 10000 rows.
I have written as below, to split the senetences as comma separated. Now I need to filter the server names or ips
df['sentsplit'] = df["Description"].str.split(" ")
print df
The general case of what you're asking is "How do I parse this input?". The task then is what knowledge of your input can you exploit to answer your question? Do all the lines follow one or a few forms? Can you place any restrictions on where the hostname or IP address will be on each line?
Given your input, here's a regex I might apply. Quick and dirty -- not elegant -- but if it's only for 10,000 lines, and a one-off job, who cares? It's functional:
database (\d+:\d+:\d+:\d+)|database (\w+)|Sys=([^, ]+),
This regex assumes that the IP address will always be after the word database and preceded by a space, OR that the hostname will be after the word database, OR that the hostname will be preceded bySys=and followed by a,` or a space.
Obviously, test for your purposes, and fine tune as appropriate. In the Python API:
host_or_ip_re = re.compile(r'database (\d+:\d+:\d+:\d+)|database (\w+)|Sys=([^, ]+),')
for line in log:
m = host_or_ip_re.searc( line )
if m:
print m.groups()
The detail that always trips me up is the difference between match and search. Match only matches from the beginning of the string
I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}