I'm trying to extract a sentence between two dots. All sentences have inflam or Inflam in them which is my specific word but I don't know how to make that happen.
what I want is ".The bulk of the underlying fibrous connective tissue consists of diffuse aggregates of chronic inflammatory cells."
or
".The fibrous connective tissue reveals scattered vascular structures and possible chronic inflammation."
from a long paragraph
what I have tried so far is this
##title Extract microscopic-inflammation { form-width: "20%" }
def inflammation1(microscopic_description):
PATTERNS=[
"(?=\.)(.*)(?<=inflamm)",
"(?=inflamm)(.*)(?<=.)",
]
for pattern in PATTERNS:
matches = re.findall(pattern, microscopic_description)
if len(matches) > 0:
break
inflammation1 = ''.join([k for k in matches])
return (inflammation1)
for index, microscopic_description in enumerate(texts):
print(inflammation1(microscopic_description))
print("#"*79, index)
which hasn't worked for me and it gives me error. when I separate my patterns and run them in different cells they work. The problem is they don't work together to give me the sentence between "." and "." before inflamm and after inflamm.
import re
string='' # replace with your paragraph
print(re.search(r"\.[\s\w]*\.",string).group()) #will print first matched string
print(re.findall(r"\.[\s\w]*\.",string)) #will print all matched strings
You can try by checking for the word in every sentence of the text.
for sentence in text.split("."):
if word in sentence:
print(sentence[1:])
Here you do exactly that and if you find the word, you print the sentence without the space in the start of it. You can modify it in any way you want.
Related
I want to remove the letters br from the end of every word in my Pandas dataframe column (As you will see, the rows of this column are actually sentances - all different from one another).
Unfortunately, I would already cleaned the data without giving much thought to the < br > tags, so I am now left with words like 'startbr,' 'nicebr,' and 'hellobr,' which are of no use to me.
A dataframe row may look something like this (errors denoted by ** ** tags):
Sentence = here are **somebr** examples of poorly written paragraphs **andbr** well-written **paragraphsbr** on the same **topicbr** how do they compare?
What I would like (without the br at the end):
Sentence: here are **some** examples of poorly written **and** well-written **paragraphs** on the same **topic** how do they compare?
I am hoping for an answer that will allow me to keep the original sentance (without any words that are followed by the letter br at the end). Words like "brutish," "breathtaking," and "ember" should be kept as is, since they could be of value. Fortunately there are not any words that I would like to retain that end with the letters br.
Use a regex with a word boundary (\b) to match the end of words:
df['text'] = df['text'].str.replace(r'br\b', '', regex=True)
Example (with assignment as a new column text2):
text text2
0 word wordbr bread breadbr word word bread bread
I want the program to search for all occurrences of crocodile, etc with fuzzy matching i.e. If there are any spelling mistakes, it should count those words as well.
s="Difference between a crocodile and an alligator is......." #Long paragraph, >10000 words
to_search=["crocodile","insect","alligator"]
for i in range(len(to_search)):
for j in range(len(s)):
a = s[j:j+len(to_search[i])]
match = difflib.SequenceMatcher(None,a,to_search[I]).ratio()
if(match>0.9): #90% similarity
print(a)
So all of the following should be considered as instances of "crocodile": "crocodile","crocodil","crocodele",etc
The above method works but is too slow if the main string ("s" here) is of large size like >1million words.
Is there any way to do this that's faster than the above method**?
**(splitting the string into sub-string sized blocks and then comparing sub-string with reference word)
One of the reasons it takes too long on a large body of text is that you are repeating the sliding window through the entire text multiple times, once for each words you are searching for. And a lot of the computation is on comparing your words to blocks of the same length which might contain parts of multiple words.
If you are willing to posit that you are always looking to match individual words, you could split the text into words and just compare against the words - far fewer comparisons (number of words, vs. windows starting at every position in the text), and the splitting only needs to be done once, not for every search term. Here's an example:
to_search= ["crocodile", "insect", "alligator"]
s = "Difference between a crocodile and an alligator is" #Long paragraph, >10000 words
s_words = s.replace(".", " ").split(" ") # Split on spaces, with periods removed
for search_for in to_search:
for s_word in s_words:
match = difflib.SequenceMatcher(None, s_word, search_for).ratio()
if(match > 0.9): #90% similarity
print(s_word)
continue # no longer need to continue the search for this word!
This should give you significant speedup, hope it solves your needs!
Happy Coding!
I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.
#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
pnp = pride_file.read()
#Change '!' and '?' to '.'
for ch in ['!','?']:
if ch in pnp:
pnp = pnp.replace(ch,".")
#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")
To split a string into a list of strings on some character:
pnp = pnp.split('.')
Then we can split each of those sentences into a list of strings (words)
pnp = [sentence.split() for sentence in pnp]
Then we get the number of words in each sentence
pnp = [len(sentence) for sentence in pnp]
Then we can use statistics.mean to calculate the mean:
statistics.mean(pnp)
To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.
You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.
Thus,
pnp.split('.')
is going to give you a list of all sentences. Once you have that list, for each sentence in the list,
sentence.split() # i.e., split according to whitespace by default
will give you a list of words in the sentence.
Is that enough of a start?
You can try the code below.
numbers_per_sentence = [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)
However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)
I m reading a sentence from excel(containing bio data) file and want to extract the organizations where they are working. The file also contains sentences which specifies where the person is studying.
ex :
i m studying in 'x' instition(university)
i m student in 'y' college
i want to skip these type of sentences.
I am using regular expression to match these sentences, and if its related to student then skip the part, and only other lines i want write in a separate excel file.
my code is as below..
csvdata = pandas.read_csv("filename.csv",",");
for data in csvdata:
regEX=re.compile('|'.join([r'\bstudent\b',r'\bstudy[ing]\b']),re.I)
matched_data=re.match(regEX,data)
if matched_data is not None:
continue
else:
## write the sentence to excel
But, when i check the newly created excel file, it still contains the sentences that contain 'student', 'study'.
How regular expression can be modified to get the result.
There are 2 things here:
1) Use re.search (re.match only searches at the string start)
2) The regex should be regEX=re.compile(r"\b(?:{})\b".format('|'.join([r'student',r'study(?:ing)?'])),re.I)
The [ing] only matches 1 symbol, either i, n or g while you intended to match an optional ing ending. A non-capturing group with a ? quantifier - (?:ing)? - is actually matching 1 or 0 sequences of ings.
Also, \b(x|y)\b is a more efficient pattern than \bx\b|\by\b, as it involves fewer backtracking steps.
Here is just a demo of what this regex looks like:
import re
pat = r"\b(?:{})\b".format('|'.join([r'student',r'study(?:ing)?']))
print(pat)
# => \b(?:student|study(?:ing)?)\b
regEX=re.compile(pat,re.I)
s = "He is studying here."
mObj = regEX.search(s)
if mObj:
print(mObj.group(0))
# => studying
Using pandas in Python 2.7 I am attempting to count the number of times a phrase (e.g., "very good") appears in pieces of text stored in a CSV file. I have multiple phrases and multiple pieces of text. I have succeeded in this first part using the following code:
for row in df_book.itertuples():
index, text = row
normed = re.sub(r'[^\sa-zA-Z0-9]', '', text).lower().strip()
for row in df_phrase.itertuples():
index, phrase = row
count = sum(1 for x in re.finditer(r"\b%s\b" % (re.escape(phrase)), normed))
file.write("%s," % (count))
However, I don't want to count the phrase if it's preceded by a different phrase (e.g., "it is not"). Therefore I used a negative lookbehind assertion:
for row in df_phrase.itertuples():
index, phrase = row
for row in df_negations.itertuples():
index, negation = row
count = sum(1 for x in re.finditer(r"(?<!%s )\b%s\b" % (negation, re.escape(phrase)), normed))
The problem with this approach is that it records a value for each and every negation as pulled from the df_negations dataframe. So, if finditer doesn't find "it was not 'very good'", then it will record a 0. And so on for every single possible negation.
What I really want is just an overall count for the number of times a phrase was used without a preceding phrase. In other words, I want to count every time "very good" occurs, but only when it's not preceded by a negation ("it was not") on my list of negations.
Also, I'm more than happy to hear suggestions on making the process run quicker. I have 100+ phrases, 100+ negations, and 1+ million pieces of text.
I don't really do pandas, but this cheesy non-Pandas version gives some results with the data you sent me.
The primary complication is that the Python re module does not allow variable-width negative look-behind assertions. So this example looks for matching phrases, saving the starting location and text of each phrase, and then, if it found any, looks for negations in the same source string, saving the ending locations of the negations. To make sure that negation ending locations are the same as phrase starting locations, we capture the whitespace after each negation along with the negation itself.
Repeatedly calling functions in the re module is fairly costly. If you have a lot of text as you say, you might want to batch it up, e.g. by using 'non-matching-string'.join() on some of your source strings.
import re
from collections import defaultdict
import csv
def read_csv(fname):
with open(fname, 'r') as csvfile:
result = list(csv.reader(csvfile))
return result
df_negations = read_csv('negations.csv')[1:]
df_phrases = read_csv('phrases.csv')[1:]
df_book = read_csv('test.csv')[1:]
negations = (str(row[0]) for row in df_negations)
phrases = (str(re.escape(row[1])) for row in df_phrases)
# Add a word to the negation pattern so it overlaps the
# next group.
negation_pattern = r"\b((?:%s)\W+)" % '|'.join(negations)
phrase_pattern = r"\b(%s)\b" % '|'.join(phrases)
counts = defaultdict(int)
for row in df_book:
normed = re.sub(r'[^\sa-zA-Z0-9]', '', row[0]).lower().strip()
# Find the location and text of any matching good groups
phrases = [(x.start(), x.group()) for x in
re.finditer(phrase_pattern, normed)]
if not phrases:
continue
# If we had matches, find the (start, end) locations of matching bad
# groups
negated = set(x.end() for x in re.finditer(negation_pattern, normed))
for start, text in phrases:
if start not in negated:
counts[text] += 1
else:
print("%r negated and ignored" % text)
for pattern, count in sorted(counts.items()):
print(count, pattern)