How to parse a file sentence by sentence in Python - python

I need to read a large amount of large text files.
For each file, I need to open it and read in text sentence by sentence.
Most of approaches I found is read line by line.
How can I do it with Python?

If you want sentence tokenization, nltk is probably the quickest way to do so. http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt
Will get you pretty far.
i.e. code from docs
>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

If the files have large amounts of lines you could make a generator using the yield statement
def read(filename):
file = open(filename, "r")
for line in file.readlines():
for word in line.split():
yield word
for word in read("sample.txt"):
print word
This would return all the words of each line of the file

Related

Why is the non-greedy Pattern greedy?

import re
from textblob import TextBlob
f = open('G:/temp1/words.srt')
fp = open('G:/temp1/words1.txt','w')
pattern = re.compile(r'/NN.+? .+?/VB/B-VP.+? .+?/NN')
for line in f:
blob = TextBlob(line)
for sentence in blob.sentences:
if re.search(pattern, sentence.parse()):
print(sentence, file=fp)
print(sentence.parse(), file=fp)
f.close()
fp.close()
Input:
dogs eat bones.
it's a performance they put on at her school
result:
dogs eat bones.
dogs/NNS/B-NP/O eat/VB/B-VP/O bones/NNS/B-NP/O ././O/O
it's a performance they put on at her school
it/PRP/B-NP/O '/POS/O/O s/PRP/B-NP/O a/DT/I-NP/O performance/NN/I-NP/O they/PRP/I-NP/O put/VB/B-VP/O on/IN/B-PP/B-PNP at/IN/I-PP/I-PNP her/PRP$/B-NP/I-PNP school/NN/I-NP/I-PNP
question:
I want to get the line1-2(dogs eat bones), but the line3-4 was also selected. why?
. matches anything, so yeah, both lines match that RE.
If you want to prevent NN.+? from matching more than one token, you would need to use something that says "anything but spaces" instead of "anything".
Using NN\S+ works, and then you don't need to ?:
pattern = re.compile(r'/NN\S+ \S+/VB/B-VP\S+ \S+/NN')
Demo: https://regex101.com/r/8N6yKW/2
Compare with your original RE: https://regex101.com/r/EpFo5i/1

Scraping a sentence across many lines | Recursive error unresolved

Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines).
I am able to print() the line the phrase appears in.
Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from the previous sentence, and iterate forward again until the next sentence terminator.
This is so as I can print() the entire sentence the phrase belongs in.
However, I have a Recursive Error with scrape_sentence() getting stuck infinitely running.
Jupyter Notebook:
# pip install PyPDF2
# pip install pdfplumber
# ---
# import re
import glob
import PyPDF2
import pdfplumber
# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')
def scrape_sentence(sentence, lines, index, phrase):
if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
return sentence.replace('\n', '').strip()
sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1, phrase) # previous line
sentence = scrape_sentence(sentence + lines[index+1], lines, index+1, phrase) # following line
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
sentence = sentence[0] # first occurance
print(sentence)
return sentence
# ---
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
Output:
connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there
Phrase:
Responsible Care Company
Sentence (across multiple lines):
"GPIC is a Responsible Care Company certified for RC 14001
since July 2010."
PDF (pg. 2).
Please let me know if there is anything else I can add to post.
I solved this problem here by removing any recursion from scrape_sentence().

Extracting words/phrase followed by a phrase

I have one text file with a list of phrases. Below is how the file looks:
Filename: KP.txt
And from the below input (paragraph), I want to extract the next 2 words after the KP.txt phrase (the phrases could be anything as shown in my above KP.txt file). All I need is to extract the next 2 words.
Input:
This is Lee. Thanks for contacting me. I wanted to know the exchange policy at Noriaqer hardware services.
In the above example, I found the phrase " I wanted to know", matches with the KP.txt file content. So if I wanted to extract the next 2 words after this, my output will be like "exchange policy".
How could I extract this in python?
Assuming you already know how to read the input file into a list, it can be done with some help from regex.
>>> wordlist = ['I would like to understand', 'I wanted to know', 'I wish to know', 'I am interested to know']
>>> input_text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
>>> def word_extraction (input_text, wordlist):
... for word in wordlist:
... if word in input_text:
... output = re.search (r'(?<=%s)(.\w*){2}' % word, input_text)
... print (output.group ().lstrip ())
>>> word_extraction(input_text, wordlist)
exchange policy
>>> input_text = 'This is Lee. Thanks for contacting me. I wish to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
where is
>>> input_text = 'This is Lee. Thanks for contacting me. I\'d like to know where is Noriaqer hardware.'
>>> word_extraction(input_text, wordlist)
>>>
First we need to check whether the phrase we want is in the sentence. It's not the most efficient way if you have large list but it works for now.
Next if it is in our "dictionary" of phrase, we use regex to extract the keyword that we wanted.
Finally strip the leading white space in front of our target word.
Regex Hint:
(?<=%s) is look behind assertion. Meaning check the word behind the sentence starting with "I wanted to know"
(.\w*){2} means any character after our phrase followed by one or more words stopping at 2 words after the key phrase.
I Think natural language processing could be a better solution, but this code would help :)
def search_in_text(kp,text):
for line in kp:
#if a search phrase found in kp lines
if line in text:
#the starting index of the two words
i1=text.find(line)+len(line)
#the end index of the following two words (first index+50 at maximum)
i2=(i1+50) if len(text)>(i1+50) else len(text)
#split the following text to words (next_words) and remove empty spaces
next_words=[word for word in text[i1:i2].split(' ') if word!='']
#return only the next two words from (next_words)
return next_words[0:2]
return [] # return empty list if no phrase matching
#read your kp file as list of lines
kp=open("kp.txt").read().split("\n")
#input 1
text = 'This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> This is Lee. Thanks for contacting me. I wanted to know exchange policy at Noriaqer hardware services.
output ->> ['exchange', 'policy']
#input 2
text = 'Boss was very angry and said: I wish to know why you are late?'
print('input ->>',text)
output = search_in_text(kp,text)
print('output ->>',output)
input ->> Boss was very angry and said: I wish to know why you are late?
output ->> ['why', 'you']
you can use this:
with open("KP.txt") as fobj:
phrases = list(map(lambda sentence : sentence.lower().strip(), fobj.readlines()))
paragraph = input("Enter The Whole Paragraph in one line:\t").lower()
for phrase in phrases:
if phrase in paragraph:
temp = paragraph.split(phrase)[1:]
for clause in temp:
print(" ".join(clause.split()[:2]))

Identify in-text Citations (in APA, MLA, Harvard, Vancouver, etc.) with Python

I'm trying to identify all sentences that contain in-text citations in a journal article in pdf formats.
I converted the .pdf to .txt and wanted to find all sentences that contained a citation, possibly in one of the following format:
Smith (1990) stated that....
An agreement was made on... (Smith, 1990).
An agreement was made on... (April, 2005; Smith, 1990)
Mixtures of the above
I first tokenized the txt into sentences:
import nltk
from nltk.tokenize import sent_tokenize
ss = sent_tokenize(text)
This makes type(ss) list, so I converted the list into str to use re findall:
def listtostring(s):
str1 = ' '
return (str1. join(s))
ee = listtostring(ss)
Then, my idea was to identify the sentences that contained a four digit number:
import re
for sentence in ee:
zz = re.findall(r'\d{4}', ee)
if zz:
print (zz)
However, this extracts only the years but not the sentences that contained the years.
Using regex, something (try it out) that can have decent recall while trying to avoid inappropriate matches (\d{4} may give you a few) is
\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)
A python example (using spaCy instead of NLTK) would then be
import re
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("One statement. Then according to (Smith, 1990) everything will be all right. Or maybe not.")
l = [sent.text for sent in doc.sents]
for sentence in l:
if re.findall(r'\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)', sentence):
print(sentence)
import re
l = ['This is 1234','Hello','Also 1234']
for sentence in l:
if re.findall(r'\d{4}',sentence):
print(sentence)
Output
This is 1234
Also 1234

Python: How to format large text outputs to be 'prettier' and user defined

Ahoy StackOverlow-ers!
I have a rather trivial question but it's something that I haven't been able to find in other questions here or on online tutorials: How might we be able to format the output of a Python program that so that it fits a certain aesthetic format without any extra modules?
The aim here is that I have a block of plain text like that from a newspaper article, and I've filtered through it earlier to extract just the words I want but now I'd like to print it out in the format that each line only has 70 characters along it and any word won't be broken if it should normally fall on a line break.
Using .ljust(70) as in stdout.write(article.ljust(70)) doesn't seem to do anything to it.
The other thing about not having words broken would be as:
Latest news tragic m
urder innocent victi
ms family quiet neig
hbourhood
Looking more like this:
Latest news tragic
murder innocent
victims family
quiet neighbourhood
Thank you all kindly in advance!
Checkout the python textwrap module (a standard module)
>>> import textwrap
>>> t="""Latest news tragic murder innocent victims family quiet neighbourhood"""
>>> print "\n".join(textwrap.wrap(t, width=20))
Latest news tragic
murder innocent
victims family quiet
neighbourhood
>>>
use textwrap module:
http://docs.python.org/library/textwrap.html
I'm sure this can be improved on. Without any libraries:
def wrap_text(text, wrap_column=80):
sentence = ''
for word in text.split(' '):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence
EDIT: From the comment if you want to use Regular expressions to just pick out words use this:
import re
def wrap_text(text, wrap_column=80):
sentence = ''
for word in re.findall(r'\w+', text):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence

Categories

Resources