Why is the non-greedy Pattern greedy? - python

import re
from textblob import TextBlob
f = open('G:/temp1/words.srt')
fp = open('G:/temp1/words1.txt','w')
pattern = re.compile(r'/NN.+? .+?/VB/B-VP.+? .+?/NN')
for line in f:
blob = TextBlob(line)
for sentence in blob.sentences:
if re.search(pattern, sentence.parse()):
print(sentence, file=fp)
print(sentence.parse(), file=fp)
f.close()
fp.close()
Input:
dogs eat bones.
it's a performance they put on at her school
result:
dogs eat bones.
dogs/NNS/B-NP/O eat/VB/B-VP/O bones/NNS/B-NP/O ././O/O
it's a performance they put on at her school
it/PRP/B-NP/O '/POS/O/O s/PRP/B-NP/O a/DT/I-NP/O performance/NN/I-NP/O they/PRP/I-NP/O put/VB/B-VP/O on/IN/B-PP/B-PNP at/IN/I-PP/I-PNP her/PRP$/B-NP/I-PNP school/NN/I-NP/I-PNP
question:
I want to get the line1-2(dogs eat bones), but the line3-4 was also selected. why?

. matches anything, so yeah, both lines match that RE.
If you want to prevent NN.+? from matching more than one token, you would need to use something that says "anything but spaces" instead of "anything".
Using NN\S+ works, and then you don't need to ?:
pattern = re.compile(r'/NN\S+ \S+/VB/B-VP\S+ \S+/NN')
Demo: https://regex101.com/r/8N6yKW/2
Compare with your original RE: https://regex101.com/r/EpFo5i/1

Related

pyspark dataframe: remove some whole words but case insensitive in a column

I am trying to remove some whole words (but case insensitive) in a pyspark dataframe column.
import re
s = "I like the book. i'v seen it. Iv've" # add a new phrase
exclude_words = ["I", "I\'v", "I\'ve"]
exclude_words_re = re.compile(r"\b(" + r"|".join(exclude_words) +r")\b|\s", re.I|re.M)
exclude_words_re.sub("" , s)
I added
"Iv've"
but, got:
'like the book. seen it.'
"Iv've" should not be removed because it does not match any whole words in exclude_words.
2 changes to implement to your code:
Use proper regex flags to ignore case
Add \b to only include whole words.
import re
s = "I like the book. i'v seen it. Iv've I've"
exclude_words = ["I", "I\'v", "I\'ve"]
exclude_words_re = re.compile(r"(^|\b)((" + r"|".join(exclude_words) +r"))(\s|$)", re.I|re.M)
exclude_words_re.sub("" , s)
"like the book. seen it. Iv've "

Identify in-text Citations (in APA, MLA, Harvard, Vancouver, etc.) with Python

I'm trying to identify all sentences that contain in-text citations in a journal article in pdf formats.
I converted the .pdf to .txt and wanted to find all sentences that contained a citation, possibly in one of the following format:
Smith (1990) stated that....
An agreement was made on... (Smith, 1990).
An agreement was made on... (April, 2005; Smith, 1990)
Mixtures of the above
I first tokenized the txt into sentences:
import nltk
from nltk.tokenize import sent_tokenize
ss = sent_tokenize(text)
This makes type(ss) list, so I converted the list into str to use re findall:
def listtostring(s):
str1 = ' '
return (str1. join(s))
ee = listtostring(ss)
Then, my idea was to identify the sentences that contained a four digit number:
import re
for sentence in ee:
zz = re.findall(r'\d{4}', ee)
if zz:
print (zz)
However, this extracts only the years but not the sentences that contained the years.
Using regex, something (try it out) that can have decent recall while trying to avoid inappropriate matches (\d{4} may give you a few) is
\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)
A python example (using spaCy instead of NLTK) would then be
import re
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("One statement. Then according to (Smith, 1990) everything will be all right. Or maybe not.")
l = [sent.text for sent in doc.sents]
for sentence in l:
if re.findall(r'\(([^)]+)?(?:19|20)\d{2}?([^)]+)?\)', sentence):
print(sentence)
import re
l = ['This is 1234','Hello','Also 1234']
for sentence in l:
if re.findall(r'\d{4}',sentence):
print(sentence)
Output
This is 1234
Also 1234

How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson#sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA#anet.com
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Your problem is that \s will also match newlines. Instead of \s just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.

How to parse a file sentence by sentence in Python

I need to read a large amount of large text files.
For each file, I need to open it and read in text sentence by sentence.
Most of approaches I found is read line by line.
How can I do it with Python?
If you want sentence tokenization, nltk is probably the quickest way to do so. http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt
Will get you pretty far.
i.e. code from docs
>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
If the files have large amounts of lines you could make a generator using the yield statement
def read(filename):
file = open(filename, "r")
for line in file.readlines():
for word in line.split():
yield word
for word in read("sample.txt"):
print word
This would return all the words of each line of the file

splitting strings using re.split

I have multiple strings (>1000) of the form:
\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n
The strings may have a whitespace before the '\n'
How do I split these strings (in an efficient way) so as to avoid getting any empty or duplicate (the whitespace case) elements?
I was using:
re.split(r'\r|\n', str)
EDIT:
some more examples:
\r\nThe Creme Brulee Cart \r\nCurry Up Now\r\nKoJa Kitchen\r\nAn the Go\r\nPacific Puffs\r\nEbbett's Good to Go\r\nFiveten Burger\r\nGo Streatery\r\nHiyaaa\r\nSAJJ\r\nKinder's Truck\r\nBlue Saigon\r
\r\nThe Chairman\r\nSanguchon\r\nSeoul on Wheels\r\nGo Streatery\r\nStreet Dog Truck\r\nKinder's Truck\r\nYummi BBQ\r\nLexie's Frozen Custard\r\nDrewski's Hot Rod Kitchen\r
\n An the Go \n Cheese Gone Wild \n Cupkates \n Curry Up Now \n Fins on the Hoof\n KoJa Kitchen\n Lobsta Truck \n Oui Chef \n Sanguchon\n Senor Sisig \n The Chairman \n The Rib Whip
thanks!
Your example doesn't show any "whitespace before the \n" except for a single optional \r.
If that's all you're trying to handle, instead of splitting on either \r or \n, split on a possible \r and a definite \n:
re.split(r"\r?\n", s)
Of course that's assuming you don't have any bare \r without \n to handle. If you need to handle \r, \r\n, and \n all equally (similar to Python's universal newline support…):
re.split(r"\r|\n|(\r\n)", s)
Or, more simply:
re.split(r"(\r|\n)+", s)
If you want to remove leading spaces, tabs, multiple \r, etc., you could do that in the regexp, or just call lstrip on each result:
map(str.lstrip, re.split(r"\r|\n", s))
… but that can leave you with empty elements. You could filter those out, but it's probably better to just split on any run of whitespace that ends with a \n instead:
re.split(r"\s*\n", s)
That will still leave empty elements at the start and end, because your string starts and ends with newlines, and that's what re.split is supposed to do. If you want to eliminate them, you can either strip the string before parsing, or toss the end values after parsing:
re.split(r"\s*\n", s.strip())
re.split(r"\s*\n", s)[1:-1]
I think one of these last two is exactly what you want… but that's really just a guess based on the limited information you gave. If not, then one of the others (along with its explanation) should hopefully be enough for you to write what you really want.
From your new examples, it looks like what you really want to split on is any run of whitespace that includes at least one \n. And your input may or may not have newlines at the start and end (your first example has both, your second has \r\n at the start but nothing at the end…), and you want to ignore them if it does. So:
re.split(r"\s*\n\s*", s.strip())
However, at this point, it might be worth asking why you're trying to parse this as a string instead of as a text file. Assuming you got these from some file or file-like object, instead of this:
with open(path, 'r') as f:
s = f.read()
results = re.split(regexpr, s.strip())
… something like this might be a lot more readable, and more than fast enough (maybe not as fast as the optimal regexp, but still so fast that any wasted string-processing time is swamped by the actual file reading time anyway):
with open(path, 'r') as f:
results = filter(None, map(str.strip, f))
Especially if you just want to iterate over this list once, in which case (assuming either Python 3.x, or using ifilter and imap from itertools if 2.x) this version doesn't have to read the whole file into memory and process it before you start doing your actual work.
re.split(r'[\s\n\r]+', str.strip())
>>> s = "\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n"
>>> [x for x in s.strip("\r\n").split("\n") if x]
['Senor Sisig', 'The Chairman', 'Cupkates', 'Little Green Cyclo', 'Sanguchon', 'Seoul on Wheels', 'Kasa Indian', 'Go Streatery', 'Whip Out!', 'Liba Falafel', 'Grilled Cheese Bandits']
If you insist on regex
>>> import re
>>> re.split(r"[\r\n]+", s.strip("\r\n"))
['Senor Sisig', 'The Chairman', 'Cupkates', 'Little Green Cyclo', 'Sanguchon', 'Seoul on Wheels', 'Kasa Indian', 'Go Streatery', 'Whip Out!', 'Liba Falafel', 'Grilled Cheese Bandits']
Just filter out the empty values
list(ifilter(None, re.split(r"\r|\n", your_string)))
Pythons regular expressions offer you the \s -character class which matches any whitespace in [ \t\n\r\f\v] (unless UNICODE flag is set, then it depends on the character database in use).
As mentioned in the other answers (#abarnert), your regex could be \s*\n which is 0 or more whitespace ending with an \n. Below is an example.
In [1]: import re
In [2]: from itertools import ifilter
In [3]: my_string = """\r\nSenor Sisig \nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n"""
In [4]: list(ifilter(None, re.split(r"\s*\n", my_string)))
Out[4]:
['Senor Sisig',
'The Chairman',
'Cupkates',
'Little Green Cyclo',
'Sanguchon',
'Seoul on Wheels',
'Kasa Indian',
'Go Streatery',
'Whip Out!',
'Liba Falafel',
'Grilled Cheese Bandits']
Note that I'm using ifilter from the itertools package. You could use filter or a list comp.
Like so:
[x for x in re.split("\s*\n", my_string) if x]

Categories

Resources