Python regex fullmatch doesn't work as expected

Python regex fullmatch doesn't work as expected - python

I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl + f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:
import re
pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w+")
with open('sentences.txt',encoding='utf8') as file:
lines = file.readlines()
for line in lines:
matches = pattern.fullmatch(line)
if(matches==None):
text.write("not valid"+"\n")
else:
text.write("valid"+"\n")
file.close()
In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have. The text file that I have:
How can you say that to me?
As he looked at his reflection in the mirror, he took a deep breath.
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the
extremely tall man, who was waiting by the door.
David said ‘Oh, sorry!’.
The happy pair discussed their future life 2gether and shared sweet words of admiration.
We will not stop you; I promise!
Come here ASAP!
He pushed his chair back and went to the kitchen at 2 pM.
I do not know...
The main character in the movie said: "Play hard. Work harder."
When I enter my regex in vs code with ctrl+f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.

First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so
Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).
So, a possible solution looks like
with open('sentences.txt',encoding='utf8') as file:
for line in file:
matches = pattern.fullmatch(line.rstrip('\n'))
...
Or,
pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
for line in file:
....

Related

python regex matching paragraph(s) starting with labels

I'm trying to match a paragraph or paragraphs which are lead by letters. I'm testing on and have tried dotALL, lookaheads, multiline, etc and I can't seem to get one to work. The string I'm trying to match looks like this:
A-B: Object, procedure:
- Somethings.
- More things, might run over several lines like this where the sentence just keeps on going and going and going and sometimes isn't even a sentence.
- Another line, sometimes not ending with period
- Variable amount of white space at the beginning of new lines
Comment (A-B): sometimes, there are comments which are separated by two \n\n characters like this.*
C. Second object, other procedure:
- More lines.
- Can have various leads (including no ' - ' leading.
- Variable number of lines.
The closest I've come to a match was using '(.+?\n\n|.+?$)' and dotALL (which I realize is sloppy), but even this didn't work because it misses comments or paragraphs separated by more lines but still under the header ([A-Z]?-?[A-Z]).
Ideally I'd like to capture the header or title (A-B:) or (C.) in match.group(1) and the rest of the paragraphs(s) before the next title in match.group(2), but I'd just be happy to capture everything. I tried lookaheads to catch everything between titles, but that misses the last instance which won't have a title at the end.
I'm a newb and I apologize if this has already been answered or if I'm not clear (but I have been looking for the past 2 days without success). Thanks!

so here is my proposed solution for you :)
import re
with open('./samplestring.txt') as f:
header =[]
nonheader = []
yourString = f.read()
for line in content.splitlines():
if(re.match('(^[A-Z]?-?[A-Z]:)|(^[A-Z]\.)',line.lstrip())):
header.append(line)
else:
nonheader.append(line)

I ended up giving up on capturing comments and everything after them. I used the following code to capture the letter for each header (group(1)), the text for the header (group(2)), and the text in the paragraph excluding comments (group(3)).
([A-Z]{1,2}|[A-Z]-[A-Z])(?::|.) +(\w.+)\n+((\s*(- *.+))+)
([A-Z]{1,2}|[A-Z]-[A-Z])(?::|.) +
captures the letter (group 1), the colon or period, and the space(s) after that
(\w.+)\n+
captures the text of the header, and the next line(s)
((\s*(- *.+))+)
captures multiple lines starting variably with a space, dash, space, and text
I appreciate all your help with this! :)

You can use
(^[^\n]+)(?:\n *-.+(?:\n.+)*|\n\n.+\n)+
(^[^\n]+) - Match the header line, then repeatedly alternate between
\n *-.+(?:\n.+)* - A non-comment line: starts with whitespace, followed by -, optionally running across multiple lines
\n\n.+\n - Or, match a comment line
(no dotall flag)
https://regex101.com/r/6kle0u/2
This depends on the comment lines always having \n\n before them.

Python regex a binary file text file - how to use a range of numbers and word boundry?

I have a text file that requires me to read it in binary and write out in binary. No problem. I need to mask out Social Security Numbers with Xs, pretty easy normally:
text = re.sub("\\b\d{3}-\d{2}-\{4}\\b","XXX-XX-XXXX", text)
This is a sample of the text I'm parsing:
more stuff here
CHILDREN�S 001-02-0003 get rid of that
stuff goes here
not001-02-0003
but ssn:001-02-0003
and I need to turn it into this:
more stuff here
CHILDREN�S XXX-XX-XXXX get rid of that
stuff goes here
not001-02-0003
but ssn:XXX-XX-XXXX
Super! So now I'm trying to write that same regex 'in binary'. Here is what I've got and it's 'works' but gosh it doesn't feel right at all:
line = re.sub(b"\\B(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\\B", b"\x00X\x00X\x00X\x00-\x00X\x00X\x00-\x00X\x00X\x00X\x00X", line)
Notes:
that junk in CHILDRENS, gotta keep it like that
need to word boundary, thus the 4th line doesn't get masked out
Shouldn't my regex be a range of numbers instead? I just don't know how to do that in binary. And my word boundaries only work backwards as \B instead of \b, uh.. what is up with that?
UPDATE: I've also tried this:
line = re.sub(b"[\x30-\x39]", b"\x58", line)
and that does it for EVERY number, but if I try to even do something simple like:
line = re.sub(b"[\x30-\x39][\x30-\x39]", b"\x58\x58", line)
it doesn't match anything anymore, any idea why?

You might try:
import re
rx = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')
with open("test.txt", "rb") as fr, open("test2.txt", "wb+") as fp:
repl = rx.sub('XXX-XX-XXXX', fr.read())
fp.write(repl)
This keeps every junk characters as they are and writes them to test2.txt.
Note that, when you don't want every backslash escaped, you could use r'string here' in Python.

Where i am wrong? Count total words excluding header and footer in python?

This is the file i am trying to read and count the total no of words in this file test.txt
I have written a code for it:
def create_wordlist(filename, is_Gutenberg=True):
words = 0
wordList = []
data = False
regex = re.compile('[%s]' % re.escape(string.punctuation))
file1 = open("temp",'w+')
with open(filename, 'r') as file:
if is_Gutenberg:
for line in file:
if line.startswith("*** START "):
data = True
continue
if line.startswith("End of the Project Gutenberg EBook"):
#data = False
break
if data:
line = line.strip().replace("-"," ")
line = line.replace("_"," ")
line = regex.sub("",line)
for word in line.split():
wordList.append(word.lower())
#print(wordList)
#words = words + len(wordList)
return len(wordList)
#return wordList
create_wordlist('test.txt', True)
Here are few rules to be followed:
1. Strip off whitespace, and punctuation
2. Replace hyphens with spaces
3.skip the file header and footer. Header ends with a line that starts with "*** START OF THIS" and footer starts with "End of the Project".
My answer: 60513 but the actual answer is 60570. This answer came with the question itself. It may be correct or wrong. Where I am doing it wrong.

You give a number for the actual answer -- the answer you consider correct, that you want your code to output.
You did not tell us how you got that number.
It looks to me like the two numbers come from different definitions of "word".
For example, you have in your example text several numbers in the form:
140,000,000
Is that one word or three?
You are replacing hyphens with spaces, so a hyphenated word will be counted as two. Other punctuation you are removing. That would make the above number (and there are other, similar, examples in your text) into one word. Is that what you intended? Is that what was done to get your "correct" number? I suspect this is all or part of your difference.
At a quick glance, I see three numbers in the form above (counted as either 3 or 9, difference 6)
I see 127 apostrophes (words like wife's, which could be counted as either one word or two) for a difference of 127.
Your difference is 57, so the answer is not quite so simple, but I still strongly suspect different definitions of what is a word, for specific corner cases.
By the way, I am not sure why you are collecting all the words into a huge list and then getting the length. You could skip the append loop and just accumulate a sum of len(line.split()). This would remove complexity, which lessens the possibility of bugs (and probably make the program faster, if that matters in this case)
Also, you have a line:
if line.startswith("*** START " in"):
When I try that in my python interpreter, I get a syntax error. Are you sure the code you posted here is what you are running? I would have expected:
if line.startswith("*** START "):

Without an example text file that shows this behaviour it is difficult to guess what goes wrong. But there is one clue: your number is less than what you expect. That seems to imply that you somehow glue together separate words, and count them as a single word. And the obvious candidate for this behaviour is the statement line = regex.sub("",line): this replaces any punctuation character with an empty string. So if the text contains that's, your program changes this to thats.
If that is not the cause, you really need to provide a small sample of text that shows the behaviour you get.
Edit: if your intention is to treat punctuation as word separators, you should replace the punctuation character with a space, so: line = regex.sub(" ",line).

split() not splitting all white spaces?

I am trying to take a text document and write each word separately into another text document. My only issue is with the code I have sometimes the words aren't all split based on the white space and I'm wondering if I'm just using .split wrong? If so, could you explain why or what to do better?
Here's my code:
list_of_words = []
with open('ExampleText.txt', 'r') as ExampleText:
for line in ExampleText:
for word in line.split(''):
list_of_words.append(word)
print("Done!")
print("Also done!")
with open('TextTXT.txt', 'w') as EmptyTXTdoc:
for word in list_of_words:
EmptyTXTdoc.write("%s\n" % word)
EmptyTXTdoc.close()
This is the first line in the ExampleText text document as it is written in the newly created EmptyTXTdoc:
Submit
a personal
statement
of
research
and/or
academic
and/or
career
plans.

Use .split() (or .split(' ') for only spaces) instead of .split('').
Also, consider sanitizing the line with .strip() for every iteration of the file, since the line is accepted with a newline (\n) in its end.

.split('') Will not remove a space because there isn't a space in between the two apostrophes. You're telling it to split on, well, nothing.

Deleting prior two lines of a text file in Python where certain characters are found

I have a text file that is an ebook. Sometimes full sentences are broken up by two new lines, and I am trying to get rid of these extra new lines so the sentences are not separated mid-sentence by the new lines. The file looks like
Here is a regular sentence.
This one is fine too.
However, this
sentence got split up.
If I hit delete twice on the keyboard, it'd fix it. Here's what I have so far:
with open("i.txt","r") as input:
with open("o.txt","w") as output:
for line in input:
line = line.strip()
if line[0].isalpha() == True and line[0].isupper() == False:
# something like hitting delete twice on the keyboard
output.write(line + "\n")
else:
output.write(line + "\n")
Any help would be greatly appreciated!

If you can read the entire file into memory, then a simple regex will do the trick - it says, replace a sequence of new lines preceding a lowercase letter, with a single space:
import re
i = open('i.txt').read()
o = re.sub(r'\n+(?=[a-z])', ' ', i)
open('o.txt', 'w').write(o)

The important difference here is that you are not in an editor, but rather writing out lines, that means you can't 'go back' to delete things, but rather recognise they are wrong before you write them.
In this case, you will iterate five times. Each time you will get a string ending in \n to indicate a newline, and in two of those cases, you want to remove that newline. The easiest way I can see to identify that time is if there was no full stop at the end of the line. If we check that, and strip the newline off in that case, we can get the result you want:
with open("i.txt", "r") as input, open("o.txt", "w") as output:
for line in input:
if line.endswith(".\n"):
output.write(line)
else:
output.write(line.rstrip("\n"))
Obviously, sometimes it will not be possible to tell in advance that you need to make a change. In such cases, you will either need to make two iterations over the file - the first to find where you want to make changes, the second to make them, or, alternatively, store (part of, or all of) the file in memory until you know what you need to change. Note that if your files are extremely large, storinbg them in memory could cause problems.

You can use fileinput if you just want to remove the lines in the original file:
from __future__ import print_function
for line in fileinput.input(""i.txt", inplace=True):
if not line.rstrip().endswith("."):
print(line.rstrip(),end=" ")
else:
print(line, end="")
Output:
Here is a regular sentence.
This one is fine too.
However, this sentence got split up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.