split() not splitting all white spaces? - python

I am trying to take a text document and write each word separately into another text document. My only issue is with the code I have sometimes the words aren't all split based on the white space and I'm wondering if I'm just using .split wrong? If so, could you explain why or what to do better?
Here's my code:
list_of_words = []
with open('ExampleText.txt', 'r') as ExampleText:
for line in ExampleText:
for word in line.split(''):
list_of_words.append(word)
print("Done!")
print("Also done!")
with open('TextTXT.txt', 'w') as EmptyTXTdoc:
for word in list_of_words:
EmptyTXTdoc.write("%s\n" % word)
EmptyTXTdoc.close()
This is the first line in the ExampleText text document as it is written in the newly created EmptyTXTdoc:
Submit
a personal
statement
of
research
and/or
academic
and/or
career
plans.

Use .split() (or .split(' ') for only spaces) instead of .split('').
Also, consider sanitizing the line with .strip() for every iteration of the file, since the line is accepted with a newline (\n) in its end.

.split('') Will not remove a space because there isn't a space in between the two apostrophes. You're telling it to split on, well, nothing.

Related

Python regex fullmatch doesn't work as expected

I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl + f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:
import re
pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w+")
with open('sentences.txt',encoding='utf8') as file:
lines = file.readlines()
for line in lines:
matches = pattern.fullmatch(line)
if(matches==None):
text.write("not valid"+"\n")
else:
text.write("valid"+"\n")
file.close()
In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have. The text file that I have:
How can you say that to me?
As he looked at his reflection in the mirror, he took a deep breath.
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the
extremely tall man, who was waiting by the door.
David said ‘Oh, sorry!’.
The happy pair discussed their future life 2gether and shared sweet words of admiration.
We will not stop you; I promise!
Come here ASAP!
He pushed his chair back and went to the kitchen at 2 pM.
I do not know...
The main character in the movie said: "Play hard. Work harder."
When I enter my regex in vs code with ctrl+f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.
First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so
Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).
So, a possible solution looks like
with open('sentences.txt',encoding='utf8') as file:
for line in file:
matches = pattern.fullmatch(line.rstrip('\n'))
...
Or,
pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
for line in file:
....

How to read from text file into array paragraph by paragraph?

Making a text based game and want to read from the story text file via paragraph rather than printing a certain amount of characters?
You wake up from a dazed slumber to find yourself in a deep dank cave with moonlight casting upon the entrance...
You see a figure approaching towards you... Drawing nearer you hear him speak...
You want this: my_list = my_string.splitlines()
https://docs.python.org/3/library/stdtypes.html#str.splitlines
Like #martineau suggested you need a delimiter for separate different paragraphs.
This can even be a new line character (\n) and after you have it you read all content of the file and split it by the defined delimiter.
Doing so you generate a list of elements with each one being a paragraph.
Some example code:
delimiter = "\n"
with open("paragraphs.txt", "r") as paragraphs_file:
all_content = paragraphs_file.read() #reading all the content in one step
#using the string methods we split it
paragraphs = all_content.split(delimiter)
This approach has some drawbacks like the fact that read all the content and if the file is big you fill the memory with thing that you don't need now, at the moment of the story.
Looking at your text example and knowing that you will continuously print the retrieved text, reading one line a time could be a better solution:
with open("paragraphs.txt", "r") as paragraphs_file:
for paragraph in paragraphs_file: #one line until the end of file
if paragraph != "\n":
print(paragraph)
Obviously add some logic control where you need it.

Where i am wrong? Count total words excluding header and footer in python?

This is the file i am trying to read and count the total no of words in this file test.txt
I have written a code for it:
def create_wordlist(filename, is_Gutenberg=True):
words = 0
wordList = []
data = False
regex = re.compile('[%s]' % re.escape(string.punctuation))
file1 = open("temp",'w+')
with open(filename, 'r') as file:
if is_Gutenberg:
for line in file:
if line.startswith("*** START "):
data = True
continue
if line.startswith("End of the Project Gutenberg EBook"):
#data = False
break
if data:
line = line.strip().replace("-"," ")
line = line.replace("_"," ")
line = regex.sub("",line)
for word in line.split():
wordList.append(word.lower())
#print(wordList)
#words = words + len(wordList)
return len(wordList)
#return wordList
create_wordlist('test.txt', True)
Here are few rules to be followed:
1. Strip off whitespace, and punctuation
2. Replace hyphens with spaces
3.skip the file header and footer. Header ends with a line that starts with "*** START OF THIS" and footer starts with "End of the Project".
My answer: 60513 but the actual answer is 60570. This answer came with the question itself. It may be correct or wrong. Where I am doing it wrong.
You give a number for the actual answer -- the answer you consider correct, that you want your code to output.
You did not tell us how you got that number.
It looks to me like the two numbers come from different definitions of "word".
For example, you have in your example text several numbers in the form:
140,000,000
Is that one word or three?
You are replacing hyphens with spaces, so a hyphenated word will be counted as two. Other punctuation you are removing. That would make the above number (and there are other, similar, examples in your text) into one word. Is that what you intended? Is that what was done to get your "correct" number? I suspect this is all or part of your difference.
At a quick glance, I see three numbers in the form above (counted as either 3 or 9, difference 6)
I see 127 apostrophes (words like wife's, which could be counted as either one word or two) for a difference of 127.
Your difference is 57, so the answer is not quite so simple, but I still strongly suspect different definitions of what is a word, for specific corner cases.
By the way, I am not sure why you are collecting all the words into a huge list and then getting the length. You could skip the append loop and just accumulate a sum of len(line.split()). This would remove complexity, which lessens the possibility of bugs (and probably make the program faster, if that matters in this case)
Also, you have a line:
if line.startswith("*** START " in"):
When I try that in my python interpreter, I get a syntax error. Are you sure the code you posted here is what you are running? I would have expected:
if line.startswith("*** START "):
Without an example text file that shows this behaviour it is difficult to guess what goes wrong. But there is one clue: your number is less than what you expect. That seems to imply that you somehow glue together separate words, and count them as a single word. And the obvious candidate for this behaviour is the statement line = regex.sub("",line): this replaces any punctuation character with an empty string. So if the text contains that's, your program changes this to thats.
If that is not the cause, you really need to provide a small sample of text that shows the behaviour you get.
Edit: if your intention is to treat punctuation as word separators, you should replace the punctuation character with a space, so: line = regex.sub(" ",line).

Deleting prior two lines of a text file in Python where certain characters are found

I have a text file that is an ebook. Sometimes full sentences are broken up by two new lines, and I am trying to get rid of these extra new lines so the sentences are not separated mid-sentence by the new lines. The file looks like
Here is a regular sentence.
This one is fine too.
However, this
sentence got split up.
If I hit delete twice on the keyboard, it'd fix it. Here's what I have so far:
with open("i.txt","r") as input:
with open("o.txt","w") as output:
for line in input:
line = line.strip()
if line[0].isalpha() == True and line[0].isupper() == False:
# something like hitting delete twice on the keyboard
output.write(line + "\n")
else:
output.write(line + "\n")
Any help would be greatly appreciated!
If you can read the entire file into memory, then a simple regex will do the trick - it says, replace a sequence of new lines preceding a lowercase letter, with a single space:
import re
i = open('i.txt').read()
o = re.sub(r'\n+(?=[a-z])', ' ', i)
open('o.txt', 'w').write(o)
The important difference here is that you are not in an editor, but rather writing out lines, that means you can't 'go back' to delete things, but rather recognise they are wrong before you write them.
In this case, you will iterate five times. Each time you will get a string ending in \n to indicate a newline, and in two of those cases, you want to remove that newline. The easiest way I can see to identify that time is if there was no full stop at the end of the line. If we check that, and strip the newline off in that case, we can get the result you want:
with open("i.txt", "r") as input, open("o.txt", "w") as output:
for line in input:
if line.endswith(".\n"):
output.write(line)
else:
output.write(line.rstrip("\n"))
Obviously, sometimes it will not be possible to tell in advance that you need to make a change. In such cases, you will either need to make two iterations over the file - the first to find where you want to make changes, the second to make them, or, alternatively, store (part of, or all of) the file in memory until you know what you need to change. Note that if your files are extremely large, storinbg them in memory could cause problems.
You can use fileinput if you just want to remove the lines in the original file:
from __future__ import print_function
for line in fileinput.input(""i.txt", inplace=True):
if not line.rstrip().endswith("."):
print(line.rstrip(),end=" ")
else:
print(line, end="")
Output:
Here is a regular sentence.
This one is fine too.
However, this sentence got split up.

How can I format a txt file in python so that extra paragraph lines are removed as well as extra blank spaces?

I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))
You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.
The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.
split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...).
Write
n.split(" ")
and it will only split at spaces.
Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")
Firstly, let's see, what exactly is the problem...
You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces.
That approach won't work on 2+ newlines as there are 3 possible situations:
- 1 newline
- 2 newlines
- 2+ newlines
Great so.. How do you solve this then?
There are many solutions. I'll list 3 of them.
Regex based.
This problem is very easy to solve iff1 you know how to use regex...
So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based.
This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based.
Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.
Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter and map with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))
To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)

Categories

Resources