Weird behavior when writing a string to a file - python

I am trying to make an AutoHotKey script that removes the letter 'e' from most words you type. To do this, I am going to put a list of common words in a text file and have a python script add the proper syntax to the AHK file for each word. For testing purposes, my word list file 'words.txt' contains this:
apple
dog
tree
I want the output in the file 'wordsOut.txt' (which I will turn into the AHK script) to end up like this after I run the python script:
::apple::appl
::tree::tr
As you can see, it will exclude words without the letter 'e' and removes 'e' from everything else. But when I run my script which looks like this...
f = open('C:\\Users\\jpyth\\Desktop\\words.txt', 'r')
while True:
word = f.readline()
if not word: break
if 'e' in word:
sp_word = word.strip('e')
outString = '::{}::{}'.format(word, sp_word)
p = open('C:\\Users\\jpyth\\Desktop\\wordsOut.txt', 'a+')
p.write(outString)
p.close()
f.close()
The output text file ends up like this:
::apple
::apple
::tree::tr
The weirdest part is that, while it never gets it right, the text in the output file can change depending on the number of lines in the input file.

I'm making this an official answer and not a comment because it's worth pointing out how strip works, and to be weary of hidden characters like new line characters.
f.readline() returns each line, including the '\n'. Because strip() only removes the character from beginning and end of string, not from the middle, it's not actually removing anything from most words with 'e'. In fact even a word that ends in 'e' doesn't get that 'e' removed, since you have a new line character to the right of it. It also explains why ::apple is printed over two lines.
'hello\n'.strip('o') outputs 'hello\n'
whereas 'hello'.strip('o') outputs 'hell'
As pointed out in the comments, just do sp_word = word.strip().replace('\n', '').replace('e', '')

lhay's answer is right about the behavior of strip() but I'm not convinced that list comprehension really qualifies as "simple".
I would instead go with replace():
>>> 'elemental'.replace('e', '')
'lmntal'
(Also side note: for word in f: does the same thing as the first three lines of your code.)

Related

Where i am wrong? Count total words excluding header and footer in python?

This is the file i am trying to read and count the total no of words in this file test.txt
I have written a code for it:
def create_wordlist(filename, is_Gutenberg=True):
words = 0
wordList = []
data = False
regex = re.compile('[%s]' % re.escape(string.punctuation))
file1 = open("temp",'w+')
with open(filename, 'r') as file:
if is_Gutenberg:
for line in file:
if line.startswith("*** START "):
data = True
continue
if line.startswith("End of the Project Gutenberg EBook"):
#data = False
break
if data:
line = line.strip().replace("-"," ")
line = line.replace("_"," ")
line = regex.sub("",line)
for word in line.split():
wordList.append(word.lower())
#print(wordList)
#words = words + len(wordList)
return len(wordList)
#return wordList
create_wordlist('test.txt', True)
Here are few rules to be followed:
1. Strip off whitespace, and punctuation
2. Replace hyphens with spaces
3.skip the file header and footer. Header ends with a line that starts with "*** START OF THIS" and footer starts with "End of the Project".
My answer: 60513 but the actual answer is 60570. This answer came with the question itself. It may be correct or wrong. Where I am doing it wrong.
You give a number for the actual answer -- the answer you consider correct, that you want your code to output.
You did not tell us how you got that number.
It looks to me like the two numbers come from different definitions of "word".
For example, you have in your example text several numbers in the form:
140,000,000
Is that one word or three?
You are replacing hyphens with spaces, so a hyphenated word will be counted as two. Other punctuation you are removing. That would make the above number (and there are other, similar, examples in your text) into one word. Is that what you intended? Is that what was done to get your "correct" number? I suspect this is all or part of your difference.
At a quick glance, I see three numbers in the form above (counted as either 3 or 9, difference 6)
I see 127 apostrophes (words like wife's, which could be counted as either one word or two) for a difference of 127.
Your difference is 57, so the answer is not quite so simple, but I still strongly suspect different definitions of what is a word, for specific corner cases.
By the way, I am not sure why you are collecting all the words into a huge list and then getting the length. You could skip the append loop and just accumulate a sum of len(line.split()). This would remove complexity, which lessens the possibility of bugs (and probably make the program faster, if that matters in this case)
Also, you have a line:
if line.startswith("*** START " in"):
When I try that in my python interpreter, I get a syntax error. Are you sure the code you posted here is what you are running? I would have expected:
if line.startswith("*** START "):
Without an example text file that shows this behaviour it is difficult to guess what goes wrong. But there is one clue: your number is less than what you expect. That seems to imply that you somehow glue together separate words, and count them as a single word. And the obvious candidate for this behaviour is the statement line = regex.sub("",line): this replaces any punctuation character with an empty string. So if the text contains that's, your program changes this to thats.
If that is not the cause, you really need to provide a small sample of text that shows the behaviour you get.
Edit: if your intention is to treat punctuation as word separators, you should replace the punctuation character with a space, so: line = regex.sub(" ",line).

Deleting prior two lines of a text file in Python where certain characters are found

I have a text file that is an ebook. Sometimes full sentences are broken up by two new lines, and I am trying to get rid of these extra new lines so the sentences are not separated mid-sentence by the new lines. The file looks like
Here is a regular sentence.
This one is fine too.
However, this
sentence got split up.
If I hit delete twice on the keyboard, it'd fix it. Here's what I have so far:
with open("i.txt","r") as input:
with open("o.txt","w") as output:
for line in input:
line = line.strip()
if line[0].isalpha() == True and line[0].isupper() == False:
# something like hitting delete twice on the keyboard
output.write(line + "\n")
else:
output.write(line + "\n")
Any help would be greatly appreciated!
If you can read the entire file into memory, then a simple regex will do the trick - it says, replace a sequence of new lines preceding a lowercase letter, with a single space:
import re
i = open('i.txt').read()
o = re.sub(r'\n+(?=[a-z])', ' ', i)
open('o.txt', 'w').write(o)
The important difference here is that you are not in an editor, but rather writing out lines, that means you can't 'go back' to delete things, but rather recognise they are wrong before you write them.
In this case, you will iterate five times. Each time you will get a string ending in \n to indicate a newline, and in two of those cases, you want to remove that newline. The easiest way I can see to identify that time is if there was no full stop at the end of the line. If we check that, and strip the newline off in that case, we can get the result you want:
with open("i.txt", "r") as input, open("o.txt", "w") as output:
for line in input:
if line.endswith(".\n"):
output.write(line)
else:
output.write(line.rstrip("\n"))
Obviously, sometimes it will not be possible to tell in advance that you need to make a change. In such cases, you will either need to make two iterations over the file - the first to find where you want to make changes, the second to make them, or, alternatively, store (part of, or all of) the file in memory until you know what you need to change. Note that if your files are extremely large, storinbg them in memory could cause problems.
You can use fileinput if you just want to remove the lines in the original file:
from __future__ import print_function
for line in fileinput.input(""i.txt", inplace=True):
if not line.rstrip().endswith("."):
print(line.rstrip(),end=" ")
else:
print(line, end="")
Output:
Here is a regular sentence.
This one is fine too.
However, this sentence got split up.

How can I format a txt file in python so that extra paragraph lines are removed as well as extra blank spaces?

I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))
You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.
The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.
split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...).
Write
n.split(" ")
and it will only split at spaces.
Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")
Firstly, let's see, what exactly is the problem...
You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces.
That approach won't work on 2+ newlines as there are 3 possible situations:
- 1 newline
- 2 newlines
- 2+ newlines
Great so.. How do you solve this then?
There are many solutions. I'll list 3 of them.
Regex based.
This problem is very easy to solve iff1 you know how to use regex...
So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based.
This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based.
Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.
Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter and map with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))
To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)

How to sort contents in a file in python

I'm trying to figure out a simple way to sort words from a file, however the spaces "\n" are always returned when I print the words.
How could I improve this code to make it work properly? I'm using python 2.7
Thanks in advance.
def sorting(self):
filename = ("food.txt")
file_handle = open(filename, "r")
for word in file_handle:
word = word.split()
print sorted(file_handle)
file_handle.close()
You actually have two problems here.
The big one is that print sorted(file_handle) reads and sorts the whole rest of the file and prints that out. You're doing that once per line. So, what happens is that you read the first line, split it, ignore the result, sort and print all the lines after the first, and then you're done.
What you want to do is accumulate all the words as you go along, then sort and print that. Like this:
def sorting(self):
filename = ("food.txt")
file_handle = open(filename, "r")
words = []
for line in file_handle:
words += line.split()
file_handle.close()
print sorted(words)
Or, if you want to print the sorted list one line at a time, instead of as a giant list, change the last line to:
print '\n'.sorted(words)
For the second, more minor problem, the one you asked about, you just need to strip off the newlines. So, change the words += line to this:
words += line.strip().split()
However, if you had solved the first problem, you wouldn't even have noticed this one. If you have a line like "one two three\n", and you call split() on it, you will get back ["one", "two", "three"], with no \n to worry about. So, you don't actually even need to solve this one.
While we're at it, there are a few other improvements you could make here:
Use a with statement to close the file instead of doing it manually.
Make this function return the list of words (so you can do various different things with it, instead of just printing it and returning nothing).
Take the filename as a parameter instead of hardcoding it (for similar flexibility).
Maybe turn the loop into a comprehension—but that would require an extra "flattening" step, so I'm not sure it's worth it.
If you don't want duplicate words, use a set rather than a list.
Depending on the use case, you often want to use rstrip() or rstrip('\n') to remove just the trailing newline, while leaving, say, paragraph indentation tabs or spaces. If you're looking for individual words, however, you probably don't want that.
You might want to filter out and/or split on non-alphabetical characters, so you don't get "that." as a word. Doing even this basic kind of natural-language processing is non-trivial, so I won't show an example here. (For example, you probably want "John's" to be a word, you may or may not want "jack-o-lantern" to be one word instead of three; you almost certainly don't want "two-three" to be one word…)
The self parameter is only needed in methods of classes. This doesn't appear to be in any class. (If it is, it's not doing anything with self, so there's no visible reason for it to be in a class. You might have some reason which would be visible in your larger program, of course.)
So, anyway:
def sorting(filename):
words = []
with open(filename) as file_handle:
for line in file_handle:
words += line.split()
return sorted(words)
print '\n'.join(sorting('food.txt'))
Basically all you have to do is strip that newline (and all other whitespace because you probably don't want it):
def sorting(self):
filename = ("food.txt")
file_handle = open(filename, "r")
for line in file_handle:
word = line.strip().split()
print sorted(file_handle)
file_handle.close()
Otherwise you can just remove the last character with line[:-1].split()
Use .strip(). It will remove white space by default. You can also add other characters (like "\n") to strip as well. This will leave just the words.
Try this:
def sorting(self):
words = []
with open("food.txt") as f:
for line in f:
words.extend(line.split())
return sorted(words, key=lambda word: word.lower())
To avoid printing the new lines just put , in the end:
print sorted(file_handle),
In your code, i don't see that you are sorting the whole file, just the line. Use a list to save all the words, and after you read the file, sort them all.

Need to match whole word completely from long set with no partials in Python

I have a function that - as a larger part of a different program - checks to see if a word entry is in a text file. So if the text file looks like this:
aardvark
aardvark's
aardvarks
abaci
.
.
.
zygotes
I just ran a quick if statement
infile = open("words","r") # Words is the file with all the words. . . yeah.
text = infile.read()
if word in text:
return 1
else:
return 0
Works, sort-of. The problem is, while it returns true for aardvark, and false for wj;ek, it also will return true for any SUBSET of any word. So, for example, the word rdva will come back as a 'word' because it IS in the file, as a subset of aardvark. I need it to match whole words only, and I've been quite stumped.
So how can I have it match an entire word (which is equivalent to an entire line, here) or nothing?
I apologize if this question is answered elsewhere, I searched before I posted!
Many Thanks!
Iterate over each line and see if the whole line matches:
def in_dictionary(word):
for line in open('words', 'r').readlines():
if word == line.strip():
return True
return False
When you use the in statement, you are basically asking whether the word is in the line.
Using == matches the whole line.
.strip() removes leading and trailing whitespace, which will cause hello to not equal {space}hello
There is a simpler approach. Your file is, conceptually, a list of words, so build that list of words (instead of a single string).
with open("words") as infile: words = infile.read().split()
return word in words
<string> in <string> does a substring search, but <anything> in <list> checks for membership. If you are going to check multiple times against the same list of words, then you may improve performance by instead storing a set of the words (just pass the list to the set constructor).
Blender's answer works, but here is a different way that doesn't require you to iterate yourself:
Each line is going to end with a newline character (\n). So, what you can do is put a \n before and after your checked string when comparing. So something like this:
infile = open("words","r") # Words is the file with all the words. . . yeah.
text = "\n" + infile.read() # add a newline before the file contents so we can check the first line
if "\n"+word+"\n" in text:
return 1
else:
return 0
Watch out, though -- your line endings may be \r\n or just \r too.
It could also have problems if the word you're checking CONTAINS a newline. Blender's answer is better.
That's all great until you want to verify every word in a longer text using that list. For me and /usr/share/dict/words it takes up to 3ms to check a single word in words. Therefore I suggest using a dictionary (no pun) instead. Lookups were about 2.5 thousand times faster with:
words = {}
for word in open('words', 'r').readlines():
words[word.strip()] = True
def find(word):
return word in words

Categories

Resources