How to iterate over space-separated ASCII file in Python - python

Strange question here.
I have a .txt file that I want to iterate over. I can get all the words into an array from the file, which is good, but what I want to know how to do is, how do I iterate over the whole file, but not the individual letters, but the words themselves.
I want to be able to go through the array which houses all the text from the file, and basically count all the instances in which a word appears in it.
Only problem is I don't know how to write the code for it.
I tried using a for loop, but that just iterates over every single letter, when I want the whole words.

This code reads the space separated file.txt
f = open("file.txt", "r")
words = f.read().split()
for w in words:
print w

file = open("test")
for line in file:
for word in line.split(" "):
print word

Untested:
def produce_words(file_):
for line in file_:
for word in line.split():
yield word
def main():
with open('in.txt', 'r') as file_:
for word in produce_words(file_):
print word

If you want to loop over an entire file, then the sensible thing to do is to iterate over the it, taking the lines and splitting them into words. Working line-by-line is best as it means we don't read the entire file into memory first (which, for large files, could take a lot of time or cause us to run out of memory):
with open('in.txt') as input:
for line in input:
for word in line.split():
...
Note that you could use line.split(" ") if you want to preserve more whitespace, as line.split() will remove all excess whitespace.
Also note my use of the with statement to open the file, as it's more readable and handles closing the file, even on exceptions.
While this is a good solution, if you are not doing anything within the first loop, it's also a little inefficient. To reduce this to one loop, we can use itertools.chain.from_iterable and a generator expression:
import itertools
with open('in.txt') as input:
for word in itertools.chain.from_iterable(line.split() for line in input):
...

Related

How to limit length of total words to read from a txt file

I have been trying to read a text documents easily using the code below, however I don't want to read the
entire text documents, let's say the total length of the words in the text documents is 2,845.
for line in open('foo.txt', "r"):
print(line)
i want to read the first 1,674 words from the documents
Thanks in advance
First of all, you should always use with open() to open and read a file, as the file gets closed automatically. In total it's less error prone and more readable.
Concerning your problem, here is a short snippet which should push you forward:
with open('foo.txt', 'r') as file:
text = file.read().replace('\n', ' ')
words = text.split(' ')
char_limited_text = ' '.join(words[:1674]
The above code works in three steps:
It reads the whole text of the file into variable text
It splits the text by single whitespaces
Joining the words back together, but only taking the first 1674 words
If performance counts, there might be a better solution reading the file line by line and to keep track of how much words are read in already.

process and export text file into csv file

I'm using this code to get a text file as an input, and turn it to csv file as an output. The csv file has two columns, one for the words, and the other for the count of the words.
from collections import Counter
file = open(r"/Users/abdullahtammour/Documents/txt/1984/1984.txt", "r", encoding="utf-8-sig")
wordcount={}
wordcount = Counter((file.read().split()))
for item in wordcount.items():
print("{}\t{}".format(*item), file=open("/Users/abdullahtammour/Documents/txt/1984/1984.csv", "a"))
file.close()
I want to enhance the code and add two feature:
1st (and the most important) I want only the words in the output file, no numbers, no characters like (*&-//.,!?) and so on.
2nd to turn all the words in the output file to be lower case.
Any help will be appreciated.
You can use the string method isalpha() to check if there are only alphabetic characters in a word, and you can use lower() to convert it to lower case. I'm assuming you don't want apostrophes or other punctuation in your words either, but if that is OK then you could strip such characters out with replace, like this:
word.replace("'",'').isalpha()
It's also better to just open a file once than to open & close it a thousand times, which is what you do by opening it in the body of the loop. It is not only inefficient but could conceivably have weird results if buffering is involved.
I rewrote it with a 'with' clause which is roughly equal to opening the file at the beginning of the clause and closing it at the end.
Not as important, but you can use the 'sep' keyword in print() instead of manually inserting a tab, like this:
print(arg1, arg2, sep='\t')
Revising your code:
from collections import Counter
file = open(r"/Users/abdullahtammour/Documents/txt/1984/1984.txt", "r", encoding="utf-8-sig")
wordcount={}
wordcount = Counter((file.read().split()))
file.close()
with open("/Users/abdullahtammour/Documents/txt/1984/1984.csv", "w") as file:
for word, count in wordcount.items():
if word.isalpha():
print(word.lower(), count, sep='\t', file=file)

Delete words from text file if they exist in another textfile

I have created one txt.file, consisting of five other text files (all text.txt). I also have a text file with words on each line (remove words.txt). I would like to remove the words from removewords.txt from alltext.txt, without creating a new textfile and without writing the words from removewords.txt manually.
I have thought about using sets, but is a but confused how to approach this?
My mergin of files looks like this:
files=["file1.txt", "file2.txt"...."file5.txt"]
with open("compare_out.txt", "w") as fout:
for file in files:
with open (file) as complete_file:
for line in complete_file:
fout.write(line)
Any suggestions? Thank you very much
I would do the following:
read all words from "removewords.txt" into a list called remove_words
read all words from "alltext.txt" into a list called all_words
open the file "alltext.txt" in write mode ("w") and write content to it as follows:
for each word in all_words, check if that word is in the list remove_words. If it is not, write it to "alltext.txt"
Are these steps detailed enough so that you can solve your problem?
If not, comment below on what you are having problems with.
If it is not a problem you can load all the words to remove in to a set using split, then check each word before you write it to the output file.
Split separates a string in to list elements based on a delimiting character - in the case of words we can use a space character " " to separate each word from other words.
rm_word_file = open('removewords.txt', 'r')
remove_words = set(rm_word_file.read().split(" "))
rm_word_file.close()
files=["file1.txt", "file2.txt"...."file5.txt"]
with open("compare_out.txt", "w") as fout:
for file in files:
with open (file) as complete_file:
for line in complete_file:
for word in line.split(" "):
if(word not in remove_words):
fout.write(line)
Something else to think about is, if there is punctuation in your text body, how you will handle that?
You can just remove all punctuation, but then its and it's would be treated as the same word, which may not be the intended behaviour.

Searching a text file and grabbing all lines that do not include ## in python

I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.

Deleting a specific word from a file in python

I am quite new to python and have just started importing text files. I have a text file which contains a list of words, I want to be able to enter a word and this word to be deleted from the text file. Can anyone explain how I can do this?
text_file=open('FILE.txt', 'r')
ListText = text_file.read().split(',')
DeletedWord=input('Enter the word you would like to delete:')
NewList=(ListText.remove(DeletedWord))
I have this so far which takes the file and imports it into a list, I can then delete a word from the new list but want to delete the word also from the text file.
Here's what I would recommend since its fairly simple and I don't think you're concerned with performance.:
f = open("file.txt",'r')
lines = f.readlines()
f.close()
excludedWord = "whatever you want to get rid of"
newLines = []
for line in lines:
newLines.append(' '.join([word for word in line.split() if word != excludedWord]))
f = open("file.txt", 'w')
for line in lines:
f.write("{}\n".format(line))
f.close()
This allows for a line to have multiple words on it, but it will work just as well if there is only one word per line
In response to the updated question:
You cannot directly edit the file (or at least I dont know how), but must instead get all the contents in Python, edit them, and then re-write the file with the altered contents
Another thing to note, lst.remove(item) will throw out the first instance of item in lst, and only the first one. So the second instance of item will be safe from .remove(). This is why my solution uses a list comprehension to exclude all instances of excludedWord from the list. If you really want to use .remove() you can do something like this:
while excludedWord in lst:
lst.remove(excludedWord)
But I would discourage this in favor for the equivalent list comprehension
We can replace strings in files (some imports needed;)):
import os
import sys
import fileinput
for line in fileinput.input('file.txt', inplace=1):
sys.stdout.write(line.replace('old_string', 'new_string'))
Find this (maybe) here: http://effbot.org/librarybook/fileinput.htm
If 'new_string' change to '', then this would be the same as to delete 'old_string'.
So I was trying something similar, here are some points to people whom might end up reading this thread. The only way you can replace the modified contents is by opening the same file in "w" mode. Then python just overwrites the existing file.
I tried this using "re" and sub():
import re
f = open("inputfile.txt", "rt")
inputfilecontents = f.read()
newline = re.sub("trial","",inputfilecontents)
f = open("inputfile.txt","w")
f.write(newline)
#Wnnmaw your code is a little bit wrong there it should go like this
f = open("file.txt",'r')
lines = f.readlines()
f.close()
excludedWord = "whatever you want to get rid of"
newLines = []
for line in newLines:
newLines.append(' '.join([word for word in line.split() if word != excludedWord]))
f = open("file.txt", 'w')
for line in lines:
f.write("{}\n".format(line))
f.close()

Categories

Resources