How to split a text file to its words in python? - python

I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read one file in python using:
with open("C:\\...\\...\\...\\record-13.txt") as f:
content = f.readlines()
print (content)
Now I can split each line of this file to its words using for example:
a = content[0].split()
print (a)
but I don't know how to split whole file to words?
do loops (while or for) help with that?
Thank you for your help guys. Your answers help me to write this (in my file, words are split by space so that's delimiter I think!):
with open ("C:\\...\\...\\...\\record-13.txt") as f:
lines = f.readlines()
for line in lines:
words = line.split()
for word in words:
print (word)
that simply splits words by line (one word in one line).

It depends on how you define words, or what you regard as the delimiters.
Notice string.split in Python receives an optional parameter delimiter, so you could pass it as this:
for lines in content[0].split():
for word in lines.split(','):
print(word)
Unfortunately, string.split receives a single delimiter only, so you may need multi-level splitting like this:
for lines in content[0].split():
for split0 in lines.split(' '):
for split1 in split0.split(','):
for split2 in split1.split('.'):
for split3 in split2.split('?'):
for split4 in split3.split('!'):
for word in split4.split(':'):
if word != "":
print(word)
Looks ugly, right? Luckily we can use iteration instead:
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
new_words = []
for word in words:
new_words += word.split(delimiter)
words = new_words
EDITED:
Or simply we could use the regular expression package:
import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)

with open("C:\...\...\...\record-13.txt") as f:
for line in f:
for word in line.split():
print word
Or, this gives you a list of words
with open("C:\...\...\...\record-13.txt") as f:
words = [word for line in f for word in line.split()]
Or, this gives you a list of lines, but with each line as a list of words.
with open("C:\...\...\...\record-13.txt") as f:
words = [line.split() for line in f]

Nobody has suggested a generator, I'm surprised. Here's how I would do it:
def words(stringIterable):
#upcast the argument to an iterator, if it's an iterator already, it stays the same
lineStream = iter(stringIterable)
for line in lineStream: #enumerate the lines
for word in line.split(): #further break them down
yield word
Now this can be used both on simple lists of sentences that you might have in memory already:
listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
print(word)
But it will work just as well on a file, without needing to read the whole file in memory:
with open('words.py', 'r') as myself:
for word in words(myself):
print(word)

I would use Natural Language Tool Kit as the split() way does not deal well with punctuation.
import nltk
for line in file:
words = nltk.word_tokenize(line)

The most flexible approach is to use list comprehension to generate a list of words:
with open("C:\...\...\...\record-13.txt") as f:
words = [word
for line in f
for word in line.split()]
# Do what you want with the words list
Which you can then iterate over, add to a collections.Counter or anything else you please.

Related

Remove words from a subtitle file that aren't in a wordlist (of common words)

I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...
I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)
Here is our example:
subtitle file (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.
wordlist of 3000 common words (.txt):
...
people
with
are
good
...
Output we need (.srt):
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Or just mark them if it's possible (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.
The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.
import os
import re
from glob import glob
with open('words.txt') as f:
keep_words = {line.strip().lower() for line in f}
for filename_in in glob('*.srt'):
filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
with open(filename_in) as fin, open(filename_out, 'w') as fout:
for i, line in enumerate(fin):
if i == 2:
parts = re.split(r"([\w']+)", line.strip())
parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
line = ''.join(parts) + '\n'
fout.write(line)
Result (for the subtitle.rst you gave as example:
! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Alternative: just add a '*' next to out-of-vocabulary words:
# replace:
# parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]
The output is then:
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
Explanation:
The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
We use glob to find all filenames ending in '.srt'.
For each such file, we construct a new filename derived from it as '..._new.srt'.
We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
line.strip() removes the trailing newline.
We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
The words themselves are every other element of parts, starting at index 1.
We replace the words by '*' if their lowercase form is not in keep_words.
Finally we re-assemble that line, and generally output all lines to the new file.
you could simply run a python script like this:
with open("words.txt", "rt") as words:
#create a list with every word
wordList = words.read().split("\n")
with open("subtitle.srt", "rt") as subtitles:
with open("subtitle_output.srt", "wt") as out:
for line in subtitles.readlines():
if line[0].isdigit():
#ignore the line as it starts with a digit
out.write(line)
continue
else:
for word in line.split():
if not word in wordList:
out.write(line.replace(word, f"*{word}*"))
this script will replace every word that's not in the common words file with the modified *word* keeping the original file and putting everything into a new output file

Replacing lines with a part of that line in Python3

I am working on files with transcripts. I have lines of text and every few lines there is a statement similar to this 'Play video starting at 16 seconds and follow transcript0:16' (there might be more words when minutes are showing). I was able to isolate the text I want to replace the whole sentence with. So the end goal is to leave all the text from the file but replace the sentences with my shorter text - in my case it will be "transcript0:16"
with open("transcript.txt", "r") as fhandle:
newline=[]
for line in fhandle.readlines():
if line.startswith("Play video"):
words = line.split()
word = words[::-1]
wordfinal = word[0]
newline.append(line.replace(line,wordfinal))
with open("transcript.txt", "w") as fhandle:
for line in newline:
fhandle.writelines(line)
Thanks
You can append all the lines of your document in newline and apply your rule if the statement is true, otherwise just append the normal line:
newline=[]
for line in fhandle.readlines():
if line.startswith("Play video"):
words = line.split()
word = words[::-1]
wordfinal = word[0]
newline.append(wordfinal))
else:
newline.append(line)
for line in newline:
fhandle.writelines(line)

Normalization on a corpus using a dictionary

I want to do lexical normalization on a corpus using a dictionary. The corpus has eight thousands of lines and the dictionary has thousands of word pairs (nonstandard : standard).
I have adopted an approach which is discussed here. The code looks like this:
with open("corpus.txt", 'r', encoding='utf8') as main:
words = main.read().split()
lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3', and so on}
for x in lexnorm:
for y in words:
if lexnorm[x][0] == y:
y == x[1]
text = ' '.join(lexnorm.get(y, y) for y in words)
print(text)
The code above works well, but I'm facing a problem since there are thousands of word pairs in the dictionary. Is it possible to represent the dictionary through a text file?
Last question, the output file of the code consists only of one line. It would be great if it has the same number of lines as the original corpus does.
Anyone could help me with this? I'd be thankful.
One way to output the dictionary as a text file is as a JSON string:
import json
lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3'} # etc.
with open('lexnorm.txt', 'w') as f:
json.dump(lexnorm, f)
See my comment to your original. I am only guessing what you are trying to do:
import json, re
with open('lexnorm.txt') as f:
lexnorm = json.load(f) # read back lexnorm dictionary
with open("corpus.txt", 'r', encoding='utf8') as main, open('new_corpus.txt', 'w') as new_main:
for line in main:
words = re.split(r'[^a-zA-z]+', line)
for word in words:
if word in lexnorm:
line = line.replace(word, lexnorm[word])
new_main.write(line)
The above program reads in the corpus.txt file line by line and attempts to intelligently split the line into words. Splitting on a single space is not adequate. Consider the following sentence:
'"The fox\'s foot grazed the sleeping dog, waking it."'
A standard split on a single space yields:
['"The', "fox's", 'foot', 'grazed', 'the', 'sleeping', 'dog,', 'waking', 'it."']
You would never be able to match The, fox, dog nor it.
There are several ways to handle it. I am splitting on one or more non-alpha characters. This may need to be "tweeked" if the words in lexnorm consist of characters other than a-z:
re.split(r'[^a-zA-z]+', '"The fox\'s foot grazed the sleeping dog, waking it."')
Yields:
['', 'The', 'fox', 's', 'foot', 'grazed', 'the', 'sleeping', 'dog', 'waking', 'it', '']
Once the line is split into words, each word is looked up in the lexnorm dictionary and if found then a simple replace of that word is done in the original line. Finally, the line and any replacements done to that line are written out to a new file. You can then delete the old file and rename the new file.
Think about how you might handle words that would match if they had been converted to lower case first.
Update (Major Optimization)
Since there is likely to be a lot of duplicate words in a file, an optimization is to process each unique word once, which can be done if the file is not so large that it cannot be read into memory:
import json, re
with open('lexnorm.txt') as f:
lexnorm = json.load(f) # read back lexnorm dictionary
with open("corpus.txt", 'r', encoding='utf8') as main:
text = main.read()
word_set = set(re.split(r'[^a-zA-z]+', text))
for word in word_set:
if word in lexnorm:
text = text.replace(word, lexnorm[word])
with open("corpus.txt", 'w', encoding='utf8') as main:
main.write(text)
Here the entire file is read into text, split into words and then the words are added to a set word_set guaranteeing the uniqueness of words. Then each word in word_set is looked up and replaced in the entire text and the entire text rewritten back out to the original file.

How would I read only the first word of each line of a text file?

I wanted to know how I could read ONLY the FIRST WORD of each line in a text file. I tried various codes and tried altering codes but can only manage to read whole lines from a text file.
The code I used is as shown below:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
line = QuizList[0]
for word in line.split():
print(word)
This refers to an attempt to extract only the first word from the first line. In order to repeat the process for every line i would do the following:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
capacity = len(QuizList)
capacity = capacity-1
index = 0
while index!=capacity:
line = QuizList[index]
for word in line.split():
print(word)
index = index+1
You are using split at the wrong point, try:
for line in f:
QuizList.append(line.split(None, 1)[0]) # add only first word
Changed to a one-liner that's also more efficient with the strip as Jon Clements suggested in a comment.
with open('Quizzes.txt', 'r') as f:
wordlist = [line.split(None, 1)[0] for line in f]
This is pretty irrelevant to your question, but just so the line.split(None, 1) doesn't confuse you, it's a bit more efficient because it only splits the line 1 time.
From the str.split([sep[, maxsplit]]) docs
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
' 1 2 3 '.split() returns ['1', '2', '3']
and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
with Open(filename,"r") as f:
wordlist = [r.split()[0] for r in f]
I'd go for the str.split and similar approaches, but for completness here's one that uses a combination of mmap and re if you needed to extract more complicated data:
import mmap, re
with open('quizzes.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
wordlist = re.findall('^(\w+)', mf, flags=re.M)
You should read one character at a time:
import string
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
for i, c in enumerate(line):
if c not in string.letters:
print line[:i]
break
l=[]
with open ('task-1.txt', 'rt') as myfile:
for x in myfile:
l.append(x)
for i in l:
print[i.split()[0] ]

I have a text file of a paragraph of writing, and want to iterate through each word in Python

How would I do this? I want to iterate through each word and see if it fits certain parameters (for example is it longer than 4 letters..etc. not really important though).
The text file is literally a rambling of text with punctuation and white spaces, much like this posting.
Try split()ing the string.
f = open('your_file')
for line in f:
for word in line.split():
# do something
If you want it without punctuation:
f = open('your_file')
for line in f:
for word in line.split():
word = word.strip('.,?!')
# do something
You can simply content.split()
f = open(filename,"r");
lines = f.readlines();
for i in lines:
thisline = i.split(" ");
data=open("file").read().split()
for item in data:
if len(item)>4:
print "longer than 4: ",item

Categories

Resources