I want to do lexical normalization on a corpus using a dictionary. The corpus has eight thousands of lines and the dictionary has thousands of word pairs (nonstandard : standard).
I have adopted an approach which is discussed here. The code looks like this:
with open("corpus.txt", 'r', encoding='utf8') as main:
words = main.read().split()
lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3', and so on}
for x in lexnorm:
for y in words:
if lexnorm[x][0] == y:
y == x[1]
text = ' '.join(lexnorm.get(y, y) for y in words)
print(text)
The code above works well, but I'm facing a problem since there are thousands of word pairs in the dictionary. Is it possible to represent the dictionary through a text file?
Last question, the output file of the code consists only of one line. It would be great if it has the same number of lines as the original corpus does.
Anyone could help me with this? I'd be thankful.
One way to output the dictionary as a text file is as a JSON string:
import json
lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3'} # etc.
with open('lexnorm.txt', 'w') as f:
json.dump(lexnorm, f)
See my comment to your original. I am only guessing what you are trying to do:
import json, re
with open('lexnorm.txt') as f:
lexnorm = json.load(f) # read back lexnorm dictionary
with open("corpus.txt", 'r', encoding='utf8') as main, open('new_corpus.txt', 'w') as new_main:
for line in main:
words = re.split(r'[^a-zA-z]+', line)
for word in words:
if word in lexnorm:
line = line.replace(word, lexnorm[word])
new_main.write(line)
The above program reads in the corpus.txt file line by line and attempts to intelligently split the line into words. Splitting on a single space is not adequate. Consider the following sentence:
'"The fox\'s foot grazed the sleeping dog, waking it."'
A standard split on a single space yields:
['"The', "fox's", 'foot', 'grazed', 'the', 'sleeping', 'dog,', 'waking', 'it."']
You would never be able to match The, fox, dog nor it.
There are several ways to handle it. I am splitting on one or more non-alpha characters. This may need to be "tweeked" if the words in lexnorm consist of characters other than a-z:
re.split(r'[^a-zA-z]+', '"The fox\'s foot grazed the sleeping dog, waking it."')
Yields:
['', 'The', 'fox', 's', 'foot', 'grazed', 'the', 'sleeping', 'dog', 'waking', 'it', '']
Once the line is split into words, each word is looked up in the lexnorm dictionary and if found then a simple replace of that word is done in the original line. Finally, the line and any replacements done to that line are written out to a new file. You can then delete the old file and rename the new file.
Think about how you might handle words that would match if they had been converted to lower case first.
Update (Major Optimization)
Since there is likely to be a lot of duplicate words in a file, an optimization is to process each unique word once, which can be done if the file is not so large that it cannot be read into memory:
import json, re
with open('lexnorm.txt') as f:
lexnorm = json.load(f) # read back lexnorm dictionary
with open("corpus.txt", 'r', encoding='utf8') as main:
text = main.read()
word_set = set(re.split(r'[^a-zA-z]+', text))
for word in word_set:
if word in lexnorm:
text = text.replace(word, lexnorm[word])
with open("corpus.txt", 'w', encoding='utf8') as main:
main.write(text)
Here the entire file is read into text, split into words and then the words are added to a set word_set guaranteeing the uniqueness of words. Then each word in word_set is looked up and replaced in the entire text and the entire text rewritten back out to the original file.
Related
I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...
I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)
Here is our example:
subtitle file (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.
wordlist of 3000 common words (.txt):
...
people
with
are
good
...
Output we need (.srt):
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Or just mark them if it's possible (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.
The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.
import os
import re
from glob import glob
with open('words.txt') as f:
keep_words = {line.strip().lower() for line in f}
for filename_in in glob('*.srt'):
filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
with open(filename_in) as fin, open(filename_out, 'w') as fout:
for i, line in enumerate(fin):
if i == 2:
parts = re.split(r"([\w']+)", line.strip())
parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
line = ''.join(parts) + '\n'
fout.write(line)
Result (for the subtitle.rst you gave as example:
! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Alternative: just add a '*' next to out-of-vocabulary words:
# replace:
# parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]
The output is then:
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
Explanation:
The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
We use glob to find all filenames ending in '.srt'.
For each such file, we construct a new filename derived from it as '..._new.srt'.
We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
line.strip() removes the trailing newline.
We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
The words themselves are every other element of parts, starting at index 1.
We replace the words by '*' if their lowercase form is not in keep_words.
Finally we re-assemble that line, and generally output all lines to the new file.
you could simply run a python script like this:
with open("words.txt", "rt") as words:
#create a list with every word
wordList = words.read().split("\n")
with open("subtitle.srt", "rt") as subtitles:
with open("subtitle_output.srt", "wt") as out:
for line in subtitles.readlines():
if line[0].isdigit():
#ignore the line as it starts with a digit
out.write(line)
continue
else:
for word in line.split():
if not word in wordList:
out.write(line.replace(word, f"*{word}*"))
this script will replace every word that's not in the common words file with the modified *word* keeping the original file and putting everything into a new output file
I have a dictionary dict with some words (2000) and I have a huge text, like Wikipedia corpus, in text format. For each word that is both in the dictionary and in the text file, I would like to replace it with word_1.
with open("wiki.txt",'r') as original, open("new.txt",'w') as mod:
for line in original:
new_line = line
for word in line.split():
if (dict.get(word.lower()) is not None):
new_line = new_line.replace(word,word+"_1")
mod.write(new_line)
This code creates a new file called new.txt with the words that appear in the dictionary replaced as I want.
This works for short files, but for the longer that I am using as input, it "freezes" my computer.
Is there a more efficient way to do that?
Edit for Adi219:
Your code seems working, but there is a problem:
if a line is like that: Albert is a friend of Albert and in my dictionary I have Albert, after the for cycle, the line will be like this:Albert_1_1 is a friend of Albert_1. How can I replace only the exact word that I want, to avoid repetitions like _1_1_1_1?
Edit2:
To solve the previous problem, I changed your code:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
mod.write(word+"_1 ")
else:
mod.write(word+" ")
mod.write("\n")
Now everything should work
A few things:
You could remove the declaration of new_line. Then, change new_line = new_line.replace(...) line with line = line.replace(...). You would also have to write(line) afterwards.
You could add words = line.split() and use for word in words: for the for loop, as this removes a call to .split() for every iteration through the words.
You could (manually(?)) split your large .txt file into multiple smaller files and have multiple instances of your program running on each file, and then you could combine the multiple outputs into one file. Note: You would have to remember to change the filename for each file you're reading/writing to.
So, your code would look like:
with open("wiki.txt", "r") as original, open("new.txt", "w") as mod:
for line in original:
words = line.split()
for word in words:
if dict.get(word.lower()) is not None:
line = line.replace(word, word + "_1")
mod.write(line)
I have to open a file read it line by line. For each line, I have to split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already on the list and if not append it to the list. When the program completes, print the resulting words in alphabetical order.
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
for line in fh:
for each in line:
word = line.split()
if word not in lst:
lst.append(word)
print(lst)
after this, I am getting 4 different lists but I am required to get a single list and I am not able to get that
one line, create a set in a set comprehension using split on each line, and sort the set into a list, using ,key=str.casefold to sort case-insensitive/locale wise.
with open(fname) as f:
result = sorted({word for line in f for word in line.split()},key=str.casefold)
this is particularly efficient since you don't have to use in in your existing list, which performs a linear search, very slow if the list is big.
If the file contains punctuation, that won't work very well because split won't remove it. Use a regex in that case:
result = sorted({word for line in f for word in re.split("\W+",line) if word},key = str.casefold)
(you have to add an extra non-empty filter)
Try this
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
for line in fh:
for each in line:
word = line.split()
for wrd in word:
if wrd not in lst:
lst.append(wrd)
print(lst)
There are several problems in the code
First, it's unclear what you want to do with for each in line. I simply removed it.
Second, with word = line.split(), you get word as a list of words. You need to iterate through the list of words and perform actions on individual words.
Then, use sorted() to sort the words.
The refined code looks like this:
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
for line in fh:
words = line.split()
for word in words
if word not in lst:
lst.append(word)
fh.close()
print(sorted(lst))
Side note: You're not closing the file. Use fh.close() (as I added above) or use with open(fname) as fh: which will close the file for you after leaving the with block.
Same style as the answers before but rather than using a for loop to get all of the lines I use the readlines() functions
The line lines=fh.readlines() will return a full list of the lines.
Then you must look at each line with for line in lines to see each line. Then you look at each word in the line with for word in words.
fname = input("Enter file name: ")
fh = open(fname)
lst=list()
lines=fh.readlines()
for line in lines:
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
You can do this like this:
with open(fname) as fh:
unique_words = set(f.read().split())
To move that set into list use:
unique_words = list(unique_words)
and to sort that list:
unique_words.sort()
You should be able to adapt this idea to your problem.
words.txt
In linguistics, a word is the smallest element
that can be uttered in isolation with objective
or practical meaning. This contrasts deeply with
a morpheme, which is the smallest unit of meaning
but will not necessarily stand on its own.
import re
with open('words.txt', 'r', encoding='utf8') as f:
words = f.read()
all_words = re.findall(r'\w+', words)
result = sorted(set(all_words), key=str.lower)
print(result)
['a', 'be', 'but', 'can', 'contrasts', 'deeply', 'element', 'in', 'In',
'is', 'isolation', 'its', 'linguistics', 'meaning', 'morpheme',
'necessarily', 'not', 'objective', 'of', 'on', 'or', 'own', 'practical',
'smallest', 'stand', 'that', 'the', 'This', 'unit', 'uttered', 'which',
'will', 'with', 'word']
Is this a interview question by chanse? I am asking because of this part:
For each word on each line check to see if the word is already in the list
If this is a interview question, it could very well be that your interviewer does not necessarily want you to use language features to solve this but rather implement a searching algorithm or structure, for example a BST.
Assuming this is not the case however, let's go over your code. First, I would recommend you switch from opening the file like you do, fh = open(fname) and switch to a context, using with. Or, at least, close the file handler.
word_dictionary = dict()
with open(file_name) as source:
for line in source:
for word in line.split():
if word_dictionary.get(word) is None:
word_dictionary[word] = True
word_list = [word for word, _ in word_dictionary.items()]
sorted(word_list)
print word_list
Let's go over the code together.
First, we define a dictionary, called word_dictionary. A dictionary or hash table is a data structure that allows you to make lookup operation in constant, O(1), time. That is to say, very fast.
Second, we open the file containing the text. This is with open(file_name) as source: We call this a context. It is a convenient way of dealing with files ( and not only ) that automatically takes care of resource management. I won't go into detail, but i will recommend this article.
We begin by reading each line, and for each line, we read each word.
for line in source:
for word in line.split():
For each word, we check to see if we have already encountered it. We do this by using the .get() method of dictionaries. This method will check to see if the argument exists as a key in the dictionary. If it is, it will return the value associated with the key. Otherwise, it will return None.
if word_dictionary.get(word) is None:
word_dictionary[word] = True
This says that if we encounter a word that we have not seen already, then we register seeing it. Note that it's not necessary to use True as a value. Anything different to None will work.
One we have seen every word in the text, we do this.
word_list = [word for word, _ in word_dictionary.items()]
Using .items() we iterate over the key, value pairs of the dictionary. That is to say, if we had a dictionary d = {0: "a", 1: "b", 2: "c"}, calling for key, value in d.items() will yield key = 0, value = "a" at first, key = 1, value = "b" second and finally key = 2, value = "c"
In our case, we are not interested in the value. That is why we use _. We are only interested in the word.
What the list comprehension will leave us with is a list of all the words you encountered, in a random order. This is because dictionaries make no guarantee on the order to the key, value pairs.
Therefore, we need to sort.
sorted(word_list)
Since the word_list variable is going to be a list of string and since strings are comparable by default, they will be sorted.
Now, there are several things to consider. First, .split() will consider 'this' and 'this!' to be different words. You may or may not want this.
If you do not want this, you can use the 'string' module to check against punctuation or you can use a regex to clean up your word.
The second thing to consider is capitalisation. You make no mention of this. Are you allowed to lower or capitalise your text? If you are, your life will be easier. You lowercase everything and your problems go away.
If you are not allowed to lower your text, you will have to change the sort call. this is because capital letters are "smaller" then lowercase ones.
>>> "A" < "a"
True
This will manifest in the following way.
>>> sorted(["b", "a", "C"])
['C', 'a', 'b']
Most likely you expect ["a", "b", "C"] here. In this case, I recommend you look into the cpm argument for sort.
I am an absolute beginner in Python. I am doing a textual analysis of greek plays and counting the word frequencies of each word. Because the plays are very long, I am unable to see my full set of data, it only shows the words with the lowest frequencies because there is not enough space in the Python window. I am thinking of converting it to a .csv file. My full code is below:
#read the file as one string and spit the string into a list of separate words
input = open('Aeschylus.txt', 'r')
text = input.read()
wordlist = text.split()
#read file containing stopwords and split the string into a list of separate words
stopwords = open("stopwords .txt", 'r').read().split()
#remove stopwords
wordsFiltered = []
for w in wordlist:
if w not in stopwords:
wordsFiltered.append(w)
#create dictionary by counting no of occurences of each word in list
wordfreq = [wordsFiltered.count(x) for x in wordsFiltered]
#create word-frequency pairs and create a dictionary
dictionary = dict(zip(wordsFiltered,wordfreq))
#sort by decreasing frequency and print
aux = [(dictionary[word], word) for word in dictionary]
aux.sort()
aux.reverse()
for y in aux: print y
import csv
with open('Aeschylus.csv', 'w') as csvfile:
fieldnames = ['dictionary[word]', 'word']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'dictionary[word]': '1', 'word': 'inherited'})
writer.writerow({'dictionary[word]': '1', 'word': 'inheritance'})
writer.writerow({'dictionary[word]': '1', 'word': 'inherit'})
I found the code for the csv on the internet. What I'm hoping to get is the full list of data from the highest to lowest frequency. Using this code I have right now, python seems to be totally ignoring the csv part and just printing the data as if I didn't code for the csv.
Any idea on what I should code to see my intended result?
Thank you.
Since you have a dictionary where the words are keys and their frequencies the values, a DictWriter is ill suited. It is good for sequences of mappings that share some common set of keys, used as the columns of the csv. For example if you had had a list of dicts such as you manually create:
a_list = [{'dictionary[word]': '1', 'word': 'inherited'},
{'dictionary[word]': '1', 'word': 'inheritance'},
{'dictionary[word]': '1', 'word': 'inherit'}]
then a DictWriter would be the tool for the job. But instead you have a single dictionary like:
dictionary = {'inherited': 1,
'inheritance': 1,
'inherit': 1,
...: ...}
But, you've already built a sorted list of (freq, word) pairs as aux, which is perfect for writing to csv:
with open('Aeschylus.csv', 'wb') as csvfile:
header = ['frequency', 'word']
writer = csv.writer(csvfile)
writer.writerow(header)
# Note the plural method name
writer.writerows(aux)
python seems to be totally ignoring the csv part and just printing the data as if I didn't code for the csv.
sounds rather odd. At least you should've gotten a file Aeschylus.csv containing:
dictionary[word],word
1,inherited
1,inheritance
1,inherit
Your frequency counting method could also be improved. At the moment
#create dictionary by counting no of occurences of each word in list
wordfreq = [wordsFiltered.count(x) for x in wordsFiltered]
has to loop through the list wordsFiltered for each word in wordsFiltered, so O(n²). You could instead iterate through the words in the file, filter, and count as you go. Python has a specialized dictionary for counting hashable objects called Counter:
from __future__ import print_function
from collections import Counter
import csv
# Many ways to go about this, could for example yield from (<gen expr>)
def words(filelike):
for line in filelike:
for word in line.split():
yield word
def remove(iterable, stopwords):
stopwords = set(stopwords) # O(1) lookups instead of O(n)
for word in iterable:
if word not in stopwords:
yield word
if __name__ == '__main__':
with open("stopwords.txt") as f:
stopwords = f.read().split()
with open('Aeschylus.txt') as wordfile:
wordfreq = Counter(remove(words(wordfile), stopwords))
Then, as before, print the words and their frequencies, beginning from most common:
for word, freq in wordfreq.most_common():
print(word, freq)
And/or write as csv:
# Since you're using python 2, 'wb' and no newline=''
with open('Aeschylus.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['word', 'freq'])
# If you want to keep most common order in CSV as well. Otherwise
# wordfreq.items() would do as well.
writer.writerows(wordfreq.most_common())
I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read one file in python using:
with open("C:\\...\\...\\...\\record-13.txt") as f:
content = f.readlines()
print (content)
Now I can split each line of this file to its words using for example:
a = content[0].split()
print (a)
but I don't know how to split whole file to words?
do loops (while or for) help with that?
Thank you for your help guys. Your answers help me to write this (in my file, words are split by space so that's delimiter I think!):
with open ("C:\\...\\...\\...\\record-13.txt") as f:
lines = f.readlines()
for line in lines:
words = line.split()
for word in words:
print (word)
that simply splits words by line (one word in one line).
It depends on how you define words, or what you regard as the delimiters.
Notice string.split in Python receives an optional parameter delimiter, so you could pass it as this:
for lines in content[0].split():
for word in lines.split(','):
print(word)
Unfortunately, string.split receives a single delimiter only, so you may need multi-level splitting like this:
for lines in content[0].split():
for split0 in lines.split(' '):
for split1 in split0.split(','):
for split2 in split1.split('.'):
for split3 in split2.split('?'):
for split4 in split3.split('!'):
for word in split4.split(':'):
if word != "":
print(word)
Looks ugly, right? Luckily we can use iteration instead:
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
new_words = []
for word in words:
new_words += word.split(delimiter)
words = new_words
EDITED:
Or simply we could use the regular expression package:
import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)
with open("C:\...\...\...\record-13.txt") as f:
for line in f:
for word in line.split():
print word
Or, this gives you a list of words
with open("C:\...\...\...\record-13.txt") as f:
words = [word for line in f for word in line.split()]
Or, this gives you a list of lines, but with each line as a list of words.
with open("C:\...\...\...\record-13.txt") as f:
words = [line.split() for line in f]
Nobody has suggested a generator, I'm surprised. Here's how I would do it:
def words(stringIterable):
#upcast the argument to an iterator, if it's an iterator already, it stays the same
lineStream = iter(stringIterable)
for line in lineStream: #enumerate the lines
for word in line.split(): #further break them down
yield word
Now this can be used both on simple lists of sentences that you might have in memory already:
listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
print(word)
But it will work just as well on a file, without needing to read the whole file in memory:
with open('words.py', 'r') as myself:
for word in words(myself):
print(word)
I would use Natural Language Tool Kit as the split() way does not deal well with punctuation.
import nltk
for line in file:
words = nltk.word_tokenize(line)
The most flexible approach is to use list comprehension to generate a list of words:
with open("C:\...\...\...\record-13.txt") as f:
words = [word
for line in f
for word in line.split()]
# Do what you want with the words list
Which you can then iterate over, add to a collections.Counter or anything else you please.