Lemmatization of a list of words - python

So I have a list of words in a text file. I want to perform lemmatization on them to remove words which have the same meaning but are in different tenses. Like try, tried etc. When I do this, I keep getting an error like TypeError: unhashable type: 'list'
results=[]
with open('/Users/xyz/Documents/something5.txt', 'r') as f:
for line in f:
results.append(line.strip().split())
lemma= WordNetLemmatizer()
lem=[]
for r in results:
lem.append(lemma.lemmatize(r))
with open("lem.txt","w") as t:
for item in lem:
print>>t, item
How do I lemmatize words which are already tokens?

The method WordNetLemmatizer.lemmatize is probably expecting a string but you are passing it a list of strings. This is giving you the TypeError exception.
The result of line.split() is a list of strings which you are appending as a list to results i.e. a list of lists.
You want to use results.extend(line.strip().split())
results = []
with open('/Users/xyz/Documents/something5.txt', 'r') as f:
for line in f:
results.extend(line.strip().split())
lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, results)
with open("lem.txt", "w") as t:
for item in lem:
print >> t, item
or refactored without the intermediate results list
def words(fname):
with open(fname, 'r') as document:
for line in document:
for word in line.strip().split():
yield word
lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, words('/Users/xyz/Documents/something5.txt'))

Open a text file and and read lists as results as shown below
fo = open(filename)
results1 = fo.readlines()
results1
['I have a list of words in a text file', ' \n I want to perform lemmatization on them to remove words which have the same meaning but are in different tenses', '']
# Tokenize lists
results2 = [line.split() for line in results1]
# Remove empty lists
results2 = [ x for x in results2 if x != []]
# Lemmatize each word from a list using WordNetLemmatizer
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma_list_of_words = []
for i in range(0, len(results2)):
l1 = results2[i]
l2 = ' '.join([lemmatizer.lemmatize(word) for word in l1])
lemma_list_of_words.append(l2)
lemma_list_of_words
['I have a list of word in a text file', 'I want to perform lemmatization on them to remove word which have the same meaning but are in different tense']
Please look at the lemmatized difference between lemma_list_of_words and results1.

Related

Python count words of split sentence?

Not sure how to remove the "\n" thing at the end of output
Basically, i have this txt file with sentences such as:
"What does Bessie say I have done?" I asked.
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child
taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I managed to split the sentences by semicolon with code:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
re.split(';',low)
But not sure how to count the words of the split sentences as len() doesn't work:
The output of the sentences:
['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']
The third sentence for example, i am trying to count the 3 words at left and 8 words at right.
Thanks for reading!
The number of words is the number of spaces plus one:
e.g.
Two spaces, three words:
World is wonderful
Code:
import re
import string
lines = []
with open('file.txt', 'r') as f:
lines = f.readlines()
DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
# Remove empty sentance
if not sentence.strip():
continue
# Remove punctuation besides our delimiter ';'
sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
# Split by our delimeter
splitted = re.split(DELIMETER, sentence)
# The number of words is the number of spaces plus one
word_count.append([1 + x.strip().count(' ') for x in splitted])
# [[9], [7, 9], [7], [3, 8]]
print(word_count)
Use str.rstrip('\n') to remove the \n at the end of each sentence.
To count the words in a sentence, you can use len(sentence.split(' '))
To transform a list of sentences into a list of counts, you can use the map function.
So here it is:
import re
with open("testing.txt") as file:
for i, line in enumerate(file.readlines()):
# Ignore empty lines
if line.strip(' ') != '\n':
line = line.lower()
# Split by semicolons
parts = re.split(';', line)
print("SENTENCES:", parts)
counts = list(map(lambda part: len(part.split()), parts))
print("COUNTS:", counts)
Outputs
SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]
You'll need the library nltk
from nltk import sent_tokenize, word_tokenize
mytext = """I have a dog.
The dog is called Bob."""
for sent in sent_tokenize(mytext):
print(len(word_tokenize(sent)))
Output
5
6
Step by step explanation:
for sent in sent_tokenize(mytext):
print('Sentence >>>',sent)
print('List of words >>>',word_tokenize(sent))
print('Count words per sentence>>>', len(word_tokenize(sent)))
Output:
Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6
`
import re
sentences = [] #empty list for storing result
with open('testtext.txt') as fileObj:
lines = [line.strip() for line in fileObj if line.strip()] #makin list of lines allready striped from '\n's
for line in lines:
sentences += re.split(';', line) #spliting lines by ';' and store result in sentences
for sentence in sentences:
print(sentence +' ' + str(len(sentence.split()))) #out
try this one:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
low = low.strip()
low = low.replace('\n', '')
re.split(';',low)

Remove words with spaces

I want to find all the "phrases" in a list in remove them from the list, so that I have only words (without spaces) left. I'm making a hangman type game and want the computer to choose a random word. I'm new to Python and coding, so I'm happy to hear other suggestions for my code as well.
import random
fhand = open('common_words.txt')
words = []
for line in fhand:
line = line.strip()
words.append(line)
for word in words:
if ' ' in word:
words.remove(word)
print(words)
Sets are more efficient than lists. When lazily constructed like here, you can gain significant performance boost.
# Load all words
words = {}
with open('common_words.txt') as file:
for line in file.readlines():
line = line.strip()
if " " not in line:
words.add(line)
# Can be converted to one-liner using magic of Python
words = set(filter(lambda x: " " in x, map(str.strip, open('common_words.txt').readlines())))
# Get random word
import random
print(random.choice(words))
Use str.split(). It separates by both spaces and newlines by default.
>>> 'some words\nsome more'.split()
['some', 'words', 'some', 'more']
>>> 'this is a sentence.'.split()
['this', 'is', 'a', 'sentence.']
>>> 'dfsonf 43 SDFd fe#2'.split()
['dfsonf', '43', 'SDFd', 'fe#2']
Read the file normally and make a list this way:
words = []
with open('filename.txt','r') as file:
words = file.read().split()
That should be good.
with open( 'common_words.txt', 'r' ) as f:
words = [ word for word in filter( lambda x: len( x ) > 0 and ' ' not in x, map( lambda x: x.strip(), f.readlines() ) ) ]
with is used because file objects are content managers. The strange list-like syntax is a list comprehension, so it builds a list from the statements inside of the brackets. map is a function with takes in an iterable, applying a provided function to each item in the iterable, placing each transformed result into a new list*. filter is function which takes in an iterable, testing each item against the provided predicate, placing each item which evaluated to True into a new list*. lambda is used to define a function (with a specific signature) in-line.
*: The actual return types are generators, which function like iterators so they can still be used with for loops.
I am not sure if I understand you correctly, but I guess the split() method is something for you, like:
with open('common_words.txt') as f:
words = [line.split() for line in f]
words = [word for words in words_nested for word in words] # flatten nested list
As mentioned, the
.split()
Method could be a solution.
Also, the NLTK module might be useful for future language processing tasks.
Hope this helps!

Change a file to a list to a dictionary

I am trying to write a code that takes the text from a novel and converts it to a dictionary where the keys are each unique word and the values are the number of occurrences of the word in the text.
For example it could look like: {'the': 25, 'girl': 59...etc}
I have been trying to make the text first into a list and then use the Counter function to make a dictionary of all the words:
source = open('novel.html', 'r', encoding = "UTF-8")
soup = BeautifulSoup(source, 'html.parser')
#make a list of all the words in file, get rid of words that aren't content
mylist = []
mylist.append(soup.find_all('p'))
newlist = filter(None, mylist)
cnt = collections.Counter()
for line in newlist:
try:
if line is not None:
words = line.split(" ")
for word in line:
cnt[word] += 1
except:
pass
print(cnt)
This code doesn't work because of an error with "NoneType" or it just prints an empty list. I'm wondering if there is an easier way to do what I'm trying to do or how I can fix this code so it won't have this error.
import collections
from bs4 import BeautifulSoup
with open('novel.html', 'r', encoding='UTF-8') as source:
soup = BeautifulSoup(source, 'html.parser')
cnt = collections.Counter()
for tag in soup.find_all('p'):
for word in tag.string.split():
word = ''.join(ch for ch in word.lower() if ch.isalnum())
if word != '':
cnt[word] += 1
print(cnt)
with statement is simply a safer way to open the file
soup.find_all returns a list of Tag's
tag.string.split() gets all the words (separated by spaces) from the Tag
word = ''.join(ch for ch in word.lower() if ch.isalnum()) removes punctuation and convertes to lowercase so that 'Hello' and 'hello!' count as the same word
For the counter just do a
from collections import Counter
cnt = Counter(mylist)
Are you sure your list is getting items to begin with? After what step are you getting an empty list?
Once you've converted your page to a list, try something like this out:
#create dictionary and fake list
d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]
#count the times a unique word occurs and add that pair to your dictionary
for word in set(x):
count = x.count(word)
d[word] = count
Output:
{'hello': 2, 'hey': 2, 'hi': 4}

How can I create a list that is created using the indexes and items of two other lists?

I have two lists. One is made up of positions from a sentence and the other is made up of words that make up the sentence. I want to recreate the variable sentence using poslist and wordlist.
recreate = []
sentence = "This and that, and this and this."
poslist = [1, 2, 3, 2, 4, 2, 5]
wordlist = ['This', 'and', 'that', 'this', 'this.']
I wanted to use a for loop to go through poslist and if the item in poslist was equal to the position of a word in wordlist it would append it to a new list, recreating the original list. My first try was:
for index in poslist:
recreate.append(wordlist[index])
print (recreate)
I had to make the lists strings to write the lists into a text file. When I tried splitting them again and using the code shown above it does not work. It said that the indexes needed to be slices or integers or slices not in a list. I would like a solution to this problem. Thank you.
The list of words is gotten using:
sentence = input("Enter a sentence >>") #asking the user for an input
sentence_lower = sentence.lower() #making the sentence lower case
wordlist = [] #creating a empty list
sentencelist = sentence.split() #making the sentence into a list
for word in sentencelist: #for loop iterating over the sentence as a list
if word not in wordlist:
wordlist.append(word)
txtfile = open ("part1.txt", "wt")
for word in wordlist:
txtfile.write(word +"\n")
txtfile.close()
txtfile = open ("part1.txt", "rt")
for item in txtfile:
print (item)
txtfile.close()
print (wordlist)
And the positions are gotten using:
poslist = []
textfile = open ("part2.txt", "wt")
for word in sentencelist:
poslist.append([position + 1 for position, i in enumerate(wordlist) if i == word])
print (poslist)
str1 = " ".join(str(x) for x in poslist)
textfile = open ("part2.txt", "wt")
textfile.write (str1)
textfile.close()
Lists are 0-indexed (the first item has the index 0, the second the index 1, ...), so you have to substract 1 from the indexes if you want to use "human" indexes in the poslist:
for index in poslist:
recreate.append(wordlist[index-1])
print (recreate)
Afterwards, you can glue them together again and write them to a file:
with open("thefile.txt", "w") as f:
f.write("".join(recreate))
First, your code can be simplified to:
sentence = input("Enter a sentence >>") #asking the user for an input
sentence_lower = sentence.lower() #making the sentence lower case
wordlist = [] #creating a empty list
sentencelist = sentence.split() #making the sentence into a list
with open ("part1.txt", "wt") as txtfile:
for word in sentencelist: #for loop iterating over the sentence as a list
if word not in wordlist:
wordlist.append(word)
txtfile.write(word +"\n")
poslist = [wordlist.index(word) for word in sentencelist]
print (poslist)
str1 = " ".join(str(x) for x in poslist)
with open ("part2.txt", "wt") as textfile:
textfile.write (str1)
In your original code, poslist was a list of lists instead of a list of integers.
Then, if you want to reconstruct your sentence from poslist (which is now a list of int and not a list of lists as in the code you provided) and wordlist, you can do the following:
sentence = ' '.join(wordlist[pos] for pos in poslist)
You can also do it using a generator expression and the string join method:
sentence = ' '.join(wordlist[pos-1] for pos in poslist if pos if pos <= len(wordlist))
# 'This and that, and this and this.'
You can use operator.itemgetter() for this.
from operator import itemgetter
poslist = [0, 1, 2, 1, 3, 1, 4]
wordlist = ['This', 'and', 'that', 'this', 'this.']
print(' '.join(itemgetter(*poslist)(wordlist)))
Note that I had to subtract one from all of the items in poslist, as Python is a zero-indexed language. If you need to programmatically change poslist, you could do poslist = (n - 1 for n in poslist) right after you declare it.

How to check for words in a list and then return a count?

I am relatively new to python and have a question:
I am trying to write a script that will read a .txt file and check if words are in a list that I've provided and then return a count as to how many words were in that list.
So far,
import string
#this is just an example of a list
list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
infile=open(filename, "r")
lines = infile.readlines()
for line in lines:
words = line.split()
for word in words:
word = word.strip(string.punctuation)
I've tried to split the file into lines and then the lines into words without punctuation.
I am not sure where to go after this. I would like ultimately for the output to be something like this:
"your file has x words that are in the list".
Thank you!
You can split your file to words using the following command :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
Then count the number of words in your word list with loop over it and using count function :
w_list = ['hi', 'how', 'are', 'you']
with open("hi.txt", "r") as f :
words=reduce(lambda x,y:x+y,[line.split() for line in f])
for w in w_list:
print "your file has {} {}".format(words.count(w),w)
# words to search for;
# (stored as a set so `word in search_for` is O(1))
search_for = set(["hi", "how", "are", "you"])
# get search text
# (no need to split into lines)
with open("hi.txt") as inf:
text = inf.read().lower()
# create translation table
# - converts non-word chars to spaces (this maintains appropriate word-breaks)
# - keeps apostrophe (for words like "don't" or "couldn't")
trans = str.maketrans(
"0123456789abcdefghijklmnopqrstuvwxyz'!#$%&()*+,-./:;<=>?#[\\]^_`{|}~\"\\",
" abcdefghijklmnopqrstuvwxyz' "
)
# apply translation table and split into words
words = text.translate(trans).split()
# count desired words
word_count = sum(word in search_for for word in words)
# show result
print("your file has {} words that are in the list".format(word_count))
Read file content by with statement and open() method.
Remove punctuation from the file content by string module.
Split file content by split() method and iterate every word by for loop.
Check if word is present in the input list or not and increment count value according yo that.
input file: hi.txt
hi, how are you?
hi, how are you?
code:
import string
input_list = ['hi', 'how', 'are', 'you']
filename="hi.txt"
count = 0
with open(filename, "rb") as fp:
data = fp.read()
data = data.translate(string.maketrans("",""), string.punctuation)
for word in data.split():
if word in input_list:
count += 1
print "Total number of word present in file from the list are %d"%(count)
Output:
vivek#vivek:~/Desktop/stackoverflow$ python 18.py
Total number of word present in file from the list are 8
vivek#vivek:~/Desktop/stackoverflow$
Do not use variable names which already define by python interpreter
e.g. list in your code.
>>> list
<type 'list'>
>>> a = list([1,2,3])
>>> a
[1, 2, 3]
>>> list = ["hi", "how"]
>>> b = list([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable
>>>
use the len()
for example for a list;
enter code here
myList = ["c","b","a"]
len(myList)
it will return 3 meaning there are three items on your list.

Categories

Resources