WordNetlemmatizer error - all alphabets are lemmatized - python

I am trying to lemmatize my dataset for sentiment analysis - What should I do to get the expected output rather than the current output? Input file is a csv - stored as DataFrame object.
dataset = pd.read_csv('xyz.csv')
Here is my code
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list1_ = []
for file_ in dataset:
result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
list1_.append(result1)
dataset = pd.concat(list1_, ignore_index=True)
Expected
>> lemmatizer.lemmatize('cats')
>> [cat]
Current output
>> lemmatizer.lemmatize('cats')
>> [c,a,t,s]

TL;DR
result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x.split()])
Lemmatizer takes in any string as an input.
If dataset['Content'] columns are strings, iterating through a string would be iterating through the characters not "words", e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> x = 'this is a foo bar sentence, that is of type str'
>>> [wnl.lemmatize(ch) for ch in x]
['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'o', 'o', ' ', 'b', 'a', 'r', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ',', ' ', 't', 'h', 'a', 't', ' ', 'i', 's', ' ', 'o', 'f', ' ', 't', 'y', 'p', 'e', ' ', 's', 't', 'r']
So you would have to first word tokenize your sentence string, e.g.:
>>> from nltk import word_tokenize
>>> [wnl.lemmatize(word) for word in x.split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', 'is', 'of', 'type', 'str']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['this', 'is', 'a', 'foo', 'bar', 'sentence', ',', 'that', 'is', 'of', 'type', 'str']
another e.g.
>>> from nltk import word_tokenize
>>> x = 'the geese ran through the parks'
>>> [wnl.lemmatize(word) for word in x.split()]
['the', u'goose', 'ran', 'through', 'the', u'park']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['the', u'goose', 'ran', 'through', 'the', u'park']
But to get a more accurate lemmatization, you should get the sentence word tokenized and pos-tagged, see https://github.com/alvations/earthy/blob/master/FAQ.md#how-to-use-default-nltk-functions-in-earthy

Related

How do I make a list of strings withouth invalid characters in any of the strings? [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 4 months ago.
For example, if I had this list of invalid characters:
invalid_char_list = [',', '.', '!']
And this list of strings:
string_list = ['Hello,', 'world.', 'I', 'am', 'a', 'programmer!!']
I would want to get this new list:
new_string_list = ['Hello', 'world', 'I', 'am', 'a', 'programmer']
withouth , or . or ! in any of the strings in the list because those are the characters that are in my list of invalid characters.
You can use regex and create this pattern : [,.!] and replace with ''.
import re
re_invalid = re.compile(f"([{''.join(invalid_char_list)}])")
# re_invalid <-> re.compile(r'([,.!])', re.UNICODE)
new_string_list = [re_invalid.sub(r'', s) for s in string_list]
print(new_string_list)
Output:
['Hello', 'world', 'I', 'am', 'a', 'programmer']
[.,!] : Match only this characters (',', '.', '!') in the set
You can try looping through the string_list and replacing each invalid char with an empty string.
invalid_char_list = [',', '.', '!']
string_list = ['Hello,', 'world.', 'I', 'am', 'a', 'programmer!!']
for invalid_char in invalid_char_list:
string_list=[x.replace(invalid_char,'') for x in string_list]
print(string_list)
The Output:
['Hello', 'world', 'I', 'am', 'a', 'programmer']
We can loop over each string in string_list and each invalid character and use String.replace to replace any invalid characters with '' (nothing).
invalid_char_list = [',', '.', '!']
string_list = ['Hello,', 'world.', 'I', 'am', 'a', 'programmer!!']
formatted_string_list = []
for string in string_list:
for invalid in invalid_char_list:
string = string.replace(invalid, '')
formatted_string_list.append(string)
You can use strip():
string_list = ['Hello,', ',world.?', 'I', 'am?', '!a,', 'programmer!!?']
new_string_list = [c.strip(',.!?') for c in string_list]
print(new_string_list)
#['Hello', 'world', 'I', 'am', 'a', 'programmer']

How can I extract words from a file from a string

So I have tried to make it so that I can extract words in a file with every English word from random letters a generator gives me. Then I would like to add the found words to a list. But I am having a bit of a problem acquiring this result. Could you help me please?
This is what I have tried:
import string
import random
def gen():
b = []
for i in range(100):
a = random.choice(string.ascii_lowercase)
b.append(a)
with open('allEnglishWords.txt') as f:
words = f.read().splitlines()
joined = ''.join([str(elem) for elem in b])
if joined in words:
print(joined)
f.close()
print(joined)
gen()
if you are wondering where I got the txt file it is located here http://www.gwicks.net/dictionaries.htm. I downloaded the one labeled ENGLISH - 84,000 words the text file
import string
import random
b = []
for i in range(100):
a = random.choice(string.ascii_lowercase)
b.append(a)
b = ''.join(b)
with open('engmix.txt', 'r') as f:
words = [x.replace('\n', '') for x in f.readlines()]
output=[]
for word in words:
if word in b:
output.append(word)
print(output)
Output:
['a', 'ad', 'am', 'an', 'ape', 'au', 'b', 'bi', 'bim', 'c', 'cb', 'd', 'e',
'ed', 'em', 'eo', 'f', 'fa', 'fy', 'g', 'gam', 'gem', 'go', 'gov', 'h',
'i', 'j', 'k', 'kg', 'ko', 'l', 'le', 'lei', 'm', 'mg', 'ml', 'mr', 'n',
'no', 'o', 'om', 'os', 'p', 'pe', 'pea', 'pew', 'q', 'ql', 'r', 's', 'si',
't', 'ta', 'tap', 'tape', 'te', 'u', 'uht', 'uk', 'v', 'w', 'wan', 'x', 'y',
'yo', 'yom', 'z', 'zed']
Focusing on acquiring this result, assume your words are seperated by a single space:
with open("allEnglishWords.txt") as f:
for line in f:
for word in line.split(" "):
print(word)
Also, you don't need f.close() inside a with block.

Hangman game but I'm having troubles with a list function

I'm currently doing the MIT Opencourseware on Python and one of the assignments is to do a Hangman game.
Most of the functions I've managed to do pretty well but the problem I'm encountering is in these two functions:
def get_guessed_word(secret_word, letters_guessed):
lengthOf = len(secret_word)
listLength = ["_ "] *lengthOf
for i,char in enumerate(secret_word):
if char == letters_guessed:
listLength[i]=char+" "
listCopy = listLength[:]
print(list)
def get_available_letters(letters_guessed):
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
for i, char in enumerate(alphabet):
if char == letters_guessed:
alphabet[i]="_"
alphabetCopy = alphabet[:]
print(alphabetCopy)
break
The problem is that each time I go through the function the alphabet resets and I tried solving this by creating a copy but I realized the solution really doesn't work even before implementing it because the listCopy and alphabetCopy just copy the "zero-state" each time the function is called.
I know I can do other solutions but I specifically want this "user-experience". I tried some other workarounds but I just can't figure it out right now.
I am assuming that the variable letters_guessed is a list or set containing all letters which have been guessed.
In that case, you could use:
def GetGuessedWord(SecretWord, GuessedLetters):
Ln = len(SecretWord)
DisplayList = ["_ "]*Ln
for i, char in enumerate(SecretWord):
if char in GuessedLetters: # This will check if char is present in the list
DisplayList[i] = char + " "
print(DisplayList)
def GetAvailableLetters(GuessedLetters):
Letters = "abcdefghijlkmnopqrstuvwxyz"
DisplayList = [L for L in Letters] # Converts it into a list of smaller strings, 1 letter each
for i, char in enumerate(DisplayList):
if char in GuessedLetters:
DisplayList[i] = "_"
print(DisplayList)
>>> GetGuessedWord("overgrown", ['o'])
['o ', '_ ', '_ ', '_ ', '_ ', '_ ', 'o ', '_ ', '_ ']
>>> GetGuessedWord("overgrown", ['o', 'e', 'r'])
['o ', '_ ', 'e ', 'r ', '_ ', 'r ', 'o ', '_ ', '_ ']
>>> GetGuessedWord("overgrown", ['o','e','r','z'])
['o ', '_ ', 'e ', 'r ', '_ ', 'r ', 'o ', '_ ', '_ ']
>>> GetAvailableLetters(['o'])
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'l', 'k', 'm', 'n', '_', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> GetAvailableLetters(['o', 'e', 'r'])
['a', 'b', 'c', 'd', '_', 'f', 'g', 'h', 'i', 'j', 'l', 'k', 'm', 'n', '_', 'p', 'q', '_', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> GetAvailableLetters(['o', 'e', 'r', 'z'])
['a', 'b', 'c', 'd', '_', 'f', 'g', 'h', 'i', 'j', 'l', 'k', 'm', 'n', '_', 'p', 'q', '_', 's', 't', 'u', 'v', 'w', 'x', 'y', '_']
The '==' operator returns true if the objects being compared have the same value. It will return false since one is a string and the other a list, hence can't have same value.
The 'in' operator returns true if the input on the left is present in the input in the right.
However
If you are guessing one letter at a time, then you need your functions to leave permanent modifications on the variables outside. So variable alphabet can't be declared inside get_available_letters, but must be declared in the main code and passed as an input to the function get_available_letters. This should fix this function if letters_guessed is a string 1 letter long. Now you can use the '==' operator.
def get_available_letters(letters_guessed, alphabet):
for i, char in enumerate(alphabet):
if char == letters_guessed:
alphabet[i]="_" # This line will permanently change the variable alphabet
alphabetCopy = alphabet[:] # Not useful, u may as well print the original
print(alphabetCopy)
break
>>> alphabet = [L for L in 'abcdefghijklmnopqrstuvwxyz']; print(alphabet)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> get_available_letters("o", alphabet)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', '_', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> get_available_letters("e", alphabet)
['a', 'b', 'c', 'd', '_', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', '_', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> get_available_letters("r", alphabet)
['a', 'b', 'c', 'd', '_', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', '_', 'p', 'q', '_', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> get_available_letters("z", alphabet)
['a', 'b', 'c', 'd', '_', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', '_', 'p', 'q', '_', 's', 't', 'u', 'v', 'w', 'x', 'y', '_']
For the other function you would need variable listLength to be declared outside and passed to this function.
def get_guessed_word(secret_word, letters_guessed, listLength):
for i,char in enumerate(secret_word):
if char == letters_guessed:
listLength[i]=char+" " # Permanently modifies listLength, not breaking since multiple same letters can occur in the same word
listCopy = listLength[:]
print(listCopy)
>>> secret_word = "overgrown"
>>> listLength = ["_ "]*len(secret_word); print(listLength)
['_ ', '_ ', '_ ', '_ ', '_ ', '_ ', '_ ', '_ ', '_ ']
>>> get_guessed_word(secret_word, "o", listLength)
['o ', '_ ', '_ ', '_ ', '_ ', '_ ', 'o ', '_ ', '_ ']
>>> get_guessed_word(secret_word, "e", listLength)
['o ', '_ ', 'e ', '_ ', '_ ', '_ ', 'o ', '_ ', '_ ']
>>> get_guessed_word(secret_word, "r", listLength)
['o ', '_ ', 'e ', 'r ', '_ ', 'r ', 'o ', '_ ', '_ ']
>>> get_guessed_word(secret_word, "z", listLength)
['o ', '_ ', 'e ', 'r ', '_ ', 'r ', 'o ', '_ ', '_ ']
And make copies of arrays when you need to modify arrays without affecting the original.
get_available_letters
So you have a list with letters that were already guessed and you want to know, what letters are left.
The simplest way is to use sets.
def get_available_letters(guessed_letters):
alphabet = set(map(chr, range(97, 123))) # Same list like you but shorter version
return sorted(alphabet - set(guessed_letters))
What this does:
>>>get_available_letters(['a', 'e', 'f'])
['b', 'c', 'd', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
get_guessed_word
def get_guessed_word(secret_word, letters_guessed):
guessed_word = ["_"] * len(secret_word)
for i, letter in enumerate(secret_word):
if letter in letters_guessed: # Changed == to in
guessed_word[i] = letter # Don't break after a letter was found and no copy necessary
return "".join(guessed_word)
>>>get_guessed_word("Hello", ["e", "o"])
'_e__o'

how to turn string of words into list

i have turned a list of words into a string
now i want to turn them back into a list but i dont know how, please help
temp = ['hello', 'how', 'is', 'your', 'day']
temp_string = str(temp)
temp_string will then be "[hello, how, is, your, day]"
i want to turn this back into a list now but when i do list(temp_string), this will happen
['[', "'", 'h', 'e', 'l', 'l', 'o', "'", ',', ' ', "'", 'h', 'o', 'w', "'", ',', ' ', "'", 'i', 's', "'", ',', ' ', "'", 'y', 'o', 'u', 'r', "'", ',', ' ', "'", 'd', 'a', 'y', "'", ']']
Please help
You can do this easily by evaluating the string. That's not something I'd normally suggest but, assuming you control the input, it's quite safe:
>>> temp = ['hello', 'how', 'is', 'your', 'day'] ; type(temp) ; temp
<class 'list'>
['hello', 'how', 'is', 'your', 'day']
>>> tempstr = str(temp) ; type(tempstr) ; tempstr
<class 'str'>
"['hello', 'how', 'is', 'your', 'day']"
>>> temp2 = eval(tempstr) ; type(temp2) ; temp2
<class 'list'>
['hello', 'how', 'is', 'your', 'day']
Duplicate question? Converting a String to a List of Words?
Working code below (Python 3)
import re
sentence_list = ['hello', 'how', 'are', 'you']
sentence = ""
for word in sentence_list:
sentence += word + " "
print(sentence)
#output: "hello how are you "
word_list = re.sub("[^\w]", " ", sentence).split()
print(word_list)
#output: ['hello', 'how', 'are', 'you']
You can split on commas and join them back together:
temp = ['hello', 'how', 'is', 'your', 'day']
temp_string = str(temp)
temp_new = ''.join(temp_string.split(','))
The join() function takes a list, which is created from the split() function while using ',' as the delimiter. join() will then construct a string from the list.

Simple tokenization issue in NTLK

I want to tokenize the following text:
In Düsseldorf I took my hat off. But I can't put it back on.
'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I',
'can't', 'put', 'it', 'back', 'on', '.'
But to my surprise none of the NLTK tokenizers work. How can I accomplish did? Is it possible to use a combination of these tokenizers somehow to achieve the above?
You can take one of the tokenizers as a starting point and then fix the contractions (assuming that is the problem):
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "In Düsseldorf I took my hat off. But I can't put it back on."
tokens = TreebankWordTokenizer().tokenize(text)
contractions = ["n't", "'ll", "'m"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)
fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1
print(tokens)
>>>['In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']
You should tokenize the sentence before tokenizing the words:
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
If you want to get them back into a single list:
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']
If you must put the ["ca", "n't"] -> ["can't"]:
>>> from itertools import izip_longest, chain
>>> tok_text = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
>>> contractions = ["n't", "'ll", "'re", "'s"]
# Iterate through two words at a time and then join the contractions back.
>>> [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", "n't", 'put', 'it', 'back', 'on', '.']
# Remove all contraction tokens since you've joint them to their root stem.
>>> [w for w in [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])] if w not in contractions]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

Categories

Resources