Stem words of complex structure - python

I have a list of tuples. Those tuples contain a label and a list. It looks like this:
[('neg', ['watching', 'by', 'myself', 'tweetdebate', 'not', ...]), ('pos', ['here', 'we', 'go', 'tweetdebate', 'tweetdebate', ...])]
And it's iterable by this
for label, words in labeled_words:
How can I mutate those words to have their lowercase stems?
Something like this in a loop I guess ( the stemmer is the PorterStemmer() ):
stemmer.stem(word.lower())
This doesn't work:
labeled_words = [( label, [stemmer.stem(word.lower()) for words]) for label, words in labeled_words ]
Thank you for your time.

This is mostly a 'how do I work with loops and variables' question. The main thing is to not try and modify the list you are iterating on. Instead, build up a new list.
I think this is what you are looking for:
labeled_words = [('neg', ['watching', 'by', 'myself', 'tweetdebate', 'not']), ('pos', ['here', 'we', 'go', 'tweetdebate', 'tweetdebate'])]
stemmedWords = []
for label, words in labeled_words:
stemmed = []
for word in words:
stemmed.append(porter2.stem(word))
stemmedWords.append((label,stemmed))
Output looks like:
>>> stemmedWords
[('neg', ['watch', 'by', 'myself', 'tweetdeb', 'not']), ('pos', ['here', 'we', 'go', 'tweetdeb', 'tweetdeb'])]

Related

How can I split a txt file into a list by word but including commas on the elements

I have a big txt file and I want to split it into a list where every word is a element of the list. I want to commas to be included on the elements like the example.
txt file
Hi, my name is Mick and I want to split this with commas included, like this.
list ['Hi,','my','name','is','Mick' etc. ]
Thank you very much for the help
Just use str.split() without any pattern, it'll split on space(s)
value = 'Hi, my name is Mick and I want to split this with commas included, like this.'
res = value.split()
print(res) # ['Hi,', 'my', 'name', 'is', 'Mick', 'and', 'I', 'want', 'to', 'split', 'this', 'with', 'commas', 'included,', 'like', 'this.']
res = [r for r in value.split() if ',' not in r]
print(res) # ['my', 'name', 'is', 'Mick', 'and', 'I', 'want', 'to', 'split', 'this', 'with', 'commas', 'like', 'this.']

Is there a better way to tokenize some strings?

I was trying to write a code for tokenization of strings in python for some NLP and came up with this code:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
s.append([])
s[a].append(line.split())
a+=1
print(s)
the output came out to be:
[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]
As you can see, the list now has an extra dimension, for example, If I want the word 'Batman', I would have to type s[0][0][2] instead of s[0][2], so I changed the code to:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
s.append([])
m=(line.split())
for word in m:
s[a].append(word)
a += 1
print(s)
which got me the correct output:
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
But I have this feeling that this could work with a single loop, because the dataset that I will be importing would be pretty large and a complexity of n would be a lot better that n^2, so, is there a better way to do this/a way to do this with one loop?
Your original code is so nearly there.
>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
... s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
The line.split() gives you a list, so append that in your loop.
Or go straight for a comprehension:
[line.split() for line in str]
When you say s.append([]), you have an empty list at index 'a', like this:
L = []
If you append the result of the split to that, like L.append([1]) you end up with a list in this list: [[1]]
You should use split() for every string in loop
Example with list comprehension:
str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
[s.split() for s in str]
[['I', 'am', 'Batman.'],
['I', 'loved', 'the', 'tea.'],
['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]
See this:-
>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]
# split by default slits on whitespace strings and give output as list
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

Nested List Iteration

I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:
corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']
corpus = [_.split(' ') for _ in corpus]
[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]
So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.
for _ in range(0, len(corpus)):
for x in corpus[_]:
if x == 'is' or x == 'a':
corpus[_].remove(x)
[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]
The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.
What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.
All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.
for _ in range(0, len(corpus)):
for x in corpus[_][:]: # <--- create a copy of the list using [:]
if x == 'is' or x == 'a':
corpus[_].remove(x)
OUTPUT
[['he', 'brave', 'king'],
['she', 'kind', 'queen'],
['he', 'young', 'boy'],
['she', 'gentle', 'girl']]
Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).
def reject_if(predicate, iterable):
for element in iterable:
if not predicate(element):
yield element
Once you have the method in place, you can use this way:
stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]
nested = [input()]
nested = [i.split() for i in nested]

removing common words from a text file

I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))

Python append words to a list from file

I'm writing a program to read text from a file into a list, split it into a list of words using the split function. And for each word, I need to check it if its already in the list, if not I need to add it to the list using the append function.
The desired output is:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
My output is :
[['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']]
I have been trying to sort it and remove the double square brackets "[[ & ]]" in the begining and end but I'm not able to do so. And fo some reason the sort() function does not seem to work.
Please let me know where I am making a mistake.
word_list = []
word_list = [open('romeo.txt').read().split()]
for item in word_list:
if item in word_list:
continue
else:
word_list.append(item)
word_list.sort()
print word_list
Remove brackets
word_list = open('romeo.txt').read().split()
Use two separate variables. Also, str.split() returns a list so no need to put [] around it:
word_list = []
word_list2 = open('romeo.txt').read().split()
for item in word_list2:
if item in word_list:
continue
else:
word_list.append(item)
word_list.sort()
print word_list
At the moment you're checking if item in word_list:, which will always be true because item is from word_list. Make item iterate from another list.
If order doesn't matter, it's a one liner
uniq_words = set(open('romeo.txt').read().split())
If order matters, then
uniq_words = []
for word in open('romeo.txt').read().split():
if word not in uniq_words:
uniq_words.append(word)
If you want to sort, then take the first approach and use sorted().
The statement open('remeo.txt).read().split() returns a list already so remove the [ ] from the [open('remeo.txt).read().split() ]
if i say
word = "Hello\nPeter"
s_word = [word.split()] # print [['Hello', wPeter']]
But
s_word = word.split() # print ['Hello', wPeter']
Split returns an list, so no need to put square brackets around the open...split. To remove duplicates use a set:
word_list = sorted(set(open('romeo.txt').read().split()))
print word_list

Categories

Resources