how to produce bigrams without stop words

how to produce bigrams without stop words - python

i wrote this function for generating bigrams from string using nltk.bigrams and ignoring stop words and letters but the stop words and letters still appear in the output. please help me to correct the funtion.
def bigramReturner (tweetString, stopWords):
bigramFeatureVector = []
tweetStringG = tweetString.lower()
tweetStringG = tweetString.split()
for i in tweetStringG:
i =replaceTwoOrMore(i)
i =i.strip('\'"?,.')
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*$", i)
if(i in stopWords is None):
continue
else:
for i in nltk.bigrams(tweetStringG):
bigramFeatureVector.append(' '.join(i))
return bigramFeatureVector

Try removing the is None check as currently you're comparing True or False with None
def bigramReturner (tweetString, stopWords):
bigramFeatureVector = []
tweetStringG = tweetString.lower()
tweetStringG = tweetString.split()
for i in tweetStringG:
i =replaceTwoOrMore(i)
i =i.strip('\'"?,.')
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*$", i)
if(i in stopWords):
continue
else:
for i in nltk.bigrams(tweetStringG):
bigramFeatureVector.append(' '.join(i))
return bigramFeatureVector

Related

Printing out Words for a Python Word Game that uses Parallelism/Multiprocessing

I am working on a Multiprocessing Word game with Python. And I need some help with getting the program to print words to the screen. The original code, that i started out with, before I added code for Multiprocessing, was able to print words out from the screen. I'm using a dictionary, where each words is a 3 letter word. The goal of the game is to, ask "What is the 1st word" & "What is the 2nd word. And then, the program changes 1 letter at a time, until it reaches the 2nd word. I will include the original code, that is able to print words from the screen, somewhere in the comments, if i can.
Here is the link to the dictionary that I am using. But, u can also use a dictionary of ur choice. text
Below is the fist part of my code. vvvvvvvvvvvvvv
```python
from multiprocessing import Process, Queue
class node:
def __init__(self, word):
self._word = word
self._neighbors = []
self._status = 0
self._discoveredby = 0
def getWord(self):
return self._word
def getNeighbors(self):
return self._neighbors
def addNeighbor(self, neighbor_index):
self._neighbors.append(neighbor_index)
def isUnseen(self):
if self._status == 0:
return True
else:
return False
def isSeen(self):
if self._status == 1:
return True
else:
return False
def setUnseen(self):
self._status = 0
def setSeen(self):
self._status = 1
def discover(self, n):
self._discoveredby = n
def discovered(self):
return self._discoveredby
#Get starting and ending word
word1 = input("What is the starting word? ")
word2 = input("What is the ending word? ")
```python
Also, here is the second part of the code for the word Game. It does not have any errors as far as I know. But, what i'm wondering is.
Why does it not print out the words to the screen?
The Original code that I started out with, before modifying it to use Multiprocessing, was able to print words to the screen. I can include my original code, in the comments if u want to see it.
And here is the Dictionary again text
```python
if __name__ == '__main__':
#Read in the words from a dictionary
words = []
dict = open("Word Game Text.txt", 'r')
for line in dict:
word = node(line.strip())
words.append(word)
def findlinks(words, q, starti, endi):
for i in range(starti, endi):
for j in range(i + 1, len(words)):
#compare word i and word j
if compareWords(words[i].getWord(), words[j].getWord()):
q.put((i, j)) #addLink(words, i, j)
q = Queue()
process_list = []
for i in range(0,len(words)-100,100):
p = Process(target=findlinks, args=(words,q, i, i+100))
process_list.append(p)
p = Process(target=findlinks, args=(words, q, i+100, len(words)))
process_list.append(p)
for p in process_list:
p.start()
while not q.empty():
i, j = q.get()
addLink(words, i, j)
for p in process_list:
p.join()
while not q.empty():
i, j = q.get()
addLink(words, i, j)
index1 = -1
index2 = -1
for i in range(len(words)):
if words[i].getWord() == word1:
index1 = i
if words[i].getWord() == word2:
index2 = i
if index1 == -1:
print(word1,"was not in the dictionary. Exiting.")
exit(0)
if index2 == -1:
print(word2,"was not in the dictionary. Exiting.")
exit(0)
path = BFS(words, index1, index2)
#Report the chain
if path == []:
print("There was no chain between those words, in my dictionary")
else:
for index in path:
print(words[index].getWord())
#print(words[index].getWord())
```python

finding the longest common prefix of elements inside a list

I have a sequence print(lcp(["flower","flow","flight", "dog"])) which should return fl. Currently I can get it to return flowfl.
I can locate the instances where o or w should be removed, and tried different approaches to remove them. However they seem to hit syntax issue, which I cannot seem to resolve by myself.
I would very much appreciate a little guidance to either have the tools to remedy this issue my self, or learn from a working proposed solution.
def lcp(strs):
if not isinstance(strs, list) or len(strs) == 0:
return ""
if len(strs) == 1:
return strs[0]
original = strs[0]
original_max = len(original)
result = ""
for _, word in enumerate(strs[1:],1):
current_max = len(word)
i = 0
while i < current_max and i < original_max:
copy = "".join(result)
if len(copy) and copy[i-1] not in word:
# result = result.replace(copy[i-1], "")
# result = copy[:i-1]
print(copy[i-1], copy, result.index(copy[i-1]), i, word)
if word[i] == original[i]:
result += word[i]
i += 1
return result
print(lcp(["flower","flow","flight", "dog"])) # returns flowfl should be fl
print(lcp(["dog","car"])) # works
print(lcp(["dog","racecar","car"])) # works
print(lcp([])) # works
print(lcp(["one"])) # works
I worked on an alternative which does not be solve removing inside the same loop, adding a counter at the end. However my instincts suggest it can be solved within the for and while loops without increasing code bloat.
if len(result) > 1:
counter = {char: result.count(char) for char in result}
print(counter)

I have solved this using the below approach.
class Solution:
def longestCommonPrefix(self, strs: List[str]) -> str:
N = len(strs)
if N == 1:
return strs[0]
len_of_small_str, small_str = self.get_min_str(strs)
ans = ""
for i in range(len_of_small_str):
ch = small_str[i]
is_qualified = True
for j in range(N):
if strs[j][i] != ch:
is_qualified = False
break
if is_qualified:
ans += ch
else:
break
return ans
def get_min_str(self, A):
min_len = len(A[0])
s = A[0]
for i in range(1, len(A)):
if len(A[i]) < min_len:
min_len = len(A[i])
s = A[i]
return min_len, s

Returns the longest prefix that the set of words have in common.
def lcp(strs):
if len(strs) == 0:
return ""
result = strs[0]
for word in strs[1:]:
for i, (l1, l2) in enumerate(zip(result, word)):
if l1 != l2:
result = result[:i]
break
else:
result = result[:i+1]
return result
Results:
>>> print(lcp(["flower","flow","flight"]))
fl
>>> print(lcp(["flower","flow","flight", "dog"]))
>>> print(lcp(["dog","car"]))
>>> print(lcp(["dog","racecar","car"]))
>>> print(lcp([]))
>>> print(lcp(["one"]))
one
>>> print(lcp(["one", "one"]))
one

You might need to rephrase your goal.
By your description you don't want the longest common prefix, but the prefix that the most words have in common with the first one.
One of your issues is that your tests only test one real case and four edgecases. Make some more real examples.
Here's my proposition: I mostly added the elif to check if we already have a difference on the first letter to then discard the entry.
It also overwrites the original to rebuild the string based on the common prefix with the next word (if there are any)
def lcp(strs):
if not isinstance(strs, list) or len(strs) == 0:
return ""
if len(strs) == 1:
return strs[0]
original = strs[0]
result = ""
for word in strs[1:]:
i = 0
while i < len(word) and i < len(original) :
if word[i] == original[i]:
result += word[i]
elif i == 0:
result = original
break
i += 1
original = result
result = ""
return original
print(lcp(["flower","flow","flight", "dog"])) # fl
print(lcp(["shift", "shill", "hunter", "shame"])) # sh
print(lcp(["dog","car"])) # dog
print(lcp(["dog","racecar","car"])) # dog
print(lcp(["dog","racecar","dodge"])) # do
print(lcp([])) # [nothing]
print(lcp(["one"])) # one

Longest substring

s = 'abcabcbb'
def longsubstring(s):
if len(s)==0:
return 0
list1 = []
empty = ''
for i in s:
if i in empty:
list1.append(len(empty))
empty = ''
continue
else:
empty+=i
return max(list1)
longsubstring(s)
The above code works fine when s = 'abcabcbb'
but it returns 1 when actually 2 for s = 'aab'. Could someone debug the code tell me where I am wrong to satisfy the condition. Thanks in advance.

Here is the modification to your code that worked
s = 'aab'
def longsubstring(s):
if len(s)==0:
return 0
list1 = []
empty = ''
for i in range(0,len(s)):
empty=''
for j in range(i,len(s)):
if s[j] in empty:
list1.append(len(empty))
empty = ''
break
else:
empty+=s[j]
list1.append(len(empty))
return max(list1)
print(longsubstring(s))

Backtrack Algorithm To Check Strings form Matrix

I have list:
words = ["ALI", "SIN", "ASI", "LIR", "IRI", "INI", "KAR"]
I want to check if they form matrix such as this:
and return my solution as a list like:
solution = ["ALI", "SIN", "IRI"]
I have come up with this code:
words=["ALI", "SIN", "ASI", "LIR", "IRI", "INI", "KAR"]
solution =[]
failedsolutions = []
def Get_next_word():
while True:
for word in words:
if (word in solution) == False:
solution.append(word)
if (solution in failedsolutions) == False:
return False
else:
solution.pop(len(solution) - 1 )
return True
def Check_word_order():
checker = True
for i in range(len(solution)):
for j in range(len(words[0])):
for word in words:
if solution[i][j] == word[j]:
checker = False
else:
checker = True
if checker == False:
return checker
def main():
while True:
Get_next_word()
check = Check_word_order()
if check is False:
#Backtrack
failedsolutions.append(solution)
solution.pop(len(solution) - 1 )
# else:
# solution.append()
print(solution)
main()
I am tired and no longer be able to debug my code. Help will be appreciated.
Thank you
ps: I advice not fixing my code if there is a better way to it all.

You can use this simple recursive function which will analyze all possible groups:
import itertools
import copy
words = ["ALI", "SIN", "ASI", "LIR", "IRI", "INI", "KAR"]
def get_pairs(word_group, current_words, found):
if not current_words:
return found
new_group = list(word_group)+[current_words[0]]
if all(''.join(i) in words and ''.join(i) not in new_group for i in zip(*new_group)):
return get_pairs(word_group, current_words[1:], new_group)
return get_pairs(word_group, current_words[1:], found)
starting_pairs = [i for i in itertools.combinations(words, 2)]
final_listing = filter(lambda x:x, [get_pairs(i, copy.deepcopy(words), []) for i in starting_pairs])
Output:
[['ALI', 'SIN', 'IRI'], ['ASI', 'LIR', 'INI']]
Which yields all combinations of valid matrices.
Or, without using itertools:
def get_combos(word, new_words):
if new_words[1:]:
new_list = [(yield (word, i)) for i in new_words if i != word]
for b in get_combos(new_words[0], new_words[1:]):
yield b
starting_pairs = get_combos(words[0], words[1:])
final_listing = filter(lambda x:x, [get_pairs(i, copy.deepcopy(words), []) for i in starting_pairs])

Finding All Variants Of Word With Shifted Capital Letter

I need a python function that can do the following:
Given an input of 't' and 'tattle', it should return a list like so:
['Tattle','taTtle','tatTle']
Or with 'z' and 'zzzzz':
['Zzzzz','zZzzz','zzZzz','zzzZz','zzzzZ']
I coded the following, but it does not work with the second example because the current function checks to see if the basestr matches what is already in the resulting list, R, and can pick up false positives due to words with multiple basestr's already in the word. Anyone have any advice?
def all_variants(wrapletter,word):
L,R,WLU,basestr=list(word),[],wrapletter.upper(),''
if L.count(wrapletter)==1:
for char in L:
if wrapletter==char:
basestr=basestr+WLU
else:
basestr=basestr+char
R.append(basestr)
return(R)
else:
for i in range(L.count(wrapletter)):
basestr=''
if i==0 and L[0]==wrapletter:
basestr=WLU
for char in range(1,len(L)):
basestr=basestr+L[char]
R.append(basestr)
else:
for char in L:
if wrapletter==char:
if WLU in basestr:
basestr=basestr+char
elif basestr in str(R):
basestr=basestr+char
else:
basestr=basestr+WLU
else:
basestr=basestr+char
R.append(basestr)
R.remove(R[0])
return(R)

It's not elegant, but maybe it's what you need?
target = "daaddaad"
def capitalize(target_letter, word):
return [word[:i] + word[i].upper() + word[i + 1:]
for i in xrange(len(word)) if word[i] == target_letter]
print capitalize("d", target)
Outputs:
['Daaddaad', 'daaDdaad', 'daadDaad', 'daaddaaD']

inp = 't'
word = 'tattle'
inds = (i for i,ele in enumerate(word) if ele == inp)
print([word[:i]+word[i].upper()+word[i+1:] for i in inds])
['Tattle', 'taTtle', 'tatTle']

Try this. I iterate through each letter, shift it to uppercase, and sandwich it with the other parts of the original string.
def all_variants(wrapletter, word):
variants = []
for i, letter in enumerate(word):
if letter == wrapletter:
variants.append(word[:i] + letter.upper() + word[i+1:])
return variants
print all_variants('z', 'zzzzz')
print all_variants('t', 'tattle')

def all_variants(wrapletter, word):
list = []
for i in range(len(word)):
if(word[i]==wrapletter):
start = word[0:i].lower()
str = word[i].upper()
end = word[i+1::].lower()
list.append(start+str+end)
return list
These returned when I ran this function:
>>>all_variants("t", "tattle")
['Tattle', 'taTtle', 'tatTle']
>>>all_variants("z", "zzzzz")
['Zzzzz', 'zZzzz', 'zzZzz', 'zzzZz', 'zzzzZ']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to produce bigrams without stop words - python

Related

Printing out Words for a Python Word Game that uses Parallelism/Multiprocessing

finding the longest common prefix of elements inside a list

Longest substring

Backtrack Algorithm To Check Strings form Matrix

Finding All Variants Of Word With Shifted Capital Letter

Categories

Resources