How to split word to ngrams in Python? - python

I've got this question. I should split word to ngrams (for example: word ADVENTURE has three 4grams - ADVE; ENTU; TURE). There is a book file document (that's the reason for counter and isalpha), which is I don't have here, so I'm using only a list of 2 words. This is my code in Python:
words = ['adven', 'adventure']
def ngrams(words, n):
counter = {}
for word in words:
if (len(word)-1) >= n:
for i in range(0, len(word)):
if word.isalpha() == True:
ngram = ""
for i in range(len(word)):
ngram += word[i:n:]
if len(ngram) == n:
ngram.join(counter)
counter[ngram] = counter.get(ngram, 0) + 1
return counter
print(trotl(words, 4))
This is what the code gives me:
{'adve': 14}
I don't care about the values in it but I'm not so good at strings and I don't know what I should do to gives me the three 4grams. I try to do "ngram += word[i::]" but that gives me None. Please help me, this is my school homework and I can't do more functions when this ngrams doesn't work.

use nltk.ngrams for this job:
from nltk import ngrams

I think the definition you have of n-grams is a little bit different from the conventional, as pointed out by #Stuart in his comment. However, with the definition from your comment, I think the following would solve your problem.
def n_grams(word, n):
# We can't find n-grams if the word has less than n letters.
if n > len(word):
return []
output = []
start_idx = 0
end_idx = start_idx + n
# Grab all n-grams except the last one
while end_idx < len(word):
n_gram = word[start_idx:end_idx]
output.append(n_gram)
start_idx = end_idx - 1
end_idx = start_idx + n
# Grab the last n-gram
last_n_gram_start = len(word) - n
last_n_gram_end = len(word)
output.append(word[last_n_gram_start:last_n_gram_end])
return output

If I've understood the rules correctly, you can do it like this
def special_ngrams(word, n):
""" Yield character ngrams of word that overlap by only one character,
except for the last two ngrams which may overlap by more than one
character. The first and last ngrams of the word are always included. """
for start in range(0, len(word) - n, n - 1):
yield word[start:start + n]
yield word[-n:]
for word in "hello there this is a test", "adventure", "tyrannosaurus", "advent":
print(list(special_ngrams(word, 4)))

Related

Python Inserting a string

I need to insert a string (character by character) into another string at every 3rd position
For example:- string_1:-wwwaabkccgkll
String_2:- toadhp
Now I need to insert string2 char by char into string1 at every third position
So the output must be wwtaaobkaccdgkhllp
Need in Python.. even Java is ok
So i tried this
Test_str="hiimdumbiknow"
challenge="toadh"
new_st=challenge [k]
Last=list(test_str)
K=0
For i in range(Len(test_str)):
if(i%3==0):
last.insert(i,new_st)
K+=1
and the output i get
thitimtdutmbtiknow
You can split test_str into sub-strings to length 2, and then iterate merging them with challenge:
def concat3(test_str, challenge):
chunks = [test_str[i:i+2] for i in range(0,len(test_str),2)]
result = []
i = j = 0
while i<len(chunks) or j<len(challenge):
if i<len(chunks):
result.append(chunks[i])
i += 1
if j<len(challenge):
result.append(challenge[j])
j += 1
return ''.join(result)
test_str = "hiimdumbiknow"
challenge = "toadh"
print(concat3(test_str, challenge))
# hitimoduambdikhnow
This method works even if the lengths of test_str and challenge are mismatching. (The remaining characters in the longest string will be appended at the end.)
You can split Test_str in to groups of two letters and then re-join with each letter from challenge in between as follows;
import itertools
print(''.join(f'{two}{letter}' for two, letter in itertools.zip_longest([Test_str[i:i+2] for i in range(0,len(Test_str),2)], challenge, fillvalue='')))
Output:
hitimoduambdikhnow
*edited to split in to groups of two rather than three as originally posted
you can try this, make an iter above the second string and iterate over the first one and select which character should be part of the final string according the position
def add3(s1, s2):
def n():
try:
k = iter(s2)
for i,j in enumerate(s1):
yield (j if (i==0 or (i+1)%3) else next(k))
except:
try:
yield s1[i+1:]
except:
pass
return ''.join(n())
def insertstring(test_str,challenge):
result = ''
x = [x for x in test_str]
y = [y for y in challenge]
j = 0
for i in range(len(x)):
if i % 2 != 0 or i == 0:
result += x[i]
else:
if j < 5:
result += y[j]
result += x[i]
j += 1
get_last_element = x[-1]
return result + get_last_element
print(insertstring(test_str,challenge))
#output: hitimoduambdikhnow

How to find longest common substring of words in a list?

I have a list of words:
list1 = ['technology','technician','technical','technicality']
I want to check which phrase is repeated in each of the word. In this case, it is 'tech'.
I have tried converting all the characters to ascii values, but I am stuck there as I am unable to think of any logic.
Can somebody please help me with this?
This is generally called the Longest common substring/subsequence problem.
A very basic (but slow) strategy:
longest_substring = ""
curr_substring = ""
# Loop over a particular word (ideally, shortest).
for start_idx in range(shortest_word):
# Select a substring from that word.
for length in range(1, len(shortest_word) - start_idx):
curr_substring = shortest_word[start_idx : start_idx + length]
# Check if substring is present in all words,
# and exit loop or update depending on outcome.
if "curr_substring not in all words":
break
if "new string is longer":
longest_substring = curr_substring
Iterate over first word, increase length of prefix if there is only one prefix in all words checked by set, when difference in prefix is found return last result
list1 = ['technology', 'technician', 'technical', 'technicality']
def common_prefix(li):
s = set()
word = li[0]
while(len(s) < 2):
old_s = s
for i in range(1, len(word)):
s.add(word[:i])
return old_s.pop()
print(common_prefix(list1))
output: techn
Find the shortest word. Iterate over increasingly small chunks of the first word, starting with a chunk equal in length to the shortest word, checking that each is contained in all of the other strings. If it is, return that substring.
list1 = ['technology', 'technician', 'technical', 'technicality']
def shortest_common_substring(lst):
shortest_len = min(map(len, lst))
shortest_word = next((w for w in lst if len(w) == shortest_len), None)
for i in range(shortest_len, 1, -1):
for j in range(0, shortest_len - i):
substr = lst[0][j:i]
if all(substr in w for w in lst[1:]):
return substr
And just for fun, let's replace that loop with a generator expression, and just take the first thing it gives us (or None).
def shortest_common_substring(lst):
shortest_len = min(map(len, lst))
shortest_word = next((w for w in lst if len(w) == shortest_len), 0)
return next((lst[0][j:i] for i in range(shortest_len, 1, -1)
for j in range(0, shortest_len - i)
if all(lst[0][j:i] in w for w in lst[1:])),
None)

How to find the longest repeating sequence using python

I went through an interview, where they asked me to print the longest repeated character sequence.
I got stuck is there any way to get it?
But my code prints only the count of characters present in a string is there any approach to get the expected output
import pandas as pd
import collections
a = 'abcxyzaaaabbbbbbb'
lst = collections.Counter(a)
df = pd.Series(lst)
df
Expected output :
bbbbbbb
How to add logic to in above code?
A regex solution:
max(re.split(r'((.)\2*)', a), key=len)
Or without library help (but less efficient):
s = ''
max((s := s * (c in s) + c for c in a), key=len)
Both compute the string 'bbbbbbb'.
Without any modules, you could use a comprehension to go backward through possible sizes and get the first character multiplication that is present in the string:
next(c*s for s in range(len(a),0,-1) for c in a if c*s in a)
That's quite bad in terms of efficiency though
another approach would be to detect the positions of letter changes and take the longest subrange from those
chg = [i for i,(x,y) in enumerate(zip(a,a[1:]),1) if x!=y]
s,e = max(zip([0]+chg,chg+[len(a)]),key=lambda se:se[1]-se[0])
longest = a[s:e]
Of course a basic for-loop solution will also work:
si,sc = 0,"" # current streak (start, character)
ls,le = 0,0 # longest streak (start, end)
for i,c in enumerate(a+" "): # extra space to force out last char.
if i-si > le-ls: ls,le = si,i # new longest
if sc != c: si,sc = i,c # new streak
longest = a[ls:le]
print(longest) # bbbbbbb
A more long winded solution, picked wholesale from:
maximum-consecutive-repeating-character-string
def maxRepeating(str):
len_s = len(str)
count = 0
# Find the maximum repeating
# character starting from str[i]
res = str[0]
for i in range(len_s):
cur_count = 1
for j in range(i + 1, len_s):
if (str[i] != str[j]):
break
cur_count += 1
# Update result if required
if cur_count > count :
count = cur_count
res = str[i]
return res, count
# Driver code
if __name__ == "__main__":
str = "abcxyzaaaabbbbbbb"
print(maxRepeating(str))
Solution:
('b', 7)

How to create a list from results from a given list within a for loop?

I am doing googles python class. And came across this problem:
# A. match_ends
# Given a list of strings, return the count of the number of
# strings where the string length is 2 or more and the first
# and last chars of the string are the same.
# Note: python does not have a ++ operator, but += works.
I tried different approaches, but cant seem to get it to work. This is what i got now:
def match_ends(words):
words=sorted(words, key=len)
for i in words:
if len(i)<2:
print(i)
words=words[1:]
print(words)
for i in words:
if i[0:2]==i[-2:]:
x=[]
x.append[i]
How is this done?
Easy to accomplish using sum and a generator expression:
def match_ends(words):
return sum(len(word) >= 2 and word[0] == word[-1] for word in words)
You could simply do this:
def match_ends(words):
count = 0
for word in words:
if len(word) >= 2 and word[0] == word[-1]:
count += 1
return count
A more pythonic solution might be
def func(s):
return len(s) >= 2 and s[0] == s[-1]
str_list = ['applea', 'b', 'cardc']
filtered_list = [s for s in str_list if (len(s) >= 2 and s[0] == s[-1])]
# or
filtered_list = list(filter(func, str_list))
count = len(filtered_list)
pretty much the same as previous answers, but lambda
match_ends = lambda ws: sum(1 for w in ws if len(w)>1 and w[0] == w[-1])
or 'expanded' form
match_ends = lambda words: sum(1 for word in words if len(word)>1 and word[0] == word[-1])

Matching case words in strings python

I'm trying to write a function that check if a word is in string OR the word has len(word)-1 chars in common with each word in the string.
For example:
word: match string: There is a match -> True
word: matck string: There is a match -> True
The output need to be True for both examples because matck-1=matc and match-1=matc
I have wrote the below code so far:
for idx, f in enumerate(files):
for word in words:
if term in f:
numOfWord[idx] += 1
else:
file_words = f.split()
for f_word in file_words:
if word[:-1] == file_word[:-1]:
numOfWords[idx] += 1
But it's not good because I have a very big list of word and very big dir of long files so the run time is not realistic.
You can use Levenshtein distance to check that
def minimumEditDistance(s1,s2):
if len(s1) > len(s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
for index2,char2 in enumerate(s2):
newDistances = [index2+1]
for index1,char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1],
distances[index1+1],
newDistances[-1])))
distances = newDistances
return distances[-1]
print(minimumEditDistance("kitten","sitting"))
print(minimumEditDistance("rosettacode","raisethysword"))
https://rosettacode.org/wiki/Levenshtein_distance#Python

Categories

Resources