Convert consecutive duplicate character string to known word

Convert consecutive duplicate character string to known word - python

I am trying to convert a string with consecutive duplicate characters to it's 'dictionary' word. For example, 'aaawwesome' should be converted to 'awesome'.
Most answers I've come across have either removed all duplicates (so 'stations' would be converted to 'staion' for example), or have removed all consecutive duplicates using itertools.groupby(), however this doesn't account for cases such as 'happy' which would be converted to 'hapy'.
I have also tried using string intersections with Counter(), however this disregards the order of the letters so 'botaniser' and 'baritones' would match incorrectly.
In my specific case, I have a few lists of words:
list_1 = ["wife", "kid", "hello"]
list_2 = ["husband", "child", "goodbye"]
and, given a new word, for example 'hellllo', I want to check if it can be reduced down to any of the words, and if it can, replace it with that word.

use the enchant module, you may need to install it using pip
See which letters duplicate, remove the letter from the word until the word is in the English dictionary.
import enchant
d = enchant.Dict("en_US")
list_1 = ["wiffe", "kidd", "helllo"]
def dup(x):
for n,j in enumerate(x):
y = [g for g,m in enumerate(x) if m==j]
for h in y:
if len(y)>1 and not d.check(x) :
x = x[:h] + x[h+1:]
return x
list_1 = list(map(dup,list_1))
print(list_1)
>>> ['wife', 'kid', 'hello']

Related

Does string contain any of the words in my list?

I want to check a string to see if it contains any of the words i have in my list.
the list is has somewhere around 100 individual words.
i have tried using regex but cant get it to work...
string = "<div class="header_links">$$ - $$$, Dansk, Veganske retter, Glutenfri retter</div>"
list = ['Café','Afrikansk','............','Sushi','Svensk','Sydamerikansk','Syditaliensk','Szechuan','Taiwansk','Thai','Tibetansk','Østeuropæisk','Dansk']
in this case the string has 'Dansk' in it. The string could contain more than one of the words in the list.
i want to write a piece of code that prints the words in the list which is also in the string.
in this case the output should be: Dansk
if there was more than one word in the string it should be: Dansk, ...., ....
I hope someone can help

>>> list = ['Café','Afrikansk','............','Sushi','Svensk','Sydamerikansk','Syditaliensk','Szechuan','Taiwansk','Thai','Tibetansk','Østeuropæisk','Dansk']
>>> string = """<div class="header_links">$$ - $$$, Dansk, Veganske retter, Glutenfri retter</div>"""
>>> [x for x in list if x in string]
['Dansk']
I recommend not using list as a variable name, as it usually referring to the type list (like str or int)

Use a list comprehension with a membership check:
[x for x in lst if x in string]
Note that I have renamed your list to lst, as list is built-in.
Example:
string = '<div class="header_links">$$ - $$$, Dansk, Veganske retter, Glutenfri retter</div>'
lst = ['Café','Afrikansk','Sushi','Svensk','Sydamerikansk','Syditaliensk','Szechuan','Taiwansk','Thai','Tibetansk','Østeuropæisk','Dansk']
print([x for x in lst if x in string])
# ['Dansk']

in your case you can use:
string_intersection = set(string.replace(',', '').split()).intersection(my_list)
print(*string_intersection, sep =',')
output:
Dansk

How to compare reverse strings in list of strings with the original list of strings in python?

Input a given string and check if any word in that string matches with its reverse in the same string then print that word else print $
I split the string and put the words in a list and then I reversed the words in that list. After that, I couldn't able to compare both the lists.
str = input()
x = str.split()
for i in x: # printing i shows the words in the list
str1 = i[::-1] # printing str1 shows the reverse of words in a new list
# now how to check if any word of the new list matches to any word of the old list
if(i==str):
print(i)
break
else:
print('$)
Input: suman is a si boy.
Output: is ( since reverse of 'is' is present in the same string)

You almost have it, just need to add another loop to compare each word against each inverted word. Try using the following
str = input()
x = str.split()
for i in x:
str1 = i[::-1]
for j in x: # <-- this is the new nested loop you are missing
if j == str1: # compare each inverted word against each regular word
if len(str1) > 1: # Potential condition if you would like to not include single letter words
print(i)
Update
To only print the first occurrence of a match, you could, in the second loop, only check the elements that come after. We can do this by keeping track of the index:
str = input()
x = str.split()
for index, i in enumerate(x):
str1 = i[::-1]
for j in x[index+1:]: # <-- only consider words that are ahead
if j == str1:
if len(str1) > 1:
print(i)
Note that I used index+1 in order to not consider single word palindromes a match.

a = 'suman is a si boy'
# Construct the list of words
words = a.split(' ')
# Construct the list of reversed words
reversed_words = [word[::-1] for word in words]
# Get an intersection of these lists converted to sets
print(set(words) & set(reversed_words))
will print:
{'si', 'is', 'a'}

Another way to do this is just in a list comprehension:
string = 'suman is a si boy'
output = [x for x in string.split() if x[::-1] in string.split()]
print(output)
The split on string creates a list split on spaces. Then the word is included only if the reverse is in the string.
Output is:
['is', 'a', 'si']
One note, you have a variable name str. Best not to do that as str is a Python thing and could cause other issues in your code later on.
If you want word more than one letter long then you can do:
string = 'suman is a si boy'
output = [x for x in string.split() if x[::-1] in string.split() and len(x) > 1]
print(output)
this gives:
['is', 'si']
Final Answer...
And for the final thought, in order to get just the 'is':
string = 'suman is a si boy'
seen = []
output = [x for x in string.split() if x[::-1] not in seen and not seen.append(x) and x[::-1] in string.split() and len(x) > 1]
print(output)
output is:
['is']
BUT, this is not necessarily a good way to do it, I don't believe. Basically you are storing information in seen during the list comprehension AND referencing that same list. :)

This answer wouldn't show you 'a' and won't output 'is' with 'si'.
str = input() #get input string
x = str.split() #returns list of words
y = [] #list of words
while len(x) > 0 :
a = x.pop(0) #removes first item from list and returns it, then assigns it to a
if a[::-1] in x: #checks if the reversed word is in the list of words
#the list doesn't contain that word anymore so 'a' that doesn't show twice wouldn't be returned
#and 'is' that is present with 'si' will be evaluated once
y.append(a)
print(y) # ['is']

Python looping through lists

I have a list called:
word_list_pet_image = [['beagle', '01125.jpg'], ['saint', 'bernard', '08010.jpg']]
There is more data in this list but I kept it short. I am trying to iterate through this list and check to see if the word is only alphabetical characters if this is true append the word to a new list called
pet_labels = []
So far I have:
word_list_pet_image = []
for word in low_pet_image:
word_list_pet_image.append(word.split("_"))
for word in word_list_pet_image:
if word.isalpha():
pet_labels.append(word)
print(pet_labels)
For example I am trying to put the word beagle into the list pet_labels, but skip 01125.jpg. see below.
pet_labels = ['beagles', 'Saint Bernard']
I am getting a atributeError
AtributeError: 'list' object has no attribute 'isalpha'
I am sure it has to do with me not iterating through the list properly.

It looks like you are trying to join alphabetical words in each sublist. A list comprehension would be effective here.
word_list = [['beagle', '01125.jpg'], ['saint', 'bernard', '08010.jpg']]
pet_labels = [' '.join(w for w in l if w.isalpha()) for l in word_list]
>>> ['beagle', 'saint bernard']

You have lists of lists, so the brute force method would be to nest loops. like:
for pair in word_list_pet_image:
for word in pair:
if word.isalpha():
#append to list
Another option might be single for loop, but then slicing it:
for word in word_list_pet_image:
if word[0].isalpha():
#append to list

word_list = [['beagle', '01125.jpg'], ['saint', 'bernard', '08010.jpg']]
Why not list comprehension (only if non-all alphabetical letters element is always at last):
pet_labels = [' '.join(l[:-1]) for l in word_list]

word_list_pet_image.append(word.split("_"))
.split() returns lists, so word_list_pet_image itself contains lists, not plain words.

searching a list of strings for integers

Given the following list of strings:
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n', ... ]
how can I split each entry into a list of tuples:
[ (123, 321), (223, 32221), (19823, 328771), ... ]
In my other poor attempt, I managed to extract the numbers, but I encountered a problem, the element placeholder also contains a number which this method includes! It also doesn't write to a tuple, rather a list.
numbers = list()
for s in my_list:
for x in s:
if x.isdigit():
numbers.append((x))
numbers

We can first build a regex that identifies positive integers:
from re import compile
INTEGER_REGEX = compile(r'\b\d+\b')
Here \d stands for digit (so 0, 1, etc.), + for one or more, and \b are word boundaries.
We can then use INTEGER_REGEX.findall(some_string) to identify all positive integers from the input. Now the only thing left to do is iterate through the elements of the list, and convert the output of INTEGER_REGEX.findall(..) to a tuple. We can do this with:
output = [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
For your given sample data, this will produce:
>>> [tuple(INTEGER_REGEX.findall(l)) for l in my_list]
[('123', '321'), ('223', '32221'), ('19823', '328771')]
Note that digits that are not separate words will not be matched. For instance the 8 in 'see you l8er' will not be matched, since it is not a word.

your attempts iterates on each char of the string. You have to split the string according to blank. A task that str.split does flawlessly.
Also numbers.append((x)) is numbers.append(x). For a tuple of 1 element, add a comma before the closing parenthese. Even if that doesn't solve it either.
Now, the list seems to contain an id (skipped), then 2 integers as string, so why not splitting, zap the first token, and convert as tuple of integers?
my_list = ['element0 123 321\n', 'element1 223 32221\n', 'element2 19823 328771\n']
result = [tuple(map(int,x.split()[1:])) for x in my_list]
print(result)
gives:
[(123, 321), (223, 32221), (19823, 328771)]

Find the occurrence of: any one of the substrings (whichever first) stored in a list; in a bigger string in Python

I'm new to Python. I've gone through other answers.. I can say with some assurance that this may not be a duplicate.
Basically; let us say for example I want to find the occurrence of one of the substrings (stored in a list); and if found? I want it to stop searching for the other substrings of the list!
To illustrate more clearly;
a = ['This', 'containing', 'many']
string1 = "This is a string containing many words"
If you ask yourself, what is the first word in the bigger string string1 that matches with the words in the list a? The answer will be This, because the first word in the bigger string string1 that has a match with list of substrings a is This
a = ['This', 'containing', 'many']
string1 = "kappa pride pogchamp containing this string this many words"
Now, I've changed string1 a bit. If you ask yourself, what is the first word in the bigger string string1 that matches with the words in the list a? The answer will be containing, because the word containing is the first word that appears in the bigger string string1 that also has a match in the list of substrings a.
and if such a match is found? I want it to stop searching for any more matches!
I tried this:
string1 = "This is a string containing many words"
a = ['This', 'containing', 'many']
if any(x in string1 for x in a):
print(a)
else:
print("Nothing found")
The above code, prints the entire list of substrings. In other words, it checks for the occurrence of ANY and ALL of the substrings in the list a, and if found; it prints the entire list of substrings.
I've also tried looking up String find() method but I can't seem to understand how to exactly use it in my case
I'm looking for;
to word it EXACTLY: The first WORD in the bigger string that matches any of the list of words in the substring and print that word.
or
to find WHICHEVER SUBSTRING (stored in a list of SUBSTRINGS) appears first in a BIGGER STRING and PRINT that particular SUBSTRING.

You could use a set membership check + next here.
>>> a = {'This', 'containing', 'many'}
>>> next((v for v in string1.split() if v in a), 'Nothing Found!')
'This'
This should give you (possibly better than) O(N) performance, since we're using next to find just the first value, and set membership tests are constant time.

I think this can be done without splitting the string1 instead by matching the elements of the list. For the first match use break to stop execution.
string1 = "This is a string containing many words"
a = ['This', 'containing', 'many']
for x in a:
if x in string1:
print(x)
break
else:
print("Nothing found")
List comprehension
l=[x for x in a if x in string1]
if l:
print(l[0])
else:
print("Nothing found")

You can use re here.
import re
a = ['This', 'containing', 'many']
string1 = "kappa pride pogchamp containing this string this many words"
print re.search(r"\b(?:"+"|".join(a)+r")\b", string1).group()
Output:
containing
s="""
a = ['This', 'containing', 'many']
a=set(a)
string1 = 'is a string containing many words This '
c=next((v for v in string1.split() if v in a), 'Nothing Found!')
"""
s1="""
a = ['This', 'containing', 'many']
string1 = "is a string containing many words This "
re.search(r"\b(?:"+"|".join(a)+r")\b", string1)
"""
print timeit.timeit(stmt=s,number=1000000)
print timeit.timeit(stmt=s1,number=1000000, setup="import re")

There are two ways you could approach this. One is using the
string.find('substring')
method that will return the index of the first occurence of 'substring' in string1, or presumably return -1 if there is no occurence of 'substring' in string1. By iterating over the list of search terms a, you would have a collection of indicies, each corresponding to one word in your list. The smallest non-negative_one value in your list would be the index of your first word. This is very complex but would not require any sort of loop over the actual string.
Another alternative would be to use
string1.split(' ')
to create a list of all of the words in the string. Then you could go through this list with a for each loop and check if each item in your string1 list corresponds to any of the other items. This would be a great learning opportunity to try on your own, but let me know if I was too vague or if code would be more helpful.
Hope this helps!

a = ['This', 'containing', 'many']
string1 = "kappa pride pogchamp containing this string this many words"
Break is better option but that solution is already there so i wanted to show you can do in with slice too:
print("".join([item for item in string1.split() if item in a][:1]))
Above list comprehension is same as:
new=[]
for item in string1.split():
if item in a:
new.append(item)
print("".join(new[:1]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.