Words in a list with consecutively repeated letters - python

Right now I have a list of for example
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff']
I want to remove the words with the repeated letters, in which I want to remove the words
'aa','aac','bbb','bcca','ffffff'
Maybe import re?

Thanks to this thread: Regex to determine if string is a single repeating character
Here is the re version, but I would stick to PM2 ring and Tameem's solutions if the task was as simple as this:
import re
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff']
[i for i in data if not re.search(r'^(.)\1+$', i)]
Output
['dog', 'cat', 'a', 'aac', 'bcca']
And the other:
import re
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff']
[i for i in data if not re.search(r'((\w)\2{1,})', i)]
Output
['dog', 'cat', 'a']

Loop is the way to go. Forget about sets so far as they do not work for words with repetitive letters.
Here is a method you can use to determine if word is valid in a single loop:
def is_valid(word):
last_char = None
for i in word:
if i == last_char:
return False
last_char = i
return True
Example
In [28]: is_valid('dogo')
Out[28]: True
In [29]: is_valid('doo')
Out[29]: False

The original version of this question wanted to drop words that consist entirely of repetitions of a single character. An efficient way to do this is to use sets. We convert each word to a set, and if it consists of only a single character the length of that set will be 1. If that's the case, we can drop that word, unless the original word consisted of a single character.
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff']
newdata = [s for s in data if len(s) == 1 or len(set(s)) != 1]
print(newdata)
output
['dog', 'cat', 'a', 'aac', 'bcca']
Here's code for the new version of your question, where you want to drop words that contain any repeated characters. This one's simpler, because we don't need to make a special test for one-character words..
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff']
newdata = [s for s in data if len(set(s)) == len(s)]
print(newdata)
output
['dog', 'cat', 'a']
If the repetitions have to be consecutive, we can handle that using groupby.
from itertools import groupby
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff', 'abab', 'wow']
newdata = [s for s in data if max(len(list(g)) for _, g in groupby(s)) == 1]
print(newdata)
output
['dog', 'cat', 'a', 'abab', 'wow']

Here's a way to check if there are consecutive repeated characters:
def has_consecutive_repeated_letters(word):
return any(c1 == c2 for c1, c2 in zip(word, word[1:]))
You can then use a list comprehension to filter your list:
words = ['dog','cat','a','aa','aac','bbb','bcca','ffffff', 'abab', 'wow']
[word for word in words if not has_consecutive_repeated_letters(word)]
# ['dog', 'cat', 'a', 'abab', 'wow']

One line is all it takes :)
data = ['dog','cat','a','aa','aac','bbb','bcca','ffffff']
data = [value for value in data if(len(set(value))!=1 or len(value) ==1)]
print(data)
Output
['dog', 'cat', 'a', 'aac', 'bcca']

Related

Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string. The problem is, that using a nested for loop, the words in the string are appended multiple times in the list. Is there a way to avoid this?
def stemmer(text):
stemmed_string = []
res = text.split()
suffixes = ('ed', 'ly', 'ing')
for word in res:
for i in range(len(suffixes)):
if word.endswith(suffixes[i]):
stemmed_string.append(word[:-len(suffixes[i])])
elif len(word) > 8:
stemmed_string.append(word[:8])
else:
stemmed_string.append(word)
return stemmed_string
If I call the function on this text ('I have a dog is barking') this is the output:
['I',
'I',
'I',
'have',
'have',
'have',
'a',
'a',
'a',
'dog',
'dog',
'dog',
'that',
'that',
'that',
'is',
'is',
'is',
'barking',
'barking',
'bark']
You are appending something in each round of the loop over suffixes. To avoid the problem, don't do that.
It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes. Here's a version which always strips as much as possible.
def stemmer(text):
stemmed_string = []
suffixes = ('ed', 'ly', 'ing')
for word in text.split():
for suffix in suffixes:
if word.endswith(suffix):
word = word[:-len(suffix)]
stemmed_string.append(word)
return stemmed_string
Notice the fixed syntax for looping over a list, too.
This will reduce "sparingly" to "spar", etc.
Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".
Demo: https://ideone.com/a7FqBp

strip all strings in list of specific character

I have been looking for an answer to this for a while but keep finding answers about stripping a specific string from a list.
Let's say this is my list of strings
stringList = ["cat\n","dog\n","bird\n","rat\n","snake\n"]
But all list items contain a new line character (\n)
How can I remove this from all the strings within the list?
Use a list comprehension with rstrip():
stringList = ["cat\n","dog\n","bird\n","rat\n","snake\n"]
output = [x.rstrip() for x in stringList]
print(output) # ['cat', 'dog', 'bird', 'rat', 'snake']
If you really want to target a single newline character only at the end of each string, then we can get more precise with re.sub:
stringList = ["cat\n","dog\n","bird\n","rat\n","snake\n"]
output = [re.sub(r'\n$', '', x) for x in stringList]
print(output) # ['cat', 'dog', 'bird', 'rat', 'snake']
By applying the method strip (or rstrip) to all terms of the list with map
out = list(map(str.strip, stringList))
print(out)
or with a more rudimental check and slice
strip_char = '\n'
out = [s[:-len(strip_char)] if s.endswith(strip_char) else s for s in stringList]
print(out)
Since you can use an if to check if a new line character exists in a string, you can use the code below to detect string elements with the new line character and replace those characters with empty strings
stringList = ["cat\n","dog\n","bird\n","rat\n","snake\n"]
nlist = []
for string in stringList:
if "\n" in string:
nlist.append(string.replace("\n" , ""))
print(nlist)
You could also use map() along with str.rstrip:
>>> string_list = ['cat\n', 'dog\n', 'bird\n', 'rat\n', 'snake\n']
>>> new_string_list = list(map(str.rstrip, string_list))
>>> new_string_list
['cat', 'dog', 'bird', 'rat', 'snake']

Count the number of occurrences different final word types succeeding the string

Suppose I have a list of lists like this
l = [['a', 'paragraph', 'is'],
['a', 'paragraph', 'can'],
['a', 'dog', 'barks']]
also suppose I have this smaller list a = ['a', 'paragraph'] I want to count the number of occurrences different final word types succeeding the string. Therefore, the answer in this case should be 2 since 'is' and 'can' succeed the string 'a paragraph'.
I was trying to do something like this
l.count(a)
but that did not work and gives me 0.
Ill try to spell this idea out more clearly, we basically have this substring 'a paragraph' and there are two occurrences that have 'a paragraph' namely 'is' and 'can', therefore since there is 2 unique cases the answer is 2.
Make a set of all the desired words:
myset = {item[2] for item in l if item[:2] == ['a', 'paragraph']}
Then use len() of the set.
You want to match each item in the list against the smaller list. We start by subsetting the larger list to match the size of the smaller list. If it doesn't match, we continue, ignoring that item. If it does match, we add the next item to a set. The set is important because it handles uniqueness.
items = [
['a', 'paragraph', 'is'],
['a', 'paragraph', 'can'],
['a', 'dog', 'barks']
]
check = ['a', 'paragraph']
check_len = len(check)
unique_words = set()
for item in items:
if item[:check_len] != check:
continue
unique_words.add(item[-1])
print(len(unique_words))
2

Want to reverse my String location and special character should be there as it is

I've a String as Input like
input = 'apple&&bat&&&cat&&dog&elephant'
and i want to reverse the words and special character should be remains same in their place.
Output - 'elephant&&dog&&&cat&&bat&apple'
Exactly, i don't know in which approach i have to solve this problem.
But, yes i've tried this
with this i got the reverse word but how to place the '&' in their respective position i don't know.
input = 'apple&&bat&&&cat&&dog&elephant'
ab = input.split('&')[::-1]
print ab
output
['elephant', 'dog', '', 'cat', '', '', 'bat', '', 'apple']
But my output should be
'elephant&&dog&&&cat&&bat&apple'
First get separate lists of the words and special marks using re module:
In [2]: import re
In [4]: words = re.findall(r'\w+', input)
In [6]: words
Out[6]: ['apple', 'bat', 'cat', 'dog', 'elephant']
In [7]: special = re.findall(r'\W+', input)
In [8]: special
Out[8]: ['&&', '&&&', '&&', '&']
Then reverse the words list:
In [11]: rwords = words[::-1]
In [12]: rwords
Out[12]: ['elephant', 'dog', 'cat', 'bat', 'apple']
Finally, combine each word with the corresponding mark. Please note that I expand the special list by one empty string to make the lists the same length. The final operation is one line of code:
In [15]: ''.join(w + s for w, s in zip(rwords, special + ['']))
Out[15]: 'elephant&&dog&&&cat&&bat&apple'
Here is one solution to the problem that uses only the basic concepts. It navigates the split list from both the left and the right and swaps each pair of encountered words.
s = 'apple&&bat&&&cat&&dog&elephant'
words = s.split('&')
idx_left = 0
idx_right = len(words) - 1
while idx_left < idx_right:
while not words[idx_left]:
idx_left += 1
while not words[idx_right]:
idx_right -= 1
words[idx_left], words[idx_right] = words[idx_right], words[idx_left] # Swap words
idx_left += 1
idx_right -= 1
output = '&'.join(words)
The result is
'elephant&&dog&&&cat&&bat&apple'
Another more advanced approach is to use groupby and list slicing:
from itertools import groupby
# Split the input string into the list
# ['apple', '&&', 'bat', '&&&', 'cat', '&&', 'dog', '&', 'elephant']
words = [''.join(g) for _, g in groupby(s, lambda c: c == '&')]
n = len(words)
words[::2] = words[n-n%2::-2] # Swapping (assume the input string does not start with a separator string)
output = ''.join(words)
Another regex solution:
>>> import re
>>> # Extract the "words" from the string.
>>> words = re.findall(r'\w+', s)
>>> words
['apple', 'bat', 'cat', 'dog', 'elephant']
>>> # Replace the words with formatting placeholders ( {} )
>>> # then format the resulting string with the words in
>>> # reverse order
>>> re.sub(r'\w+', '{}', s).format(*reversed(words))
'elephant&&dog&&&cat&&bat&apple'

List comprehension output is None [duplicate]

This question already has answers here:
Python list comprehension: list sub-items without duplicates
(6 answers)
Closed 7 years ago.
I'm new to python and I wanted to try to use list comprehension but outcome I get is None.
print
wordlist = ['cat', 'dog', 'rabbit']
letterlist = []
letterlist = [letterlist.append(letter) for word in wordlist for letter in word if letter not in letterlist]
print letterlist
# output i get: [None, None, None, None, None, None, None, None, None]
# expected output: ['c', 'a', 't', 'd', 'o', 'g', 'r', 'b', 'i']
Why is that? It seems that it works somehow because I get expected number of outcomes (9) but all of them are None.
list.append(element) doesn’t return anything – it appends an element to the list in-place.
Your code could be rewritten as:
wordlist = ['cat', 'dog', 'rabbit']
letterlist = [letter for word in wordlist for letter in word]
letterlist = list(set(letterlist))
print letterlist
… if you really want to use a list comprehension, or:
wordlist = ['cat', 'dog', 'rabbit']
letterset = set()
for word in wordlist:
letterset.update(word)
print letterset
… which is arguably clearer. Both of these assume order doesn’t matter. If it does, you could use OrderedDict:
from collections import OrderedDict
letterlist = list(OrderedDict.fromkeys("".join(wordlist)).keys())
print letterlist
list.append returns None. You need to adjust the expression in the list comprehension to return letters.
wordlist = ['cat', 'dog', 'rabbit']
letterset = set()
letterlist = [(letterset.add(letter), letter)[1]
for word in wordlist
for letter in word
if letter not in letterset]
print letterlist
If order doesn't matter, do this:
resultlist = list({i for word in wordlist for i in word})

Categories

Resources