Removing words containing digits from a given string - python

I'm trying to write a simple program that removes all words containing digits from a received string.
Here is my current implementation:
import re
def checkio(text):
text = text.replace(",", " ").replace(".", " ") .replace("!", " ").replace("?", " ").lower()
counter = 0
words = text.split()
print words
for each in words:
if bool(re.search(r'\d', each)):
words.remove(each)
print words
checkio("1a4 4ad, d89dfsfaj.")
However, when I execute this program, I get the following output:
['1a4', '4ad', 'd89dfsfaj']
['4ad']
I can't figure out why '4ad' is printed in the second line as it contains digits and should have been removed from the list. Any ideas?

Assuming that your regular expression does what you want, you can do this to avoid removing while iterating.
import re
def checkio(text):
text = re.sub('[,\.\?\!]', ' ', text).lower()
words = [w for w in text.split() if not re.search(r'\d', w)]
print words ## prints [] in this case
Also, note that I simplified your text = text.replace(...) line.
Additionally, if you do not need to reuse your text variable, you can use regex to split it directly.
import re
def checkio(text):
words = [w for w in re.split('[,.?!]', text.lower()) if w and not re.search(r'\d', w)]
print words ## prints [] in this case

If you are testing for alpha numeric strings why not use isalnum() instead of regex ?
In [1695]: x = ['1a4', '4ad', 'd89dfsfaj']
In [1696]: [word for word in x if not word.isalnum()]
Out[1696]: []

This would be possible through using re.sub, re.search and list_comprehension.
>>> import re
>>> def checkio(s):
print([i for i in re.sub(r'[.,!?]', '', s.lower()).split() if not re.search(r'\d', i)])
>>> checkio("1a4 4ad, d89dfsfaj.")
[]
>>> checkio("1a4 ?ad, d89dfsfaj.")
['ad']

So apparently what happens is a concurrent access error. Namely - you are deleting an element while traversing the array.
At the first iteration we have words = ['1a4', '4ad', 'd89dfsfaj']. Since '1a4' has a number, we remove it.
Now, words = ['4ad','d89dfsfaj']. However, at the second iteration, the current word is now 'd89dfsfaj' and we remove it. What happens is that we skip '4ad', because it is now at index 0 and the current pointer for the for cycle is at 1.

Related

Filter a list of strings by a char in same position

I am trying to make a simple function that gets three inputs: a list of words, list of guessed letters and a pattern. The pattern is a word with some letters hidden with an underscore. (for example the word apple and the pattern '_pp_e')
For some context it's a part of the game hangman where you try to guess a word and this function gives a hint.
I want to make this function to return a filtered list of words from the input that does not contain any letters from the list of guessed letters and the filtered words contain the same letters and their position as with the given pattern.
I tried making this work with three loops.
First loop that filters all words by the same length as the pattern.
Second loop that checks for similarity between the pattern and the given word. If the not filtered word does contain the letter but not in the same position I filter it out.
Final loop checks the filtered word that it does not contain any letters from the given guessed list.
I tried making it work with not a lot of success, I would love for help. Also any tips for making the code shorter (without using third party libraries) will be a appreciated very much.
Thanks in advance!
Example: pattern: "d _ _ _ _ a _ _ _ _" guessed word list ['b','c'] and word list contain all the words in english.
output list: ['delegating', 'derogation', 'dishwasher']
this is the code for more context:
def filter_words_list(words, pattern, wrong_guess_lst):
lst_return = []
lst_return_2 = []
lst_return_3 = []
new_word = ''
for i in range(len(words)):
if len(words[i]) == len(pattern):
lst_return.append(words[i])
pattern = list(pattern)
for i in range(len(lst_return)):
count = 0
word_to_check = list(lst_return[i])
for j in range(len(pattern)):
if pattern[j] == word_to_check[j] or (pattern[j] == '_' and
(not (word_to_check[j] in
pattern))):
count += 1
if count == len(pattern):
lst_return_2.append(new_word.join(word_to_check))
for i in range(len(lst_return_2)):
word_to_check = lst_return_2[i]
for j in range(len(wrong_guess_lst)):
if word_to_check.find(wrong_guess_lst[j]) == -1:
lst_return_3.append(word_to_check)
return lst_return_3
The easiest, and likely quite efficient, way to do this would be to translate your pattern into a regular expression, if regular expressions are in your "toolbox". (The re module is in the standard library.)
In a regular expression, . matches any single character. So, we replace all _s with .s and add "^" and "$" to anchor the regular expression to the whole string.
import re
def filter_words(words, pattern, wrong_guesses):
re_pattern = re.compile("^" + re.escape(pattern).replace("_", ".") + "$")
# get words that
# (a) are the correct length
# (b) aren't in the wrong guesses
# (c) match the pattern
return [
word
for word in words
if (
len(word) == len(pattern) and
word not in wrong_guesses and
re_pattern.match(word)
)
]
all_words = [
"cat",
"dog",
"mouse",
"horse",
"cow",
]
print(filter_words(all_words, "c_t", []))
print(filter_words(all_words, "c__", []))
print(filter_words(all_words, "c__", ["cat"]))
prints out
['cat']
['cat', 'cow']
['cow']
If you don't care for using regexps, you can instead translate the pattern to a dict mapping each defined position to the character that should be found there:
def filter_words_without_regex(words, pattern, wrong_guesses):
# get a map of the pattern's defined letters to their positions
letter_map = {i: letter for i, letter in enumerate(pattern) if letter != "_"}
# get words that
# (a) are the correct length
# (b) aren't in the wrong guesses
# (c) have the correct letters in the correct positions
return [
word
for word in words
if (
len(word) == len(pattern) and
word not in wrong_guesses and
all(word[i] == ch for i, ch in letter_map.items())
)
]
The result is the same.
Probably not the most efficient, but this should work:
def filter_words_list(words, pattern, wrong_guess_lst):
fewer_words = [w for w in words if not any([wgl in w for wgl in wrong_guess_lst])]
equal_len_words = [w for w in fewer_words if len(w) == len(pattern)]
pattern_indices = [idl for idl, ltr in enumerate(pattern) if ltr != '_']
word_indices = [[idl for idl, ltr in enumerate(w) if ((ltr in pattern) and (ltr != '_'))] for w in equal_len_words]
out = [w for wid, w in zip(word_indices, equal_len_words) if ((wid == pattern_indices) and (w[pid] == pattern[pid] for pid in pattern_indices))]
return out
The idea is to first remove all words that have letters in your wrong_guess_lst.
Then, remove everything which does not have the same length (you could also merge this condition in the first one..).
Next, for both pattern and your remaining words, you create a pattern mask, which indicates the positions of non '_' letters.
To be a candidate, the masks have to be identical AND the letters in these positions have to be identical as well.
Note, that I replaced a lot of for loops in you code by list comprehension snippets. List comprehension is a very useful construct which helps a lot especially if you don't want to use other libraries.
Edit: I cannot really tell you, where your code went wrong as it was a little too long for me..
The regex rule is explicitely constructed, in particular no check on the word's length is needed. To achieve this the groupby function from the itertools package of the standard library is used:
'_ b _ _ _' -- regex-- > r'^.{1}b.{3}$'
Here how to filter the dictionary by a guess string:
import itertools as it
import re
# sample dictionary
dictionary = "a ability able about above accept according account across act action activity actually add address"
dictionary = dictionary.split()
guess = '_ b _ _ _'
guess = guess.replace(' ', '') # remove white spaces
# construction of the regex rule
regex = r'^'
for _, i in it.groupby(guess, key=lambda x: x == '_'):
if '_' in (l:=list(i)):
regex += ''.join(f'.{{{len(l)}}}') # escape the curly brackets
else:
regex += ''.join(l)
regex += '$'
# processing the regex rule
pattern = re.compile(regex)
# filter the dictionary by the rule
l = [word for word in dictionary if pattern.match(word)]
print(l)
Output
['about', 'above']

How to solve the string indices must be integers problem in a for loop for capitalizing every word in a string

I hope everyone is safe.
I am trying to go over a string and capitalize every first letter of the string.
I know I can use .title() but
a) I want to figure out how to use capitalize or something else in this case - basics, and
b) The strings in the tests, have some words with (') which makes .title() confused and capitalize the letter after the (').
def to_jaden_case(string):
appended_string = ''
word = len(string.split())
for word in string:
new_word = string[word].capitalize()
appended_string +=str(new_word)
return appended_string
The problem is the interpreter gives me "TypeError: string indices must be integers" even tho I have an integer input in 'word'. Any help?
thanks!
You are doing some strange things in the code.
First, you split the string just to count the number of words, but don't store it to manipulate the words after that.
Second, when iterating a string with a for in, what you get are the characters of the string, not the words.
I have made a small snippet to help you do what you desire:
def first_letter_of_word_upper(string, exclusions=["a", "the"]):
words = string.split()
for i, w in enumerate(words):
if w not in exclusions:
words[i] = w[0].upper() + w[1:]
return " ".join(words)
test = first_letter_of_word_upper("miguel angelo santos bicudo")
test2 = first_letter_of_word_upper("doing a bunch of things", ["a", "of"])
print(test)
print(test2)
Notes:
I assigned the value of the string splitting to a variable to use it in the loop
As a bonus, I included a list to allow you exclude words that you don't want to capitalize.
I use the original same array of split words to build the result... and then join based on that array. This a way to do it efficiently.
Also, I show some useful Python tricks... first is enumerate(iterable) that returns tuples (i, j) where i is the positional index, and j is the value at that position. Second, I use w[1:] to get a substring of the current word that starts at character index 1 and goes all the way to the end of the string. Ah, and also the usage of optional parameters in the list of arguments of the function... really useful things to learn! If you didn't know them already. =)
You have a logical error in your code:
You have used word = len(string.split()) which is of no use ,Also there is an issue in the for loop logic.
Try this below :
def to_jaden_case(string):
appended_string = ''
word_list = string.split()
for i in range(len(word_list)):
new_word = word_list[i].capitalize()
appended_string += str(new_word) + " "
return appended_string
from re import findall
def capitalize_words(string):
words = findall(r'\w+[\']*\w+', string)
for word in words:
string = string.replace(word, word.capitalize())
return string
This just grabs all the words in the string, then replaces the words in the original string, the characters inside the [ ] will be included in the word aswell
You are using string index to access another string word is a string you are accessing word using string[word] this causing the error.
def to_jaden_case(string):
appended_string = ''
for word in string.split():
new_word = word.capitalize()
appended_string += new_word
return appended_string
Simple solution using map()
def to_jaden_case(string):
return ' '.join(map(str.capitalize, string.split()))
In for word in string: word will iterate over the characters in string. What you want to do is something like this:
def to_jaden_case(string):
appended_string = ''
splitted_string = string.split()
for word in splitted_string:
new_word = word.capitalize()
appended_string += new_word
return appended_string
The output for to_jaden_case("abc def ghi") is now "AbcDefGhi", this is CammelCase. I suppose you actually want this: "Abc Def Ghi". To achieve that, you must do:
def to_jaden_case(string):
appended_string = ''
splitted_string = string.split()
for word in splitted_string:
new_word = word.capitalize()
appended_string += new_word + " "
return appended_string[:-1] # removes the last space.
Look, in your code word is a character of string, it is not index, therefore you can't use string[word], you can correct this problem by modifying your loop or using word instead of string[word]
So your rectified code will be:
def to_jaden_case(string):
appended_string = ''
for word in range(len(string)):
new_word = string[word].capitalize()
appended_string +=str(new_word)
return appended_string
Here I Changed The Third Line for word in string with for word in len(string), the counterpart give you index of each character and you can use them!
Also I removed the split line, because it's unnecessary and you can do it on for loop like len(string)

compare specific string to a word python

say I have a certain string and a list of strings.
I would like to append to a new list all the words from the list (of strings)
that are exactly like the pattern
for example:
list of strings = ['string1','string2'...]
pattern =__letter__letter_ ('_c__ye_' for instance)
I need to add all strings that are made up of the same letters in the same places as the pattern, and has the same length.
so for instance:
new_list = ['aczxyep','zcisyef'...]
I have tried this:
def pattern_word_equality(words,pattern):
list1 = []
for word in words:
for letter in word:
if letter in pattern:
list1.append(word)
return list1
help will be much appreciated :)
If your pattern is as simple as _c__ye_, then you can look for the characters in the specific positions:
words = ['aczxyep', 'cxxye', 'zcisyef', 'abcdefg']
result1 = list(filter(lambda w: w[1] == 'c' and w[4:6] == 'ye', words))
If your pattern is getting more complex, then you can start using regular expressions:
pat = re.compile("^.c..ye.$")
result2 = list(filter(lambda w: pat.match(w), words))
Output:
print(result1) # ['aczxyep', 'zcisyef']
print(result2) # ['aczxyep', 'zcisyef']
This works:
words = ['aczxyep', 'cxxye', 'zcisyef', 'abcdefg']
pattern = []
for i in range(len(words)):
if (words[i])[1].lower() == 'c' and (words[i])[4:6].lower() == 'ye':
pattern.append(words[i])
print(pattern)
You start by defining the words and pattern lists. Then you loop around for the amount of items in words by using len(words). You then find whether the i item number is follows the pattern by seeing if the second letter is c and the 5th and 6th letters are y and e. If this is true then it appends that word onto pattern and it prints them all out at the end.

How do I use Enumerate to to make each word into a number not each character

So I need to make a program that gets the user to enter a sentence, and then the code turns that sentence into numbers corresponding to it's position in the list, I cam across the command Enumerate here: Python using enumerate inside list comprehension but this gets every character not every word, so this is my code so far, can anyone help me fix this?
list = []
lists = ""
sentence= input("Enter a sentence").lower()
print(sentence)
list.append(lists)
print(lists)
for i,j in enumerate(sentence):
print (i,j)
Your sentence is string, so it is split to single chars. You should split it to words first:
for i,j in enumerate(sentence.split(' ')):
You can also try this:
>>> sentence = 'I like Moive'
>>> sentence = sentence.lower()
>>> sentence = sentence.split()
>>> for i, j in enumerate(sentence):
... print(i, j)

How do I remove 1 instance of x characters in a string and find the word it makes in Python3?

This is what I have so far, but I'm stuck. I'm using nltk for the word list and trying to find all the words with the letters in "sand". From this list I want to find all the words I can make from the remaining letters.
import nltk.corpus.words.words()
pwordlist = []
for w in wordlist:
if 's' in w:
if 'a' in w:
if 'n' in w:
if 'd' in w:
pwordlist.append(w)
In this case I have to use all the letters to find the words possible.
I think this will work for finding the possible words with the remaining letters, but I can't figure out how to remove only 1 instance of the letters in 'sand'.
puzzle_letters = nltk.FreqDist(x)
[w for w in pwordlist if len(w) = len(pwordlist) and nltk.FreqDist(w) = puzzle_letters]
I would separate the logic into four sections:
A function contains(word, letters), which we'll use to detect whether a word contains "sand"
A function subtract(word, letters), which we'll use to remove "sand" from the word.
A function get_anagrams(word), which finds all of the anagrams of a word.
The main algorithm that combines all of the above to find words that are anagrams of other words once you remove "sand".
 
from collections import Counter
words = ??? #todo: somehow get a list of every English word.
def contains(word, letters):
return not Counter(letters) - Counter(word)
def subtract(word, letters):
remaining = Counter(word) - Counter(letters)
return "".join(remaining.elements())
anagrams = {}
for word in words:
base = "".join(sorted(word))
anagrams.setdefault(base, []).append(word)
def get_anagrams(word):
return anagrams.get("".join(sorted(word)), [])
for word in words:
if contains(word, "sand"):
reduced_word = subtract(word, "sand")
matches = get_anagrams(reduced_word)
if matches:
print word, matches
Running the above code on the Words With Friends dictionary, I get a lot of results, including:
...
cowhands ['chow']
credentials ['reticle', 'tiercel']
cyanids ['icy']
daftness ['efts', 'fest', 'fets']
dahoons ['oho', 'ooh']
daikons ['koi']
daintiness ['seniti']
daintinesses ['sienites']
dalapons ['opal']
dalesman ['alme', 'lame', 'male', 'meal']
...
Program:
from nltk.corpus import words
from collections import defaultdict
def norm(word):
return ''.join(sorted(word))
completers = defaultdict(list)
for word in words.words():
completers[norm(word + 'sand')].append(word)
for word in words.words():
comps = completers[norm(word)]
if comps:
print(word, comps)
Output:
...
admirableness ['miserable']
adnascent ['enact']
adroitness ['sorite', 'sortie', 'triose']
adscendent ['cedent', 'decent']
adsorption ['portio']
adventuress ['vesture']
adversant ['avert', 'tarve', 'taver', 'trave']
...
Let's answer your question instead of spoiling the fun by doing the whole exercise for you: To remove just one instance of the letter, specify a replacement and give a limit to how many times it should apply:
>>> "Frodo".replace("o", "", 1)
'Frdo'
Or if you need to apply a regexp just once (though in this case you don't need a regexp):
>>> import re
>>> re.sub(r"[od]", "", "Frodo", 1)
'Frdo'
Now if you have a string whose letters (s, a, n, d) you want to remove from a word word, you can simply loop over the string:
>>> for letter in "sand":
word = word.replace(letter, "", word)
I'll leave it to you to embed this in a loop that goes over all words in your wordlist, and to utilize the remaining letters.

Categories

Resources