Adding punctuations to a list? - python

i have a small problem with punctuations.
My assignment was to check if there were any duplicated words in a text, if there were any duplicated words in the list my job was to highlight them by using .upper().
Example on text: I like apples, apples is the best thing i know.
So i took the original text, striped it from punctuations, transformed all words to lowercase and then split the list.
With a for-loop i compared every word in the list with each other and i found all duplicated word, all of this were placed in a new list.
Example (after using the for-loop): i like apples APPLES is the best thing I know
So the new list is now similar to the original list but with one major exception, it is lacking the punctuations.
Is there a way to add the punctuations on the new list were they are "suppose to be" (from the old lists position)?
Is there some kind of method build in python that can do this, or do i have to compare the two lists with another for-loop and then add the punctuations to the new list?
NewList = [] # Creates an empty list
for word in text:
if word not in NewList:
NewList.append(word)
elif word in NewList: #
NewList.append(word.upper())
List2 = ' '.join(NewList)
the code above works for longer text and thats the code i have been using for Highlighting duplicated words.
The only problem is that the punctations doesn't exist in the new file, thats the only problem i have.

Here's an example of using sub method with callback from build-in regexp module.
This solution respects all the punctuation.
import re
txt = "I like,, ,apples, apples! is the .best. thing *I* know!!1"
def repl(match, stack):
word = match.group(0)
word_upper = word.upper()
if word_upper in stack:
return word_upper
stack.add(word_upper)
return word
def highlight(s):
stack = set()
return re.sub('\\b([a-zA-Z]+)\\b', lambda match: repl(match, stack), s)
print txt
print highlight(txt)

Related

Python "not in" not correctly identifying strings which do not contain a particular letter [duplicate]

This question already has answers here:
How to remove items from a list while iterating?
(25 answers)
Closed last year.
I am attempting to write a little program to play the popular game, Wordle.
One of my function attempts to remove words from a word bank which do not contain a particular letter. The function does a decent job, but I have discovered that the function occasionally fails to identify words that should be removed from the list. If I call on the function repeatedly, it will eventually eliminate all the correct words.
#Creates a list of 4266 five-letter words from a text file
five_letters = open("five.txt", "r")
five = five_letters.read()
five = five.split()
#Iterates across list and removes words that do not contain a particular letter
def eliminate_words_not_containing(list, letter):
for word in list:
if letter not in word:
list.remove(word)
return list
# Call on the function 10 times and print the length of the list after each call
print(len(five))
for i in range(10):
five = eliminate_words_not_containing(five, "e")
print(len(five))
The output is:
4266
2932
2319
2070
2014
2010
2010
2010
2010
2010
2010
How can I write the function so that it catches all the words that do not containing a particular letter the first time through?
Silly question: Is it possible for the program to be running too quickly, and it skips over words?
You iterating over the original list, so when you remove one word the position isn't updated, try to make a copy of the list and iterate over it.
Edit: I'd try list comprehension if it was me so:
list = [word for word in list if letter not in word]
Because you are modifying your list while you are iterating over it, you are skipping words.
If your list is ['word1', 'word2', 'word3'] and you call your eliminate function, you will be left with ['word2'] because you removed 'word1', then your for loop moved on to the 2nd index, which is now 'word3' from the original list (with 'word2' being the first word in the list after the removal of 'word1').
There are several ways to solve the problem, but one solution would be to remember the indices to remove:
def eliminate_words_not_containing(word_list, letter):
to_remove = set()
for i, word in enumerate(word_list):
if letter not in word:
to_remove.add(i)
return [word for i, word in enumerate(word_list) if i not in to_remove]
Probably just a regular list comprehension would be better in this case, but the above would be generalized in case you want to do something more complicated with each entry.
(Also, it's not good practice to use the variable named list because it clobbers the built-in type list in python.)

How to avoid .replace replacing a word that was already replaced

Given a string, I have to reverse every word, but keeping them in their places.
I tried:
def backward_string_by_word(text):
for word in text.split():
text = text.replace(word, word[::-1])
return text
But if I have the string Ciao oaiC, when it try to reverse the second word, it's identical to the first after beeing already reversed, so it replaces it again. How can I avoid this?
You can use join in one line plus generator expression:
text = "test abc 123"
text_reversed_words = " ".join(word[::-1] for word in text.split())
s.replace(x, y) is not the correct method to use here:
It does two things:
find x in s
replace it with y
But you do not really find anything here, since you already have the word you want to replace. The problem with that is that it starts searching for x from the beginning at the string each time, not at the position you are currently at, so it finds the word you have already replaced, not the one you want to replace next.
The simplest solution is to collect the reversed words in a list, and then build a new string out of this list by concatenating all reversed words. You can concatenate a list of strings and separate them with spaces by using ' '.join().
def backward_string_by_word(text):
reversed_words = []
for word in text.split():
reversed_words.append(word[::-1])
return ' '.join(reversed_words)
If you have understood this, you can also write it more concisely by skipping the intermediate list with a generator expression:
def backward_string_by_word(text):
return ' '.join(word[::-1] for word in text.split())
Splitting a string converts it to a list. You can just reassign each value of that list to the reverse of that item. See below:
text = "The cat tac in the hat"
def backwards(text):
split_word = text.split()
for i in range(len(split_word)):
split_word[i] = split_word[i][::-1]
return ' '.join(split_word)
print(backwards(text))

Find Compound Words in List of Words - Python

I have a simple list of words I need to filter, but each word in the list has an accompanying "score" appended to it which is causing me some trouble. The input list has this structure:
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'BUFFET;75','FASTBREAKPOINTS;60'
]
I am trying to figure out how to identify words in my list that are compounded solely from other words on the same list. For example, the code applied to lst above would produce:
ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']
I found a prior question that deals with a nearly identical situation , but in that instance, there are no trailing scores with the words on the list and I am having trouble dealing with these trailing scores on my list. The ans list must keep the scores with the compound words found. The order of the words in lst is random and irrelevant. Ideally, I would like the ans list to be sorted by the length of the word (before the ' ; '), as shown above. This would save me some additional post-processing on ans.
I have figured out a way that works using ReGex and nested for loops (I will spare you the ugliness of my 1980s-esque brute force code, it's really not pretty), but my word list has close to a million entries, and my solution takes so long as to be completely unusable. I am looking for a solution a little more Pythonic that I can actually use. I'm having trouble working through it.
Here is some code that does the job. I'm sure it's not perfect for your situation (with a million entries), but perhaps can be useful in parts:
#!/usr/bin/env python
from collections import namedtuple
Word = namedtuple("Word", ("characters", "number"))
separator = ";"
lst = [
"FAST;5",
"BREAK;60",
"FASTBREAK;40",
"OUTBREAK;110",
"BREAKFASTBUFFET;35",
"BUFFET;75",
"FASTBREAKPOINTS;60",
]
words = [Word(*w.rsplit(separator, 1)) for w in lst]
def findparts(oword, parts):
if len(oword.characters) == 0:
return parts
for iword in words:
if not parts and iword.characters == oword.characters:
continue
if iword.characters in oword.characters:
parts.append(iword)
characters = oword.characters.replace(iword.characters, "")
return findparts(Word(characters, oword.number), parts)
return []
ans = []
for word in words:
parts = findparts(word, [])
if parts:
ans.append(separator.join(word))
print(ans)
It uses a recursive function that takes a word in your list and tries to assemble it with other words from that same list. This function will also present you with the actual atomic words forming the compound one.
It's not very smart, however. Here is an example of a composition it will not detect:
[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET].
It uses a small detour using a namedtuple to temporarily separate the actual word from the number attached to it, assuming that the separator will always be ;.
I don't think regular expressions hold an advantage over a simple string search here.
If you know some more conditions about the composition of the compound words, like for instance the maximum number of components, the itertools combinatoric generators might help you to speed things up significantly and avoid missing the example given above too.
I think I would do it like this: make a new list containing only the words. In a for loop go through this list, and within it look for the words that are part of the word of the outer loop. If they are found: replace the found part by an empty string. If afterwards the entire word is replaced by an empty string: show the word of the corresponding index of the original list.
EDIT: As was pointed out in the comments, there could be a problem with the code in some situations, like this one: lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] In BREAKFASTBUFFET I first found that BREAK was a part of it, so I replaced that one with an empty string, which prevented BREAKFAST to be found. I hope that problem can be tackled by sorting the list descending by length of the word.
EDIT 2
My former edit was not flaw-proof, for instance is there was a word BREAKFASTEN, it shouldn't be "eaten" by BREAKFAST. This version does the following:
make a list of candidates: all words that ar part of the word in investigation
make another list of words that the word is started with
keep track of the words in the candidates list that you've allready tried
in a while True: keep trying until either the start list is empty, or you've successfully replaced all words by the candidates
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'POINTS;25',
'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'
]
lst2 = [ s.split(';')[0] for s in lst ]
for i, word in enumerate(lst2):
# candidates: words that are part of current word
candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]
if len(candidates) > 0:
tried = []
word2 = word
found = False
while not found:
# start: subset of candidates that the current word starts with
start = [ x for x in candidates if word2.startswith(x) and x not in tried ]
for trial in start:
word2 = word2.replace(trial,'')
tried.append(trial)
if len(word2)==0:
print(lst[i])
found = True
break
if len(candidates)>1:
candidates = candidates[1:]
word2=candidates[0]
else:
break
There are several ways of speeding up the process but I doubt there is a polynomial solution.
So let's use multiprocessing, and do what we can to generate a meaningful result. The sample below is not identical to what you are asking for, but it does compose a list of apparently compound words from a large dictionary.
For the code below, I am sourcing https://gist.github.com/h3xx/1976236 which lists about 80,000 unique words in order of frequency in English.
The code below can easily be sped up if the input wordlist is sorted alphabetically beforehand, as each head of a compound will be immediately followed by it's potential compound members:
black
blackberries
blackberry
blackbird
blackbirds
blackboard
blackguard
blackguards
blackmail
blackness
blacksmith
blacksmiths
As mentioned in the comment, you may also need to use a semantic filter to identify true compound words - for instance, the word ‘generally’ isn’t a compound word for a ‘gene rally’ !! So, while you may get a list of contenders you will need to eliminate false positives somehow.
# python 3.9
import multiprocessing as mp
# returns an ordered list of lowercase words to be used.
def load(name) -> list:
return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]
# function that identifies the compounds of a word from a list.
# ... can be optimised if using a sorted list.
def compounds_of(word: str, values: list):
return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]
# apply compound finding across an mp environment
# but this is the slowest part
def compose(values: list) -> dict:
with mp.Pool() as pool:
result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}
return result
if __name__ == '__main__':
# https://gist.github.com/h3xx/1976236
words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.
words = list(dict.fromkeys(words))
# remove those word heads which have less than 3 tails
compounds = {k: v for k, v in compose(words).items() if len(v) > 3}
# get the top 500 keys
rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]
# compose them into a dict and print
tops = {k[0]: compounds[k] for k in rank}
print(tops)

Checking if any word in a string appears in a list using python

I have a pandas dataframe that contains a column of several thousands of comments. I would like to iterate through every row in the column, check to see if the comment contains any word found in a list of words I've created, and if the comment contains a word from my list I want to label it as such in a separate column. This is what I have so far in my code:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
def word_checker(row):
for sentence in df['comments']:
if any(word in re.findall(r'\w+', sentence.lower()) for word in retirement_words_list):
return '401k/Retirement'
else:
return 'Other'
df['topic'] = df.apply(word_checker,axis=1)
The code is labeling every single comment in my dataframe as 'Other' even though I have double-checked that many comments contain one or several of the words from my list. Any ideas for how I may correct my code? I'd greatly appreciate your help.
Probably more convenient to have a set version of retirements_word_list (for efficient inclusing testing) and then loop over words in the sentence, checking inclusion in this set, rather than the other way round:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
retirement_words_set = set(retirement_words_list)
and then
if any(word in retirement_words_list for word in sentence.lower().split()):
# .... etc ....
Your code is just checking whether any word in retirement_words_list is a substring of the sentence, but in fact you must be looking for whole-word matches or it wouldn't make sense to include 'matching' and 'retirement' on the list given that 'match' and 'retire' are already included. Hence the use of split -- and the reason why we can then also reverse the logic.
NOTE: You may need some further changes because your function word_checker has a parameter called row which it does not use. Possibly what you meant to do was something like:
def word_checker(sentence):
if any(word in retirement_words_list for word in sentence.lower().split()):
return '401k/Retirement'
else:
return 'Other'
and:
df['topic'] = df['comments'].apply(word_checker,axis=1)
where sentence is the contents of each row from the comments column.
Wouldn't this simplified version (without regex) work?
if any(word in sentence.lower() for word in retirement_words_list):

Python check for word in list

I'm writing a spell checking function and I'm using two text files: one that has misspelled text and a text file with a bunch of words from the dictionary. I have turned the text of misspelled words into a list of strings and the text file with dictionary words into a list of words. Now I need to see if the words in my misspelled list are in my list of dictionary words.
def spellCheck():
checkFile=input('Enter file name: ')
inFile=open(checkFile,'r')
# This separates my original text file into a list like this
# [['It','was','the','besst','of','times,'],
# ['it','was','teh','worst','of','times']]
separate=[]
for line in inFile:
separate.append(line.split())
# This opens my list of words from the dictionary and
# turns it into a list of the words.
wordFile=open('words.txt','r')
words=wordFile.read()
wordList=(list(words.split()))
wordFile.close()
# I need this newList to be a list of the correctly spelled words
# in my separate[] list and if the word isn't spelled correctly
# it will go into another if statement...
newList=[]
for word in separate:
if word in wordList:
newList.append(word)
return newList
Try this:
newList = []
for line in separate:
for word in line:
if word in wordList:
newList.append(word)
return newList
The problem you had was that you were iterating over separate, which is a list of lists. There does not exist any list in your wordList, which is why that if-statement always fails. The words that you want to iterate over are in the sublists contained in separate. So, you can iterate over these words in a second for-loop. You can also use for word in itertools.chain.from_iterable(separate).
Hope this helps
First, a word about data structures. Instead of lists, you should use sets, since you (apparently) only want a copy of each word. You can create sets out of your lists:
input_words = set(word for line in separate for word in line) # since it is a list of lists
correct_words = set(word_list)
Then, it is simple as that:
new_list = input_words.intersection(correct_words)
And if you want the incorrect words, you have another one liner:
incorrect = input_words.difference(correct_words)
Note that I used names_with_underscores, instead of CamelCase, as recommended in PEP 8.
Bear in mind, however, that this is not very efficient for spell checking, since you don't examine context.

Categories

Resources