Reading strings from a file to find unique characters

Reading strings from a file to find unique characters - python

I'm trying to find words with the most unique letters from a list of strings. The problem for me is not finding the unique words for a string as I know how to do that, no—, my problem is going step-by-step in the list of strings to find each words unique characters.
Example: Say that my list of strings is...
[Apple, Banana, Tiki]
and what I want the list to look like is
[Aple, Ban, Tik]
Whenever I tried to go through step by step, I end up having the entire list smashed together instead of comma separated and all my other solutions have yielded nothing. I can't use any packages or the set() function.
def unique_letters(words_list):
count = 0
while(count < len(words_list)):
for i in lines[count]:
if i not in temp:
temp.append(i)
dupes = ''.join(temp)
count += 1
return dupes
What I end up getting is...
'ApleBanTik' ### when I want ---> [Aple, Ban, Tik]
I've been working on another solution, but I end up getting the same thing. Any suggestions to how I can fix?

You could do this (with list comprehension):
def unique_letters(words_list):
return [''.join(dict.fromkeys(word)) for word in words_list]
Here is the expanded version:
def unique_letters(words_list):
result = []
for word in words_list:
result.append(''.join(dict.fromkeys(word)))
return result
When you convert a word into a dictionary, it removes all duplicates. Then, we just convert the dictionary into a string.

Related

exhaustive search over a list of complex strings without modifying original input

I am attempting to create a minimal algorithm to exhaustively search for duplicates over a list of strings and remove duplicates using an index to avoid changing cases of words and their meanings.
The caveat is the list has such words Blood, blood, DNA, ACTN4, 34-methyl-O-carboxy, Brain, brain-facing-mouse, BLOOD and so on.
I only want to remove the duplicate 'blood' word, keep the first occurrence with the first letter capitalized, and not modify cases of any other words. Any suggestions on how should I proceed?
Here is my code
def remove_duplicates(list_of_strings):
""" function that takes input of a list of strings,
uses index to iterate over each string lowers each string
and returns a list of strings with no duplicates, does not modify the original strings
an exhaustive search to remove duplicates using index of list and list of string"""
list_of_strings_copy = list_of_strings
try:
for i in range(len(list_of_strings)):
list_of_strings_copy[i] = list_of_strings_copy[i].lower()
word = list_of_strings_copy[i]
for j in range(len(list_of_strings_copy)):
if word == list_of_strings_copy[j]:
list_of_strings.pop(i)
j+=1
except Exception as e:
print(e)
return list_of_strings

Make a dictionary, {text.lower():text,...}, use the keys for comparison and save the first instance of the text in the values.
d={}
for item in list_of_strings:
if item.lower() not in d:
d[item.lower()] = item
d.values() should be what you want.

I think something like the following would do what you need:
def remove_duplicates(list_of_strings):
new_list = [] # create empty return list
for string in list_of_strings: # iterate through list of strings
string = string[0].capitalize() + string[1:].lower() # ensure first letter is capitalized and rest are low case
if string not in new_list: # check string is not duplicate in retuned list
new_list.append(string) # if string not in list append to returned list
return new_list # return end list
strings = ["Blood", "blood", "DNA", "ACTN4", "34-methyl-O-carboxy", "Brain", "brain-facing-mouse", "BLOOD"]
returned_strings = remove_duplicates(strings)
print(returned_strings)
(For reference this was written in Python 3.10)

Python "not in" not correctly identifying strings which do not contain a particular letter [duplicate]

This question already has answers here:
How to remove items from a list while iterating?
(25 answers)
Closed last year.
I am attempting to write a little program to play the popular game, Wordle.
One of my function attempts to remove words from a word bank which do not contain a particular letter. The function does a decent job, but I have discovered that the function occasionally fails to identify words that should be removed from the list. If I call on the function repeatedly, it will eventually eliminate all the correct words.
#Creates a list of 4266 five-letter words from a text file
five_letters = open("five.txt", "r")
five = five_letters.read()
five = five.split()
#Iterates across list and removes words that do not contain a particular letter
def eliminate_words_not_containing(list, letter):
for word in list:
if letter not in word:
list.remove(word)
return list
# Call on the function 10 times and print the length of the list after each call
print(len(five))
for i in range(10):
five = eliminate_words_not_containing(five, "e")
print(len(five))
The output is:
4266
2932
2319
2070
2014
2010
2010
2010
2010
2010
2010
How can I write the function so that it catches all the words that do not containing a particular letter the first time through?
Silly question: Is it possible for the program to be running too quickly, and it skips over words?

You iterating over the original list, so when you remove one word the position isn't updated, try to make a copy of the list and iterate over it.
Edit: I'd try list comprehension if it was me so:
list = [word for word in list if letter not in word]

Because you are modifying your list while you are iterating over it, you are skipping words.
If your list is ['word1', 'word2', 'word3'] and you call your eliminate function, you will be left with ['word2'] because you removed 'word1', then your for loop moved on to the 2nd index, which is now 'word3' from the original list (with 'word2' being the first word in the list after the removal of 'word1').
There are several ways to solve the problem, but one solution would be to remember the indices to remove:
def eliminate_words_not_containing(word_list, letter):
to_remove = set()
for i, word in enumerate(word_list):
if letter not in word:
to_remove.add(i)
return [word for i, word in enumerate(word_list) if i not in to_remove]
Probably just a regular list comprehension would be better in this case, but the above would be generalized in case you want to do something more complicated with each entry.
(Also, it's not good practice to use the variable named list because it clobbers the built-in type list in python.)

Find Compound Words in List of Words - Python

I have a simple list of words I need to filter, but each word in the list has an accompanying "score" appended to it which is causing me some trouble. The input list has this structure:
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'BUFFET;75','FASTBREAKPOINTS;60'
]
I am trying to figure out how to identify words in my list that are compounded solely from other words on the same list. For example, the code applied to lst above would produce:
ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']
I found a prior question that deals with a nearly identical situation , but in that instance, there are no trailing scores with the words on the list and I am having trouble dealing with these trailing scores on my list. The ans list must keep the scores with the compound words found. The order of the words in lst is random and irrelevant. Ideally, I would like the ans list to be sorted by the length of the word (before the ' ; '), as shown above. This would save me some additional post-processing on ans.
I have figured out a way that works using ReGex and nested for loops (I will spare you the ugliness of my 1980s-esque brute force code, it's really not pretty), but my word list has close to a million entries, and my solution takes so long as to be completely unusable. I am looking for a solution a little more Pythonic that I can actually use. I'm having trouble working through it.

Here is some code that does the job. I'm sure it's not perfect for your situation (with a million entries), but perhaps can be useful in parts:
#!/usr/bin/env python
from collections import namedtuple
Word = namedtuple("Word", ("characters", "number"))
separator = ";"
lst = [
"FAST;5",
"BREAK;60",
"FASTBREAK;40",
"OUTBREAK;110",
"BREAKFASTBUFFET;35",
"BUFFET;75",
"FASTBREAKPOINTS;60",
]
words = [Word(*w.rsplit(separator, 1)) for w in lst]
def findparts(oword, parts):
if len(oword.characters) == 0:
return parts
for iword in words:
if not parts and iword.characters == oword.characters:
continue
if iword.characters in oword.characters:
parts.append(iword)
characters = oword.characters.replace(iword.characters, "")
return findparts(Word(characters, oword.number), parts)
return []
ans = []
for word in words:
parts = findparts(word, [])
if parts:
ans.append(separator.join(word))
print(ans)
It uses a recursive function that takes a word in your list and tries to assemble it with other words from that same list. This function will also present you with the actual atomic words forming the compound one.
It's not very smart, however. Here is an example of a composition it will not detect:
[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET].
It uses a small detour using a namedtuple to temporarily separate the actual word from the number attached to it, assuming that the separator will always be ;.
I don't think regular expressions hold an advantage over a simple string search here.
If you know some more conditions about the composition of the compound words, like for instance the maximum number of components, the itertools combinatoric generators might help you to speed things up significantly and avoid missing the example given above too.

I think I would do it like this: make a new list containing only the words. In a for loop go through this list, and within it look for the words that are part of the word of the outer loop. If they are found: replace the found part by an empty string. If afterwards the entire word is replaced by an empty string: show the word of the corresponding index of the original list.
EDIT: As was pointed out in the comments, there could be a problem with the code in some situations, like this one: lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] In BREAKFASTBUFFET I first found that BREAK was a part of it, so I replaced that one with an empty string, which prevented BREAKFAST to be found. I hope that problem can be tackled by sorting the list descending by length of the word.
EDIT 2
My former edit was not flaw-proof, for instance is there was a word BREAKFASTEN, it shouldn't be "eaten" by BREAKFAST. This version does the following:
make a list of candidates: all words that ar part of the word in investigation
make another list of words that the word is started with
keep track of the words in the candidates list that you've allready tried
in a while True: keep trying until either the start list is empty, or you've successfully replaced all words by the candidates
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'POINTS;25',
'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'
]
lst2 = [ s.split(';')[0] for s in lst ]
for i, word in enumerate(lst2):
# candidates: words that are part of current word
candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]
if len(candidates) > 0:
tried = []
word2 = word
found = False
while not found:
# start: subset of candidates that the current word starts with
start = [ x for x in candidates if word2.startswith(x) and x not in tried ]
for trial in start:
word2 = word2.replace(trial,'')
tried.append(trial)
if len(word2)==0:
print(lst[i])
found = True
break
if len(candidates)>1:
candidates = candidates[1:]
word2=candidates[0]
else:
break

There are several ways of speeding up the process but I doubt there is a polynomial solution.
So let's use multiprocessing, and do what we can to generate a meaningful result. The sample below is not identical to what you are asking for, but it does compose a list of apparently compound words from a large dictionary.
For the code below, I am sourcing https://gist.github.com/h3xx/1976236 which lists about 80,000 unique words in order of frequency in English.
The code below can easily be sped up if the input wordlist is sorted alphabetically beforehand, as each head of a compound will be immediately followed by it's potential compound members:
black
blackberries
blackberry
blackbird
blackbirds
blackboard
blackguard
blackguards
blackmail
blackness
blacksmith
blacksmiths
As mentioned in the comment, you may also need to use a semantic filter to identify true compound words - for instance, the word ‘generally’ isn’t a compound word for a ‘gene rally’ !! So, while you may get a list of contenders you will need to eliminate false positives somehow.
# python 3.9
import multiprocessing as mp
# returns an ordered list of lowercase words to be used.
def load(name) -> list:
return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]
# function that identifies the compounds of a word from a list.
# ... can be optimised if using a sorted list.
def compounds_of(word: str, values: list):
return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]
# apply compound finding across an mp environment
# but this is the slowest part
def compose(values: list) -> dict:
with mp.Pool() as pool:
result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}
return result
if __name__ == '__main__':
# https://gist.github.com/h3xx/1976236
words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.
words = list(dict.fromkeys(words))
# remove those word heads which have less than 3 tails
compounds = {k: v for k, v in compose(words).items() if len(v) > 3}
# get the top 500 keys
rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]
# compose them into a dict and print
tops = {k[0]: compounds[k] for k in rank}
print(tops)

Returning the elements in a list that are over a certain number of characters

I want to be able to get words that are more than 3 three letters long from a list. I am struggling with the process of doing this.
words = "My name is bleh, I am bleh, how are you bleh?"
I would like to be able to extract everything that's above 3 letters.

Try this:
result = []
for word in words.split(' '):
if len(word) >= 3:
result.append(word)
When you use split you get a list containing every word of the variable 'words'. Then you iterate over it and calculate the length using len().

Adding punctuations to a list?

i have a small problem with punctuations.
My assignment was to check if there were any duplicated words in a text, if there were any duplicated words in the list my job was to highlight them by using .upper().
Example on text: I like apples, apples is the best thing i know.
So i took the original text, striped it from punctuations, transformed all words to lowercase and then split the list.
With a for-loop i compared every word in the list with each other and i found all duplicated word, all of this were placed in a new list.
Example (after using the for-loop): i like apples APPLES is the best thing I know
So the new list is now similar to the original list but with one major exception, it is lacking the punctuations.
Is there a way to add the punctuations on the new list were they are "suppose to be" (from the old lists position)?
Is there some kind of method build in python that can do this, or do i have to compare the two lists with another for-loop and then add the punctuations to the new list?
NewList = [] # Creates an empty list
for word in text:
if word not in NewList:
NewList.append(word)
elif word in NewList: #
NewList.append(word.upper())
List2 = ' '.join(NewList)
the code above works for longer text and thats the code i have been using for Highlighting duplicated words.
The only problem is that the punctations doesn't exist in the new file, thats the only problem i have.

Here's an example of using sub method with callback from build-in regexp module.
This solution respects all the punctuation.
import re
txt = "I like,, ,apples, apples! is the .best. thing *I* know!!1"
def repl(match, stack):
word = match.group(0)
word_upper = word.upper()
if word_upper in stack:
return word_upper
stack.add(word_upper)
return word
def highlight(s):
stack = set()
return re.sub('\\b([a-zA-Z]+)\\b', lambda match: repl(match, stack), s)
print txt
print highlight(txt)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading strings from a file to find unique characters - python

Related

exhaustive search over a list of complex strings without modifying original input

Python "not in" not correctly identifying strings which do not contain a particular letter [duplicate]

Find Compound Words in List of Words - Python

Returning the elements in a list that are over a certain number of characters

Adding punctuations to a list?

Categories

Resources