Python - finding words based on repeated characters specified in a string

Python - finding words based on repeated characters specified in a string - python

Let's say I have a list of words:
resign
resins
redyed
resist
reeded
I also have a string ".10.10"
I need to iterate through the list and find the words where there are repeated characters in the same locations where there are numbers in the string.
For instance, the string ".10.10" would find the word 'redyed' since there are e's where there are 1's and there are d's where there are 0's.
Another string ".00.0." would find the word 'reeded' as there are e's in that position.
My attempts in python so far are not really worth printing. At the moment I look through the string, add all 0s to an array and the 1s to an array then try to find repeated characters in the array positions. But it's terribly clumsy and doesn't work properly.

def matches(s, pattern):
d = {}
return all(cp == "." or d.setdefault(cp, cs) == cs
for cs, cp in zip(s, pattern))
a = ["resign", "resins", "redyed", "resist", "reeded"]
print [s for s in a if matches(s, ".01.01")]
print [s for s in a if matches(s, ".00.0.")]
prints
['redyed']
['reeded']

Related

Reading strings from a file to find unique characters

I'm trying to find words with the most unique letters from a list of strings. The problem for me is not finding the unique words for a string as I know how to do that, no—, my problem is going step-by-step in the list of strings to find each words unique characters.
Example: Say that my list of strings is...
[Apple, Banana, Tiki]
and what I want the list to look like is
[Aple, Ban, Tik]
Whenever I tried to go through step by step, I end up having the entire list smashed together instead of comma separated and all my other solutions have yielded nothing. I can't use any packages or the set() function.
def unique_letters(words_list):
count = 0
while(count < len(words_list)):
for i in lines[count]:
if i not in temp:
temp.append(i)
dupes = ''.join(temp)
count += 1
return dupes
What I end up getting is...
'ApleBanTik' ### when I want ---> [Aple, Ban, Tik]
I've been working on another solution, but I end up getting the same thing. Any suggestions to how I can fix?

You could do this (with list comprehension):
def unique_letters(words_list):
return [''.join(dict.fromkeys(word)) for word in words_list]
Here is the expanded version:
def unique_letters(words_list):
result = []
for word in words_list:
result.append(''.join(dict.fromkeys(word)))
return result
When you convert a word into a dictionary, it removes all duplicates. Then, we just convert the dictionary into a string.

How to merge strings with overlapping characters in python?

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical).
The goal of the program is to go from a list of overlapping strings - either ordered or unordered - to a compressed URL encoded string.
My current method fails at duplicate segments in the overlapping strings. For example, my program is incorrectly combining:
StrList1 = [ 'd+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
to output:
output = ['ublic+class+HelloWorld+%7B%0A++++public+', '%2F%2F+Sample+program%0Apublic+static+v`]
when correct output is:
output = ['%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v']
I am using simple python, not biopython or sequence aligners, though perhaps I should be?
Would greatly appreciate any advice on the matter or suggestions of a nice way to do this in python!
Thanks!

You can start with one of the strings in the list (stored as string), and for each of the remaining strings in the list (stored as candidate) where:
candidate is part of string,
candidate contains string,
candidate's tail matches the head of string,
or, candidate's head matches the tail of string,
assemble the two strings according to how they overlap, and then recursively repeat the procedure with the overlapping string removed from the remaining strings and the assembled string appended, until there is only one string left in the list, at which point it is a valid fully assembled string that can be added to the final output.
Since there can potentially be multiple ways several strings can overlap with each other, some of which can result in the same assembled strings, you should make output a set of strings instead:
def assemble(str_list, min=3, max=15):
if len(str_list) < 2:
return set(str_list)
output = set()
string = str_list.pop()
for i, candidate in enumerate(str_list):
matches = set()
if candidate in string:
matches.add(string)
elif string in candidate:
matches.add(candidate)
for n in range(min, max + 1):
if candidate[:n] == string[-n:]:
matches.add(string + candidate[n:])
if candidate[-n:] == string[:n]:
matches.add(candidate[:-n] + string)
for match in matches:
output.update(assemble(str_list[:i] + str_list[i + 1:] + [match]))
return output
so that with your sample input:
StrList1 = ['d+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
assemble(StrList1) would return:
{'%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v'}
or as an example of an input with various overlapping possibilities (that the second string can match the first by being inside, having tail matching the head, and having head matching the tail):
assemble(['abcggggabcgggg', 'ggggabc'])
would return:
{'abcggggabcgggg', 'abcggggabcggggabc', 'abcggggabcgggggabc', 'ggggabcggggabcgggg'}

Create new words from start word python

def make_new_words(start_word):
"""create new words from given start word and returns new words"""
new_words=[]
for letter in start_word:
pass
#for letter in alphabet:
#do something to change letters
#new_words.append(new_word)
I have a three letter word input for example car which is the start word.
I then have to create new word by replacing one letter at a time with every letter from the alphabet. Using my example car I want to create the words, aar, bar, car, dar, ear,..., zar. Then create the words car, cbr, ccr, cdr, cer,..., czr. Finally caa, cab, cac, cad, cae,..., caz.
I don't really know what the for loop should look like. I was thinking about creating some sort of alphabet list and by looping through that creating new words but I don't know how to choose what parts of the original word should remain. The new words can be appended to a list to be returned.

import string
def make_new_words(start_word):
"""create new words from given start word and returns new words"""
new_words = []
for i, letter in enumerate(start_word):
word_as_list = list(start_word)
for char in string.ascii_lowercase:
word_as_list[i] = char
new_words.append("".join(word_as_list))
return new_words

lowercase is just a string containing the lowercase letters...
We want to change each letter of the original word (here w) so we
iterate on the letters of w, but we'll mostly need the index of the letter, so we do our for loop on enumerate(w).
First of all, in python strings are immutable so we build a list x from w... lists are mutable
Now a second, inner loop on the lowercase letters: we change the current element of the x list accordingly (having changed x, we need to reset it before the next inner loop) and finally we print it.
Because we want to print a string rather than the characters in a list, we use the join method of the null string '' that glue together the elements of x using, of course, the null string.
I have not reported the output but it's exactly what you've asked for, just try...
from string import lowercase
w = 'car'
for i, _ in enumerate(w):
x = list(w)
for s in lowercase:
x[i] = s
print ''.join(x)

import string
all_letters = string.ascii_lowercase
def make_new_words(start_word):
for index, letter in enumerate(start_word):
template = start_word[:index] + '{}' + start_word[index+1:]
for new_letter in all_letters:
print template.format(new_letter)

You can do this with two loops, by looping over the word and then looping over a range for all letters. By keeping an index for the first loop, you can use a slice to construct your new strings:
for index in enumerate(start_word):
for let in range(ord('a'), ord('z')+1):
new_words.append(start_word[:index] + chr(let) + start_word[index+1:])

This could work as a brute-force approach, although you might end up with some performance issues when you go to try it with longer words.
It also sounds like you might want to constrain it only to words that exist in a dictionary at some point, which is a whole other can of worms.
But for right now, for three-letter words, you're onto something of the right track, although I worry that the question might be a little too specific for Stack Overflow.
First, you will probably have more success if you loop through the index for the word, rather than the letter:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
for i in range(len(start_word)):
Then, you can use a slice to grab the letters before and after the index.
for letter in alphabet:
new_word = start_word[:i] + letter + start_word[i + 1:]
Another approach is given above, which casts the string to a list. That works around the fact that python will disallow simply setting start_word[i] = letter, which you can read about here.

How to find the combination of words that includes all the letters in the input with Python

I want to find the most efficient way to loop through the combination of letters that are entered in Python and return a set of words whose combination includes all the letters, if feasible.
Example:
Say user entered A B C D E. Goal is to find the least number of words that includes all the letters. In this case an optimum solution, in preference order, will be:
One word that has all 5 letters
Two words that has all the 5 letters. (can be 4-letter word + 1-letter word OR 3 letter word + 2 letter word. Does not make difference)
....
etc.
If no match, then find go back to 1. with n-1 letters etc.
I have a function to check if a "combination of letters" (i.e. word) is in dictionary.
def is_in_lib(word):
if word in lib:
return word
return False
Ideal answer should not include finding the combination of those letters and searching all of those. Searching through my dictionary is very costly, so I need something that can take also optimize the time that we search through the dictionary
IMPORTANT EDIT: The order matters and continuity is required. Meaning if user enters "H", "T", "A", you cannot build "HAT".
Real Example: If the input is : T - H - G - R - A - C - E - K - B - Y - E " output should be "Grace" and "Bye"

You could create a string/list from the input letters, and iterate trought THEM on every word in the word library:
inputstring='abcde'
for i in lib:
is_okay=True
for j in inputstring:
if i.find(j)=-1:
is_okay=False
if is_okay:
return i
I think the other cases (two words with 3-2 letters) can be implemented recursively, but it couldn't be efficient.

I think the key idea here would be to have some kind of index providing a mapping from a canonical sequence of characters to actual words. Something like that:
# List of known words
>>> words = ('bonjour', 'jour', 'bon', 'poire', 'proie')
# Build the index
>>> index = collections.defaultdict(list)
>>> for w in words:
... index[''.join(sorted(w.lower()))].append(w)
...
This will produce a efficient way to find all the anagrams corresponding to a sequence of characters:
>>> index
defaultdict(<class 'list'>, {'joru': ['jour'], 'eiopr': ['poire', 'proie'], 'bjnooru': ['bonjour'], 'bno': ['bon']})
You could query the index that way:
>>> user_str = 'OIREP'
>>> index.get(''.join(sorted(user_str.lower())), "")
['poire', 'proie']
Of course, this will only find "exact" anagrams -- that is containing all the letters provided by the user. To find all the string that match a subset of the user provided string, you will have to remove one letter at a time and check again each combination. I feel like recursivity will help to solve that problem ;)
EDIT:
(should I put that on a spoiler section?)
Here is a possibl solution:
import collections
words = ('bonjour', 'jour', 'bon', 'or', 'pire', 'poire', 'proie')
index = collections.defaultdict(list)
for w in words:
index[''.join(sorted(w.lower()))].append(w)
# Recursively search all the words containing a sequence of letters
def search(letters, result = set()):
# Assume "letters" ordered
if not letters:
return
solutions = index.get(letters)
if solutions:
for s in solutions:
result.add(s)
for i in range(0,len(letters)):
search(letters[:i]+letters[i+1:], result)
return result
# Use case:
user_str = "OIREP"
s = search(''.join(sorted(user_str.lower())))
print(s)
Producing:
set(['poire', 'or', 'proie', 'pire'])
It is not that bad, but could be improved since the same subset of characters are examined several times. This is especially true is the user provided search string contain several identical letters.

Breaking a string into individual words in Python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.

We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!

assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c

with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - finding words based on repeated characters specified in a string - python

def matches(s, pattern): d = {} return all(cp == "." or d.setdefault(cp, cs) == cs for cs, cp in zip(s, pattern)) a = ["resign", "resins", "redyed", "resist", "reeded"] print [s for s in a if matches(s, ".01.01")] print [s for s in a if matches(s, ".00.0.")] prints ['redyed'] ['reeded']

Related

Reading strings from a file to find unique characters

How to merge strings with overlapping characters in python?

Create new words from start word python

How to find the combination of words that includes all the letters in the input with Python

Breaking a string into individual words in Python

Categories

Resources