Breaking a string into individual words in Python - python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.

We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!

assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c

with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Related

Find Compound Words in List of Words - Python

I have a simple list of words I need to filter, but each word in the list has an accompanying "score" appended to it which is causing me some trouble. The input list has this structure:
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'BUFFET;75','FASTBREAKPOINTS;60'
]
I am trying to figure out how to identify words in my list that are compounded solely from other words on the same list. For example, the code applied to lst above would produce:
ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']
I found a prior question that deals with a nearly identical situation , but in that instance, there are no trailing scores with the words on the list and I am having trouble dealing with these trailing scores on my list. The ans list must keep the scores with the compound words found. The order of the words in lst is random and irrelevant. Ideally, I would like the ans list to be sorted by the length of the word (before the ' ; '), as shown above. This would save me some additional post-processing on ans.
I have figured out a way that works using ReGex and nested for loops (I will spare you the ugliness of my 1980s-esque brute force code, it's really not pretty), but my word list has close to a million entries, and my solution takes so long as to be completely unusable. I am looking for a solution a little more Pythonic that I can actually use. I'm having trouble working through it.
Here is some code that does the job. I'm sure it's not perfect for your situation (with a million entries), but perhaps can be useful in parts:
#!/usr/bin/env python
from collections import namedtuple
Word = namedtuple("Word", ("characters", "number"))
separator = ";"
lst = [
"FAST;5",
"BREAK;60",
"FASTBREAK;40",
"OUTBREAK;110",
"BREAKFASTBUFFET;35",
"BUFFET;75",
"FASTBREAKPOINTS;60",
]
words = [Word(*w.rsplit(separator, 1)) for w in lst]
def findparts(oword, parts):
if len(oword.characters) == 0:
return parts
for iword in words:
if not parts and iword.characters == oword.characters:
continue
if iword.characters in oword.characters:
parts.append(iword)
characters = oword.characters.replace(iword.characters, "")
return findparts(Word(characters, oword.number), parts)
return []
ans = []
for word in words:
parts = findparts(word, [])
if parts:
ans.append(separator.join(word))
print(ans)
It uses a recursive function that takes a word in your list and tries to assemble it with other words from that same list. This function will also present you with the actual atomic words forming the compound one.
It's not very smart, however. Here is an example of a composition it will not detect:
[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET].
It uses a small detour using a namedtuple to temporarily separate the actual word from the number attached to it, assuming that the separator will always be ;.
I don't think regular expressions hold an advantage over a simple string search here.
If you know some more conditions about the composition of the compound words, like for instance the maximum number of components, the itertools combinatoric generators might help you to speed things up significantly and avoid missing the example given above too.
I think I would do it like this: make a new list containing only the words. In a for loop go through this list, and within it look for the words that are part of the word of the outer loop. If they are found: replace the found part by an empty string. If afterwards the entire word is replaced by an empty string: show the word of the corresponding index of the original list.
EDIT: As was pointed out in the comments, there could be a problem with the code in some situations, like this one: lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] In BREAKFASTBUFFET I first found that BREAK was a part of it, so I replaced that one with an empty string, which prevented BREAKFAST to be found. I hope that problem can be tackled by sorting the list descending by length of the word.
EDIT 2
My former edit was not flaw-proof, for instance is there was a word BREAKFASTEN, it shouldn't be "eaten" by BREAKFAST. This version does the following:
make a list of candidates: all words that ar part of the word in investigation
make another list of words that the word is started with
keep track of the words in the candidates list that you've allready tried
in a while True: keep trying until either the start list is empty, or you've successfully replaced all words by the candidates
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'POINTS;25',
'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'
]
lst2 = [ s.split(';')[0] for s in lst ]
for i, word in enumerate(lst2):
# candidates: words that are part of current word
candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]
if len(candidates) > 0:
tried = []
word2 = word
found = False
while not found:
# start: subset of candidates that the current word starts with
start = [ x for x in candidates if word2.startswith(x) and x not in tried ]
for trial in start:
word2 = word2.replace(trial,'')
tried.append(trial)
if len(word2)==0:
print(lst[i])
found = True
break
if len(candidates)>1:
candidates = candidates[1:]
word2=candidates[0]
else:
break
There are several ways of speeding up the process but I doubt there is a polynomial solution.
So let's use multiprocessing, and do what we can to generate a meaningful result. The sample below is not identical to what you are asking for, but it does compose a list of apparently compound words from a large dictionary.
For the code below, I am sourcing https://gist.github.com/h3xx/1976236 which lists about 80,000 unique words in order of frequency in English.
The code below can easily be sped up if the input wordlist is sorted alphabetically beforehand, as each head of a compound will be immediately followed by it's potential compound members:
black
blackberries
blackberry
blackbird
blackbirds
blackboard
blackguard
blackguards
blackmail
blackness
blacksmith
blacksmiths
As mentioned in the comment, you may also need to use a semantic filter to identify true compound words - for instance, the word ‘generally’ isn’t a compound word for a ‘gene rally’ !! So, while you may get a list of contenders you will need to eliminate false positives somehow.
# python 3.9
import multiprocessing as mp
# returns an ordered list of lowercase words to be used.
def load(name) -> list:
return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]
# function that identifies the compounds of a word from a list.
# ... can be optimised if using a sorted list.
def compounds_of(word: str, values: list):
return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]
# apply compound finding across an mp environment
# but this is the slowest part
def compose(values: list) -> dict:
with mp.Pool() as pool:
result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}
return result
if __name__ == '__main__':
# https://gist.github.com/h3xx/1976236
words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.
words = list(dict.fromkeys(words))
# remove those word heads which have less than 3 tails
compounds = {k: v for k, v in compose(words).items() if len(v) > 3}
# get the top 500 keys
rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]
# compose them into a dict and print
tops = {k[0]: compounds[k] for k in rank}
print(tops)

Most Frequent Character - User Submitted String without Dictionaries or Counters

Currently, I am in the midst of writing a program that calculates all of the non white space characters in a user submitted string and then returns the most frequently used character. I cannot use collections, a counter, or the dictionary. Here is what I want to do:
Split the string so that white space is removed. Then count each character and return a value. I would have something to post here but everything I have attempted thus far has been met with critical failure. The closest I came was this program here:
strin=input('Enter a string: ')
fc=[]
nfc=0
for ch in strin:
i=0
j=0
while i<len(strin):
if ch.lower()==strin[i].lower():
j+=1
i+=1
if j>nfc and ch!=' ':
nfc=j
fc=ch
print('The most frequent character in string is: ', fc )
If you can fix this code or tell me a better way of doing it that meets the required criteria that would be helpful. And, before you say this has been done a hundred times on this forum please note I created an account specifically to ask this question. Yes there are a ton of questions like this but some that are reading from a text file or an existing string within the program. And an overwhelmingly large amount of these contain either a dictionary, counter, or collection which I cannot presently use in this chapter.
Just do it "the old way". Create a list (okay it's a collection, but a very basic one so shouldn't be a problem) of 26 zeroes and increase according to position. Compute max index at the same time.
strin="lazy cat dog whatever"
l=[0]*26
maxindex=-1
maxvalue=0
for c in strin.lower():
pos = ord(c)-ord('a')
if 0<=pos<=25:
l[pos]+=1
if l[pos]>maxvalue:
maxindex=pos
maxvalue = l[pos]
print("max count {} for letter {}".format(maxvalue,chr(maxindex+ord('a'))))
result:
max count 3 for letter a
As an alternative to Jean's solution (not using a list that allows for one-pass over the string), you could just use str.count here which does pretty much what you're trying to do:
strin = input("Enter a string: ").strip()
maxcount = float('-inf')
maxchar = ''
for char in strin:
c = strin.count(char) if not char.isspace() else 0
if c > maxcount:
maxcount = c
maxchar = char
print("Char {}, Count {}".format(maxchar, maxcount))
If lists are available, I'd use Jean's solution. He doesn't use a O(N) function N times :-)
P.s: you could compact this with one line if you use max:
max(((strin.count(i), i) for i in strin if not i.isspace()))
To keep track of several counts for different characters, you have to use a collection (even if it is a global namespace implemented as a dictionary in Python).
To print the most frequent non-space character while supporting arbitrary Unicode strings:
import sys
text = input("Enter a string (case is ignored)").casefold() # default caseless matching
# count non-space character frequencies
counter = [0] * (sys.maxunicode + 1)
for nonspace in map(ord, ''.join(text.split())):
counter[nonspace] += 1
# find the most common character
print(chr(max(range(len(counter)), key=counter.__getitem__)))
A similar list in Cython was the fastest way to find frequency of each character.

A function to count words in a corpora using dictionary values using Python

I'm a Python Newbie trying to get a count of words that occur within a corpora (corpora) using a dictionary of specific words. The corpora is a string type that has been tokenized, normalized, lemmatized, and stemmed.
dict = {}
dict ['words'] = ('believe', 'tried', 'trust', 'experience')
counter=0
Result = []
for word in corpora:
if word in dict.values():
counter = i + 1
else counter = 0
This code produces a syntax error on the dict.values() line. Any help is appreciated!
Don't do dict = {}. dict is a built-in function and you are shadowing it. That's not critical, you won't be able t use if you 'll need it later.
A dictionary is a key→value mapping. Like a real dictionary (word → translation). What you did is said that value ('believe', …), which is a tuple, corresponds to the key 'word' in your dictionary. Then you are using dict.values() which gives you a sequence of all the values stored in the dictionary, in your case this sequence consists of exacly one item, and this item is a tuple. Your if condition will never be True: word is a string and dict.values() is a sequence, consisting of a single tuple of strings.
I'm not really sure why you are using a dictionary. It seems that you've got a set of words that are important for you, and you are scanning your corpora and count the number of occurences of those words. The key word here is set. You don't need a dictionary, you need a set.
It is not clear, what you are counting. What's that i you are adding to the counter? If you meant to increment counter by one, that should be counter = counter + 1 or simply counter += 1.
Why are you resetting counter?
counter = 0
I don't think you really want to reset the counter when you found an unknown word. It seems that unkown words shouldn't change your counter, then, just don't alter it.
Notes. Try to avoid using upper case letters in variable names (Result = [] is bad). Also as others mntioned, you are missing a colon after else.
So, now let's put it all together. The first thing to do is to make a set of words we are interested in:
words = {'believe', 'tried', 'trust', 'experience'}
Next you can iterate over the words in your corpora and see which of them are present in the set:
for word in corpora:
if word in words:
# do something
It is not clear what exactly the code should do, but if your goal is to know how many times all the words in the set are found in the corpora all together, then you'll just add one to counter inside that if.
If you want to know how many times each word of the set appears in the corpora, you'll have to maintain a separate counter for every word in the set (that's where a dictionary might be useful). This can be achieved easily with collections.Counter (which is a special dictionary) and you'll have to filter your corpora to leave only the words you are interested in, that's where ifilter will help you.
filtered_corpora = itertools.ifilter(lambda w: w in words, corpora)
—this is your corpora will all the words not found in words removed. You can pass it do Counter right away.
This trick is also useful for the first case (i.e. when you need only the total count). You'll just return the length of this filtered corpora (len(filtered_corpora)).
You have multiple issues. You did not define corpora in the example here. you are redfining dict, which is a built-in type. the else is not indented correctly. dict.values() return an iterable, each of which is a tuple; word will not be inside it, if word is a string. and it is not clear what counter counts, actually. adn what Results is doing there?
your code may be similar to this (pseudo)code
d = {'words' : ('believe', 'tried', 'trust', 'experience')} #if that's really what you want
counter = {}
for word in corpora:
for tup in d.values(): # each tup is a tuple
if word in tup:
x = counter[word] if word in counter else 0
counter[word] = x+1
There Is A Shorter Way To Do It.
This task, of counting things, is so common that a specific class for doing so exists in the library: collections.Counter.

How to find the combination of words that includes all the letters in the input with Python

I want to find the most efficient way to loop through the combination of letters that are entered in Python and return a set of words whose combination includes all the letters, if feasible.
Example:
Say user entered A B C D E. Goal is to find the least number of words that includes all the letters. In this case an optimum solution, in preference order, will be:
One word that has all 5 letters
Two words that has all the 5 letters. (can be 4-letter word + 1-letter word OR 3 letter word + 2 letter word. Does not make difference)
....
etc.
If no match, then find go back to 1. with n-1 letters etc.
I have a function to check if a "combination of letters" (i.e. word) is in dictionary.
def is_in_lib(word):
if word in lib:
return word
return False
Ideal answer should not include finding the combination of those letters and searching all of those. Searching through my dictionary is very costly, so I need something that can take also optimize the time that we search through the dictionary
IMPORTANT EDIT: The order matters and continuity is required. Meaning if user enters "H", "T", "A", you cannot build "HAT".
Real Example: If the input is : T - H - G - R - A - C - E - K - B - Y - E " output should be "Grace" and "Bye"
You could create a string/list from the input letters, and iterate trought THEM on every word in the word library:
inputstring='abcde'
for i in lib:
is_okay=True
for j in inputstring:
if i.find(j)=-1:
is_okay=False
if is_okay:
return i
I think the other cases (two words with 3-2 letters) can be implemented recursively, but it couldn't be efficient.
I think the key idea here would be to have some kind of index providing a mapping from a canonical sequence of characters to actual words. Something like that:
# List of known words
>>> words = ('bonjour', 'jour', 'bon', 'poire', 'proie')
# Build the index
>>> index = collections.defaultdict(list)
>>> for w in words:
... index[''.join(sorted(w.lower()))].append(w)
...
This will produce a efficient way to find all the anagrams corresponding to a sequence of characters:
>>> index
defaultdict(<class 'list'>, {'joru': ['jour'], 'eiopr': ['poire', 'proie'], 'bjnooru': ['bonjour'], 'bno': ['bon']})
You could query the index that way:
>>> user_str = 'OIREP'
>>> index.get(''.join(sorted(user_str.lower())), "")
['poire', 'proie']
Of course, this will only find "exact" anagrams -- that is containing all the letters provided by the user. To find all the string that match a subset of the user provided string, you will have to remove one letter at a time and check again each combination. I feel like recursivity will help to solve that problem ;)
EDIT:
(should I put that on a spoiler section?)
Here is a possibl solution:
import collections
words = ('bonjour', 'jour', 'bon', 'or', 'pire', 'poire', 'proie')
index = collections.defaultdict(list)
for w in words:
index[''.join(sorted(w.lower()))].append(w)
# Recursively search all the words containing a sequence of letters
def search(letters, result = set()):
# Assume "letters" ordered
if not letters:
return
solutions = index.get(letters)
if solutions:
for s in solutions:
result.add(s)
for i in range(0,len(letters)):
search(letters[:i]+letters[i+1:], result)
return result
# Use case:
user_str = "OIREP"
s = search(''.join(sorted(user_str.lower())))
print(s)
Producing:
set(['poire', 'or', 'proie', 'pire'])
It is not that bad, but could be improved since the same subset of characters are examined several times. This is especially true is the user provided search string contain several identical letters.

Is there a faster way of looping through a set and replacing MWE in a sentence? - Python

The task is to group expressions that are made up of multiple words (aka Multi-Word Expressions).
Given a dictionary of MWE, I need to add dashes to the input sentences where MWE are detected, e.g.
**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .
Currently I looped through the sorted dictionary and see whether the MWE appears in the sentence and replace them whenever it appears. But there's a lot of wasted iterations.
Is there a better way of doing so? One solution is to produce all possible n-grams 1st, i.e. chunker2()
import re, time
mwe_list =set([i.strip() for i in codecs.open( \
"wn-mwe-en.dic","r","utf8").readlines()])
def chunker(sentence):
for item in mwe_list:
if item or item.replace("-", " ") in sentence:
#print item
mwe_item = '-'.join(item.split(" "))
r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
sentence=re.sub(r,mwe_item,sentence)
return sentence
def chunker2(sentence):
nodes = []
tokens = sentence.split(" ")
for i in range(0,len(tokens)):
for j in range(i,len(tokens)):
nodes.append(" ".join(tokens[i:j]))
n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(" ", "-"))
return sentence
s = "i have got an ace of diamonds in my wet suit ."
time.clock()
print chunker(s)
print time.clock()
time.clock()
print chunker2(s)
print time.clock()
I'd try doing it like this:
For each sentence, construct a set of n-grams up to a given length (the longest MWE in your list).
Now, just do mwe_nmgrams.intersection(sentence_ngrams) and search/replace them.
You won't have to waste time by iterating over all of the items in your original set.
Here's a slightly faster version of chunker2:
def chunker3(sentence):
tokens = sentence.split(' ')
len_tokens = len(tokens)
nodes = set()
for i in xrange(0, len_tokens):
for j in xrange(i, len_tokens):
chunks = tokens[i:j]
if len(chunks) > 1:
nodes.add(' '.join(chunks))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(' ', '-'))
return sentence
First, a 2x improvement: Because you are replacing the MWEs with hyphenated versions, you can pre-process the dictionary (wn-mwe-en.dic) to eliminate all hyphens from the MWEs in the set, eliminating one string comparison. If you allow hyphens within the sentence, then you'll have to pre-process it as well, presumably online, for a minor penalty. This should cut your runtime in half.
Next, a minor improvement: Immutable tuples are generally faster for iteration rather than a set or list (which are mutable and the iterator has to check for movement of elements in memory with each step). The set() conversion will eliminate duplicates, as you intend. The tuple bit will firm it up in memory allowing low level iteration optimizations by the python interpreter and its compiled libs.
Finally, you should probably parse both the sentence and the MWEs into words or tokens before doing all your comparisons, this would cut down on the # of string comparisons required by the average length of your words (4x if your words are 4 characters long on average). You'd also be able to nest another loop to search for the first word in the MWE as an anchor for all MWEs that share that first word, reducing the length of the string comparisons required. But I'll leave this lion's share for you experimentation on real data. And depending on the intepreter vs. compiled lib efficiency, doing all this splitting nested looping at the python level may actually slow things down.
So here's the result of the first two easy "sure" bets. Should be 2x faster despite the preprocessing, unless your sentence is very short.
mwe_list = set(i.strip() for i in codecs.open("wn-mwe-en.dic", "r", "utf8").readlines())
mwe_list = tuple(mwe.replace('-', ' ').strip() for mwe in mwe_list)
sentence = sentence.replace('-', ' ').strip()
def chunker(sentence):
for item in mwe_list:
if item in sentence:
...
Couldn't find a .dic file on my system or I'd profile it for you.

Categories

Resources