Trying to search 2 lists for common strings. 1-st list being a file with text, while the 2-nd is a list of words with logarithmic probability before the actual word – to match, a word not only needs to be in both lists, but also have a certain minimal log probability (for instance, between -2,123456 and 0,000000; that is negative 2 increasing up to 0). The tab separated list can look like:
-0.962890 dog
-1.152454 lol
-2.050454 cat
I got stuck doing something like this:
common = []
for i in list1:
if i in list2 and re.search("\-[0-1]\.[\d]+", list2):
common.append(i)
The idea to simply preprocess the list to remove lines under a certain threshold is valid of course, but since both the word and its probability are on the same line, isn’t a condition also possible? (Regexps aren’t necessary, but for comparison solutions both with and without them would be interesting.)
EDIT: own answer to this question below.
Assuming your list contains strings such as "-0.744342 dog", and my_word_list is a list of strings, how about this:
worddict = dict(map(lambda x: (x, True), my_word_list))
import re
for item in my_list:
parts = re.split("\s+", item)
if len(parts) != 2:
raise ValueError("Wrong data format")
logvalue, word = float(parts[0]), parts[1]
if logvalue > TRESHHOLD and word in worddict:
print("Hurrah, a match!")
Note the first line, which makes a dictionary out of your list of words. If you don't do this, you'll waste a lot of time doing linear searches through your word list, causing your algorithm to have a time complexity of O(n*m), while the solution I propose is much closer to O(n+m), n being the number of lines in my_list, m being the number of words in my_word_list.
Here's my solution without using regex. First create a dictionary of words which are within the accepted range, then check if each word in the the text is in the dict.
word_dict = {}
with open('probability.txt', 'r') as prob, open('text.txt', 'r') as textfile:
for line in prob:
if (-2.123456 < float(line.split()[0]) < 0):
word_dict[line.split()[1]] = line.split()[0]
for line in textfile:
for word in line.split():
if word in word_dict.keys():
print('MATCH, {}: {}'.format(word, word_dict[word]))
Answering my own question after hours of trial and error, and read tips from here and there. Turns out, i was thinking in the right direction from start, but needed to separate word detection and pattern matching, and instead combine the latter with log probability checking. Thus creating a temporary list of items with needed log prob, and then just comparing that against the text file.
common = []
prob = []
loga , rithmus = -9.87 , -0.01
for i in re.findall("\-\d\.\d+", list2):
if (loga < float(i.split()[0]) < rithmus):
prob.append(i)
prob = "\n".join(prob)
for i in list1:
if i in prob:
common.append(i)
Related
I've already made a post about this but since then I managed to solve the issues I had at first since just editing the old question just makes things more complicated.
I have a text file with about 10'000 words. The output of the function should be: a list of tuples that have the word and the amount of occurences of that word as a tuple in descending order. For example:
out = [("word1",10),("word3",8),("word2",5)...]
So this is my code so far: (Keep in mind, this does work currently to a certain extent it is just extremely inefficient)
def text(inp):
with open(inp,"r") as file:
content = file.readlines()
delimiters = ["\n"," ",",",".","?","!",":",";","-"]
words = content
spaces = ["","'",'']
out = []
for delimiter in delimiters:
new_words = []
for word in words:
if word in spaces:
continue
new_words += word.split(delimiter)
words = new_words
for word in words:
x = (words.count(word),word)
out.append(x)
return out
I found some help from older posts on Stackoverflow for the first few lines.
The input should be the file path. This does work in my case. The first part (the lines I found on here) work nicely. Although there are elements in the list such as empty strings. My questions now are:
How can I sort the output such that the word with the most occurences comes first and then from that on in a descending order? Currently it is random. Also, I'm not sure if the same word comes up multiple times in this list. If yes, how can I make it such that it only occurs once in the output?
Also, How can I make this code more efficient? I used time.time() to check and it took almost 419 seconds, which obviously is terribly inefficient, since the task stated that it should take less than 30sec.
I apologize in advance for any mistakes I made and my lack of knowledge on this
instead of running so many loops and conditions.
you can use re.split
import re
def text(inp):
with open(inp,"r") as file:
content = file.readlines()
#delimiters = ["\n"," ",",",".","?","!",":",";","-"]
words = content
spaces = ["","'",'']
out = []
temp_list=[]
for word in words:
#using re.split
temp_list.extend(re.split('[\n, .?!:;-]',word))
for word in set(temp_list):
x = (word,temp_list.count(word))
out.append(x)
return out
I have a simple list of words I need to filter, but each word in the list has an accompanying "score" appended to it which is causing me some trouble. The input list has this structure:
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'BUFFET;75','FASTBREAKPOINTS;60'
]
I am trying to figure out how to identify words in my list that are compounded solely from other words on the same list. For example, the code applied to lst above would produce:
ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']
I found a prior question that deals with a nearly identical situation , but in that instance, there are no trailing scores with the words on the list and I am having trouble dealing with these trailing scores on my list. The ans list must keep the scores with the compound words found. The order of the words in lst is random and irrelevant. Ideally, I would like the ans list to be sorted by the length of the word (before the ' ; '), as shown above. This would save me some additional post-processing on ans.
I have figured out a way that works using ReGex and nested for loops (I will spare you the ugliness of my 1980s-esque brute force code, it's really not pretty), but my word list has close to a million entries, and my solution takes so long as to be completely unusable. I am looking for a solution a little more Pythonic that I can actually use. I'm having trouble working through it.
Here is some code that does the job. I'm sure it's not perfect for your situation (with a million entries), but perhaps can be useful in parts:
#!/usr/bin/env python
from collections import namedtuple
Word = namedtuple("Word", ("characters", "number"))
separator = ";"
lst = [
"FAST;5",
"BREAK;60",
"FASTBREAK;40",
"OUTBREAK;110",
"BREAKFASTBUFFET;35",
"BUFFET;75",
"FASTBREAKPOINTS;60",
]
words = [Word(*w.rsplit(separator, 1)) for w in lst]
def findparts(oword, parts):
if len(oword.characters) == 0:
return parts
for iword in words:
if not parts and iword.characters == oword.characters:
continue
if iword.characters in oword.characters:
parts.append(iword)
characters = oword.characters.replace(iword.characters, "")
return findparts(Word(characters, oword.number), parts)
return []
ans = []
for word in words:
parts = findparts(word, [])
if parts:
ans.append(separator.join(word))
print(ans)
It uses a recursive function that takes a word in your list and tries to assemble it with other words from that same list. This function will also present you with the actual atomic words forming the compound one.
It's not very smart, however. Here is an example of a composition it will not detect:
[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET].
It uses a small detour using a namedtuple to temporarily separate the actual word from the number attached to it, assuming that the separator will always be ;.
I don't think regular expressions hold an advantage over a simple string search here.
If you know some more conditions about the composition of the compound words, like for instance the maximum number of components, the itertools combinatoric generators might help you to speed things up significantly and avoid missing the example given above too.
I think I would do it like this: make a new list containing only the words. In a for loop go through this list, and within it look for the words that are part of the word of the outer loop. If they are found: replace the found part by an empty string. If afterwards the entire word is replaced by an empty string: show the word of the corresponding index of the original list.
EDIT: As was pointed out in the comments, there could be a problem with the code in some situations, like this one: lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] In BREAKFASTBUFFET I first found that BREAK was a part of it, so I replaced that one with an empty string, which prevented BREAKFAST to be found. I hope that problem can be tackled by sorting the list descending by length of the word.
EDIT 2
My former edit was not flaw-proof, for instance is there was a word BREAKFASTEN, it shouldn't be "eaten" by BREAKFAST. This version does the following:
make a list of candidates: all words that ar part of the word in investigation
make another list of words that the word is started with
keep track of the words in the candidates list that you've allready tried
in a while True: keep trying until either the start list is empty, or you've successfully replaced all words by the candidates
lst = ['FAST;5','BREAK;60','FASTBREAK;40',
'OUTBREAK;110','BREAKFASTBUFFET;35',
'POINTS;25',
'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'
]
lst2 = [ s.split(';')[0] for s in lst ]
for i, word in enumerate(lst2):
# candidates: words that are part of current word
candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]
if len(candidates) > 0:
tried = []
word2 = word
found = False
while not found:
# start: subset of candidates that the current word starts with
start = [ x for x in candidates if word2.startswith(x) and x not in tried ]
for trial in start:
word2 = word2.replace(trial,'')
tried.append(trial)
if len(word2)==0:
print(lst[i])
found = True
break
if len(candidates)>1:
candidates = candidates[1:]
word2=candidates[0]
else:
break
There are several ways of speeding up the process but I doubt there is a polynomial solution.
So let's use multiprocessing, and do what we can to generate a meaningful result. The sample below is not identical to what you are asking for, but it does compose a list of apparently compound words from a large dictionary.
For the code below, I am sourcing https://gist.github.com/h3xx/1976236 which lists about 80,000 unique words in order of frequency in English.
The code below can easily be sped up if the input wordlist is sorted alphabetically beforehand, as each head of a compound will be immediately followed by it's potential compound members:
black
blackberries
blackberry
blackbird
blackbirds
blackboard
blackguard
blackguards
blackmail
blackness
blacksmith
blacksmiths
As mentioned in the comment, you may also need to use a semantic filter to identify true compound words - for instance, the word ‘generally’ isn’t a compound word for a ‘gene rally’ !! So, while you may get a list of contenders you will need to eliminate false positives somehow.
# python 3.9
import multiprocessing as mp
# returns an ordered list of lowercase words to be used.
def load(name) -> list:
return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]
# function that identifies the compounds of a word from a list.
# ... can be optimised if using a sorted list.
def compounds_of(word: str, values: list):
return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]
# apply compound finding across an mp environment
# but this is the slowest part
def compose(values: list) -> dict:
with mp.Pool() as pool:
result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}
return result
if __name__ == '__main__':
# https://gist.github.com/h3xx/1976236
words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.
words = list(dict.fromkeys(words))
# remove those word heads which have less than 3 tails
compounds = {k: v for k, v in compose(words).items() if len(v) > 3}
# get the top 500 keys
rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]
# compose them into a dict and print
tops = {k[0]: compounds[k] for k in rank}
print(tops)
I am trying to solve this problem:
You are given n words. Some words may repeat. For each word, output its
number of occurrences. The output order should correspond with the
input order of appearance of the word. See the sample input/output for
clarification.
Note: Each input line ends with a "\n" character.
Input Format
The first line contains the integer, n .
The next n lines each contain a
word.
Output Format
Output 2 lines. On the first line, output the number of distinct words
from the input. On the second line, output the number of occurrences
for each distinct word according to their appearance in the input.
I have implemented a solution like this,
# Enter your code here. Read input from STDIN. Print output to STDOUT
n = int(input())
mySet = set()
myDict = {}
for i in range(n):
inp = input()[:-1]
if inp not in mySet:
mySet.add(inp)
myDict[inp] = 1
else:
myDict[inp] += 1
print(len(mySet))
# print(' '.join(list(map(str, myDict.values()))))
print(*myDict.values())
My strategy is as follows:
If the word is not in mySet, add it to the set and create a value-key pair on myDict with word as key and 1 as value.
If the word is in the set already, then increment the value of that word in the dictionary.
However, half of the test cases are successfull, but the rest are "Wrong Answer". So, I wonder, can anybody point out what do I miss to add to this code?
Your mistake is in inp = input()[:-1]. The [:-1] cuts off the word's last character. I guess you're trying to remove the newline character, but input() is already "stripping a trailing newline". Demo:
>>> [input() for _ in range(2)]
foo
bar
['foo', 'bar']
Simpler solution, btw:
from collections import Counter
ctr = Counter(input() for _ in range(int(input())))
print(len(ctr))
print(*ctr.values())
And for fun a tricky version of that (also gets accepted):
from collections import Counter
ctr = Counter(map(input, [''] * int(input())))
print(len(ctr))
print(*ctr.values())
Another one:
from collections import Counter
import sys
next(sys.stdin)
ctr = Counter(map(str.strip, sys.stdin))
print(len(ctr))
print(*ctr.values())
This one reads the whole lines, so here the strings do include the newline character. That wouldn't matter if all lines had it, but no, HackerRank commits the cardinal sin of not ending the last line with a newline. So I strip them off every line. Sigh.
In the problem, you don't need to use the set property and the way you are using input ie input()[::-1] is wrong as it is reversing the input which you have entered, ie for the input abc it is storing it as cba.
now to solve problem, as OP has figured out to take unique item use set, and to get the frequency of each word use dictionary. but using set is costly operation and dictionary can be used to get the unique word as it store unique keys which can be used to get the unique elments, and to get the frequency use the same operation as you are doing.
instead of adding key, value in dictionary by dict[key]=value better to update the current dictionary with inbuild method ie dict1.update(dict2) where dict1 is orignal dictionary and dict2 is new dictionary with key and value , dict2={key:value}, by this way you will keep the order of the element in dict.
to get the len of unique word , len of dictionary work ie len(dic) and to get the frequency of each word values need to printed ie print(*dic.values())
n = int(input())
dic = {}
for _ in range(n):
w = input()
if w in dic.keys():
dic[w]+=1
else:
dic.update({w:1})
print(len(dic.keys()))
print(*dic.values())
n=int(input())
word_list=[]
for i in range(n):
word_list.append(input())
New_word_list=[]
for element in word_list:
if element in New_word_list:
New_word_list=New_word_list
else:
New_word_list.append(element)
print(len(New_word_list))
for element in New_word_list:
print(word_list.count(element), end=" ")
Input:
10
cat
dog
dog
rabbit
cat
pig
horse
pig
pig
goat
Results:
6
2 2 1 3 1 1
def foo(self, n):
wordFreqDict = defaultdict(int)
resultList = []
for sentence in self.allSentences:
for word in self.wordSet:
if word in sentence:
wordFreqDict[word] += 1
for candidateWord in wordFreqDict:
if wordFreqDict[candidateWord] >= n:
resultList.append(candidateWord)
return resultList
My function basically return the list of words where a word is in the list iff it occurs in at least n sentences. What I am trying to do is just iterate through the self.allSentences is a list of list of words(sentence) and iterate through the self.wordSet(all unique words in the corpus) and add the frequency of each word to the dictionary (wordFreqDict).
Then loop through the wordFreqDict and add a word with frequency >= n into the resultList.
I am assuming that this will work, but it's taking too long time to check the result.
Is there a way to make it for efficient and make the computation time shorter?
EDIT:
Here is how the self.allSentences is computed
def storeAllSentences(self, corpus):
for sentence in corpus:
self.allSentences.append(sentence)
and the self.wordSet:
def getUniqueWordSet(self, corpus):
for sentence in self.allSentences:
for word in sentence:
self.wordSet.append(word)
self.wordSet = set(self.wordSet)
An assumption here is that a word is chosen if it uniquely appears in at least n sentences. So, even if it occurs 10 times in a single sentence, that's still once.
A couple of glaring issues are your nested loops and your in check on a string. This is effectively cubic in complexity. It should be possible to reduce this drastically using a set + Counter.
from collections import Counter
from itertools import takewhile
def foo(self, n):
c = Counter()
for sent in self.allSentences:
c.update(set(sent))
resultSet = list(itertools.takewhile(lambda x: x[1] >= n, c.most_common()))
return resultSet
Here, I use regex to remove special characters and punctuation, following which I split and retrieve all unique words in the sentence. The Counter is then updated.
Finally, extract all words of the desired frequency (or more) using itertools.takewhile and return.
If sent is a string sentence, you might use re based filtering to remove punctuation and then split:
import re
tempWordSet = set(re.sub('[^\w\s]+', '', sent).split())
c.update(tempWordSet)
This does not take into consideration whether a word belongs to self.wordSet or not. So if you want that as well, you might modify the first loop a bit to include a filter step:
c.update(set(filter(lambda x: x in self.wordSet, sent)))
Or, using the second technique:
tempWordSet = set(filter(lambda x: x in self.wordSet,
re.sub('[^\w\s]+', '', sent).split()))
On an unrelated note, are you trying to perform text mining? You might be interested in looking into TFIDF.
The set wordSet is probably at least two orders of magnitude larger than the words in a single sentence. Therefore it makes sense loop by words in a sentence. However, this requires that splitting a sentence is not a really slow operation. If it is, you should do the whole process in getUniqueWordSet. Here's only the first for-loop changed:
def foo(self, n):
wordFreqDict = defaultdict(int)
resultList = []
for sentence in self.allSentences:
for word in sentence:
# This if may be left out if allSentences is guaranteed to stay the same after storing them
if word in self.wordSet:
wordFreqDict[word] += 1
for candidateWord in wordFreqDict:
if wordFreqDict[candidateWord] >= n:
resultList.append(candidateWord)
return resultList
I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1