How to optimize word_count in python

How to optimize word_count in python - python

I am given n words (1≤n≤10^5). Some words may repeat. For each word, I have to output its number of occurences. But the output order should correspond with the order of the first appearance of the word.
I have a working program of the problem, but for large inputs I am getting timeout. Here is my solution for the problem:
n=int(input())
l=[]
ll=[]
for x in range(n):
l.append(raw_input())
if l[x] not in ll:
ll.append(l[x])
result = [ l.count(ll[x]) for x in range(len(ll)) ]
for x in range(len(result)):
print result[x],

Use an ordered counter by subclassing OrderedDict and Counter:
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
counts = OrderedCounter(['b', 'c', 'b', 'b', 'a', 'c'])
for k, c in counts.items():
print(k, c)
Which prints:
b 3
c 2
a 1
See the documentation for the collections module for a more complete recipe which includes a __repr__ for OrderedCounter.

The easiest way to count items in python is to use a Counter from the collections module.
Assuming you have a list of items in the order that you expect, passing it to a Counter should suffice:
c = collections.Counter(['foo', 'bar', 'bar'])
print(c['bar']) # Will print 2
If words is the list of words you retrieved from the user, you can iterate over it to print the values:
seen = set()
for elem in words:
if elem not in seen:
print(counter[elem])
seen.add(elem)

Take a look at collections.OrderedDict. It can handle this for you, and it removes the linear membership test expense using a list is imposing:
import collections
n = int(input())
l = []
ll = collections.OrderedDict()
for x in range(n):
v = raw_input()
l.append(v)
ll[v] = None # If v already in OrderedDict, does nothing, otherwise, appends
ll = list(ll) # Can convert back to list when you're done if you like
If you need the count, you can make a custom class based on OrderedDict that both handles counts and remains ordered.
class OrderedCounter(collections.OrderedDict):
def __missing__(self, key):
return 0
Then change ll to an OrderedCounter, and ll[v] = None to ll[v] += 1. At the end, ll will have the ordered words with their counts; l isn't even needed:
for word, count in ll.items():
print(word, count)
The final code would simplify to just (omitting imports and class definition):
n = int(input())
word_counts = OrderedCounter()
for x in range(n):
word_counts[raw_input()] += 1
for cnt in word_counts.values():
print cnt,
Much simpler, right?

Related

Select most frequent string in a list, if 'n' strings have the same frequency count then compare alphabetically the first letter of each string

I have a list:
Fruit_list = ['apples','oranges','peaches','peaches','watermelon','oranges','watermelon']
Want to output:
print(most_frequent(Fruit_list))
which should print out "oranges"
I want to find the most frequnent string in the list. The 3 most frequent items are 'oranges','peaches','pears'. However, I want to select 'oranges' as 'o' is before 'p' and 'w' in the alphabet

from collections import Counter
fruits = ['apples','oranges','peaches','peaches','watermelon','oranges','watermelon']
counter = Counter(fruits)
sorted_fruits = sorted(counter.items(), key=lambda tpl: (-tpl[1], tpl[0]))
print(sorted_fruits[0][0])
Output:
oranges

I think you're looking for a function like this:
def most_frequent(l):
return max(sorted(l, key=str.lower), key=l.count)
Fruit_list = ['apples','oranges','peaches','peaches','watermelon','oranges','watermelon']
print(most_frequent(Fruit_list)) # outputs "oranges"
... if you don't want to use Counter.
To clarify:
sorted(l, key=str.lower) sorts the list l lexicographically.
max(<>, key=l.count) gets the mode of the sorted list.

did you try the following:
from collections import Counter
words = ['apples','oranges','peaches','peaches','watermelon','oranges','watermelon']
most_common_words= [word for word, word_count in Counter(words).most_common(3)]
most_common_words

from collections import Counter
Fruit_list = ['apples','zranges','peaches','peaches','watermelon','zranges','watermelon']
max_counter = 0
min_ret = "z"
my_dict = dict(Counter(Fruit_list))
for items in my_dict.keys():
if my_dict[items] > max_counter:
max_counter = my_dict[items]
min_ret = items
if my_dict[items] == max_counter:
if items < min_ret:
min_ret = items
print(min_ret)
~

Filter a list of strings by frequency

I have a list of strings:
a = ['book','book','cards','book','foo','foo','computer']
I want to return anything in this list that's x > 2
Final output:
a = ['book','book','book']
I'm not quite sure how to approach this. But here's two methods I had in mind:
Approach One:
I've created a dictionary to count the number of times an item appears:
a = ['book','book','cards','book','foo','foo','computer']
import collections
def update_item_counts(item_counts, itemset):
for a in itemset:
item_counts[a] +=1
test = defaultdict(int)
update_item_counts(test, a)
print(test)
Out: defaultdict(<class 'int'>, {'book': 3, 'cards': 1, 'foo': 2, 'computer': 1})
I want to filter out the list with this dictionary but I'm not sure how to do that.
Approach two:
I tried to write a list comprehension but it doesn't seem to work:
res = [k for k in a if a.count > 2 in k]

A very barebone answer is that you should replace a.count by a.count(k) in your second solution.
Although, do not attempt to use list.count for this, as this will traverse the list for each item. Instead count occurences first with collections.Counter. This has the advantage of traversing the list only once.
from collections import Counter
from itertools import repeat
a = ['book','book','cards','book','foo','foo','computer']
count = Counter(a)
output = [word for item, n in count.items() if n > 2 for word in repeat(item, n)]
print(output) # ['book', 'book', 'book']
Note that the list comprehension is equivalent to the loop below.
output = []
for item, n in count.items():
if n > 2:
output.extend(repeat(item, n))

Try this:
a_list = ['book','book','cards','book','foo','foo','computer']
b_list = []
for a in a_list:
if a_list.count(a) > 2:
b_list.append(a)
print(b_list)
# ['book', 'book', 'book']
Edit: You mentioned list comprehension. You are on the right track! You can do it with list comprehension like this:
a_list = ['book','book','cards','book','foo','foo','computer']
c_list = [a for a in a_list if a_list.count(a) > 2]
Good luck!

a = ['book','book','cards','book','foo','foo','computer']
list(filter(lambda s: a.count(s) > 2, a))

Your first attempt builds a dictionary with all of the counts. You need to take this a step further to get the items that you want:
res = [k for k in test if test[k] > 2]
Now that you have built this by hand, you should check out the builtin Counter class that does all of the work for you.

If you just want to print there are better answers already, if you want to remove you can try this.
a = ['book','book','cards','book','foo','foo','computer']
countdict = {}
for word in a:
if word not in countdict:
countdict[word] = 1
else:
countdict[word] += 1
for x, y in countdict.items():
if (2 >= y):
for i in range(y):
a.remove(x)

You can try this.
def my_filter(my_list, my_freq):
'''Filter a list of strings by frequency'''
# use set() to unique my_list, then turn set back to list
unique_list = list(set(my_list))
# count frequency in unique_list
frequencies = []
for value in unique_list:
frequencies.append(my_list.count(value))
# filter frequency
return_list = []
for i, frequency in enumerate(frequencies):
if frequency > my_freq:
for _ in range(frequency):
return_list.append(unique_list[i])
return return_list
a = ['book','book','cards','book','foo','foo','computer']
my_filter(a, 2)
['book', 'book', 'book']

How to standardize the format of element in the list from big data

Trying to count unique value from the following list without using collection:
('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
The output which I require is :
('TOILET':2,'AIR CONDITIONiNGS':3)
My code currently is
for i in Data:
if i in number:
number[i] += 1
else:
number[i] = 1
print number
Is it possible to get the output?

Using difflib.get_close_matches to help determine uniqueness
import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
#print(similar)
if similar:
d[similar[0]] += 1
else:
d[word] = 1
The actual keys in the dictionary will depend on the order of the words in the list.
difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.
If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.
import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
for key in z.keys():
if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
z[key] += 1
break
else:
z[word] = 1
Results:
>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>
I imagine there are python packages that do this sort of thing and may be optimized.

I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:
some_list = ['a', 'a', 'b', 'c']
some_list.count('a') #=> 2
Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:
some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}

You can try this:
import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
s = [b for b in final_data if i.startswith(b)]
if s:
new_data = s[0]
final_data[new_data] += 1
else:
final_data[i] = 1
print final_data
Output:
{'TOILETS': 2, 'AIR CONDITIONING': 3}

original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING',
'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}
First, making a set from original list (or tuple) gives you all values from it, but without repeating.
Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.

a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}
for i in a:
b.setdefault(i,0)
b[i] += 1
You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

How to find the longest common substring of multiple strings?

I am writing a python script where I have multiple strings.
For example:
x = "brownasdfoersjumps"
y = "foxsxzxasis12sa[[#brown"
z = "thissasbrownxc-34a#s;"
In all these three strings, they have one sub string in common which is brown. I want to search it in a way that I want to create a dictionary as:
dict = {[commonly occuring substring] =>
[total number of occurrences in the strings provided]}
What would be the best way of doing that? Considering that I will have more than 200 strings each time, what would be an easy/efficient way of doing it?

This is a relatively optimised naïve algorithm. You first transform each sequence into a set of all its ngrams. Then you intersect all sets and find the longest ngram in the intersection.
from functools import partial, reduce
from itertools import chain
from typing import Iterator
def ngram(seq: str, n: int) -> Iterator[str]:
return (seq[i: i+n] for i in range(0, len(seq)-n+1))
def allngram(seq: str) -> set:
lengths = range(len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a#s;"]
seqs_ngrams = map(allngram, sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
While this might get you through short sequences, this algorithm is extremely inefficient on long sequences. If your sequences are long, you can add a heuristic to limit the largest possible ngram length (i.e. the longest possible common substring). One obvious value for such a heuristic may be the shortest sequence's length.
def allngram(seq: str, minn=1, maxn=None) -> Iterator[str]:
lengths = range(minn, maxn) if maxn else range(minn, len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a#s;"]
maxn = min(map(len, sequences))
seqs_ngrams = map(partial(allngram, maxn=maxn), sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
This may still take too long (or make your machine run out of RAM), so you might want to read about some optimal algorithms (see the link I left in my comment to your question).
Update
To count the number of strings wherein each ngram occurs
from collections import Counter
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a#s;"]
seqs_ngrams = map(allngram, sequences)
counts = Counter(chain.from_iterable(seqs_ngrams))
Counter is a subclass of dict, so its instances have similar interfaces:
print(counts)
Counter({'#': 1,
'#b': 1,
'#br': 1,
'#bro': 1,
'#brow': 1,
'#brown': 1,
'-': 1,
'-3': 1,
'-34': 1,
'-34a': 1,
'-34a#': 1,
'-34a#s': 1,
'-34a#s;': 1,
...
You can filter the counts to leave substrings occurring in at least n strings: {string: count for string, count in counts.items() if count >= n}

I have used a straightforward method to get the common sub sequences from multiple strings. Although the code can be further optimised.
import itertools
def getMaxOccurrence(stringsList, key):
count = 0
for word in stringsList:
if key in word:
count += 1
return count
def getSubSequences(STR):
combs = []
result = []
for l in range(1, len(STR)+1):
combs.append(list(itertools.combinations(STR, l)))
for c in combs:
for t in c:
result.append(''.join(t))
return result
def getCommonSequences(S):
mainList = []
for word in S:
temp = getSubSequences(word)
mainList.extend(temp)
mainList = list(set(mainList))
mainList = reversed(sorted(mainList, key=len))
mainList = list(filter(None, mainList))
finalData = dict()
for alpha in mainList:
val = getMaxOccurrence(S, alpha)
if val > 0:
finalData[alpha] = val
finalData = {k: v for k, v in sorted(finalData.items(), key=lambda item: item[1], reverse=True)}
return finalData
stringsList = ['abc', 'cab', 'dfab', 'xz']
seqs = getCommonSequences(stringsList)
print(seqs)

Filter List by Longest Element Containing a String

I want to filter a list of all items containing the same last 4 digits, I want to print the longest of them.
For example:
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
# want to return abcdabcd1234 and poiupoiupoiupoiu7890
In this case, we print the longer of the elements containing 1234, and the longer of the elements containing 7890. Finding the longest element containing a certain element is not hard, but doing it for all items in the list (different last four digits) efficiently seems difficult.
My attempt was to first identify all the different last 4 digits using list comprehension and slice:
ids=[]
for x in lst:
ids.append(x[-4:])
ids = list(set(ids))
Next, I would search through the list by index, with a "max_length" variable and "current_id" to find the largest elements of each id. This is clearly very inefficient and was wondering what the best way to do this would be.

Use a dictionary:
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> d = {} # to keep the longest items for digits.
>>> for item in lst:
... key = item[-4:] # last 4 characters
... d[key] = max(d.get(key, ''), item, key=len)
...
>>> d.values() # list(d.values()) in Python 3.x
['abcdabcd1234', 'poiupoiupoiupoiu7890']

from collections import defaultdict
d = defaultdict(str)
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
for x in lst:
if len(x) > len(d[x[-4:]]):
d[x[-4:]] = x
To display the results:
for key, value in d.items():
print key,'=', value
which produces:
1234 = abcdabcd1234
7890 = poiupoiupoiupoiu7890

itertools is great. Use groupby with a lambda to group the list into the same endings, and then from there it is easy:
>>> from itertools import groupby
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> [max(y, key=len) for x, y in groupby(lst, lambda l: l[-4:])]
['abcdabcd1234', 'poiupoiupoiupoiu7890']

Slightly more generic
import string
import collections
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
z = [(x.translate(None, x.translate(None, string.digits)), x) for x in lst]
x = collections.defaultdict(list)
for a, b in z:
x[a].append(b)
for k in x:
print k, max(x[k], key=len)
1234 abcdabcd1234
7890 poiupoiupoiupoiu7890

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize word_count in python - python

Related

Select most frequent string in a list, if 'n' strings have the same frequency count then compare alphabetically the first letter of each string

Filter a list of strings by frequency

How to standardize the format of element in the list from big data

How to find the longest common substring of multiple strings?

Filter List by Longest Element Containing a String

Categories

Resources