I am trying to work on a simple word count problem and trying to figure if that can be done by use of map, filter and reduce exclusively.
Following is an example of an wordRDD(the list used for spark):
myLst = ['cats', 'elephants', 'rats', 'rats', 'cats', 'cats']
All i need is to count the words and present it in a tuple format:
counts = [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]
I tried with simple map() and lambdas as:
counts = myLst.map(lambdas x: (x, <HERE IS THE PROBLEM>))
I might be wrong with the syntax or maybe confused.
P.S.: This isnt a duplicate questin as rest answers give suggestions using if/else or list comprehensions.
Thanks for the help.
You don't need map(..) at all. You can do it with just reduce(..)
>>> def function(obj, x):
... obj[x] += 1
... return obj
...
>>> from functools import reduce
>>> reduce(function, myLst, defaultdict(int)).items()
dict_items([('elephants', 1), ('rats', 2), ('cats', 3)])
You can then iterate of the result.
However, there's a better way of doing it: Look into Counter
Not using a lambda but gets the job done.
from collections import Counter
c = Counter(myLst)
result = list(c.items())
And the output:
In [21]: result
Out[21]: [('cats', 3), ('rats', 2), ('elephants', 1)]
If you don't want the full reduce step done for you (which aggregated the counts in SuperSaiyan's answer), you can use map this way:
>>> myLst = ['cats', 'elephants', 'rats', 'rats', 'cats', 'cats']
>>> counts = list(map(lambda s: (s,1), myLst))
>>> print(counts)
[('cats', 1), ('elephants', 1), ('rats', 1), ('rats', 1), ('cats', 1), ('cats', 1)]
You Can use map() to get this result:
myLst = ['cats', 'elephants', 'rats', 'rats', 'cats', 'cats']
list(map(lambda x : (x,len(x)), myLst))
Related
I try to count the frequency of word occurances in a variable. The variables counts more than 700.000 observations. The output should return a dictionary with the words that occured the most. I used the code below to do this:
d1 = {}
for i in range(len(words)-1):
x=words[i]
c=0
for j in range(i,len(words)):
c=words.count(x)
count=dict({x:c})
if x not in d1.keys():
d1.update(count)
I've runned the code for the first 1000 observations and it worked perfectly. The output is shown below:
[('semantic', 23),
('representations', 11),
('models', 10),
('task', 10),
('data', 9),
('parser', 9),
('language', 8),
('languages', 8),
('paper', 8),
('meaning', 8),
('rules', 8),
('results', 7),
('performance', 7),
('parsing', 7),
('systems', 7),
('neural', 6),
('tasks', 6),
('entailment', 6),
('generic', 6),
('te', 6),
('natural', 5),
('method', 5),
('approaches', 5)]
When I try to run it for 100.000 observations, it keeps running. I've tried it for more than 24 hours and it still doesn't execute. Does anyone have an idea?
You can use collections.Counter.
from collections import Counter
counts = Counter(words)
print(counts.most_common(20))
#Jon answer is the best in your case, however in some cases collections.counter will be slower than iteration. (specially if afterwards you don't need to sort by frequency) as I asked in this question
You can count frequencies by iteration.
d1 = {}
for item in words:
if item in d1.keys():
d1[item] += 1
else:
d1[item] = 1
# finally sort the dictionary of frequencies
print(dict(sorted(d1.items(), key=lambda item: item[1])))
But again, for your case, using #Jon answer is faster and more compact.
#...
for i in range(len(words)-1):
#...
#...
for j in range(i,len(words)):
c=words.count(x)
#...
if x not in d1.keys():
#...
I've tried to highlight the problems your code is having above. In english this looks something like:
"Count the number of occurences of each word after the word I'm looking at, repeatedly, for every word in the whole list. Also, look through the whole dictioniary I'm building again for every word in the list, while I'm building it."
This is way more work than you need to do; you only need to look at each word in the list once. You do need to look in the dictionary once for every word, but looking at d1.keys() makes this far slower by converting the dictionary to another list and looking through the whole thing. The following code will do what you want, much more quickly:
words = ['able', 'baker', 'charlie', 'dog', 'easy', 'able', 'charlie', 'dog', 'dog']
word_counts = {}
# Look at each word in our list once
for word in words:
# If we haven't seen it before, create a new count in our dictionary
if word not in word_counts:
word_counts[word] = 0
# We've made sure our count exists, so just increment it by 1
word_counts[word] += 1
print(word_counts.items())
The above example will give:
[
('charlie', 2),
('baker', 1),
('able', 2),
('dog', 3),
('easy', 1)
]
I have a list, and i want to return each element in that list along with its position in it.
for example:
my_enumerate(['dog', 'pig', 'cow'])
should return:
[(0, 'dog'), (1, 'pig'), (2, 'cow')]
The following is how I've approached it:
def my_enumerate(items):
''' returning position of item in a list'''
lista = []
for num in range(0, len(items)+1):
for item in items:
lista.append("({0}, {1})".format(num, item))
return lista
which returned to me:
['(0, dog)', '(0, pig)', '(0, cow)', '(1, dog)', '(1, pig)', '(1, cow)', '(2, dog)', '(2, pig)', '(2, cow)', '(3, dog)', '(3, pig)', '(3, cow)']
The function should behave exactly like the built-in enumerate function, but I'm not allowed to use it.
Your program produces cartesian product of all the indexes and elements of the list. It takes each index and produces all the strings with the elements in the list. Also, note that, you should iterate only till the length of the list, when you do len(items) + 1, you are actually exceeding the actual length of the list.
You can use only the first loop, like this
>>> def my_enumerate(items):
... lista = []
... for num in range(len(items)):
... lista.append("({0}, {1})".format(num, items[num]))
... return lista
...
>>> my_enumerate(['dog', 'pig', 'cow'])
['(0, dog)', '(1, pig)', '(2, cow)']
You can also use a simple list comprehension, like this
>>> def my_enumerate(items):
... return ["({0}, {1})".format(num, items[num]) for num in range(len(items))]
...
>>> my_enumerate(['dog', 'pig', 'cow'])
['(0, dog)', '(1, pig)', '(2, cow)']
Note 1: In Python 3.x, you don't have to use the positions in the format string unless it is necessary. So, "({}, {})".format is enough.
Note 2: If you actually wanted to return tuples, like enumerate, then you should not use string formatting, instead prepare tuples like this
>>> def my_enumerate(items):
... return [(num, items[num]) for num in range(len(items))]
...
>>> my_enumerate(['dog', 'pig', 'cow'])
[(0, 'dog'), (1, 'pig'), (2, 'cow')]
Note 3: If you actually wanted to simulate enumerate like it works in Python 3.x, then you should better use a generator function, like this
>>> def my_enumerate(items):
... for num in range(len(items)):
... yield (num, items[num])
...
...
>>> my_enumerate(['dog', 'pig', 'cow'])
<generator object my_enumerate at 0x7f5ff7abf900>
>>> list(my_enumerate(['dog', 'pig', 'cow']))
[(0, 'dog'), (1, 'pig'), (2, 'cow')]
Note 4: More good news is, you can write the same my_enumerate, with yield from and a generator expression, like this
>>> def my_enumerate(items):
... yield from ((num, items[num]) for num in range(len(items)))
...
>>> my_enumerate(['dog', 'pig', 'cow'])
<generator object my_enumerate at 0x7f5ff7abfe10>
>>> list(my_enumerate(['dog', 'pig', 'cow']))
[(0, 'dog'), (1, 'pig'), (2, 'cow')]
If you write it as a generator, it will even work with infinite generators:
def my_enumerate(items):
i = 0
for e in items:
yield (i, e)
i += 1
print(list(my_enumerate(['dog', 'pig', 'cow']))
You can do it in a one-liner pretty easily using zip, range, and len:
def my_enumerate(items):
return zip(range(len(myList)), myList)
my_list = ['dog', 'pig', 'cow']
print(my_enumerate(my_list)) # prints [(0, 'dog'), (1, 'pig'), (2, 'cow')]
The above would be the Python 2 version. Note that in Python 3, zip returns a generator, which actually may be just what you need, but if you absolutely need a list, you can just wrap the returned expression in a list() call.
Simplest:
def my_enumerate(items):
return list(enumerate(items))
How can I transforms strings in list return a list consisting the elements of strings and the lengths of strings?
Just like add_sizes([str]) -> [(str, int)]
Here is what I did:
def add_sizes(strings):
for i in strings:
return [(i, (len(i)))]
but this only works for one string, What should I do if I have more than one string?
Use list_comprehension
>>> l = ['foo', 'bar', 'j', 'mn']
>>> [(i,len(i)) for i in l]
[('foo', 3), ('bar', 3), ('j', 1), ('mn', 2)]
Defining it as a seperate function.
def add_sizes(l):
return [(i,len(i)) for i in l]
Your mistake
You are returning after you check the first element, hence you will get only the value of the first string. A work around can be
def add_sizes(strings):
temp = []
for i in strings:
temp.append((i, (len(i))))
return temp
Better ways to do it
A better way using map
>>> l = ['abc','defg','hijkl']
>>> map(lambda x:(x,len(x)),l)
[('abc', 3), ('defg', 4), ('hijkl', 5)]
And you can define your function as
def add_sizes(l):
return map(lambda x:(x,len(x)),l)
Or using a list comp as shown in Avinash's answer
I would use map and zip:
def add_sizes(strings):
return list(zip(strings, map(len, strings)))
print(add_sizes(["foo","bar","foobar"]))
[('foo', 3), ('bar', 3), ('foobar', 6)]
If you will have a mixture of lists and single strings you need to catch that:
def add_sizes(strings):
return zip(strings, map(len, strings)) if not isinstance(strings, str) else [(strings, len(strings))]
Use List Comprehension:
def add_sizes(slist):
out = [(s, len(s)) for s in slist]
return out
Yes correct, because return statement is inside for loop,so during first iteration it will return.
Move return statement outside of for loop and create new list to add(append) every element and its length as tuple.
Demo:
>>> def add_sizes(input_list):
... result = []
... for i in input_list:
... result.append((i, len(i)), )
... return result
...
>>> l1 = ["asb", "werg", "a"]
>>> add_sizes(l1)
[('asb', 3), ('werg', 4), ('a', 1)]
>>>
I have a list with such structure:
[(key1, val1), (key2, val2), ...]
And I want to iterate over it getting key and the index of item on each step. In reverse order.
Right now I'm doing it like this:
for index, key in reversed(list(enumerate(map(lambda x: x[0], data)))):
print index, key
It works perfectly, but I'm just worrying if it's a properly way to do. Can there is be a better solution?
enumerate() cannot count down, only up. Use a itertools.count() object instead:
from itertools import izip, count
for index, item in izip(count(len(data) - 1, -1), reversed(data)):
This produces a count starting at the length (minus 1), then counting down as you go along the reversed sequence.
Demo:
>>> from itertools import izip, count
>>> data = ('spam', 'ham', 'eggs', 'monty')
>>> for index, item in izip(count(len(data) - 1, -1), reversed(data)):
... print index, item
...
3 monty
2 eggs
1 ham
0 spam
Here is interesting article about this problem. The following solution is proposed:
from itertools import izip
reverse_enumerate = lambda l: izip(xrange(len(l)-1, -1, -1), reversed(l))
>>> a = ['a', 'b', 'c']
>>> it = reverse_enumerate(a)
>>> it.next()
(2, c)
list = [('a', 1), ('b', 2)]
for n,k in reversed([(i,k[0]) for i, k in enumerate(list)]):
print n,k
You should use a dict instead of list with key/values, that what they are for.
edit: That should work.
Either of these 2 suffice if the performance is not absolutely crucial.
sorted(enumerate(data), reverse=True)
reversed(list(enumerate(data)))
enumerate() on the reverse slice will work:
for i, x in enumerate(data[::-1]):
print(len(data)-1-i, x[0])
This will create least temp objects: just one enumerate() object, and one slice() object.
Define your own enumerate:
def enumerate_in_reverse(sequence):
if not sequence:
return
for i in range(len(sequence) - 1, -1, -1):
yield i, sequence[i]
I believe this should be pretty straightforward, but it seems I am not able to think straight to get this right.
I have a list as follows:
comp = [Amazon, Apple, Microsoft, Google, Amazon, Ebay, Apple, Paypal, Google]
I just want to print the words that occur the most. I did the following:
cnt = Counter(comp.split(','))
final_list = cnt.most_common(2)
This gives me the following output:
[[('Amazon', 2), ('Apple', 2)]]
I am not sure what parameter pass in most_common() since it could be different for each input list. So, I would like to know how I can print the top occurring words, be it 3 for one list or 4 for another. So, for the above sample, the output would be as follows:
[[('Amazon', 2), ('Apple', 2), ('Google',2)]]
Thanks
You can use itertools.takewhile here:
>>> from itertools import takewhile
>>> lis = ['Amazon', 'Apple', 'Microsoft', 'Google', 'Amazon', 'Ebay', 'Apple', 'Paypal', 'Google']
>>> c = Counter(lis)
>>> items = c.most_common()
Get the max count:
>>> max_ = items[0][1]
Select only those items where count = max_, and stop as soon as an item with less count is found:
>>> list(takewhile(lambda x: x[1]==max_, items))
[('Google', 2), ('Apple', 2), ('Amazon', 2)]
You've misunderstood Counter.most_common:
most_common(self, n=None)
List the n most common elements and their counts from the most common
to the least. If n is None, then list all element counts.
i.e n is not the count here, it is the number of top items you want to return. It is essentially equivalent to:
>>> c.most_common(4)
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
>>> c.most_common()[:4]
[('Google', 2), ('Apple', 2), ('Amazon', 2), ('Paypal', 1)]
You can do this by maintaining two variables maxi and maxi_value storing the maximum element and no of times it has occured.
dict = {}
maxi = None
maxi_value = 0
for elem in comp:
try:
dict[elem] += 1
except IndexError:
dict[elem] = 1
if dict[elem] > mini_value:
mini = elem
print (maxi)
Find the number of occurences of one of the top words, and then filter the whole list returned by most_common:
>>> mc = cnt.most_common()
>>> filter(lambda t: t[1] == mc[0][1], mc)