separate and count emojis in python list - python

I want to count the occurrences of Emojis in a list in python.
Assuming my list looks like this
li = ['šŸ˜', 'šŸ¤£šŸ˜', 'šŸ˜šŸ¤£šŸ˜‹']
Counter(li) would give me {'šŸ˜': 1, 'šŸ¤£šŸ˜': 1, 'šŸ˜šŸ¤£šŸ˜‹': 1}
But I would like to get the total amount of emojis aka {'šŸ˜': 3, 'šŸ¤£': 2, 'šŸ˜‹': 1}
My main issue is how to seperate large chunks of continous emoji into single list entries. I tried with replacing the beginning "\U" with " \U" so i could then simple split by " " but it does not seem to work.
Thanks for your help in advance :)

You can flatten you list into a single string using join and then apply Counter to that:
Counter("".join(li))
results in
Counter({'šŸ˜': 3, 'šŸ¤£': 2, 'šŸ˜‹': 1})
or maybe a more memory efficient way is
counter = Counter()
for item in li:
counter.update(item)

You can count the emojis by iterating on the characters of each string:
from collections import Counter
li = ['šŸ˜', 'šŸ¤£šŸ˜', 'šŸ˜šŸ¤£šŸ˜‹']
count = Counter(emoji for string in li for emoji in string)
print(count)
# Counter({'šŸ˜': 3, 'šŸ¤£': 2, 'šŸ˜‹': 1})
#Dan gave a different answer just before me, which he sadly deleted since then, so I reproduce it for <10k users who can't see it:
Counter("".join(li))
I thought it might be less efficient because of the creation of the joined string, but I did some timings with small and larger lists up to 10 000 000 items, and it appears that his solution is consistently 30 to 40% faster.

One more way is to use the fact that counters implement addition:
>>> li = ['šŸ˜', 'šŸ¤£šŸ˜', 'šŸ˜šŸ¤£šŸ˜‹']
>>> from collections import Counter
>>> sum(map(Counter, li), Counter())
Counter({'šŸ˜': 3, 'šŸ¤£': 2, 'šŸ˜‹': 1})

Related

How can I efficiently count the number of occurrences of a given character within certain range of a string?

Given a unsorted string, e.g. "googol". I want to find the number of occurrences of character "o" in the range: [1, 3). So, in this case, the answer would be 1.
However, my method has complexity O(N^2). The problem of my method is that copy array needs O(N) time. Therefore, I was looking for another way which would be more efficient. Space complexity does not matter to me. Because I am learning string processing algorithms, it is better if I could achieve this algorithm on my own.
Any help would be appreciated.
My method.
tmp = [0] * 26 # 26 alphabet
occurrences_table = []
tmp[ord(a_string[0])] += 1
occurrences_table.append(tmp)
for i in range(1, len(a_string)):
temp = occurrences_table[i - 1]
temp[ord(a_string[i])] += 1
occurrences_table.append(temp)
Since you don't want to use counter and want to implement it yourself, your code can be tidied up and sped up a little by using dictionaries.
a_string = "googol"
my_counter = {}
for c in a_string[:2]:
my_counter[c] = my_counter.get(c, 0) + 1
which would give you:
{'o': 1, 'g': 1}
To explain it a little further a_string[:2] gets the characters up to index 2 in your string ('google'[:2] = 'go') and for c in a_string[:2]: loops over those 2 characters.
In the next line, my_counter.get(c, 0) + 1 tries to get the dictionary value for the key 'c' (a single character in your string), if it exists it returns its value, if not returns 0 and either way adds the incremented value back to the dictionary.
EDIT:
Complexity should be just O(n) due to the for loop since the complexity of dictionary.get() is constant.
I've measured it up and for very small strings like yours, this method was 8-10 times faster than Collections.Counter, but for very large strings it is 2-3 times slower.
You could use a Counter:
from collections import Counter
a_string = "googol"
occurrences = Counter(a_string[0:2])
which results in
Counter({'o': 1, 'g': 1})
Notice that array slicing works on strings.
If you can use standard libraries:
>>> from itertools import islice
>>> from collections import Counter
>>> Counter(islice('googol', 1, 3))
Counter({'o': 2})
>>> Counter(islice('googol', 0, 2))
Counter({'g': 1, 'o': 1})
(islice avoids a temporary list.)
If you want to do it manually:
>>> s = 'googol'
>>> counter = dict()
>>> for i in range(0, 2):
... if s[i] not in counter:
... counter[s[i]] = 1
... else:
... counter[s[i]] += 1
...
>>> counter
{'g': 1, 'o': 1}
The point is: use a dict.

Cleanest way to obtain a list of the numeric values in a string

What is the cleanest way to obtain a list of the numeric values in a string?
For example:
string = 'version_4.11.2-2-1.4'
array = [4, 11, 2, 2, 1, 4]
As you might understand, I need to compare versions.
By "cleanest", I mean as simple / short / readable as possible.
Also, if possible, then I prefer built-in functions over regexp (import re).
This is what I've got so far, but I feel that it is rather clumsy:
array = [int(n) for n in ''.join(c if c.isdigit() else ' ' for c in string).split()]
Strangely enough, I have not been able to find an answer on SO:
In this question, the input numeric values are assumed to be separated by white spaces
In this question, the input numeric values are assumed to be separated by white spaces
In this question, the user only asks for a single numeric value at the beginning of the string
In this question, the user only asks for a single numeric value of all the digits concatenated
Thanks
Just match on consecutive digits:
map(int, re.findall(r'\d+', versionstring))
It doesn't matter what's between the digits; \d+ matches as many digits as can be found in a row. This gives you the desired output in Python 2:
>>> import re
>>> versionstring = 'version_4.11.2-2-1.4'
>>> map(int, re.findall(r'\d+', versionstring))
[4, 11, 2, 2, 1, 4]
If you are using Python 3, map() gives you an iterable map object, so either call list() on that or use a list comprehension:
[int(d) for d in re.findall(r'\d+', versionstring)]
I'd solve this with a regular expression, too.
I prefer re.finditer over re.findall for this task. re.findall returns a list, re.finditer returns an iterator, so with this solution you won't create a temporary list of strings:
>>> [int(x.group()) for x in re.finditer('\d+', string)]
[4, 11, 2, 2, 1, 4]
You are tracking every character and checking if it is a digit, if yes you are adding it to a list, Gets slow for larger strings.
Let's say,
import re
string='version_4.11.2-2-1.4.9.7.5.43.2.57.9.5.3.46.8.5'
l=map(int, re.findall('\d+',string))
print l
Hopefully, this should work.
Not sure in the answer above why are we using 'r'.
You can simply resolve this using regular expressions.
import re
string = 'version_4.11.2-2-1.4'
p=re.compile(r'\d+')
p.findall(string)
Regex is definitely the best way to go as #MartijnPieters answer clearly shows, but if you don't want to use it, you probably can't use a list comprehension. This is how you could do it, though:
def getnumbers(string):
numberlist = []
substring = ""
for char in string:
if char.isdigit():
substring += char
elif substring:
numberlist.append(int(substring))
substring = ""
if substring:
numberlist.append(int(substring))
return numberlist

Python, Take dictionary, and produce list with (words>1, most common words, longest words)

So i made a function
def word_count(string):
my_string = string.lower().split()
my_dict = {}
for item in my_string:
if item in my_dict:
my_dict[item] += 1
else:
my_dict[item] = 1
print(my_dict)
so, what this does is that it takes a string, splits it, and produces a dictionary with the key being the word, and the value being how many times it appears.
Okay, so what im trying to do now, is to make a function that takes the output of that function, and produces a list in the following format-
((list of words longer than 1 letter),(list of most frequent words), (list of words with the longest length))
also, for example lets say two words have appeared 3 times, and both words are 6 letters long, it should include both words in both the (most frequent) and (longest length) lists.
So, this has been my attempt thus far at tackling this problem
def analyze(x):
longer_than_one= []
most_frequent= []
longest= []
for key in x.item:
if len(key) >1:
key.append(longer_than_one)
print(longer_than_one)
so what i was trying to do here, is make a series of for and if loops, that append to the lists depending on whether or not the items meet the criteria, however i have run into the following problems:-
1- how do i iterate over a dictionary without getting an error?
2- I cant figure out a way to count the most frequent words (i was thinking to append the keys with the highest values)
3- I cant figure out a way to only append the words that are the longest in the dictionary (i was thinking of using len(key) but it said error)
If it's any help, im working in Anaconda's Spyder using Python 3.5.1 ,any tips would be appreciated!
You really are trying to re-invent the wheel.
Imagine you have list_of_words which is, well, a list of strings.
To get the most frequent word, use Counter:
from collections import Counter
my_counter = Counter(list_of_words)
To sort the list by the length:
sorted_by_length = sorted(list_of_words, key=len)
To get the list of words longer than one letter you can simply use your sorted list, or create a new list with only these:
longer_than_one_letter = [word for word in list_of_words if len(word) > 1]
To get your output on your required format, simply use all of the above.
Most of your problems are solved or get easier when you use a Counter.
Writing word_count with a Counter:
>>> from collections import Counter
>>> def word_count(string):
... return Counter(string.split())
Demo:
>>> c = word_count('aa aa aa xxx xxx xxx b b ccccccc')
>>> c
Counter({'aa': 3, 'xxx': 3, 'b': 2, 'ccccccc': 1})
>>> c['aa']
3
The most_common method of a Counter helps with getting the most frequent words:
>>> c.most_common()
[('aa', 3), ('xxx', 3), ('b', 2), ('ccccccc', 1)]
>>> c.most_common(1)
[('aa', 3)]
>>> max_count = c.most_common(1)[0][1]
>>> [word for word, count in c.items() if count == max_count]
['aa', 'xxx']
You can get the words themselves with c.keys()
>>> c.keys()
['aa', 'xxx', 'b', 'ccccccc']
and a list of words with the longest length this way:
>>> max_len = len(max(c, key=len))
>>> [word for word in c if len(word) == max_len]
['ccccccc']
1)
To iterate over dictionary you can either use:
for key in my_dict:
or if you want to get key and value at the same time use:
for key, value in my_dict.iteritems():
2)
To find most frequent words you have to assume that first word is most frequent, then you look at next word used count and if it's the same you append it to your list, if it's less just skip it, if it's more - clear you list and assume that this one is most frequent
3) Pretty much the same as 2. Assume that your first is longest the compare if next one, if it's lenght equals to your current max just append to a list, if it's less skip it, if it's more clear your list and assume that this is your max.
I didn't add any code since it's better if you write it your own in order to learn something
There are other nice answers for your question, But I would like to help you in your attempt, I have done few modification in your code to make it working-
def analyze(x):
longer_than_one= []
most_frequent= []
longest= []
for key in x:
if len(key) >1:
longer_than_one.append(key)
print(longer_than_one)
It seems you haven't attempted for 2nd and 3rd use case.
At first, check collections.Counter:
import collections
word_counts = collections.Counter(your_text.split())
Given that, you can use its .most_common method for the most common words. It produces a list of (word, its_count) tuples.
To discover the longest words in the dictionary, you can do:
import heapq
largest_words= heapq.nlargest(N, word_counts, key=len)
N being the count of largest words you want. This works because by default the iteration over a dict produces only the keys, so it sorts them according to the word length (key=len) and returns only the N largest ones.
But you seem to have fallen deep into Python without going over the tutorial. Is it homework?

python dictionary comprehension iterator

Hey I have a doubt in the following python code i wrote :
#create a list of elements
#use a dictionary to find out the frequency of each element
list = [1,2,6,3,4,5,1,1,3,2,2,5]
list.sort()
dict = {i: list.count(i) for i in list}
print(dict)
In the dictionary compression method, "for i in list" is the sequence supplied to the method right ? So it takes 1,2,3,4.. as the keys. My question is why doesn't it take 1 three times ? Because i've said "for i in list", doesn't it have to take each and every element in the list as a key ?
(I'm new to python so be easy on me !)
My question is why doesn't it take 1 three times ?
That's because dictionary keys are unique. If there is another entry found for the same key, the previous value for that key will be overwritten.
Well, for your issue, if you are only after counting the frequency of each element in your list, then you can use collections.Counter
And please don't use list as variable name. It's a built-in.
>>> lst = [1,2,6,3,4,5,1,1,3,2,2,5]
>>> from collections import Counter
>>> Counter(lst)
Counter({1: 3, 2: 3, 3: 2, 5: 2, 4: 1, 6: 1})
Yes, your suspicion is correct. 1 will come up 3 times during iteration. However, since dictionaries have unique keys, each time 1 comes up it will replace the previously generated key/value pair with the newly generated key/value pair. This will give the right answer, it's not the most efficient. You could convert the list to a set instead to avoid reprocessing duplicate keys:
dict = {i: list.count(i) for i in set(list)}
However, even this method is horribly inefficient because it does a full pass over the list for each value in the list, i.e. O(nĀ²) total comparisons. You could do this in a single pass over the list, but you wouldn't use a dict comprehension:
xs = [1,2,6,3,4,5,1,1,3,2,2,5]
counts = {}
for x in xs:
counts[x] = counts.get(x, 0) + 1
The result for counts is: {1: 3, 2: 3, 3: 2, 4: 1, 5: 2, 6: 1}
Edit: I didn't realize there was something in the library to do this for you. You should use Rohit Jain's solution with collections.Counter instead.

counting duplicate words in python the fastest way [duplicate]

This question already has answers here:
Why do `key in dict` and `key in dict.keys()` have the same output?
(2 answers)
Closed 13 days ago.
I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:
for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1
The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?
from collections import Counter
word_dict = Counter(word_list)
It's because of this:
if words in word_dict.keys():
.keys() returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!
Try this instead:
if words in word_dict:
Also, if you're interested, you can see the Counter implementation for yourself. It's written in regular Python.
your dictionary counting method is not well constructed.
you could have used a defaultdict in the following way:
d = defaultdict(int)
for word in word_list:
d[word] += 1
but the counter method from itertools is still faster even though it is doing almost the same thing, because it is written in a more efficient implementation. however, with the counter method, you need to pass it a list to count, whereas using a defaultdict, you can put sources from different locations and have a more complicated loop.
ultimately it is your preference. if counting a list, counter is the way to go, if iterating from multiple sources, or you simply want a counter in your program and dont want the extra lookup to check if an item is already being counted or not. then defaultdict is your choice.
You can actually look at the Counter code, here is the update method that is called on init:
(Notice it uses the performance trick of defining a local definition of self.get)
def update(self, iterable=None, **kwds):
'''Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
>>> c = Counter('which')
>>> c.update('witch') # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d) # add elements from another counter
>>> c['h'] # four 'h' in which, witch, and watch
4
'''
# The regular dict.update() operation makes no sense here because the
# replace behavior results in the some of original untouched counts
# being mixed-in with all of the other counts for a mismash that
# doesn't have a straight-forward interpretation in most counting
# contexts. Instead, we implement straight-addition. Both the inputs
# and outputs are allowed to contain zero and negative counts.
if iterable is not None:
if isinstance(iterable, Mapping):
if self:
self_get = self.get
for elem, count in iterable.iteritems():
self[elem] = self_get(elem, 0) + count
else:
super(Counter, self).update(iterable) # fast path when counter is empty
else:
self_get = self.get
for elem in iterable:
self[elem] = self_get(elem, 0) + 1
if kwds:
self.update(kwds)
You could also try to use defaultdict as a more competitive choice. Try:
from collections import defaultdict
word_dict = defaultdict(lambda: 0)
for word in word_list:
word_dict[word] +=1
print word_dict
Similar to what monkut mentioned, one of the best ways to do this is to utilize the .get() function. Credit for this goes to Charles Severance and the Python For Everybody Course
For testing:
# Pretend line is as follow.
# It can and does contain \n (newline) but not \t (tab).
line = """Your battle is my battle . We fight together . One team . One team .
Shining sun always comes with the rays of hope . The hope is there .
Our best days yet to come . Let the hope light the road .""".lower()
His code (with my notes):
# make an empty dictionary
# split `line` into a list. default is to split on a space character
# etc., etc.
# iterate over the LIST of words (made from splitting the string)
counts = dict()
words = line.split()
for word in words:
counts[word] = counts.get(word,0) + 1
Your code:
for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1
.get() does this:
Return the VALUE in the dictionary associated with word.
Otherwise (, if the word is not a key in the dictionary,) return 0.
No matter what is returned, we add 1 to it. Thus it handles the base case of seeing the word for the first time. We cannot use a dictionary comprehension, since the variable the comprehension is assigned to won't exist as we are creating that variable. Meaning
this: counts = { word:counts.get(word,0) + 1 for word in words} is not possible, since counts (is being created and assigned to at the same time. Alternatively, since) counts the variable hasn't been fully defined when we reference it (again) to .get() from it.
Output
>> counts
{'.': 8,
'always': 1,
'battle': 2,
'best': 1,
'come': 1,
'comes': 1,
'days': 1,
'fight': 1,
'hope': 3,
'is': 2,
'let': 1,
'light': 1,
'my': 1,
'of': 1,
'one': 2,
'our': 1,
'rays': 1,
'road': 1,
'shining': 1,
'sun': 1,
'team': 2,
'the': 4,
'there': 1,
'to': 1,
'together': 1,
'we': 1,
'with': 1,
'yet': 1,
'your': 1}
As an aside here is a "loaded" use of .get() that I wrote as a way to solve the classic FizzBuzz question. I'm currently writing code for a similar situation in which I will use modulus and a dictionary, but for a split string as input.

Categories

Resources