Counting items inside tuples in Python - python

I am fairly new to python and I could not figure out how to do the following.
I have a list of (word, tag) tuples
a = [('Run', 'Noun'),('Run', 'Verb'),('The', 'Article'),('Run', 'Noun'),('The', 'DT')]
I am trying to find all tags that has been assigned to each word and collect their counts. For example, word "run" has been tagged twice to 'Noun' and once to 'Verb'.
To clarify: I would like to create another list of tuples that contains (word, tag, count)

You can use collections.Counter:
>>> import collections
>>> a = [('Run', 'Noun'),('Run', 'Verb'),('The', 'Article'),('Run', 'Noun'),('The', 'DT')]
>>> counter = collections.Counter(a)
Counter({('Run', 'Noun'): 2, ('Run', 'Verb'): 1, ... })
>>> result = {}
>>> for (tag, word), count in counter.items():
... result.setdefault(tag, []).append({word: count})
>>> print(result)
{'Run': [{'Noun': 2}, {'Verb': 1}], 'The': [{'Article': 1}, {'DT': 1}]}

Pretty easy with a defaultdict:
>>> from collections import defaultdict
>>> output = defaultdict(defaultdict(int).copy)
>>> for word, tag in a:
... output[word][tag] += 1
...
>>> output
defaultdict(<function copy>,
{'Run': defaultdict(int, {'Noun': 2, 'Verb': 1}),
'The': defaultdict(int, {'Article': 1, 'DT': 1})})

Related

Dictionary with a query of sets in python

So i am trying to get the position of each word in a list, and print it in a dictionary that has the word for key and a set of integers where it belongs in the list.
list_x = ["this is the first", "this is the second"]
my_dict = {}
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x] += 1
else:
my_dict[x] = 1
print(my_dict)
This is the code i tried but this gives me the total number of how many time it appears in the list each word.
What i am trying to get is this format:
{'this': {0, 1}, 'is': {0, 1}, 'the': {0, 1}, 'first': {0}, 'second': {1}}
As you can see this is the key and it appears once, in the "0" position and once in the "1" and .. Any idea how i might get to this point?
Fixed two lines:
list_x = ["this is the first", "this is the second"]
my_dict = {}
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x].append(i)
else:
my_dict[x] = [i]
print(my_dict)
Returns:
{'this': [0, 1], 'is': [0, 1], 'the': [0, 1], 'first': [0], 'second': [1]}
Rather than using integers in your dict, you should use a set:
for i in range(len(list_x)):
for x in list_x[i].split():
if x in my_dict:
my_dict[x].add(i)
else:
my_dict[x] = set([i])
Or, more briefly,
for i in range(len(list_x)):
for x in list_x[i].split():
my_dict.setdefault(x, set()).add(i)
You can also do this with defaultdict and enumerate:
from collections import defaultdict
list_x = ["this is the first",
"this is the second",
"third is this"]
pos = defaultdict(set)
for i, sublist in enumerate(list_x):
for word in sublist.split():
pos[word].add(i)
Output:
>>> from pprint import pprint
>>> pprint(dict(pos))
{'first': {0},
'is': {0, 1, 2},
'second': {1},
'the': {0, 1},
'third': {2},
'this': {0, 1, 2}}
The purpose of enumerate is to provide the index (position) of each string within list_x. For each word encountered, the position of its sentence within list_x will be added to the set for its corresponding key in the result, pos.

Find count of identical adjacent characters in a string

I have a string: 'AAAAATTT'
I want to write a program that would count each time 2 values are identical.
So in 'AAAAATTT' it would give a count of:
AA: 4
TT: 2
You can use collections.defaultdict for this. This is an O(n) complexity solution which loops through adjacent letters and builds a dictionary based on a condition.
Your output will be a dictionary with keys as repeated letters and values as counts.
The use of itertools.islice is to avoid building a new list for the second argument of zip.
from collections import defaultdict
from itertools import islice
x = 'AAAAATTT'
d = defaultdict(int)
for i, j in zip(x, islice(x, 1, None)):
if i == j:
d[i+j] += 1
Result:
print(d)
defaultdict(<class 'int'>, {'AA': 4, 'TT': 2}
You could use a Counter:
from collections import Counter
s = 'AAAAATTT'
print([(k*2, v - 1) for k, v in Counter(list(s)).items() if v > 1])
#output: [('AA', 4), ('TT', 2)]
You may use collections.Counter with dictionary comprehension and zip as:
>>> from collections import Counter
>>> s = 'AAAAATTT'
>>> {k: v for k, v in Counter(zip(s, s[1:])).items() if k[0]==k[1]}
{('A', 'A'): 4, ('T', 'T'): 2}
Here's another alternative to achieve this using itertools.groupby, but this one is not as clean as the above solution (also will be slow in terms of performance).
>>> from itertools import groupby
>>> {x[0]:len(x) for i,j in groupby(zip(s, s[1:]), lambda y: y[0]==y[1]) for x in (tuple(j),) if i}
{('A', 'A'): 4, ('T', 'T'): 2}
One way may be as following using Counter:
from collections import Counter
string = 'AAAAATTT'
result = dict(Counter(s1+s2 for s1, s2 in zip(string, string[1:]) if s1==s2))
print(result)
Result:
{'AA': 4, 'TT': 2}
You can try it with just range method without importing anything :
data='AAAAATTT'
count_dict={}
for i in range(0,len(data),1):
data_x=data[i:i+2]
if len(data_x)>1:
if data_x[0] == data_x[1]:
if data_x not in count_dict:
count_dict[data_x] = 1
else:
count_dict[data_x] += 1
print(count_dict)
output:
{'TT': 2, 'AA': 4}

Find count of characters within the string in Python

I am trying to create a dictionary of word and number of times it is repeating in string. Say suppose if string is like below
str1 = "aabbaba"
I want to create a dictionary like this
word_count = {'a':4,'b':3}
I am trying to use dictionary comprehension to do this.
I did
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
This ends up giving an error saying
File "<stdin>", line 1
dic = {x:dic[x]+1 if x in dic.keys() else x:1 for x in str}
^
SyntaxError: invalid syntax
Can anybody tell me what's wrong with the syntax? Also,How can I create such a dictionary using dictionary comprehension?
As others have said, this is best done with a Counter.
You can also do:
>>> {e:str1.count(e) for e in set(str1)}
{'a': 4, 'b': 3}
But that traverses the string 1+n times for each unique character (once to create the set, and once for each unique letter to count the number of times it appears. i.e., This has quadratic runtime complexity.). Bad result if you have a lot of unique characters in a long string... A Counter only traverses the string once.
If you want no import version that is more efficient than using .count, you can use .setdefault to make a counter:
>>> count={}
>>> for c in str1:
... count[c]=count.setdefault(c, 0)+1
...
>>> count
{'a': 4, 'b': 3}
That only traverses the string once no matter how long or how many unique characters.
You can also use defaultdict if you prefer:
>>> from collections import defaultdict
>>> count=defaultdict(int)
>>> for c in str1:
... count[c]+=1
...
>>> count
defaultdict(<type 'int'>, {'a': 4, 'b': 3})
>>> dict(count)
{'a': 4, 'b': 3}
But if you are going to import collections -- Use a Counter!
Ideal way to do this is via using collections.Counter:
>>> from collections import Counter
>>> str1 = "aabbaba"
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
You can not achieve this via simple dict comprehension expression as you will require reference to your previous value of count of element. As mentioned in Dawg's answer, as a work around you may use list.count(e) in order to find count of each element from the set of string within you dict comprehension expression. But time complexity will be n*m as it will traverse the complete string for each unique element (where m are uniques elements), where as with counter it will be n.
This is a nice case for collections.Counter:
>>> from collections import Counter
>>> Counter(str1)
Counter({'a': 4, 'b': 3})
It's dict subclass so you can work with the object similarly to standard dictionary:
>>> c = Counter(str1)
>>> c['a']
4
You can do this without use of Counter class as well. The simple and efficient python code for this would be:
>>> d = {}
>>> for x in str1:
... d[x] = d.get(x, 0) + 1
...
>>> d
{'a': 4, 'b': 3}
Note that this is not the correct way to do it since it won't count repeated characters more than once (apart from losing other characters from the original dict) but this answers the original question of whether if-else is possible in comprehensions and demonstrates how it can be done.
To answer your question, yes it's possible but the approach is like this:
dic = {x: (dic[x] + 1 if x in dic else 1) for x in str1}
The condition is applied on the value only not on the key:value mapping.
The above can be made clearer using dict.get:
dic = {x: dic.get(x, 0) + 1 for x in str1}
0 is returned if x is not in dic.
Demo:
In [78]: s = "abcde"
In [79]: dic = {}
In [80]: dic = {x: (dic[x] + 1 if x in dic else 1) for x in s}
In [81]: dic
Out[81]: {'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1}
In [82]: s = "abfg"
In [83]: dic = {x: dic.get(x, 0) + 1 for x in s}
In [84]: dic
Out[84]: {'a': 2, 'b': 2, 'f': 1, 'g': 1}

creating a dictionary of words in string whose values are words following that word

I would like to create a dictionary from a text file using each unique word as a key and a a dictionary of the words that follow the key with the count of that word as the value. For example something that looks like this:
>>>string = 'This is a string'
>>>word_counts(string)
{'this': {'is': 1}, 'is': {'a': 1}, 'a': {'string': 1}}
Creating a dictionary of the unique words is no issue, it's creating the dictionary for the following word values I'm stuck on. I can't use an list.index() operation in case there are word repeats. Outside of that I am kind of at a loss.
Actually, the collections.Counter class isn't always the best choice to count something. You can use collections.defaultdict:
from collections import defaultdict
def bigrams(text):
words = text.strip().lower().split()
counter = defaultdict(lambda: defaultdict(int))
for prev, current in zip(words[:-1], words[1:]):
counter[prev][current] += 1
return counter
Note that if your text contains punctuation marks as well, the line words = text.strip().lower().split() should be substituted with words = re.findall(r'\w+', text.lower()).
And if your text is so huge that the performance matters, you may consider the pairwise recipe from itertools docs or, if you're using python2, itertools.izip instead of zip.
You can make use of Counter to achieve what you want:
from collections import Counter, defaultdict
def get_tokens(string):
return string.split() # put whatever token-parsing algorithm you want here
def word_counts(string):
tokens = get_tokens(string)
following_words = defaultdict(list)
for i, token in enumerate(tokens):
if i:
following_words[tokens[i - 1]].append(token)
return {token: Counter(words) for token, words in following_words.iteritems()}
string = 'this is a string'
print word_counts(string) # {'this': Counter({'is': 1}), 'a': Counter({'string': 1}), 'is': Counter({'a': 1})}
Just to give an alternative option (I imagine the other answers are more suitable for your needs) you could use the pairwise recipe from itertools:
from itertools import tee, izip
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
Then the function can be coded as:
def word_counts(string):
words = string.split()
result = defaultdict(lambda: defaultdict(int))
for word1, word2 in pairwise(words):
result[word1][word2] += 1
return result
Test:
string = 'This is a string is not an int is a string'
print word_counts(string)
Produces:
{'a': {'string': 2}, 'string': {'is': 1}, 'This': {'is': 1}, 'is': {'a': 2, 'not': 1}, 'an': {'int': 1}, 'int': {'is': 1}, 'not': {'an': 1}}

How to count frequency of single word and also double word count from input text in python?

Hello I want to count single word and double word count from input text in python .
Ex.
"what is your name ? what you want from me ?
You know best way to earn money is Hardwork
what is your aim ?"
output:
sinle W.C. :
what 3
is 3
your 2
you 2
and so on..
Double W.C. :
what is 2
is your 2
your name 1
what you 1
ans so on..
please post the way to do this ?
i use following code for the singl word count :
ws={}
for line in text:
for wrd in line:
if wrd not in ws:
ws[wrd]=1
else:
ws[wrd]+=1
from collections import Counter
s = "..."
words = s.split()
pairs = zip(words, words[1:])
single_words, double_words = Counter(words), Counter(pairs)
Output:
print "sinle W.C."
for word, count in sorted(single_words.items(), key=lambda x: -x[1]):
print word, count
print "double W.C."
for pair, count in sorted(double_words.items(), key=lambda x: -x[1]):
print pair, count
import nltk
from nltk import bigrams
from nltk import trigrams
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = bigrams(tokens)
print [(item, tokens.count(item)) for item in sorted(set(tokens))]
print [(item, bi_tokens.count(item)) for item in sorted(set(bi_tokens))]
this works. using defaultdict. python 2.6
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> string = "what is your name ? what you want from me ?\n
You know best way to earn money is Hardwork\n what is your aim ?"
>>> l = string.split()
>>> for i in l:
d[i]+=1
>>> d
defaultdict(<type 'int'>, {'me': 1, 'aim': 1, 'what': 3, 'from': 1, 'name': 1,
'You': 1, 'money': 1, 'is': 3, 'earn': 1, 'best': 1, 'Hardwork': 1, 'to': 1,
'way': 1, 'know': 1, 'want': 1, 'you': 1, 'your': 2, '?': 3})
>>> d2 = defaultdict(int)
>>> for i in zip(l[:-1], l[1:]):
d2[i]+=1
>>> d2
defaultdict(<type 'int'>, {('You', 'know'): 1, ('earn', 'money'): 1,
('is', 'Hardwork'): 1, ('you', 'want'): 1, ('know', 'best'): 1,
('what', 'is'): 2, ('your', 'name'): 1, ('from', 'me'): 1,
('name', '?'): 1, ('?', 'You'): 1, ('?', 'what'): 1, ('to', 'earn'): 1,
('aim', '?'): 1, ('way', 'to'): 1, ('Hardwork', 'what'): 1,
('money', 'is'): 1, ('me', '?'): 1, ('what', 'you'): 1, ('best', 'way'): 1,
('want', 'from'): 1, ('is', 'your'): 2, ('your', 'aim'): 1})
>>>
I realize this question is a few years old. I wrote a little routine today to just count individual words from a word doc (docx). I used docx2txt to get the text from the word document, and used my first regex expression ever to remove every character other than alpha, numeric or spaces, and switched all to uppercase. I put this in because the question isn't answered.
Here is my little test routine in case it may help anyone.
mydoc = 'I:/flashdrive/pmw/pmw_py.docx'
words_all = {}
#####
import docx2txt
my_text = docx2txt.process(mydoc)
print(my_text)
my_text_org = my_text
import re
#added this code for the double words
from collections import Counter
pairs = zip(words, words[1:])
pair_list = Counter(pairs)
print('before pair listing')
for pair, count in sorted(pair_list.items(), key=lambda x: -x[1]):
#print (''.join('{} {}'.format(*pair)), count) #worked
#print(' '.join(pair), '', count) #worked
new_pair = ("{} {}")
my_pair = new_pair.format(pair[0],pair[1])
print ((my_pair), ": ", count)
#end of added code
my_text = re.sub('[\W_]+', ' ', my_text.upper(), flags=re.UNICODE)
print(my_text)
words = my_text.split()
words_org = words #just in case I may need the original version later
for i in words:
if not i in words_all:
words_all[i] = words.count(i)
for k,v in sorted(words_all.items()):
print(k, v)
print("Number of items in word list: {}".format(len(words_all)))

Categories

Resources