How can I count recognized entities per label?

How can I count recognized entities per label? - python

for quantitative analysis, I would like to count how many entities of a specific type were recognized in a set of descriptions.
I'm reading the excel file
checking the size
iterating through the first 100 records
so far so good - now I want to count any time a certain entity of a type is recognized in the processed line/row and print the results afterward:
eg.:
PERSON: 34,
ORG: 10,
PRODUCT: 23,...
print('RAWDATASIZE:',rawdata["Activity.Description"].size)
print('Summary of entities recognized:')
count = {}
for index, row in validation_rawdata.head(100).iterrows():
line = row['Activity.Description']
if not (line is None):
doc = nlp(str(line))
entities = {}
entities_text = []
for ent in doc.ents:
count[ent.label_] =+ 1
print(count)
the current output looks like this:
RAWDATASIZE: 233291
Summary of entities recognized:
{'PERSON': 1, 'DATE': 1, 'GPE': 1, 'SHS_PRODUCT': 1, 'ORG': 1, 'NORP': 1, 'CARDINAL': 1, 'TIME': 1, 'LOC': 1, 'WORK_OF_ART': 1}
so it seems like its resetting the count after each iteration.
How can I change the code keep counting?

There's a typo in your code:
=+ 1 should be += 1

Related

Count total number of modal verbs in text

I am trying to create a custom collection of words as shown in the following Categories:
Modal Tentative Certainty Generalizing
Can Anyhow Undoubtedly Generally
May anytime Ofcourse Overall
Might anything Definitely On the Whole
Must hazy No doubt In general
Shall hope Doubtless All in all
ought to hoped Never Basically
will uncertain always Essentially
need undecidable absolute Most
Be to occasional assure Every
Have to somebody certain Some
Would someone clear Often
Should something clearly Rarely
Could sort inevitable None
Used to sorta forever Always
I am reading text from a CSV file row by row:
import nltk
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize
count = defaultdict(int)
header_list = ["modal","Tentative","Certainity","Generalization"]
categorydf = pd.read_csv('Custom-Dictionary1.csv', names=header_list)
def analyze(file):
df = pd.read_csv(file)
modals = str(categorydf['modal'])
tentative = str(categorydf['Tentative'])
certainity = str(categorydf['Certainity'])
generalization = str(categorydf['Generalization'])
for text in df["Text"]:
tokenize_text = text.split()
for w in tokenize_text:
if w in modals:
count[w] += 1
analyze("test1.csv")
print(sum(count.values()))
print(count)
I want to find number of Modal/Tentative/Certainty verbs which are present in the above table and in each row in test1.csv, but not able to do so. This is generating words frequency with number.
19
defaultdict(<class 'int'>, {'to': 7, 'an': 1, 'will': 2, 'a': 7, 'all': 2})
See 'an','a' are not present in the table. I want to get No of Model verbs = total modal verbs present in 1 row of test.csv text
test1.csv:
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
"They convey the content of a communication."
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
I am stuck and not getting anything. How can I proceed?

I've solved your task for initial CSV format, could be of cause adopted to XML input if needed.
I've did quite fancy solution using NumPy, that's why solution might be a bit complex, but runs very fast and suitable for large data, even Giga-Bytes.
It uses sorted table of words, also sorts text to count words and sorted-search in table, hence works in O(n log n) time complexity.
It outputs original text line on first line, then Found-line where it lists each found in Tabl word in sorted order with (Count, Modality, (TableRow, TableCol)), then Non-Found-line where it lists non-found-in-table words plus Count (number of occurancies of this word in text).
Also a much simpler (but slower) similar solution is located after the first one.
Try it online!
import io, pandas as pd, numpy as np
# Instead of io.StringIO(...) provide filename.
tab = pd.read_csv(io.StringIO("""
Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
tabc = np.array(tab.columns.values.tolist(), dtype = np.str_)
taba = tab.values.astype(np.str_)
tabw = np.char.lower(taba.ravel())
tabi = np.zeros([tabw.size, 2], dtype = np.int64)
tabi[:, 0], tabi[:, 1] = [e.ravel() for e in np.split(np.mgrid[:taba.shape[0], :taba.shape[1]], 2, axis = 0)]
t = np.argsort(tabw)
tabw, tabi = tabw[t], tabi[t, :]
texts = pd.read_csv(io.StringIO("""
Text
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
""")).values[:, 0].astype(np.str_)
for i, (a, text) in enumerate(zip(map(np.array, np.char.split(texts)), texts)):
vs, cs = np.unique(np.char.lower(a), return_counts = True)
ps = np.searchsorted(tabw, vs)
unc = np.zeros_like(a, dtype = np.bool_)
psm = ps < tabi.shape[0]
psm[psm] = tabw[ps[psm]] == vs[psm]
print(
i, ': Text:', text,
'\nFound:',
', '.join([f'"{vs[i]}": ({cs[i]}, {tabc[tabi[ps[i], 1]]}, ({tabi[ps[i], 0]}, {tabi[ps[i], 1]}))'
for i in np.flatnonzero(psm).tolist()]),
'\nNon-Found:',
', '.join([f'"{vs[i]}": {cs[i]}'
for i in np.flatnonzero(~psm).tolist()]),
'\n',
)
Outputs:
0 : Text: When LIWC was first developed, the goal was to devise an efficient will system
Found: "will": (1, Modal, (6, 0))
Non-Found: "an": 1, "developed,": 1, "devise": 1, "efficient": 1, "first": 1, "goal": 1, "liwc": 1, "system": 1, "the": 1, "to": 1, "was": 2, "when":
1
1 : Text: Within a few years, it became clear that there are two very broad categories of words
Found: "clear": (1, Certainty, (10, 2))
Non-Found: "a": 1, "are": 1, "became": 1, "broad": 1, "categories": 1, "few": 1, "it": 1, "of": 1, "that": 1, "there": 1, "two": 1, "very": 1, "withi
n": 1, "words": 1, "years,": 1
2 : Text: Content words are generally nouns, regular verbs, and many adjectives and adverbs.
Found: "generally": (1, Generalizing, (0, 3))
Non-Found: "adjectives": 1, "adverbs.": 1, "and": 2, "are": 1, "content": 1, "many": 1, "nouns,": 1, "regular": 1, "verbs,": 1, "words": 1
3 : Text: They convey the content of a communication.
Found:
Non-Found: "a": 1, "communication.": 1, "content": 1, "convey": 1, "of": 1, "the": 1, "they": 1
4 : Text: To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”
Found:
Non-Found: "a": 1, "and": 2, "are:": 1, "back": 1, "content": 1, "dark": 1, "go": 1, "night”": 1, "phrase": 1, "stormy": 1, "the": 2, "to": 2, "was":
1, "words": 1, "“dark,”": 1, "“it": 1, "“night.”": 1, "“stormy,”": 1
Second solution is implemented in pure Python just for simplicity, only standard python modules io and csv are used.
Try it online!
import io, csv
# Instead of io.StringIO(...) just read from filename.
tab = csv.DictReader(io.StringIO("""Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
texts = csv.DictReader(io.StringIO("""
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
"""), fieldnames = ['Text'])
tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]
for text in texts:
cnt, mod = {}, {}
for word in text.lower().split():
if word in tabi:
cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))
It outputs:
'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)
I'm reading from StringIO content of CSV, that is to convenience so that code contains everything without need of extra files, for sure in your case you'll need direct files reading, for this you may do same as in next code and next link (named Try it online!):
Try it online!
import io, csv
tab = csv.DictReader(open('table.csv', 'r', encoding = 'utf-8-sig'))
texts = csv.DictReader(open('texts.csv', 'r', encoding = 'utf-8-sig'), fieldnames = ['Text'])
tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]
for text in texts:
cnt, mod = {}, {}
for word in text.lower().split():
if word in tabi:
cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))

Transform every word in string to a dict and pass how many times all the word occured as value in python

I'm having trouble transforming every word of a string in a dictionary and passing how many times the word appears as the value.
For example
string = 'How many times times appeared in this many times'
The dict i wanted is:
dict = {'times':3, 'many':2, 'how':1 ...}

Using Counter
from collections import Counter
res = dict(Counter(string.split()))
#{'How': 1, 'many': 2, 'times': 3, 'appeared': 1, 'in': 1, 'this': 1}

You can loop through the words and increment the count like so:
d = {}
for word in string.split(" "):
d.setdefault(word, 0)
d[word] += 1

iterate through 2 dictionaries, if value occurs in one, count down the int value in other then remove key if count = 0

I have 2 dictionary objects:
members = {'member3': ['PCP3'], 'member4': ['PCP1'], 'member11': ['PCP2'], 'member12': ['PCP3']}
providers = {'PCP1': 2, 'PCP2': 2, 'PCP3': 1, 'PCP4': 3, 'PCP5': 4}
I want to iterate through both and each time a value of the "members" dict appears, subtract one from the count of that particular provider. If the count of a provider reaches zero, remove them from the "providers" dictionary and randomly pick one of the members out of the "member" dictionary. So in this case either member3 or member12 will be kicked out because there wasnt enough spots.
Result would look like this if member3 won the random toss for example:
members = {'member3' : 'PCP3', 'member4': 'PCP1' , 'member11': PCP2}
providers = {'PCP1: 1, 'PCP2' : 1, 'PCP4': 3, 'PCP5' : 4}
I have tried starting with this but this problem is beyond my abilities the moment
from collections import defaultdict
query_dict=defaultdict(set)
for (k,v), (k2,v2) in zip(members_reduced.items(), PCPs.items()):
query_dict[k[v]].subtract(k2[v2])
This gives error
TypeError: string indices must be integers
I also tried:
for (k,v), (k2,v2) in zip(members_reduced.items(), PCPs.items()):
if members_reduced[v] == PCPs[k2]:
PCPs[v2] -= 1
With error
TypeError: unhashable type: 'list'
I have no idea how I would input the random conditional if I got this first part right. this is a smaller model of a much larger model i need at work

One way to solve the task is:
import random
members = {'member3': ['PCP3'], 'member4': ['PCP1'], 'member11': ['PCP2'], 'member12': ['PCP3']}
providers = {'PCP1': 2, 'PCP2': 2, 'PCP3': 1, 'PCP4': 3, 'PCP5': 4}
to_remove = []
for member, provider_list in members.items():
provider = provider_list[0]
if provider in providers:
providers[provider] -= 1
if providers[provider] == 0:
providers.pop(provider)
to_remove.append(provider)
for provider in to_remove:
candidates = [
member for member, provider_list in members.items()
if provider == provider_list[0]]
candidate = random.sample(candidates, 1)[0]
members.pop(candidate)
print(members)
# {'member4': ['PCP1'], 'member11': ['PCP2'], 'member12': ['PCP3']}
print(providers)
# {'PCP1': 1, 'PCP2': 1, 'PCP4': 3, 'PCP5': 4}```
Basically, we solve the problem in two passes:
in the first pass we modify the counters and, when the counters reach 0, we mark the member to remove
in the second pass we remove the members according to what was marked for removal

Determine the proximity distance of two string in python

Clean_data is list with over 9000 text files. rules is list of dictionary containing over 500 elements. Below is the rules list
rules = [{'id': 1, 'kwd_root': 'add', 'kwd_sub': 'price target', 'word_count': 5, 'occurance': 1, 'kwd_search': 1, 'status': 1}, {'id': 2, 'kwd_root': 'add', 'kwd_sub': 'PT', 'word_count': 5, 'occurance': 1, 'kwd_search': 1, 'status': 1},.....]
My Question is : I need apply the rules for each and every element in clean_data list.below is the code i have used
for word in clean_data:
for i,d in enumerate(rules):
if any(d['kwd_root'] in word and d['kwd_sub'] in word):
if abs(word.index(d['kwd_root']) - word.index(d['kwd_sub'])) <= d['word_count']:
research.append(word)
else:
non_research.append(word)
else:
non_research.append(word)
After running this code i'm getting the len(non_research) to as 110000 and len(research) as 5500
But the expected output as len(non_research) + len(research) should be equal to len(clean_data)
Thanks

The code indentation posted is wrong. On the other side, the line 3 you use 'any' which need a list as argument. In addition research/non_research append a value each word and each condition (word x condition times). Maybe you can use:
for word in clean_data:
flag_rules = False
for i,d in enumerate(rules):
if d['kwd_root'] in word and d['kwd_sub'] in word:
if abs(word.index(d['kwd_root']) - word.index(d['kwd_sub'])) <= d['word_count']:
flag_rules = True
if flag_rules:
research.append(word)
else:
non_research.append(word)

how to find most common entry in dictionary of dictionaries in python

I have a dictionary that's two levels deep. That is, each key in the first dictionary is a url and the value is another dictionary with each key being words and each value being the number of times the word appeared on that url. It looks something like this:
dic = {
'http://www.cs.rpi.edu/news/seminars.html': {
'hyper': 1,
'summer': 2,
'expert': 1,
'koushk': 1,
'semantic': 1,
'feedback': 1,
'sandia': 1,
'lewis': 1,
'global': 1,
'yener': 1,
'laura': 1,
'troy': 1,
'session': 1,
'greenhouse': 1,
'human': 1
...and so on...
The dictionary itself is very long and has 25 urls in it, each url having another dictionary as its value with every word found within the url and the number of times its found.
I want to find the word or words that appear in the most different urls in the dictionary. So the output should look something like this:
The following words appear x times on y pages: list of words

It seems that you should use a Counter for this:
from collections import Counter
print sum((Counter(x) for x in dic.values()),Counter()).most_common()
Or the multiline version:
c = Counter()
for d in dic.values():
c += Counter(d)
print c.most_common()
To get the words which are common in all of the subdicts:
subdicts = iter(dic.values())
s = set(next(subdicts)).intersection(*subdicts)
Now you can use that set to filter the resulting counter, removing words which don't appear in every subdict:
c = Counter((k,v) for k,v in c.items() if k in s)
print c.most_common()

A Counter isn't quite what you want. From the output you show, it looks like you want to keep track of both the total number of occurrences, and the number of pages the word occurs on.
data = {
'page1': {
'word1': 5,
'word2': 10,
'word3': 2,
},
'page2': {
'word2': 2,
'word3': 1,
}
}
from collections import defaultdict
class Entry(object):
def __init__(self):
self.pages = 0
self.occurrences = 0
def __iadd__(self, occurrences):
self.pages += 1
self.occurrences += occurrences
return self
def __str__(self):
return '{} occurrences on {} pages'.format(self.occurrences, self.pages)
    def __repr__(self):
return '<Entry {} occurrences, {} pages>'.format(self.occurrences, self.pages)
counts = defaultdict(Entry)
for page_words in data.itervalues():
for word, count in page_words.iteritems():
counts[word] += count
for word, entry in counts.iteritems():
print word, ':', entry
This produces the following output:
word1 : 5 occurrences on 1 pages
word3 : 3 occurrences on 2 pages
word2 : 12 occurrences on 2 pages
That would capture the information you want, the next step would be to find the most common n words. You could do that using a heapsort (which has the handy feature of not requiring that you sort the whole list of words by number of pages then occurrences - that might be important if you've got a lot of words in total, but n of 'top n' is relatively small).
from heapq import nlargest
def by_pages_then_occurrences(item):
entry = item[1]
return entry.pages, entry.occurrences
print nlargest(2, counts.iteritems(), key=by_pages_then_occurrences)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I count recognized entities per label? - python

There's a typo in your code: =+ 1 should be += 1

Related

Count total number of modal verbs in text

Transform every word in string to a dict and pass how many times all the word occured as value in python

iterate through 2 dictionaries, if value occurs in one, count down the int value in other then remove key if count = 0

Determine the proximity distance of two string in python

how to find most common entry in dictionary of dictionaries in python

Categories

Resources