TF-IDF calculation KeyError - python

I want to calculate document frequencies of text documents. First I created the term dictionary and calculated the term frequencies. I have no problems in these steps, but when I try to use the function below it gives an error:
def computeDF(docList):
df = {}
df = dict.fromkeys(docList[0].keys(), 0)
for doc in docList:
for word, val in doc.items():
if val > 0:
df[word] += 1
for word, val in df.items():
df[word] = float(val)
return df
Called the function like this:
dictList = []
for i in range(N):
# creating dictionary for all documents
tokens = processed_text[i]
dictionary = dict.fromkeys(tokens,0)
# calculation of term frequencies for all documents
for word in tokens:
dictionary[word] += 1
tf = termFreq(dictionary, tokens)
dictList.append(dictionary)
df = computeDF(dictList)
I called the function with list of 10 dictionaries, because it works with list object.
N = 10 (num of documents)
dictList continues like this: dictList
Error:
line 155, in <module> df = computeDF(dictList)
line 134, in computeDF df[word] += 1
KeyError: 'flagstaff'
It works when I try the function in different python file with same object types. I don't understand what is the problem. How can I solve this?

Where you have df = dict.fromkeys(docList[0].keys(), 0) you need something like
keys = set()
for doc in docList:
keys = keys.union(set(doc.keys()))
df = dict.fromkeys(docList[0].keys(), 0)
That way you have keys for all your docs not just the first one. If you want todo it in one line you can do it like this:
keys = set().union(*[set(doc.keys()) for doc in docList])

Related

How to find the highest value element in a list with reference to a dictionary on python

How do I code a function in python which can:
iterate through a list of word strings which may contain duplicate words and referencing to a dictionary,
find the word with the highest absolute sum, and
output it along with the corresponding absolute value.
The function also has to ignore words which are not in the dictionary.
For example,
Assume the function is called H_abs_W().
Given the following list and dict:
list_1 = ['apples','oranges','pears','apples']
Dict_1 = {'apples':5.23,'pears':-7.62}
Then calling the function as:
H_abs_W(list_1,Dict_1)
Should give the output:
'apples',10.46
EDIT:
I managed to do it in the end with the code below. Looking over the answers, turns out I could have done it in a shorter fashion, lol.
def H_abs_W(list_1,Dict_1):
freqW = {}
for char in list_1:
if char in freqW:
freqW[char] += 1
else:
freqW[char] = 1
ASum_W = 0
i_word = ''
for a,b in freqW.items():
x = 0
d = Dict_1.get(a,0)
x = abs(float(b)*float(d))
if x > ASum_W:
ASum_W = x
i_word = a
return(i_word,ASum_W)
list_1 = ['apples','oranges','pears','apples']
Dict_1 = {'apples':5.23,'pears':-7.62}
d = {k:0 for k in list_1}
for x in list_1:
if x in Dict_1.keys():
d[x]+=Dict_1[x]
m = max(Dict_1, key=Dict_1.get)
print(m,Dict_1[m])
try this,
key, value = sorted(Dict_1.items(), key = lambda x : x[1], reverse=True)[0]
print(f"{key}, {list_1.count(key) * value}")
# apples, 10.46
you can use Counter to calculate the frequency(number of occurrences) of each item in the list.
max(counter.values()) will give us the count of maximum occurring element
max(counter, key=counter.get) will give the which item in the list is
associated with that highest count.
========================================================================
from collections import Counter
def H_abs_W(list_1, Dict_1):
counter = Counter(list_1)
count = max(counter.values())
item = max(counter, key=counter.get)
return item, abs(count * Dict_1.get(item))

Python Efficient way of parsing string to list of floats from a file

This document has a word and tens of thousands of floats per line, I want to transform it to a dictionary with the word as key and a vector with all the floats.
That is how I am doing, but due to the size of the file (about 20k lines each one with about 10k values) the process is taking a bit too long. I could not find a more efficient way of doing the parsing. Just some alternative ways that were not guaranteed to decrease run time.
with open("googlenews.word2vec.300d.txt") as g_file:
i = 0;
#dict of words: [lots of floats]
google_words = {}
for line in g_file:
google_words[line.split()[0]] = [float(line.split()[i]) for i in range(1, len(line.split()))]
In your solution you preform slow line.split() for every word, twice. Consider following modification:
with open("googlenews.word2vec.300d.txt") as g_file:
i = 0;
#dict of words: [lots of floats]
google_words = {}
for line in g_file:
word, *numbers = line.split()
google_words[word] = [float(number) for number in numbers]
One advanced concept I used here is "unpacking":
word, *numbers = line.split()
Python allows to unpack iterable values into multilple variables:
a, b, c = [1, 2, 3]
# This is practically equivalent to
a = 1
b = 2
c = 3
The * is a shortcut for "take the leftovers, put them in the list and assign the list to the name":
a, *rest = [1, 2, 3, 4]
# results in
a == 1
rest == [2, 3, 4]
Just don't call line.split() more than once.
with open("googlenews.word2vec.300d.txt") as g_file:
i = 0;
#dict of words: [lots of floats]
google_words = {}
for line in g_file:
temp = line.split()
google_words[temp[0]] = [float(temp[i]) for i in range(1, len(temp))]
Here's a simple generator of such file:
s = "x"
for i in range (10000):
s += " 1.2345"
print (s)
The former version takes some time.
The version with only one split call is instant.
You could also use the csv module, which should be more efficient that what you are doing.
It would be something like:
import csv
d = {}
with (open("huge_file_so_huge.txt", "r")) as g_file:
for row in csv.reader(g_file, delimiter=" "):
d[row[0]] = list(map(float, row[1:]))

Calculating means of values for subgroups of keys in python dictionary

I have a dictionary which looks like this:
cq={'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00,
'B2_B2M_02':4.30, 'B2_B2M_02':2.40 etc.}
I need to calculate mean of triplets, where the keys[2:] agree. So, I would ideally like to get another dictionary which will be:
new={'_B2M_01': 2.47, '_B2M_02': 3.9}
The data is/should be in triplets so in theory I could just get the means of the consecutive values, but first of all, I have it in a dictionary so the keys/values will likely get reordered, besides I'd rather stick to the names, as a quality check for the triplets assigned to names (I will later add a bit showing error message when there will be more than three per group).
I've tried creating a dictionary where the keys would be _B2M_01 and _B2M_02 and then loop through the original dictionary to first append all the values that are assigned to these groups of keys so I could later calculate an average, but I am getting errors even in the first step and anyway, I am not sure if this is the most effective way to do this...
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
trips=set([x[2:] for x in cq.keys()])
new={}
for each in trips:
for k,v in cq.iteritems():
if k[2:]==each:
new[each].append(v)
Traceback (most recent call last):
File "<pyshell#28>", line 4, in <module>
new[each].append(v)
KeyError: '_B2M_01'
I would be very grateful for any suggestions. It seems like a fairly easy operation but I got stuck.
An alternative result which would be even better would be to get a dictionary which contains all the names used as in cq, but with values being the means of the group. So the end result would be:
final={'A1_B2M_01':2.47, 'A2_B2M_01':2.47, 'A3_B2M_01':2.47, 'B1_B2M_02':3.9,
'B2_B2M_02':3.9, 'B2_B2M_02':3.9}
Something like this should work. You can probably make it a little more elegant.
cq = {'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00, 'B2_B2M_02':4.30, 'B2_B2M_02':2.40 }
sum = {}
count = {}
mean = {}
for k in cq:
if k[2:] in sum:
sum[k[2:]] += cq[k]
count[k[2:]] += 1
else:
sum[k[2:]] = cq[k]
count[k[2:]] = 1
for k in sum:
mean[k] = sum[k] / count[k]
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
sums = dict()
for k, v in cq.iteritems():
_, p2 = k.split('_', 1)
if p2 not in sums:
sums[p2] = [0, 0]
sums[p2][0] += v
sums[p2][1] += 1
res = {}
for k, v in sums.iteritems():
res[k] = v[0]/float(v[1])
print res
also could be done with one iteration
Grouping:
SEPARATOR = '_'
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
groups = {}
for key in cq:
group_key = SEPARATOR.join(key.split(SEPARATOR)[1:])
if group_key in groups:
groups[group_key].append(cq[key])
else:
groups[group_key] = [cq[key]]
Generate means:
def means(groups):
for group, group_vals in groups.iteritems():
yield (group, float(sum(group_vals)) / len(group_vals),)
print list(means(groups))

Python: Concatenate similiar objects in List

I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.

python loop dictionary value references updating all values

I am having a problem updating values in a dictionary in python. I am trying to update a nested value (either as an int or list) for a single fist level key, but instead i update the values, for all first level keys.
I start by creating the dictionary:
kmerdict = {}
innerdict = {'endcover':0, 'coverdict':{}, 'coverholder':[], 'uncovered':0, 'lowstart':0,'totaluncover':0, 'totalbases':0}
for kmer in kmerlist: # build kmerdict
kmerdict [kmer] = {}
for chrom in fas: #open file and read line
chromnum = chrom[3:-3]
kmerdict [kmer][chromnum] = innerdict
Then i am walking through chromosomes (as plain text files) from a list (fas, not shown), and taking 7mer strings (k=7) as the key. If that key is in a list of keys i am looking for (kmerlist) and trying to use that to reference a single value nested in the dictionary:
for chrom in fas: #open file and read line
chromnum = chrom[3:-3]
p = 0 #chromosome position counter
thisfile = "/var/store/fa/" + chrom
thischrom = open(thisfile)
thischrom.readline()
thisline = thischrom.readline()
thisline = string.strip(thisline.lower())
l=0 #line counter
workline = thisline
while(thisline):
if len(workline) > k-1:
thiskmer = ''
thiskmer = workline[0:k] #read five bases
if thiskmer in kmerlist:
thisuncovered = kmerdict[thiskmer][chromnum]['uncovered']
thisendcover = kmerdict[thiskmer][chromnum]['endcover']
thiscoverholder = kmerdict[thiskmer][chromnum]['coverholder']
if p >= thisendcover:
thisuncovered += (p - thisendcover)
thisendcover = ((p+k) + ext)
thiscoverholder.append(p)
elif p < thisendcover:
thisendcover = ((p+k) + ext)
thiscoverholder.append(p)
print kmerdict[thiskmer]
p += 1
workline = workline[1:]
else:
thisline = thischrom.readline()
thisline = string.strip(thisline.lower())
workline = workline+thisline
l+=1
print kmerdict
but when i print the dictionary, all "thiskmer" levels are getting updated with the same values. I'm not very good with dictionaries, and i can't see the error of my ways, but they are profound! Can anyone enlighten me?
Hope i've been clear enough. I've been tinkering with this code for too long now :(
confession -- I haven't spent the time to figure out all of your code -- only the first part. The first problem you have is in the setup:
kmerdict = {}
innerdict = {'endcover':0, 'coverdict':{}, 'coverholder':[], 'uncovered':0,
'lowstart':0,'totaluncover':0, 'totalbases':0}
for kmer in kmerlist: # build kmerdict
kmerdict [kmer] = {}
for chrom in fas: #open file and read line
chromnum = chrom[3:-3]
kmerdict [kmer][chromnum] = innerdict
You create innerdict once and then proceed to use the same dictionary over an over again. In other words, every kmerdict[kmer][chromnum] refers to the same objects. Perhaps changing the last line to:
kmerdict [kmer][chromnum] = copy.deepcopy(innerdict)
would help (with an appropriate import of copy at the top of your file)? Alternatively, you could just move the creation of innerdict into the inner loop as pointed out in the comments:
def get_inner_dict():
return {'endcover':0, 'coverdict':{}, 'coverholder':[], 'uncovered':0,
'lowstart':0,'totaluncover':0, 'totalbases':0}
kmerdict = {}
for kmer in kmerlist: # build kmerdict
kmerdict [kmer] = {}
for chrom in fas: #open file and read line
chromnum = chrom[3:-3]
kmerdict [kmer][chromnum] = get_inner_dict()
-- I decided to use a function to make it easier to read :).

Categories

Resources