How to create a frequency matrix?

How to create a frequency matrix? - python

I just started using Python and I just came across the following problem:
Imagine I have the following list of lists:
list = [["Word1","Word2","Word2","Word4566"],["Word2", "Word3", "Word4"], ...]
The result (matrix) i want to get should look like this:
The Displayed Columns and Rows are all appearing words (no matter which list).
The thing that I want is a programm that counts the appearence of words in each list (by list).
The picture is the result after the first list.
Is there an easy way to achieve something like this or something similar?
EDIT:
Basically I want a List/Matrix that tells me how many times words 2-4566 appeared when word 1 was also in the list, and so on.
So I would get a list for each word that displays the absolute frequency of all other 4555 words in relationship with this word.
So I would need an algorithm that iterates through all this lists of words and builts the result lists

As far as I understand you want to create a matrix that shows the number of lists where two words are located together for each pair of words.
First of all we should fix the set of unique words:
lst = [["Word1","Word2","Word2","Word4566"],["Word2", "Word3", "Word4"], ...] # list is a reserved word in python, don't use it as a name of variables
words = set()
for sublst in lst:
words |= set(sublst)
words = list(words)
Second we should define a matrix with zeros:
result = [[0] * len(words)] * len(words) # zeros matrix N x N
And finally we fill the matrix going through the given list:
for sublst in lst:
sublst = list(set(sublst)) # selecting unique words only
for i in xrange(len(sublst)):
for j in xrange(i + 1, len(sublst)):
index1 = words.index(sublst[i])
index2 = words.index(sublst[j])
result[index1][index2] += 1
result[index2][index1] += 1
print result

I find it really hard to understand what you're really asking for, but I'll try by making some assumptions:
(1) You have a list (A), containing other lists (b) of multiple words (w).
(2) For each b-list in A-list
(3) For each w in b:
(3.1) count the total number of appearances of w in all of the b-lists
(3.2) count how many of the b-lists, in which w appears only once
If these assumptions are correct, then the table doesn't correspond correctly to the list you've provided. If my assumptions are wrong, then I still believe my solution may give you inspiration or some ideas on how to solve it correctly. Finally, I do not claim my solution to be optimal with respect to speed or similar.
OBS!! I use python's built-in dictionaries, which may become terribly slow if you intend to fill them with thousands of words!! Have a look at: https://docs.python.org/2/tutorial/datastructures.html#dictionaries
frq_dict = {} # num of appearances / frequency
uqe_dict = {} # unique
for list_b in list_A:
temp_dict = {}
for word in list_b:
if( word in temp_dict ):
temp_dict[word]+=1
else:
temp_dict[word]=1
# frq is the number of appearances
for word, frq in temp_dict.iteritems():
if( frq > 1 ):
if( word in frq_dict )
frq_dict[word] += frq
else
frq_dict[word] = frq
else:
if( word in uqe_dict )
uqe_dict[word] += 1
else
uqe_dict[word] = 1

I managed to come up with the right answer to my own question:
list = [["Word1","Word2","Word2"],["Word2", "Word3", "Word4"],["Word2","Word3"]]
#Names of all dicts
all_words = sorted(set([w for sublist in list for w in sublist]))
#Creating the dicts
dicts = []
for i in all_words:
dicts.append([i, dict.fromkeys([w for w in all_words if w != i],0)])
#Updating the dicts
for l in list:
for word in sorted(set(l)):
tmpL = [w for w in l if w != word]
ind = ([w[0] for w in dicts].index(word))
for w in dicts[ind][1]:
dicts[ind][1][w] += l.count(w)
print dicts
Gets the result:
['Word1', {'Word4': 0, 'Word3': 0, 'Word2': 2}], ['Word2', {'Word4': 1, 'Word1': 1, 'Word3': 2}], ['Word3', {'Word4': 1, 'Word1': 0, 'Word2': 2}], ['Word4', {'Word1': 0, 'Word3': 1, 'Word2': 1}]]

Related

Maximum frequency value for word

Seeking help on Homework
I am given a list and asked to find the most occurring value in a list and returns the amount of times it is occurred. This question is fairly big and i have managed to get through the other parts by myself but this one stumped me.I should add that this is for an assignment any guidance would be appreciated.
Question Statement : Maximum (word) Frequency
For example in a book with the following words ['big', 'big', 'bat', 'bob', 'book'] the maximum frequency is 2, i.e., big is the most frequently occurring word, therefore 2 is the maximum frequency.
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
I have gotten this far but I am not sure if its right/where to go from here

Try this:
from collections import Counter
c = Counter(['big', 'big', 'bat', 'bob', 'book'])
max(c.items(), key=lambda x:x[1])
the max will returns the most one by its count, you can do:
key,rate = max(c.items(), key=lambda x:x[1])
the key will be big and the rate will be 2.
also, you can access all of the items count by c.items(). and the output will be
{'big': 2, 'bat': 1, 'bob': 1, 'book': 1}
Edit:
as schwobaseggl said the best practice to find from a counter is to use most_common.
c.most_common(1)[0]

You just need to count the occurrence of all the unique elements and compare the frequency with the previously computed frequency.
sample is a list of words.
def maxfreq(sample):
m=0
frequency=0
word=''
set_sample=list(set(sample))
for i in range(len(set_sample)):
c=sample.count(set_sample[i])
if c>m:
m=c
frequency=m
word=set_sample[i]
return (frequency,word)

Since it sounds like this is some kind of challenge and/or homework you're supposed to be working on, instead of directly providing a code sample let me give you some concepts.
First off, the best way to know if you've seen a word or not is to use a map, in Python -- the term is "dict" and the syntax is simple {}, you can store values like this: my_dict['value'] = true or whatever key/value you need.
So if you're going to read your words, one by one, and store them into this dict, the what should the value be? You know you want to know the maximum frequency, right? Well, so let's use that as our value. By default, if we add a word, we should make sure to set it's initial value to 1 (we've seen it once). And if we see a word a second time, we then increment our frequency.
Now that you have a dict full of words and their frequencies, perhaps you might be able to figure out how to find the one with the largest frequency?
So that being said, things you should look into are:
How to determine if a key exists in a dict
How to modify the value of a key in a dict
How to iterate a dict's key/value pairs
After that, your answer should be pretty easy to figure out.

try this :
>>> MyList = ["above", "big", "above", "cat", "cat", "above", "cat"]
>>> my_dict = {i:MyList.count(i) for i in MyList}
>>> my_dict
{'above': 3, 'big': 1, 'cat': 3}
It can also be accomplish using collections.Counter which is compatible with Python 2.7 or 3.x !
>>> from collections import Counter
>>> MyList = ['big', 'big', 'bat', 'bob', 'book']
>>> dict(Counter(MyList))
{'big': 2, 'bat': 1, 'bob': 1, 'book': 1}
If you are open to Pandas then it can be done as follows:
>>> import pandas as pd
>>> pd.Series(MyList).value_counts()
big 2
book 1
bob 1
bat 1
dtype: int64
#Answer to the OP's next Question in the comment section what if i wanted to get just the maximum value instead of the word .
>>> pd.Series(MyList).value_counts().max()
2

How about this:
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
max_freq = max(word_counter.items(), key=(lambda x: x[1]))
return max_freq
if __name__ == '__main__':
test_data = ['big', 'big', 'bat', 'bob', 'book']
print(maximum_frequency(test_data))
Output:
('big', 2)
Works fine with Python 2 and 3 and returns result as a tuple of most frequent word and occurrences count.
EDIT:
If you don't care at all which word has the highest count and you want only the frequency number you can simplify it a bit to:
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return max(word_counter.values())
if __name__ == '__main__':
test_data = ['big', 'big', 'bat', 'bob', 'book']
print(maximum_frequency(test_data))

How to standardize the format of element in the list from big data

Trying to count unique value from the following list without using collection:
('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
The output which I require is :
('TOILET':2,'AIR CONDITIONiNGS':3)
My code currently is
for i in Data:
if i in number:
number[i] += 1
else:
number[i] = 1
print number
Is it possible to get the output?

Using difflib.get_close_matches to help determine uniqueness
import difflib
a = ('TOILET','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
d = {}
for word in a:
similar = difflib.get_close_matches(word, d.keys(), cutoff = 0.6, n = 1)
#print(similar)
if similar:
d[similar[0]] += 1
else:
d[word] = 1
The actual keys in the dictionary will depend on the order of the words in the list.
difflib.get_close_matches uses difflib.SequenceMatcher to calculate the closeness (ratio) of the word against all possibilities even if the first possibility is close - then sorts by the ratio. This has the advantage of finding the closest key that has a ratio greater than the cutoff. But as the dictionary grows the searches will take longer.
If needed, you might be able to optimize a little by sorting the list first so that similar words appear in sequence and doing something like this (lazy evaluation) - choosing an appropriately large cutoff.
import difflib, collections
z = collections.OrderedDict()
a = sorted(a)
cutoff = 0.6
for word in a:
for key in z.keys():
if difflib.SequenceMatcher(None, word, key).ratio() > cutoff:
z[key] += 1
break
else:
z[word] = 1
Results:
>>> d
{'TOILET': 2, 'AIR CONDITIONING': 3}
>>> z
OrderedDict([('AIR CONDITIONING', 3), ('TOILET', 2)])
>>>
I imagine there are python packages that do this sort of thing and may be optimized.

I don't believe the python list has an easy built-in way to do what you are asking. It does, however, have a count method that can tell you how many of a specific element there are in a list. Example:
some_list = ['a', 'a', 'b', 'c']
some_list.count('a') #=> 2
Usually the way you get what you want is to construct an incrementable hash by taking advantage of the Hash::get(key, default) method:
some_list = ['a', 'a', 'b', 'c']
counts = {}
for el in some_list
counts[el] = counts.get(el, 0) + 1
counts #=> {'a' : 2, 'b' : 1, 'c' : 1}

You can try this:
import re
data = ('TOILETS','TOILETS','AIR CONDITIONING','AIR-CONDITIONINGS','AIR-CONDITIONING')
new_data = [re.sub("\W+", ' ', i) for i in data]
print new_data
final_data = {}
for i in new_data:
s = [b for b in final_data if i.startswith(b)]
if s:
new_data = s[0]
final_data[new_data] += 1
else:
final_data[i] = 1
print final_data
Output:
{'TOILETS': 2, 'AIR CONDITIONING': 3}

original = ('TOILETS', 'TOILETS', 'AIR CONDITIONING',
'AIR-CONDITIONINGS', 'AIR-CONDITIONING')
a_set = set(original)
result_dict = {element: original.count(element) for element in a_set}
First, making a set from original list (or tuple) gives you all values from it, but without repeating.
Then you create a dictionary with keys from that set and values as occurrences of them in the original list (or tuple), employing the count() method.

a = ['TOILETS', 'TOILETS', 'AIR CONDITIONING', 'AIR-CONDITIONINGS', 'AIR-CONDITIONING']
b = {}
for i in a:
b.setdefault(i,0)
b[i] += 1
You can use this code, but same as Jon Clements`s talk, TOILET and TOILETS aren't the same string, you must ensure them.

Keeping number of hits in a dictionary

I have a list of (unique) words:
words = [store, worry, periodic, bucket, keen, vanish, bear, transport, pull, tame, rings, classy, humorous, tacit, healthy]
That i want to crosscheck with two different lists of lists (with the same range), while counting the number of hits.
l1 = [[terrible, worry, not], [healthy], [fish, case, bag]]
l2 = [[vanish, healthy, dog], [plant], [waves, healthy, bucket]]
I was thinking of using a dictionary and assume the word as the key, but would need two 'values' (one for each list) for the number of hits.
So the output would be something like:
{"store": [0, 0]}
{"worry": [1, 0]}
...
{"healthy": [1, 2]}
How would something like this work?
Thank you in advance!

You can use itertools to flatten the list and then use dictionary comprehension:
from itertools import chain
words = [store, worry, periodic, bucket, keen, vanish, bear, transport, pull, tame, rings, classy, humorous, tacit, healthy]
l1 = [[terrible, worry, not], [healthy], [fish, case, bag]]
l2 = [[vanish, healthy, dog], [plant], [waves, healthy, bucket]]
l1 = list(chain(*l1))
l2 = list(chain(*l2))
final_count = {i:[l1.count(i), l2.count(i)] for i in words}

For your dictionary example, you would just need to iterate over each list and add those to the dictionary as so:
my_dict = {}
for word in l1:
if word in words: #This makes sure you only work with words that are in your list of unique words
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][0] += 1
for word in l2:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][1] += 1
(Or you could make that repeated code a function that passes in for parameter the list, dictionary, and the index, that way you repeat fewer lines)
If your lists are 2d like in your example, then you just change the first iteration in the for loop to be 2d.
my_dict = {}
for group in l1:
for word in group:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][0] += 1
for group in l2
for word in group:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][1] += 1
Though if you are just wanting to know the words in common, perhaps sets could be an option as well, since you have the union operators in sets for easy viewing of all words in common, but sets eliminate duplicates so if the counts are necessary, then the set isn't an option.

using lambda and dictionaries functions

I wrote this function:
def make_upper(words):
for word in words:
ind = words.index(word)
words[ind] = word.upper()
I also wrote a function that counts the frequency of occurrences of each letter:
def letter_cnt(word,freq):
for let in word:
if let == 'A': freq[0]+=1
elif let == 'B': freq[1]+=1
elif let == 'C': freq[2]+=1
elif let == 'D': freq[3]+=1
elif let == 'E': freq[4]+=1

Counting letter frequency would be much more efficient with a dictionary, yes. Note that you are manually lining up each letter with a number ("A" with 0, et cetera). Wouldn't it be easier if we could have a data type that directly associated a letter with the number of times it occurs, without adding an extra set of numbers in between?
Consider the code:
freq = {"A":0, "B":0, "C":0, "D":0, ... ..., "Z":0}
for letter in text:
freq[letter] += 1
This dictionary is used to count frequencies much more efficiently than your current code does. You just add one to an entry for a given letter each time you see it.
I will also mention that you can count frequencies effectively with certain libraries. If you are interested in analyzing frequencies, look into collections.Counter() and possibly the collections.Counter.most_common() method.
Whether or not you decide to just use collections.Counter(), I would attempt to learn why dictionaries are useful in this context.
One final note: I personally found typing out the values for the "freq" dictionary to be tedious. If you want you could construct an empty dictionary of alphabet letters on-the-fly with this code:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
freq = {letter:0 for letter in alphabet}

If you want to convert strings in the list to upper case using lambda, you may use it with map() as:
>>> words = ["Hello", "World"]
>>> map(lambda word: word.upper(), words) # In Python 2
['HELLO', 'WORLD']
# In Python 3, use it as: list(map(...))
As per the map() document:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results.
For finding the frequency of each character in word, you may use collections.Counter() (sub class dict type) as:
>>> from collections import Counter
>>> my_word = "hello world"
>>> c = Counter(my_word)
# where c holds dictionary as:
# {'l': 3,
# 'o': 2,
# ' ': 1,
# 'e': 1,
# 'd': 1,
# 'h': 1,
# 'r': 1,
# 'w': 1}
As per Counter Document:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.

for the letter counting, don't reinvent the wheel collections.Counter
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

def punc_remove(words):
for word in words:
if word.isalnum() == False:
charl = []
for char in word:
if char.isalnum()==True:
charl.append(char)
ind = words.index(word)
delimeter = ""
words[ind] = delimeter.join(charl)
def letter_cnt_dic(word,freq_d):
for let in word:
freq_d[let] += 1
import string
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
punc_remove(words)
#map(lambda word: word.upper(),words)
words = [word.upper() for word in words]
for word in words:
letter_cnt_dic(word,freqs)
fhand.close()
return freqs.values()

You can read the docs about the Counter and the List Comprehensions or run this as a small demo:
from collections import Counter
words = ["acdefg","abcdefg","abcdfg"]
#list comprehension no need for lambda or map
new_words = [word.upper() for word in words]
print(new_words)
# Lets create a dict and a counter
letters = {}
letters_counter = Counter()
for word in words:
# The counter count and add the deltas.
letters_counter += Counter(word)
# We can do it to
for letter in word:
letters[letter] = letters.get(letter,0) + 1
print(letters_counter)
print(letters)

Python - counting duplicate strings

I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']

Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3

Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']

If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}

why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.

As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a frequency matrix? - python

Related

Maximum frequency value for word

How to standardize the format of element in the list from big data

Keeping number of hits in a dictionary

using lambda and dictionaries functions

Python - counting duplicate strings

Categories

Resources