Perform set operation difference on a list of tuples - python

I am trying to get the difference between 2 containers but the containers are in a weird structure so I dont know whats the best way to perform a difference on it. One containers type and structure I cannot alter but the others I can(variable delims).
delims = ['on','with','to','and','in','the','from','or']
words = collections.Counter(s.split()).most_common()
# words results in [("the",2), ("a",9), ("diplomacy", 1)]
#I want to perform a 'difference' operation on words to remove all the delims words
descriptive_words = set(words) - set(delims)
# because of the unqiue structure of words(list of tuples) its hard to perform a difference
# on it. What would be the best way to perform a difference? Maybe...
delims = [('on',0),('with',0),('to',0),('and',0),('in',0),('the',0),('from',0),('or',0)]
words = collections.Counter(s.split()).most_common()
descriptive_words = set(words) - set(delims)
# Or maybe
words = collections.Counter(s.split()).most_common()
n_words = []
for w in words:
n_words.append(w[0])
delims = ['on','with','to','and','in','the','from','or']
descriptive_words = set(n_words) - set(delims)

How about just modifying words by removing all the delimiters?
words = collections.Counter(s.split())
for delim in delims:
del words[delim]

This I how I would do it:
delims = set(['on','with','to','and','in','the','from','or'])
# ...
descriptive_words = filter(lamdba x: x[0] not in delims, words)
Using the filter method. A viable alternative would be:
delims = set(['on','with','to','and','in','the','from','or'])
# ...
decsriptive_words = [ (word, count) for word,count in words if word not in delims ]
Making sure that the delims are in a set to allow for O(1) lookup.

The simplest answer is to do:
import collections
s = "the a a a a the a a a a a diplomacy"
delims = {'on','with','to','and','in','the','from','or'}
// For older versions of python without set literals:
// delims = set(['on','with','to','and','in','the','from','or'])
words = collections.Counter(s.split())
not_delims = {key: value for (key, value) in words.items() if key not in delims}
// For older versions of python without dict comprehensions:
// not_delims = dict(((key, value) for (key, value) in words.items() if key not in delims))
Which gives us:
{'a': 9, 'diplomacy': 1}
An alternative option is to do it pre-emptively:
import collections
s = "the a a a a the a a a a a diplomacy"
delims = {'on','with','to','and','in','the','from','or'}
counted_words = collections.Counter((word for word in s.split() if word not in delims))
Here you apply the filtering on the list of words before you give it to the counter, and this gives the same result.

If you're iterating through it anyway why bother converting them to sets?
dwords = [delim[0] for delim in delims]
words = [word for word in words if word[0] not in dwords]

For performance, you can use lambda functions
filter(lambda word: word[0] not in delim, words)

Related

What is an efficient way to replace specific words (but not words that include the string in criteria)?

I am creating a list with pairs of words in a large text. I am going to use those pairs for other tasks later on.
Let's say these are the words I am looking for:
word_list = ["and", "car", "melon"]
And I'm trying to find all instances of these exact words and change them into "banana".
Method 1:
for word in range(len(text.split())):
if word in word_list:
word = "banana"
Method 2:
for word in range(len(text.split())):
word = word.replace("and", "banana")
word = word.replace("car", "banana")
word = word.replace("melon", "banana")
I feel like both of these options are far from efficient. What are some better ways to deal with the problem?
Things to note:
The end result will be a list of lists: [["He","has"],["has","a"],["a","banana"]]
Only exact matches should be replaced (watermelon should not become waterbanana)
You could use a dictionary to do that,
value = 'banana'
d = {'and': value, 'car': value, 'melon': value}
result = ' '.join(d.get(i, i) for i in text.split())
You can create the mapping dictionary like this,
value = 'banana'
word_list = ["and", "car", "melon"]
d = dict(zip(word_list,[value]*len(word_list)))

For loop to dictionary comprehension correct translation (Python)

I need to find the longest string in a list for each letter of the alphabet.
My first straight forward approach looked like this:
alphabet = ["a","b", ..., "z"]
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
result = {key:"" for key in alphabet} # create a dictionary
# go through all words, if that word is longer than the current longest, save it
for word in text:
if word[0].lower() in alphabet and len(result[word[0].lower()]) < len(word):
result[word[0].lower()] = word.lower()
print(result)
which returns:
{'a': 'andhjtje9'}
as it is supposed to do.
In order to practice dictionary comprehension I tried to solve this in just one line:
result2 = {key:"" for key in alphabet}
result2 = {word[0].lower(): word.lower() for word in text if word[0].lower() in alphabet and len(result2[word[0].lower()]) < len(word)}
I just copied the if statement into the comprehension loop...
results2 however is:
{'a': 'ajhe5'}
can someone explain me why this is the case? I feel like I did exactly the same as in the first loop...
Thanks for any help!
List / Dict / Set - comprehension can not refer to themself while building itself - thats why you do not get what you want.
You can use a complicated dictionary comprehension to do this - with help of collections.groupby on a sorted list this could look like this:
from string import ascii_lowercase
from itertools import groupby
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:sorted(value, key=len)[-1]
for key,value in groupby((s for s in sorted(text)
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:next(value) for key,value in groupby(
(s for s in sorted(text, key=lambda x: (x[0],-len(x)))
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or several other ways ... but why would you?
Having it as for loops is much cleaner and easier to understand and would, in this case, follow the zen of python probably better.
Read about the zen of python by running:
import this

Randomly pick a value from each list in a dictionary in python

I have the following code:
result = set()
with open("words.txt") as fd:
for line in fd:
matching_words = {word for word in line.lower().split() if len(word)==4 and "'" not in word}
result.update(matching_words)
print(result)
print(len(result))
result_dict = {}
for word in result:
result_dict[word[2:]] = result_dict.get(word[2:], []) + [word]
print(result_dict)
print({key: len(value) for key, value in result_dict.items()})
Output
This takes a .txt file finds all the unique four letter words and excludes any that include an apostrophe. These words are then split using the last 2 characters. Each of the word endings are then added to a dictionary with the number of words containing that ending displayed as the value.
What I now need to do is disregard any list with less than 30 words in it.
Then randomly select one word from each of the remaining lists and print the list of words.
The following comprehension should work:
[random.choice(v) for v in result_dict.values() if len(v) >= 30]
Why not use random.choice and use a list comprehension to limit the values given to it:
random.choice([k for k, v in result_dict.items() if len(v) >= 30])

Keeping number of hits in a dictionary

I have a list of (unique) words:
words = [store, worry, periodic, bucket, keen, vanish, bear, transport, pull, tame, rings, classy, humorous, tacit, healthy]
That i want to crosscheck with two different lists of lists (with the same range), while counting the number of hits.
l1 = [[terrible, worry, not], [healthy], [fish, case, bag]]
l2 = [[vanish, healthy, dog], [plant], [waves, healthy, bucket]]
I was thinking of using a dictionary and assume the word as the key, but would need two 'values' (one for each list) for the number of hits.
So the output would be something like:
{"store": [0, 0]}
{"worry": [1, 0]}
...
{"healthy": [1, 2]}
How would something like this work?
Thank you in advance!
You can use itertools to flatten the list and then use dictionary comprehension:
from itertools import chain
words = [store, worry, periodic, bucket, keen, vanish, bear, transport, pull, tame, rings, classy, humorous, tacit, healthy]
l1 = [[terrible, worry, not], [healthy], [fish, case, bag]]
l2 = [[vanish, healthy, dog], [plant], [waves, healthy, bucket]]
l1 = list(chain(*l1))
l2 = list(chain(*l2))
final_count = {i:[l1.count(i), l2.count(i)] for i in words}
For your dictionary example, you would just need to iterate over each list and add those to the dictionary as so:
my_dict = {}
for word in l1:
if word in words: #This makes sure you only work with words that are in your list of unique words
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][0] += 1
for word in l2:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][1] += 1
(Or you could make that repeated code a function that passes in for parameter the list, dictionary, and the index, that way you repeat fewer lines)
If your lists are 2d like in your example, then you just change the first iteration in the for loop to be 2d.
my_dict = {}
for group in l1:
for word in group:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][0] += 1
for group in l2
for word in group:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][1] += 1
Though if you are just wanting to know the words in common, perhaps sets could be an option as well, since you have the union operators in sets for easy viewing of all words in common, but sets eliminate duplicates so if the counts are necessary, then the set isn't an option.

using lambda and dictionaries functions

I wrote this function:
def make_upper(words):
for word in words:
ind = words.index(word)
words[ind] = word.upper()
I also wrote a function that counts the frequency of occurrences of each letter:
def letter_cnt(word,freq):
for let in word:
if let == 'A': freq[0]+=1
elif let == 'B': freq[1]+=1
elif let == 'C': freq[2]+=1
elif let == 'D': freq[3]+=1
elif let == 'E': freq[4]+=1
Counting letter frequency would be much more efficient with a dictionary, yes. Note that you are manually lining up each letter with a number ("A" with 0, et cetera). Wouldn't it be easier if we could have a data type that directly associated a letter with the number of times it occurs, without adding an extra set of numbers in between?
Consider the code:
freq = {"A":0, "B":0, "C":0, "D":0, ... ..., "Z":0}
for letter in text:
freq[letter] += 1
This dictionary is used to count frequencies much more efficiently than your current code does. You just add one to an entry for a given letter each time you see it.
I will also mention that you can count frequencies effectively with certain libraries. If you are interested in analyzing frequencies, look into collections.Counter() and possibly the collections.Counter.most_common() method.
Whether or not you decide to just use collections.Counter(), I would attempt to learn why dictionaries are useful in this context.
One final note: I personally found typing out the values for the "freq" dictionary to be tedious. If you want you could construct an empty dictionary of alphabet letters on-the-fly with this code:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
freq = {letter:0 for letter in alphabet}
If you want to convert strings in the list to upper case using lambda, you may use it with map() as:
>>> words = ["Hello", "World"]
>>> map(lambda word: word.upper(), words) # In Python 2
['HELLO', 'WORLD']
# In Python 3, use it as: list(map(...))
As per the map() document:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results.
For finding the frequency of each character in word, you may use collections.Counter() (sub class dict type) as:
>>> from collections import Counter
>>> my_word = "hello world"
>>> c = Counter(my_word)
# where c holds dictionary as:
# {'l': 3,
# 'o': 2,
# ' ': 1,
# 'e': 1,
# 'd': 1,
# 'h': 1,
# 'r': 1,
# 'w': 1}
As per Counter Document:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
for the letter counting, don't reinvent the wheel collections.Counter
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.
def punc_remove(words):
for word in words:
if word.isalnum() == False:
charl = []
for char in word:
if char.isalnum()==True:
charl.append(char)
ind = words.index(word)
delimeter = ""
words[ind] = delimeter.join(charl)
def letter_cnt_dic(word,freq_d):
for let in word:
freq_d[let] += 1
import string
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
punc_remove(words)
#map(lambda word: word.upper(),words)
words = [word.upper() for word in words]
for word in words:
letter_cnt_dic(word,freqs)
fhand.close()
return freqs.values()
You can read the docs about the Counter and the List Comprehensions or run this as a small demo:
from collections import Counter
words = ["acdefg","abcdefg","abcdfg"]
#list comprehension no need for lambda or map
new_words = [word.upper() for word in words]
print(new_words)
# Lets create a dict and a counter
letters = {}
letters_counter = Counter()
for word in words:
# The counter count and add the deltas.
letters_counter += Counter(word)
# We can do it to
for letter in word:
letters[letter] = letters.get(letter,0) + 1
print(letters_counter)
print(letters)

Categories

Resources