I'm trying to create a function in python that from a list of strings will return me a dict where the key(index) shows the most repetitive character for each index between all the strings. for example a list1 = ['one', 'two', 'twin', 'who'] should return index 0=t index 1=w index 2=o index 3=n in fact the most frequent character at the index 1 between all the string is 'w'.
I found a solution but if I have lists with thousands of strings inside it will require too much time to perform. I would like to know if you can give me some help to decrease the time of execution.
Here is what I tried to do but seems too slow to perform with lists of thousands strings inside
list1 = ['one', 'two', 'twin', 'who']
width = len(max(list1, key=len))
chars = {}
for i, item in enumerate(zip(*[s.ljust(width) for s in list1])):
set1 = set(item)
if ' ' in set1:
set1.remove(' ')
chars[i] = max(set1, key=item.count)
print(chars)
Whether something is quick enough is a matter of use case, but this solution uses a couple of seconds to go through the default wordlist available under OS X.
Python's collections.Counter implements a counter object for you, so you don't have keep track of the counts of multiple possible values yourself.
I've paired it with defaultdict, which intializes a key with a function if the key is undefined - so that if we haven't already seen the index we're updating the count for, it gets initialized to a Counter object that we then update.
from collections import defaultdict, Counter
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
letters = defaultdict(Counter)
for word in words:
for idx, letter in enumerate(word):
letters[idx].update((letter, ))
for idx, counter in letters.items():
print(idx, counter.most_common(1))
Whether this is quick enough depends on your use case as mentioned; it can be done a lot quicker if necessary, but it's probably quick enough. For 235 886 words the runtime is:
python3 letterfreq.py 2.67s user 0.04s system 99% cpu 2.734 total
This assumes that every word is lowercased, if not, lowercase it before adding it to your Counter object.
If you want to implement it without using the Counter or defaultdict parts of the standard library (which are just helper functionality to avoid reimplementing the same small code repeatedly), you can do the exact thing yourself manually:
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
letter_positions = {}
for word in words:
for idx, letter in enumerate(word):
if idx not in letter_positions:
letter_positions[idx] = {}
if letter not in letter_positions[idx]:
letter_positions[idx][letter] = 0
letter_positions[idx][letter] += 1
final_dict = {}
for idx, counts in letter_positions.items():
most_popular = sorted(counts.items(), key=lambda v: v[1], reverse=True)
print(idx, most_popular)
final_dict[idx] = most_popular[0][0]
print(final_dict)
Then pick as many entries as necessary from most_popular when going through the list afterwards.
Since we're no longer using the defaultdict and Counter abstractions, our running time is now about a third of the previous one:
python3 letterfreq2.py 1.08s user 0.03s system 98% cpu 1.124 total
It's usually a good idea to go through what you're trying to do and formulate a strategy - i.e. "ok, I need to keep track of how many times a letter has appeared in this location .. so for that I need some way to keep values for each index .. and then for each letter ..".
I just make some improvements based on your algorithm.
First, you can use itertools.zip_longest() instead of zip() to remove the need of ljust() and the width variable:
from itertools import zip_longest
list1 = ['one', 'two', 'twin', 'who']
chars = {}
for i, item in enumerate(zip_longest(*list1)):
set1 = set(item)
if None in set1:
set1.remove(None)
chars[i] = max(set1, key=item.count)
print(chars)
Then, replace max(set1, key=item.count) with a more efficent way Counter(item).most_common(1)[0][0], combined with or set1.most_common(2)[1][0] to filter None values
from itertools import zip_longest
from collections import Counter
list1 = ['one', 'two', 'twin', 'who']
chars = {}
for i, item in enumerate(zip_longest(*list1)):
set1 = Counter(item)
chars[i] = set1.most_common(1)[0][0] or set1.most_common(2)[1][0]
print(chars)
As itertools and collections are Python built-in modules, you can import them directly without pip install them.
Related
I need to find the longest string in a list for each letter of the alphabet.
My first straight forward approach looked like this:
alphabet = ["a","b", ..., "z"]
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
result = {key:"" for key in alphabet} # create a dictionary
# go through all words, if that word is longer than the current longest, save it
for word in text:
if word[0].lower() in alphabet and len(result[word[0].lower()]) < len(word):
result[word[0].lower()] = word.lower()
print(result)
which returns:
{'a': 'andhjtje9'}
as it is supposed to do.
In order to practice dictionary comprehension I tried to solve this in just one line:
result2 = {key:"" for key in alphabet}
result2 = {word[0].lower(): word.lower() for word in text if word[0].lower() in alphabet and len(result2[word[0].lower()]) < len(word)}
I just copied the if statement into the comprehension loop...
results2 however is:
{'a': 'ajhe5'}
can someone explain me why this is the case? I feel like I did exactly the same as in the first loop...
Thanks for any help!
List / Dict / Set - comprehension can not refer to themself while building itself - thats why you do not get what you want.
You can use a complicated dictionary comprehension to do this - with help of collections.groupby on a sorted list this could look like this:
from string import ascii_lowercase
from itertools import groupby
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:sorted(value, key=len)[-1]
for key,value in groupby((s for s in sorted(text)
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or
text = ["ane4", "anrhgjt8", "andhjtje9", "ajhe5", "]more_crazy_words"]
d = {key:next(value) for key,value in groupby(
(s for s in sorted(text, key=lambda x: (x[0],-len(x)))
if s[0].lower() in frozenset(ascii_lowercase)),
lambda x:x[0].lower())}
print(d) # {'a': 'andhjtje9'}
or several other ways ... but why would you?
Having it as for loops is much cleaner and easier to understand and would, in this case, follow the zen of python probably better.
Read about the zen of python by running:
import this
I have a list containing a number of strings. Some of the strings are repeated so I want to count how many times they are repeated. For the singular strings I will only print it, for the repeating strings I want to print the number of duplications it has. the code is as follows:
for string in list:
if list.count(string) > 1:
print(string+" appeared: ")
print(list.count(string))
elif list.count(string) == 1:
print(string)
However it has some problems as it is printing all the instances of the repeated strings. For example, if there are two "hello" strings in the list, it will print hello appeared 2 for twice. So is there a way to skip to check all the instances of the repeated strings? Thanks for help.
list.count in a loop is expensive. It will parse the entire list for each word. That's O(n2) complexity. You can loop over a set of words, but that's O(m*n) complexity, still not great.
Instead, you can use collections.Counter to parse your list once. Then iterate your dictionary key-value pairs. This will have O(m+n) complexity.
lst = ['hello', 'test', 'this', 'is', 'a', 'test', 'hope', 'this', 'works']
from collections import Counter
c = Counter(lst)
for word, count in c.items():
if count == 1:
print(word)
else:
print(f'{word} appeared: {count}')
hello
test appeared: 2
this appeared: 2
is
a
hope
works
Use set
Ex:
for string in set(list):
if list.count(string) > 1:
print(string+" appeared: ")
print(list.count(string))
elif list.count(string) == 1:
print(string)
Use a Counter
To create:
In [166]: import collections
In [169]: d = collections.Counter(['hello', 'world', 'hello'])
To display:
In [170]: for word, freq in d.items():
...: if freq > 1:
...: print('{0} appeared {1} times'.format(word, freq))
...: else:
...: print(word)
...:
hello appeared 2 times
world
You can use python's collections.counter like so -
import collections
result = dict(collections.Counter(list))
Another way to do this manually is:
result = {k, 0 for k in set(list)}
for item in list:
result[item] += 1
Also, you should not name your list as list as its python's inbuilt type. Now both the methods will give you dicts like -
{"a": 3, "b": 1, "c": 4, "d": 1}
Where keys are the unique values from your list and values are how many time a key has appeared in your list
I have a list of (unique) words:
words = [store, worry, periodic, bucket, keen, vanish, bear, transport, pull, tame, rings, classy, humorous, tacit, healthy]
That i want to crosscheck with two different lists of lists (with the same range), while counting the number of hits.
l1 = [[terrible, worry, not], [healthy], [fish, case, bag]]
l2 = [[vanish, healthy, dog], [plant], [waves, healthy, bucket]]
I was thinking of using a dictionary and assume the word as the key, but would need two 'values' (one for each list) for the number of hits.
So the output would be something like:
{"store": [0, 0]}
{"worry": [1, 0]}
...
{"healthy": [1, 2]}
How would something like this work?
Thank you in advance!
You can use itertools to flatten the list and then use dictionary comprehension:
from itertools import chain
words = [store, worry, periodic, bucket, keen, vanish, bear, transport, pull, tame, rings, classy, humorous, tacit, healthy]
l1 = [[terrible, worry, not], [healthy], [fish, case, bag]]
l2 = [[vanish, healthy, dog], [plant], [waves, healthy, bucket]]
l1 = list(chain(*l1))
l2 = list(chain(*l2))
final_count = {i:[l1.count(i), l2.count(i)] for i in words}
For your dictionary example, you would just need to iterate over each list and add those to the dictionary as so:
my_dict = {}
for word in l1:
if word in words: #This makes sure you only work with words that are in your list of unique words
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][0] += 1
for word in l2:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][1] += 1
(Or you could make that repeated code a function that passes in for parameter the list, dictionary, and the index, that way you repeat fewer lines)
If your lists are 2d like in your example, then you just change the first iteration in the for loop to be 2d.
my_dict = {}
for group in l1:
for word in group:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][0] += 1
for group in l2
for word in group:
if word in words:
if word not in my_dict:
my_dict[word] = [0,0]
my_dict[word][1] += 1
Though if you are just wanting to know the words in common, perhaps sets could be an option as well, since you have the union operators in sets for easy viewing of all words in common, but sets eliminate duplicates so if the counts are necessary, then the set isn't an option.
Super new to to python here, I've been struggling with this code for a while now. Basically the function returns a dictionary with the integers as keys and the values are all the words where the length of the word corresponds with each key.
So far I'm able to create a dictionary where the values are the total number of each word but not the actual words themselves.
So passing the following text
"the faith that he had had had had an affect on his life"
to the function
def get_word_len_dict(text):
result_dict = {'1':0, '2':0, '3':0, '4':0, '5':0, '6' :0}
for word in text.split():
if str(len(word)) in result_dict:
result_dict[str(len(word))] += 1
return result_dict
returns
1 - 0
2 - 3
3 - 6
4 - 2
5 - 1
6 - 1
Where I need the output to be:
2 - ['an', 'he', 'on']
3 - ['had', 'his', 'the']
4 - ['life', 'that']
5 - ['faith']
6 - ['affect']
I think I need to have to return the values as a list. But I'm not sure how to approach it.
I think that what you want is a dic of lists.
result_dict = {'1':[], '2':[], '3':[], '4':[], '5':[], '6' :[]}
for word in text.split():
if str(len(word)) in result_dict:
result_dict[str(len(word))].append(word)
return result_dict
Fixing Sabian's answer so that duplicates aren't added to the list:
def get_word_len_dict(text):
result_dict = {1:[], 2:[], 3:[], 4:[], 5:[], 6 :[]}
for word in text.split():
n = len(word)
if n in result_dict and word not in result_dict[n]:
result_dict[n].append(word)
return result_dict
Check out list comprehensions
Integers are legal dictionaries keys so there is no need to make the numbers strings unless you want it that way for some other reason.
if statement in the for loop controls flow to add word only once. You could get this effect more automatically if you use set() type instead of list() as your value data structure. See more in the docs. I believe the following does the job:
def get_word_len_dict(text):
result_dict = {len(word) : [] for word in text.split()}
for word in text.split():
if word not in result_dict[len(word)]:
result_dict[len(word)].append(word)
return result_dict
try to make it better ;)
Instead of defining the default value as 0, assign it as set() and within if condition do, result_dict[str(len(word))].add(word).
Also, instead of preassigning result_dict, you should use collections.defaultdict.
Since you need non-repetitive words, I am using set as value instead of list.
Hence, your final code should be:
from collections import defaultdict
def get_word_len_dict(text):
result_dict = defaultdict(set)
for word in text.split():
result_dict[str(len(word))].add(word)
return result_dict
In case it is must that you want list as values (I think set should suffice your requirement), you need to further iterate it as:
for key, value in result_dict.items():
result_dict[key] = list(value)
What you need is a map to list-construct (if not many words, otherwise a 'Counter' would be fine):
Each list stands for a word class (number of characters). Map is checked whether word class ('3') found before. List is checked whether word ('had') found before.
def get_word_len_dict(text):
result_dict = {}
for word in text.split():
if not result_dict.get(str(len(word))): # add list to map?
result_dict[str(len(word))] = []
if not word in result_dict[str(len(word))]: # add word to list?
result_dict[str(len(word))].append(word)
return result_dict
-->
3 ['the', 'had', 'his']
2 ['he', 'an', 'on']
5 ['faith']
4 ['that', 'life']
6 ['affect']
the problem here is you are counting the word by length, instead you want to group them. You can achieve this by storing a list instead of a int:
def get_word_len_dict(text):
result_dict = {}
for word in text.split():
if len(word) in result_dict:
result_dict[len(word)].add(word)
else:
result_dict[len(word)] = {word} #using a set instead of list to avoid duplicates
return result_dict
Other improvements:
don't hardcode the key in the initialized dict but let it empty instead. Let the code add the new keys dynamically when necessary
you can use int as keys instead of strings, it will save you the conversion
use sets to avoid repetitions
Using groupby
Well, I'll try to propose something different: you can group by length using groupby from the python standard library
import itertools
def get_word_len_dict(text):
# split and group by length (you get a list if tuple(key, list of values)
groups = itertools.groupby(sorted(text.split(), key=lambda x: len(x)), lambda x: len(x))
# convert to a dictionary with sets
return {l: set(words) for l, words in groups}
You say you want the keys to be integers but then you convert them to strings before storing them as a key. There is no need to do this in Python; integers can be dictionary keys.
Regarding your question, simply initialize the values of the keys to empty lists instead of the number 0. Then, in the loop, append the word to the list stored under the appropriate key (the length of the word), like this:
string = "the faith that he had had had had an affect on his life"
def get_word_len_dict(text):
result_dict = {i : [] for i in range(1, 7)}
for word in text.split():
length = len(word)
if length in result_dict:
result_dict[length].append(word)
return result_dict
This results in the following:
>>> get_word_len_dict(string)
{1: [], 2: ['he', 'an', 'on'], 3: ['the', 'had', 'had', 'had', 'had', 'his'], 4: ['that', 'life'], 5: ['faith'], 6: ['affect']}
If you, as you mentioned, wish to remove the duplicate words when collecting your input string, it seems elegant to use a set and convert to a list as a final processing step, if this is needed. Also note the use of defaultdict so you don't have to manually initialize the dictionary keys and values as a default value set() (i.e. the empty set) gets inserted for each key that we try to access but not others:
from collections import defaultdict
string = "the faith that he had had had had an affect on his life"
def get_word_len_dict(text):
result_dict = defaultdict(set)
for word in text.split():
length = len(word)
result_dict[length].add(word)
return {k : list(v) for k, v in result_dict.items()}
This gives the following output:
>>> get_word_len_dict(string)
{2: ['he', 'on', 'an'], 3: ['his', 'had', 'the'], 4: ['life', 'that'], 5: ['faith'], 6: ['affect']}
Your code is counting the occurrence of each word length - but not storing the words themselves.
In addition to capturing each word into a list of words with the same size, you also appear to want:
If a word length is not represented, do not return an empty list for that length - just don't have a key for that length.
No duplicates in each word list
Each word list is sorted
A set container is ideal for accumulating the words - sets naturally eliminate any duplicates added to them.
Using defaultdict(sets) will setup an empty dictionary of sets -- a dictionary key will only be created if it is referenced in our loop that examines each word.
from collections import defaultdict
def get_word_len_dict(text):
#create empty dictionary of sets
d = defaultdict(set)
# the key is the length of each word
# The value is a growing set of words
# sets automatically eliminate duplicates
for word in text.split():
d[len(word)].add(word)
# the sets in the dictionary are unordered
# so sort them into a new dictionary, which is returned
# as a dictionary of lists
return {i:sorted(d[i]) for i in d.keys()}
In your example string of
a="the faith that he had had had had an affect on his life"
Calling the function like this:
z=get_word_len_dict(a)
Returns the following list:
print(z)
{2: ['an', 'he', 'on'], 3: ['had', 'his', 'the'], 4: ['life', 'that'], 5: ['faith'], 6: ['affect']}
The type of each value in the dictionary is "list".
print(type(z[2]))
<class 'list'>
I am trying to get the difference between 2 containers but the containers are in a weird structure so I dont know whats the best way to perform a difference on it. One containers type and structure I cannot alter but the others I can(variable delims).
delims = ['on','with','to','and','in','the','from','or']
words = collections.Counter(s.split()).most_common()
# words results in [("the",2), ("a",9), ("diplomacy", 1)]
#I want to perform a 'difference' operation on words to remove all the delims words
descriptive_words = set(words) - set(delims)
# because of the unqiue structure of words(list of tuples) its hard to perform a difference
# on it. What would be the best way to perform a difference? Maybe...
delims = [('on',0),('with',0),('to',0),('and',0),('in',0),('the',0),('from',0),('or',0)]
words = collections.Counter(s.split()).most_common()
descriptive_words = set(words) - set(delims)
# Or maybe
words = collections.Counter(s.split()).most_common()
n_words = []
for w in words:
n_words.append(w[0])
delims = ['on','with','to','and','in','the','from','or']
descriptive_words = set(n_words) - set(delims)
How about just modifying words by removing all the delimiters?
words = collections.Counter(s.split())
for delim in delims:
del words[delim]
This I how I would do it:
delims = set(['on','with','to','and','in','the','from','or'])
# ...
descriptive_words = filter(lamdba x: x[0] not in delims, words)
Using the filter method. A viable alternative would be:
delims = set(['on','with','to','and','in','the','from','or'])
# ...
decsriptive_words = [ (word, count) for word,count in words if word not in delims ]
Making sure that the delims are in a set to allow for O(1) lookup.
The simplest answer is to do:
import collections
s = "the a a a a the a a a a a diplomacy"
delims = {'on','with','to','and','in','the','from','or'}
// For older versions of python without set literals:
// delims = set(['on','with','to','and','in','the','from','or'])
words = collections.Counter(s.split())
not_delims = {key: value for (key, value) in words.items() if key not in delims}
// For older versions of python without dict comprehensions:
// not_delims = dict(((key, value) for (key, value) in words.items() if key not in delims))
Which gives us:
{'a': 9, 'diplomacy': 1}
An alternative option is to do it pre-emptively:
import collections
s = "the a a a a the a a a a a diplomacy"
delims = {'on','with','to','and','in','the','from','or'}
counted_words = collections.Counter((word for word in s.split() if word not in delims))
Here you apply the filtering on the list of words before you give it to the counter, and this gives the same result.
If you're iterating through it anyway why bother converting them to sets?
dwords = [delim[0] for delim in delims]
words = [word for word in words if word[0] not in dwords]
For performance, you can use lambda functions
filter(lambda word: word[0] not in delim, words)