Manipulating sets in dictionaries

Manipulating sets in dictionaries - python

I need to update a dictionary with the use of sets that I have. My program needs to essentially take a set and assign it to a value (in a dictionary). If the set already exists, i need to update its value (keep adding the values together).
Here is how my program works now:
for line in fd:
new_line = line.split(' ')
for word in new_line:
new_word = ''.join(l for l in word if l.isalpha())
new_word = new_word.lower()
ind_count = 0
for let in new_word:
c_dict[let, ind_count] = new_word
ind_count += 1
And in my fd file, it contains a list of words.
I want my result to look something like this:
print(c_dict)
{ (0, "h") : { "hello", "helps" } , (0, "c") : { "cow" } }
This essentially takes a letter from the word and it's index #, and sets the value to that word. My file will have hundreds of words that have the letter 'h' at position 0, and essentially the key (0, 'h') would have a value that contains all of those words.
Right now, my program just replaces the values. Any help would be greatly appreciated.
Thanks!

dict.setdefault() is perfect for this:
for line in fd:
new_line = line.split(' ')
for word in new_line:
new_word = ''.join(l for l in word if l.isalpha())
new_word = new_word.lower()
for ind_count, let in enumerate(new_word):
c_dict.setdefault((let, ind_count), set()).add(new_word)
Note that I also change the innermost for loop to use enumerate() rather than manually incrementing ind_index inside the loop.
c_dict.setdefault((let, ind_count), set()).add(new_word) is equivalent in behavior to the following code:
if (let, ind_count) in c_dict:
c_dict[let, ind_count].add(new_word)
else:
c_dict[let, ind_count] = set([new_word])

Related

How to extract words from repeating strings

Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!

One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'

How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm

Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril

A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count

Selecting randomly from a list with a specific character

So, I have a text file (full of words) I put into a list. I want Python 2.7 to select a word from the list randomly, but for it to start in a specific character.
list code:
d=[]
with open("dic.txt", "r") as x:
d=[line.strip() for line in x]
It's for a game called Shiritori. The user starts with saying any word in the English language, ie dog. The program then has to pick another word starting with the last character, in this case, 'g'.
code for the game:
game_user='-1'
game_user=raw_input("em, lets go with ")
a1=len(game_user)
I need a program that will randomly select a word beginning with that character.

Because your game relies specifically upon a random word with a fixed starting letter, I suggest first sorting all your words into a dictionary with the starting letter as the key. Then, you can randomly lookup any word starting with a given letter:
d=[]
lookup = {}
with open("dic.txt", "r") as x:
d=[line.strip() for line in x]
for word in d:
if word[0] in lookup:
lookup[word[0]].append(word)
else:
lookup[word[0]] = [ word ]
now you have a dict 'lookup' that has all your words sorted by letter.
When you need a word that starts with the last letter of the previous word, you can randomly pick an element in your list:
import random
random_word = random.choice(lookup[ game_user[-1] ])

In order to get a new list of all the values that start with the last letter of the user input:
choices = [x in d if x[0] == game_user[-1]]
Then, you can select a word by:
newWord = random.choice(choices)

>>> import random
>>> with open('/usr/share/dict/words') as f:
... words = f.read().splitlines()
...
>>> c = 'a'
>>> random.choice([w for w in words if w.startswith(c)])
'anthologizing'
Obviously, you need to replace c = 'a' with raw_input("em, lets go with ")

You might get better use out of more advanced data structures, but here's a shot:
words_dict = {}
for row in d:
# Gives us the first letter
myletter = row[0]
if myletter not in words_dict:
words_dict[myletter] = []
words_dict[myletter].append(row)
After creating a dictionary of all letters and their corresponding words, you can then access any particular set of words like so:
words_dict['a']
Which will give you all the words that start with a in a list. Then you can take:
# This could be any letter here..
someletter = 'a'
newword = words_dict[someletter][random.randint(0,len(words_dict[someletter]-1))]
Let me know if that makes sense?

How to change the value of a string with a dictionary value in Python

I have the following dictionary in Python:
myDict = {"how":"como", "you?":"tu?", "goodbye":"adios", "where":"donde"}
and with a string like : "How are you?" I wish to have the following result once compared to myDict:
"como are tu?"
as you can see If a word doesn't appear in myDict like "are" in the result appears as it.
This is my code until now:
myDict = {"how":"como", "you?":"tu?", "goodbye":"adios", "where":"donde"}
def translate(word):
word = word.lower()
word = word.split()
for letter in word:
if letter in myDict:
return myDict[letter]
print(translate("How are you?"))
As a result only gets the first letter : como , so what am I doing wrong for not getting the entire sentence?
Thanks for your help in advanced!

The function returns (exits) the first time it hits a return statement. In this instance, that will always be the first word.
What you should do is make a list of words, and where you see the return currently, you should add to the list.
Once you have added each word, you can then return the list at the end.
PS: your terminology is confusing. What you have are phrases, each made up of words. "This is a phrase" is a phrase of 4 words: "This", "is", "a", "phrase". A letter would be the individual part of the word, for example "T" in "This".

The problem is that you are returning the first word that is mapped in your dictionary, so you can use this (I have changed some variable names because is kind of confusing):
myDict = {"how":"como", "you?":"tu?", "goodbye":"adios", "where":"donde"}
def translate(string):
string = string.lower()
words = string.split()
translation = ''
for word in words:
if word in myDict:
translation += myDict[word]
else:
translation += word
translation += ' ' # add a space between words
return translation[:-1] #remove last space
print(translate("How are you?"))
Output:
'como are tu?'

When you call return, the method that is currently being executed is terminated, which is why yours stops after finding one word. For your method to work properly, you would have to append to a String that is stored as a local variable within the method.
Here's a function that uses list comprehension to translate a String if its exists in the dictionary:
def translate(myDict, string):
return ' '.join([myDict[x.lower()] if x.lower() in myDict.keys() else x for x in string.split()])
Example:
myDict = {"how": "como", "you?": "tu?", "goodbye": "adios", "where": "donde"}
print(translate(myDict, 'How are you?'))
>> como are tu?

myDict = {"how":"como", "you?":"tu?", "goodbye":"adios", "where":"donde"}
s = "How are you?"
newString =''
for word in s.lower().split():
newWord = word
if word in myDict:
newWord = myDict[word]
newString = newString+' '+newWord
print(newString)

Sorting words in a text file (with parameters) and writing them into a new file with Python

I have a file.txt with thousands of words, and I need to create a new file based on certain parameters, and then sort them a certain way.
Assuming the user imports the proper libraries when they test, what is wrong with my code? (There are 3 separate functions)
For the first, I must create a file with words containing certain letters, and sort them lexicographically, then put them into a new file list.txt.
def getSortedContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
The second is similar, but must be sorted reverse lexicographically, if the string inputted is NOT in the word.
def getReverseSortedNotContain(s,ifile,ofile):
toWrite = ""
toWrites = ""
for line in ifile:
word = line[:-1]
if s not in word:
toWrite += word + "\n"
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
ofile.write(toWrites[:-1])
For the third, I must sort words that contain a certain amount of integers, and sort lexicographically by the last character in each word.
def getRhymeSortedCount(n, ifile, ofile):
toWrite = ""
for line in ifile:
word = line[:-1] #gets rid of \n
if len(word) == n:
toWrite += word + "\n"
reversetoWrite = toWrite[::-1]
newList = []
newList.append(toWrite)
newList.sort()
newList.reverse()
for h in newList:
toWrites += h
reversetoWrite = toWrites[::-1]
ofile.write(reversetoWrites[:-1])
Could someone please point me in the right direction for these? Right now they are not sorting as they're supposed to.

There is a lot of stuff that is unclear here so I'll try my best to clean this up.
You're concatenating strings together into one big string then appending that one big string into a list. You then tried to sort your 1-element list. This obviously will do nothing. Instead put all the strings into a list and then sort that list
IE: for your first example do the following:
def getSortedContain(s,ifile,ofile):
words = [word for word in ifile if s in words]
words.sort()
ofile.write("\n".join(words))

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you

The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))

from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]

import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.

Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.

You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.