I have a dictionary full of key-value pairs where the key is a word I want to search for in a string and the value is what I want to replace it with. It needs to be able to preserve case as well. I'm stumbling over the logic in this circumstance.
I was thinking it would work to split the string up into a list of words, but I'm not sure if this would be the simplest way.
dict = {'my':'your', 'dog':'cat'}
string = 'My dog is named Jeffrey.'
I'd like to substitute the values in for the keys in the string, but maintain case and punctuation.
You may use the re.sub to make a substitution case insensitive. What is really difficult is to know what letter is capital because we may have words with different sizes, so I applied the rule that the first letter in the phrase has to be capital using capitalize method.
import re
dict = {'my':'your', 'dog':'cat'}
inputString = 'My dog is named Jeffrey.'
for key in dict.keys():
inputString = re.sub("(?i)"+key,dict[key],inputString)
inputString = inputString.capitalize()
print(inputString)
Related
I have a dictionary like this:
id_dict = {'C1001': 'John','D205': 'Ben','501': 'Rose'}
This dictionary has more than 10000 keys and values. I have to search for the key from a report which has nearly 500 words and replace with values.
I have to process thousands of reports within a few minutes, so speed and memory are really important for me.
This is the code I am using now:
str = "strings in the reports"
for key, value in id_dict.iteritems():
str = str.replace(key, value)
Is there any better solution than this?
Using str.replace in a loop is very inefficient. A few arguments:
when the word is replaced, a new string is allocated and the old one is discarded. If you have a lot of words, it can take ages
str.replace would replace inside of words, probably not what you want: ex: replace "nut" by "eel" changes "donut" to "doeel".
if there are a lot of words in your replacement dictionary, you loop through all of them (using a python loop, rather slow), even if the text doesn't contain any one of them.
I would use re.sub with a replacement function (as a lambda), matching a word-boundary alphanumeric string (letters or digits).
The lambda would lookup in the dictionary and return the word if found, else return the original word, replacing nothing, but since everything is done in the re module, it executes way faster.
import re
id_dict = {'C1001': 'John','D205': 'Ben','501': 'Rose'}
s = "Hello C1001, My name is D205, not X501"
result = re.sub(r"\b(\w+)\b",lambda m : id_dict.get(m.group(1),m.group(1)),s)
print(result)
prints:
Hello John, My name is Ben, not X501
(note that the last word was left unreplaced because it's only a partial match)
I have a list of word library and a text in which there are a spell error (typos), and I want to correct the word spell error to be correct according to list of library
for example
in list of word :
listOfWord = [...,"halo","saya","sedangkan","semangat","cemooh"..];
this is my string :
string = "haaallllllooo ssya sdngkan ceemoooh , smngat semoga menyenangkan"
I want change the spellerror to be correct like :
string = "halo saya sedangkan cemooh, semangat semoga menyenangkan"
what is the best algorithm to check each word in list, because I have millions of words in the list and have many possibilities
It depends on how your data is stored, but you'll probably want to use a pattern matching algorithm like Aho–Corasick. Of course, that assumes your input data structure is a Trie. A Trie a very space-efficient storage container for words that may also be of interest to you (again, depending on your environment.)
You can use difflib's get close matches, though it is not that efficient.
words = ["halo","saya","sedangkan","semangat","cemooh"]
def get_exact_words(input_str):
exact_words = difflib.get_close_matches(input_str,words,n=1,cutoff=0.7)
if len(exact_words)>0:
return exact_words[0]
else:
return input_str
string = "haaallllllooo ssya sdngkan ceemoooh , smngat semoga menyenangkan"
string = string.split(' ')
exact = [get_exact_words(word) for word in string]
exact = ' '.join(exact)
print(exact)
Output :
With difflib
haaallllllooo saya sedangkan cemooh , semangat semangat menyenangkan
I am assuming you are writing spell checker for some language.
You might want tokenize the sentence into words.
Then shorten words like haaallllllooo to haalloo. Assuming the language you have doesn't have words that have many repeated letters too often. Easy to check since you have the dictionary.
Then you can use this algorithm/implementation by Peter Norvig. All you have to do is to replace his dictionary of correct words with your dictionary.
You can use hashing techniques for checking correct pattern, something on the lines of Rabin Karp Algorithm.
You know what would be the hash value of your original strings in the list. For spell correction, you can try the combination of those words that gives you same hash value before matching them with original string present in the dictionary. This would require, anyways, to iterate through all the characters in the spellerror list only once. But it will be efficient.
You can use pyenchant to check spelling with your list of words.
>>> import enchant
>>> d = enchant.request_pwl_dict("mywords.txt")
>>> d.check('helo')
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
You need to split your words and check one-by-one, choose the first suggestion to replace if it's false.
There are more advance features in the tutorial here.
http://pyenchant.readthedocs.io/en/latest/tutorial.html
I think you should apply string distance algorithms with a word to find nearest. You can apply these algorithms to find the nearest word. Those are mostly O(n) algorithms so at the end your sentence replacement would cost you O(n) at most.
everyone. I'm trying to complete a basic assignment. The program should allow a user to type in a phrase. If the phrase contains the word "happy" or "sad", that word should then be randomly replaced by a synonym (stored in a dictionary). The new phrase should then be printed out. What am I doing wrong? Every time I try to run it, the program crashes. This is the error I get:
0_part1.py", line 13, in <module>
phrase["happy"] = random.choice(thesaurus["happy"])
TypeError: 'str' object does not support item assignment
Here is what I have so far:
import random
thesaurus = {
"happy": ["glad", "blissful", "ecstatic", "at ease"],
"sad": ["bleak", "blue", "depressed"]
}
phrase = input("Enter a phrase: ")
phrase2 = phrase.split(' ')
if "happy" in phrase:
phrase["happy"] = random.choice(thesaurus["happy"])
if "sad" in phrase:
phrase["sad"] = random.choice(thesaurus["sad"])
print(phrase)
The reason for your error is that phrase is a string, and strings are immutable. On top of that, strings are sequences, not mappings; you can index them or slice them (e.g., happy_index = phrase.find("happy"); phrase[happy_index:happy_index+len("happy")]), but you can't use them like dictionaries.
If you want to create a new string, replacing the substring happy with another word, use the replace method.
And there's no reason to check first; if happy isn't found, replace wil do nothing.
So:
phrase = phrase.replace("happy", random.choice(thesaurus["happy"]))
While we're at it, instead of explicitly looking up each key, you may want to loop over the dictionary and apply all the synonyms:
for key, replacements in thesaurus.items():
phrase = phrase.replace(key, random.choice(replacements))
Finally, notice that this code will replace all instances of happy with the same replacement. Which I think your intended code was also trying to do. If you want to replace each of them with a separately randomly-chosen synonym, that's a bit more complicated. You could loop over phrase.find("happy", offset) until it returns -1, but a neat trick might make it simpler: split the string around each instance of happy, substitute in a different synonym for each split part, then join them all back together. Like this:
parts = phrase.split("happy")
parts[:-1] = [part + random.choice(thesaurus["happy"]) for part in parts[:-1]]
phrase = ''.join(parts)
Generate a random number from (0..[size of list - 1]). Then, access that index of the list. To get the length of a list, just do len(list_name).
I've got some homework involving Caesar cipher, and I got stuck here:
I need to write a function which gets a text (as a String) and a dictionary. The dictionary keys are the English ABC, and its values are other letters from the ABC.
My goal is to go over the text, and wherever there is a letter (only letters!)
change it to the value belongs to the specific letter in the dictionary.
edit: my function should return the deciphered text as a string.
You're looking for the translate method:
>>> u"abc".translate({ord('a'): u'x', ord('b'): u'y', ord('c'): u'z'})
'xyz'
Look at maketrans if you're using bytestrings or if your Python is older than 2.7.
A bit of pseudocode (language agnostic). You should be able to take it from here.
cipher = array
caesar_mask = [ A: G, ... , Z: F ]
for each letter_index in text
cipher_letter = caesar_mask[text[letter_index]]
cipher[] = cipher_letter
end
First question is if you have to do it in place.
Then I would look into these things:
list comprehension
map()
how to iterate through letters in string
how to join a sequence of letters to create string
how to replace characters in string
Not in any specific order and not necesarily all inclusive.
I'm having problems using findall in python.
I have a text such as:
the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched
So i'm trying to find all occurrences of of those alphanumeric strings in the text. the thing is I know they all have the "33e" prefix.
Right now, I have strings = re.findall(r"(33e+)+", stdout_value) but it doesn't work.
I want to be able to return 33e445a64b65, 33e5c44598e46
try this
>>> x="the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched"
>>> re.findall("33e\w+",x)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Here's a slight variation:
>>> string = '''the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched'''
>>> re.findall(r"(33e[a-z0-9]+)", string)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Instead of matching any word characters, it will only match digits and lowercase numbers after the 33e -- that's what the [a-z0-9]+ means.
If you wanted to also match capital letters, you could replace that part with [a-zA-Z0-9]+ instead.