This question already has answers here:
Do regular expressions from the re module support word boundaries (\b)?
(5 answers)
Closed 3 years ago.
I'm trying to replace the occurrence of a word with another:
word_list = { "ugh" : "disappointed"}
tmp = ['laughing ugh']
for index, data in enumerate(tmp):
for key, value in word_list.iteritems():
if key in data:
tmp[index]=data.replace(key, word_list[key])
print tmp
Whereas this works... the occurrence of ugh in laughing is also being replaced in the output: ladisappointeding disappointed.
How does one avoid this so that the output is laughing disappointed?
In that case, you may want to consider to replace word by word.
Example:
word_list = { "ugh" : "disappointed"}
tmp = ['laughing ugh']
for t in tmp:
words = t.split()
for i in range(len(words)):
if words[i] in word_list.keys():
words[i] = word_list[words[i]]
newline = " ".join(words)
print(newline)
Output:
laughing disappointed
Step-by-Step Explanations:
Get every sentence in the tmp list:
for t in tmp:
split the sentence into words:
words = t.split()
check whether any word in words are in the word_list keys. If it does, replace it with its value:
for i in range(len(words)):
if words[i] in word_list.keys():
words[i] = word_list[words[i]]
rejoin the replaced words and print the result out:
newline = " ".join(words)
print(newline)
You can do this by using a RegEx:
>>> import re
>>> re.sub(r'\bugh\b', 'disappointed', 'laughing ugh')
'laughing disappointed'
The \b stands for a word boundary.
Use re.sub:
for key, value in word_list.items():
tmp = re.sub("\\b{}\\b".format(key), value, tmp[index])
word_list = { "ugh" : "disappointed", "123" : "lol"}
tmp = ['laughing 123 ugh']
for word in tmp:
words = word.split()
for i in words[:]:
if i in word_list.keys():
replace_value = word_list.get(i)
words[words.index(i)] = replace_value
output = " ".join(words)
print output
This code will swap each key of the dict (so the word you want to replace ) with the dict value of that key ( the word you want it to be replaced with) in every case and with multiple values!
Output:
laughing lol disappointed
Hope that helps!
You can use regular expressions:
import re
for index, data in enumerate(tmp):
for key, value in word_list.iteritems():
if key in data:
pattern = '\b' + key + '\b'
data = re.sub(pattern, value, data)
tmp[index] = data
Side note: you need data = ... line (to overwrite data variable) otherwise it will work incorrectly when word_list contains multiple entries.
Fast:
>>> [re.sub(r'\w+', lambda m: word_list.get(m.group(), m.group()), t)
for t in tmp]
['laughing disappointed']
>>>
Very Fast:
>>> [re.sub(r'\b(?:%s)\b' % '|'.join(word_list.keys()), lambda m: word_list.get(m.group(), m.group()), t)
... for t in tmp]
['laughing disappointed']
>>>
Related
I have a python challenge that if given a string with '_' or '-' in between each word such as the_big_red_apple or the-big-red-apple to convert it to camel case. Also if the first word is uppercase keep it as uppercase. This is my code. Im not allowed to use the re library in the challenge however but I didn't know how else to do it.
from re import sub
def to_camel_case(text):
if text[0].isupper():
text = sub(r"(_|-)+"," ", text).title().replace(" ", "")
else:
text = sub(r"(_|-)+"," ", text).title().replace(" ", "")
text = text[0].lower() + text[1:]
return print(text)
Word delimiters can be - dash or _ underscore.
Let's simplify, making them all underscores:
text = text.replace('-', '_')
Now we can break out words:
words = text.split('_')
With that in hand it's simple to put them back together:
text = ''.join(map(str.capitalize, words))
or more verbosely, with a generator expression,
assign ''.join(word.capitalize() for word in words).
I leave "finesse the 1st character"
as an exercise to the reader.
If you RTFM you'll find it contains a wealth of knowledge.
https://docs.python.org/3/library/re.html#raw-string-notation
'+'
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s
The effect of + is turn both
db_rows_read and
db__rows_read
into DbRowsRead.
Also,
Raw string notation (r"text") keeps regular expressions sane.
The regex in your question doesn't exactly
need a raw string, as it has no crazy
punctuation like \ backwhacks.
But it's a very good habit to always put
a regex in an r-string, Just In Case.
You never know when code maintenance
will tack on additional elements,
and who wants a subtle regex bug on their hands?
You can try it like this :
def to_camel_case(text):
s = text.replace("-", " ").replace("_", " ")
s = s.split()
if len(text) == 0:
return text
return s[0] + ''.join(i.capitalize() for i in s[1:])
print(to_camel_case('momo_es-es'))
the output of print(to_camel_case('momo_es-es')) is momoEsEs
r"..." refers to Raw String in Python which simply means treating backlash \ as literal instead of escape character.
And (_|-)[+] is a Regular Expression that match the string containing one or more - or _ characters.
(_|-) means matching the string that contains - or _.
+ means matching the above character (- or _) than occur one or more times in the string.
In case you cannot use re library for this solution:
def to_camel_case(text):
# Since delimiters can have 2 possible answers, let's simplify it to one.
# In this case, I replace all `_` characters with `-`, to make sure we have only one delimiter.
text = text.replace("_", "-") # the_big-red_apple => the-big-red-apple
# Next, we should split our text into words in order for us to iterate through and modify it later.
words = text.split("-") # the-big-red-apple => ["the", "big", "red", "apple"]
# Now, for each word (except the first word) we have to turn its first character to uppercase.
for i in range(1, len(words)):
# `i`start from 1, which means the first word IS NOT INCLUDED in this loop.
word = words[i]
# word[1:] means the rest of the characters except the first one
# (e.g. w = "apple" => w[1:] = "pple")
words[i] = word[0].upper() + word[1:].lower()
# you can also use Python built-in method for this:
# words[i] = word.capitalize()
# After this loop, ["the", "big", "red", "apple"] => ["the", "Big", "Red", "Apple"]
# Finally, we put the words back together and return it
# ["the", "Big", "Red", "Apple"] => theBigRedApple
return "".join(words)
print(to_camel_case("the_big-red_apple"))
Try this:
First, replace all the delimiters into a single one, i.e. str.replace('_', '-')
Split the string on the str.split('-') standardized delimiter
Capitalize each string in list, i.e. str.capitilize()
Join the capitalize string with str.join
>>> s = "the_big_red_apple"
>>> s.replace('_', '-').split('-')
['the', 'big', 'red', 'apple']
>>> ''.join(map(str.capitalize, s.replace('_', '-').split('-')))
'TheBigRedApple'
>> ''.join(word.capitalize() for word in s.replace('_', '-').split('-'))
'TheBigRedApple'
If you need to lowercase the first char, then:
>>> camel_mile = lambda x: x[0].lower() + x[1:]
>>> s = 'TheBigRedApple'
>>> camel_mile(s)
'theBigRedApple'
Alternative,
First replace all delimiters to space str.replace('_', ' ')
Titlecase the string str.title()
Remove space from string, i.e. str.replace(' ', '')
>>> s = "the_big_red_apple"
>>> s.replace('_', ' ').title().replace(' ', '')
'TheBigRedApple'
Another alternative,
Iterate through the characters and then keep a pointer/note on previous character, i.e. for prev, curr in zip(s, s[1:])
check if the previous character is one of your delimiter, if so, uppercase the current character, i.e. curr.upper() if prev in ['-', '_'] else curr
skip whitepace characters, i.e. if curr != " "
Then add the first character in lowercase, [s[0].lower()]
>>> chars = [s[0].lower()] + [curr.upper() if prev in ['-', '_'] else curr for prev, curr in zip(s, s[1:]) if curr != " "]
>>> "".join(chars)
'theBigRedApple'
Yet another alternative,
Replace/Normalize all delimiters into a single one, s.replace('-', '_')
Convert it into a list of chars, list(s.replace('-', '_'))
While there is still '_' in the list of chars, keep
find the position of the next '_'
replacing the character after '_' with its uppercase
replacing the '_' with ''
>>> s = 'the_big_red_apple'
>>> s_list = list(s.replace('-', '_'))
>>> while '_' in s_list:
... where_underscore = s_list.index('_')
... s_list[where_underscore+1] = s_list[where_underscore+1].upper()
... s_list[where_underscore] = ""
...
>>> "".join(s_list)
'theBigRedApple'
or
>>> s = 'the_big_red_apple'
>>> s_list = list(s.replace('-', '_'))
>>> while '_' in s_list:
... where_underscore = s_list.index('_')
... s_list[where_underscore:where_underscore+2] = ["", s_list[where_underscore+1].upper()]
...
>>> "".join(s_list)
'theBigRedApple'
Note: Why do we need to convert the string to list of chars? Cos strings are immutable, 'str' object does not support item assignment
BTW, the regex solution can make use of some group catching, e.g.
>>> import re
>>> s = "the_big_red_apple"
>>> upper_regex_group = lambda x: x.group(1).upper()
>>> re.sub("[_|-](\w)", upper_regex_group, s)
'theBigRedApple'
>>> re.sub("[_|-](\w)", lambda x: x.group(1).upper(), s)
'theBigRedApple'
Im trying to replace a word in python, but another word with same letter format got replaced
example :
initial : 'bg bgt'
goal : 'bang banget'
current result : 'bang bangt'
heres what my code currently looks like:
def slangwords(kalimat):
words = kalimat.split(' ')
for word in words:
if any(x in word for x in "bg"):
kalimat = kalimat.replace("bg","bang")
if any(x in word for x in "bgt"):
kalimat = kalimat.replace("bgt","banget")
return kalimat
print(slangwords('bg bgt'))
n ill appreciate more if u can show me how to replace these slangword more effective and efficient, thanks
That is because you replace bg before bgt (which is a bigger substring), you need to change the order.
Also, you don't need if any(x in word for x in "bg"), that checks if every letter is present in the word and not if the substring is present in the same order, plus, you don't need any verification before using str.replace, if the strin isn't there, it won't do anything
You just need
def slangwords(kalimat):
return kalimat.replace("bgt", "banget").replace("bg", "bang")
Better and not order-dependent
Use a dictionnary, and replace each word with its substitute
def slangwords(kalimat):
replacements = {
'bg': 'bang',
'bgt': 'banget'
}
words = kalimat.split(' ')
for i, word in enumerate(words):
words[i] = replacements.get(word, word)
return " ".join(words)
This is surely a classic case for utilising a dictionary - e.g.
D = {'bg': 'bang', 'bgt': 'banget'}
def slangwords(sentence):
for word in sentence.split():
if (rv := D.get(word)) is not None:
sentence = sentence.replace(word, rv)
return sentence
if __name__ == '__main__':
print(slangwords('bg bgt'))
In this way the slang words() function doesn't change - all you have to do is extend your dictionary
You just need to reverse the order of replace:
def slangwords(kalimat):
words = kalimat.split(' ')
for word in words:
if any(x in word for x in "bgt"):
kalimat = kalimat.replace("bgt","banget")
if any(x in word for x in "bg"):
kalimat = kalimat.replace("bg","bang")
return kalimat
print(slangwords('bg bgt'))
If you would like to replace many words, you can put them in dictionary (note that the order matters here as well):
replace_words = { 'bgt' :'banget', 'bg': 'bang'}
def slangwords(kalimat):
words = kalimat.split(' ')
for word in words:
for wrd, repl in replace_words.items():
kalimat = kalimat.replace(wrd, repl)
return kalimat
Here is another approach. You just need a dictionary of the substitutions. Using get enable to set the default value if the jey is missing (here the same word).
def slangwords(kalimat, sub={'bg': 'bang', 'bgt': 'banget'}):
return ' '.join([sub.get(w, w) for w in kalimat.split(' ')])
>>> slangwords('bg bgt abc')
'bang banget abc'
I tried matching words including the letter "ab" or "ba" e.g. "ab"olition, f"ab"rics, pro"ba"ble. I came up with the following regular expression:
r"[Aa](?=[Bb])[Bb]|[Bb](?=[Aa])[Aa]"
But it includes words that start or end with ", (, ), / ....non-alphanumeric characters. How can I erase it? I just want to match words list.
import sys
import re
word=[]
dict={}
f = open('C:/Python27/brown_half.txt', 'rU')
w = open('C:/Python27/brown_halfout.txt', 'w')
data = f.read()
word = data.split() # word is list
f.close()
for num2 in word:
match2 = re.findall("\w*(ab|ba)\w*", num2)
if match2:
dict[num2] = (dict[num2] + 1) if num2 in dict.keys() else 1
for key2 in sorted(dict.iterkeys()):print "%s: %s" % (key2, dict[key2])
print len(dict.keys())
Here, I don't know how to mix it up with "re.compile~~" method that 1st comment said...
To match all the words with ab or ba (case insensitive):
import re
text = 'fabh, obar! (Abtt) yybA, kk'
pattern = re.compile(r"(\w*(ab|ba)\w*)", re.IGNORECASE)
# to print all the matches
for match in pattern.finditer(text):
print match.group(0)
# to print the first match
print pattern.search(text).group(0)
https://regex101.com/r/uH3xM9/1
Regular expressions are not the best tool for the job in this case. They'll complicate stuff way too much for such simple circumstances. You can instead use Python's builtin in operator (works for both Python 2 and 3)...
sentence = "There are no probable situations whereby that may happen, or so it seems since the Abolition."
words = [''.join(filter(lambda x: x.isalpha(), token)) for token in sentence.split()]
for word in words:
word = word.lower()
if 'ab' in word or 'ba' in word:
print('Word "{}" matches pattern!'.format(word))
As you can see, 'ab' in word evaluates to True if the string 'ab' is found as-is (that is, exactly) in word, or False otherwise. For example 'ba' in 'probable' == True and 'ab' in 'Abolition' == False. The second line takes take of dividing the sentence in words and taking out any punctuation character. word = word.lower() makes word lowercase before the comparisons, so that for word = 'Abolition', 'ab' in word == True.
I would do it this way:
Strip your string from unwanted chars using the below two
techniques, your choice:
a - By building a translation dictionary and using translate method:
>>> import string
>>> del_punc = dict.fromkeys(ord(c) for c in string.punctuation)
s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = s.translate(del_punc)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
b - using re.sub method:
>>> import string
>>> import re
>>> s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = re.sub(r'[%s]'%string.punctuation, '', s)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
Next will be finding your words containing 'ab' or 'ba':
a - Splitting over whitespaces and finding occurrences of your desired strings, which is the one I recommend to you:
>>> [x for x in s.split() if 'ab' in x.lower() or 'ba' in x.lower()]
['abolition', 'fabrics', 'probable', 'bank', 'halfback', '1ablution']
b -Using re.finditer method:
>>> pat
re.compile('\\b.*?(ab|ba).*?\\b', re.IGNORECASE)
>>> for m in pat.finditer(s):
print(m.group())
abolition
fabrics
probable
test case bank
halfback
1ablution
string = "your string here"
lowercase = string.lower()
if 'ab' in lowercase or 'ba' in lowercase:
print(true)
else:
print(false)
Try this one
[(),/]*([a-z]|(ba|ab))+[(),/]*
I'm trying to write a simple program that removes all words containing digits from a received string.
Here is my current implementation:
import re
def checkio(text):
text = text.replace(",", " ").replace(".", " ") .replace("!", " ").replace("?", " ").lower()
counter = 0
words = text.split()
print words
for each in words:
if bool(re.search(r'\d', each)):
words.remove(each)
print words
checkio("1a4 4ad, d89dfsfaj.")
However, when I execute this program, I get the following output:
['1a4', '4ad', 'd89dfsfaj']
['4ad']
I can't figure out why '4ad' is printed in the second line as it contains digits and should have been removed from the list. Any ideas?
Assuming that your regular expression does what you want, you can do this to avoid removing while iterating.
import re
def checkio(text):
text = re.sub('[,\.\?\!]', ' ', text).lower()
words = [w for w in text.split() if not re.search(r'\d', w)]
print words ## prints [] in this case
Also, note that I simplified your text = text.replace(...) line.
Additionally, if you do not need to reuse your text variable, you can use regex to split it directly.
import re
def checkio(text):
words = [w for w in re.split('[,.?!]', text.lower()) if w and not re.search(r'\d', w)]
print words ## prints [] in this case
If you are testing for alpha numeric strings why not use isalnum() instead of regex ?
In [1695]: x = ['1a4', '4ad', 'd89dfsfaj']
In [1696]: [word for word in x if not word.isalnum()]
Out[1696]: []
This would be possible through using re.sub, re.search and list_comprehension.
>>> import re
>>> def checkio(s):
print([i for i in re.sub(r'[.,!?]', '', s.lower()).split() if not re.search(r'\d', i)])
>>> checkio("1a4 4ad, d89dfsfaj.")
[]
>>> checkio("1a4 ?ad, d89dfsfaj.")
['ad']
So apparently what happens is a concurrent access error. Namely - you are deleting an element while traversing the array.
At the first iteration we have words = ['1a4', '4ad', 'd89dfsfaj']. Since '1a4' has a number, we remove it.
Now, words = ['4ad','d89dfsfaj']. However, at the second iteration, the current word is now 'd89dfsfaj' and we remove it. What happens is that we skip '4ad', because it is now at index 0 and the current pointer for the for cycle is at 1.
I have some random string, let's say :
s = "This string has some verylongwordsneededtosplit"
I'm trying to write a function trunc_string(string, len) that takes string as argument to operate on and 'len' as the number of chars after long words will be splitted.
The result should be something like that
str = trunc_string(s, 10)
str = "This string has some verylongwo rdsneededt osplit"
For now I have something like this :
def truncate_long_words(s, num):
"""Splits long words in string"""
words = s.split()
for word in words:
if len(word) > num:
split_words = list(words)
After this part I have this long word as a list of chars. Now I need to :
join 'num' chars together in some word_part temporary list
join all word_parts into one word
join this word with the rest of words, that weren't long enough to be splitted.
Should I make it in somehow similar way ? :
counter = 0
for char in split_words:
word_part.append(char)
counter = counter+1
if counter == num
And here I should somehow join all the word_part together creating word and further on
def split_word(word, length=10):
return (word[n:n+length] for n in range(0, len(word), length))
string = "This string has some verylongwordsneededtosplit"
print [item for word in string.split() for item in split_word(word)]
# ['This', 'string', 'has', 'some', 'verylongwo', 'rdsneededt', 'osplit']
Note: it's a bad idea to name your string str. It shadows the built in type.
an option is the textwrap module
http://docs.python.org/2/library/textwrap.html
example usage:
>>> import textwrap
>>> s = "This string has some verylongwordsneededtosplit"
>>> list = textwrap.wrap(s, width=10)
>>> for line in list: print line;
...
This
string has
some veryl
ongwordsne
ededtospli
t
>>>
Why not:
def truncate_long_words(s, num):
"""Splits long words in string"""
words = s.split()
for word in words:
if len(word) > num:
for i in xrange(0,len(word),num):
yield word[i:i+num]
else:
yield word
for t in truncate_long_words(s):
print t
Abusing regex:
import re
def trunc_string(s, num):
re.sub("(\\w{%d}\\B)" % num, "\\1 ", s)
assert "This string has some verylongwo rdsneededt osplit" == trunc_string("This string has some verylongwordsneededtosplit", 10)
(Edit: adopted simplification by Brian. Thanks. But I kept the \B to avoid adding a space when the word is exactly 10 characters long.)