Python String Replace Behaving Weirdly - python

I am trying to get users who are mentioned in an article. That is, words starting with # symbol and then wrap < and > around them.
WHAT I TRIED:
def getUsers(content):
users = []
l = content.split(' ')
for user in l:
if user.startswith('#'):
users.append(user)
return users
old_string = "Getting and replacing mentions of users. #me #mentee #you #your #us #usa #wo #world #word #wonderland"
users = getUsers(old_string)
new_array = old_string.split(' ')
for mention in new_array:
for user in users:
if mention == user and len(mention) == len(user):
old_string = old_string.replace(mention, '<' + user + '>')
print old_string
print users
The code is behaving funny. It wraps words starting with the same alphabets and even truncate subsequent as shown in the print below:
RESULT:
Getting and replacing mentions of users. <#me> <#me>ntee <#you> <#you>r <#us> <#us>a <#wo> <#wo>rld <#wo>rd <#wo>nderland
['#me', '#mentee', '#you', '#your', '#us', '#usa', '#wo', '#world', '#word', '#wonderland']
EXPECTED RESULT:
Getting and replacing mentions of users. <#me> <#mentee> <#you> <#your> <#us> <#usa> <#wo> <#world> <#word> <#wonderland>
['#me', '#mentee', '#you', '#your', '#us', '#usa', '#wo', '#world', '#word', '#wonderland']
Process finished with exit code 0
Why is this happening and how can do this the right way?

Why this happens: When you split the string, you put a lot of checks in to make sure you are looking at the right user e.g. you have #me and #mentee - so for user me, it will match the first, and not the second.
However, when you do replace, you are doing replace on the whole string - so when you say to replace e.g. #me with <#me>, it doesn't know anything about your careful split - it's just going to look for #me in the string and replace it. So #mentee ALSO contains #me, and will get replaced.
Two (well, three) choices: One is to add the spaced around it, to gate it (like #parchment wrote).
Second is to use your split: Instead of replacing the original string, replace the local piece. The simplest way to do this is with enumerate:
new_array = old_string.split(' ')
for index, mention in enumerate(new_array):
for user in users:
if mention == user and len(mention) == len(user):
#We won't replace this in old_string, we'll replace the current entry
#old_string = old_string.replace(a, '<' + user + '>')
new_array[index] = '<%s>'%user
new_string = ' '.join(new_array)
Third way... this is a bit more complex, but what you really want is for any instance of '#anything' to be replaced with <#anything> (perhaps with whitespace?). You can do this in one shot with re.sub:
new_string = re.sub(r'(#\w+)', r'<\g<0>>', old_string)

My previous answer was based entirely on correcting the problems in your current code. But, there is a better way to do this, which is using regular expressions.
import re
oldstring = re.sub(r'(#\w+)\b', r'<\1>', oldstring)
For more information, see the documentation on the re module.

Because #me occurs first in your array, your code replaces the #me in #mentee.
Simplest way to fix that is to add a space after the username that you want to be replaced:
old_string = old_string.replace(a + ' ', '<' + user + '> ')
# I added space here ^ and here ^
A new problem occurs, though. The last word is not wrapped, because there's no space after it. A very simple way to fix it would be:
oldstring = oldstring + ' '
for mention in ... # Your loop
oldstring = oldstring[:-1]

This should work, as long as there isn't any punctuation (like commas) next to the usernames.
def wrapUsers(content):
L = content.split()
newL = []
for word in L:
if word.startswith('#'): word = '<'+word+'>'
newL.append(word)
return " ".join(newL)

Related

How to remove a group of characters from a string in python including shifting

For instance I have a string that needs a certain keyword removed from a string so lets say may key word iswhatthemomooofun and I want to delete the word moo from it, I have tried using the remove function but it only removes "moo" one time, but now my string is whatthemoofun and I can seem to remove it is there anyway I can do that?
You can use the replace in built function
def str_manipulate(word, key):
while key in word:
word = word.replace(key, '')
return word
Have you tried using a while loop? You could loop through it until it doesn't find your keyword anymore then break out.
Edit
Cause answer could be improved by an example, take a look at the general approach it deals with:
Example
s = 'whatthemomooofun'
while 'moo' in s:
s= s.replace('moo','')
print(s)
Output
whatthefun
original_string = "!(Hell#o)"
characters_to_remove = "!()#"
new_string = original_string
for character in characters_to_remove:
new_string = new_string.replace(character, "")
print(new_string)
OUTPUT
Hello

Substring replacements based on replace and no-replace rules

I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)

Separate initials from full name input in python without using split

EDIT This was my first post and I completely forgot to show what I had already tried. I wasn't looking for a complete program, just suggestions on methods I could use to concatenate initials. EDIT
I need to create a program that allows a user to input their full name and only prints the initials. This must be done WITHOUT USING .SPLIT OR LISTS
From my hw:
Write a program that gets a string of a person's full name – first, middle, and last name and then displays their initials.
Create a function getInitials().
>>>Enter your full name: James Tiberias Kirk
>>>J.T.K.
Try this one:
n = input('Enter your full name:')
name = ''
for i,j in enumerate(n):
if i == 0:
name+=(j+'.')
elif j == ' ':
name += (n[i+1]+'.')
print(name)
And this is as method:
def getinitials(n):
name = ''
for i,j in enumerate(n):
if i == 0:
name+=(j+'.')
elif j == ' ':
name += (n[i+1]+'.')
return name
print(getinitials(input('Enter your number:')))
Output:
Enter your full name:James Tiberias Kirk
J.T.K.
I ended up figuring it out thanks to all your responses. Our teacher wanted us to only use the narrow scope of what we've learned in class so there was pretty much only one way I would be allowed to write it. I ended up coming up with this:
def getInitials():
fullName=input("Enter your full name:")
initials=''
for ch in fullName:
if ch.isupper():
initials+=ch
initials=str(initials)+"."
print (initials)
if __name__=="__getInitials__":
getInitials()
getInitials()
Assuming you won't deal with names like DeFranco or McDonald, you can iterate over the string and append encountered capital letters (that is, ord('A') <= ord(char) <= ord('Z')) with a dot to your result.
Similar approach that should work on most cases is to append the first character, then look for spaces and append a character next to them.
One solution that technically doesn't use a list:
'.'.join(i for i in x if i.isupper())+'.'
...which uses a generator expression, but that may be closer to a list than the spirit of the assignment. As in the other answer, it fails if a name has more than one capital in the name (McGregor, etc.)
You could use a regex, while returning a generator statement from your getInitials() function, therefore you technically avoid using both split() and a list:
import re
def getInitials(x):
return '.'.join(i for i in re.findall(r'^[A-Z]|(?<=\s)[A-Z]', x)) + '.'
out = getInitials('James Tiberias Kirk McGregor')
Yields:
J.T.K.M.
Note that this method works for names with more than one capital letter per name/word.

Python - Iterations of character replacements

I apologize if this has been answered previously - I wasn't entirely sure how to explain/search for this so I couldn't find anything existing.
I'm going through a large list of strings and trying to find matches in another data set. The input data set is space separated, and the existing data set uses an inconsistent combination of underscores and camel-case.
I'm looking for a clean way to iterate across all possibilities of these combinations.
The simplest case is this:
Input: "Variant Type"
Desired Output: "Variant_Type", "VariantType"
I had been accomplishing this 2-word case by searching twice:
x = input.replace(' ','_')
# Search
x = x.replace('_','')
# Search again
But now that I'm realizing there are a lot of longer strings like:
Input: "Timeline Integration Enabled"
Desired output:
"Timeline_Integration_Enabled", "TimelineIntegration_Enabled", "Timeline_IntegrationEnabled", "TimelineIntegrationEnabled"
Is there a clever, Pythonic way to accomplish this?
Note: I know I could use something like difflib.get_close_matches(), but I was hoping to do that as a last pass over the data, prompting the user to make decisions on any fields that weren't clear.
Thanks in advance and let me know if you need any more details.
def iterate_replacements(input_data):
if " " in input_data:
yield from iterate_replacements(input_data.replace(" ", "", 1))
yield from iterate_replacements(input_data.replace(" ", "_", 1))
else:
yield input_data
for s in iterate_replacements("Timeline Integration Enabled"):
print(s)
Or, for 2.7, which doesn't support yield from:
def iterate_replacements(input_data):
if " " in input_data:
for x in iterate_replacements(input_data.replace(" ", "", 1)): yield x
for x in iterate_replacements(input_data.replace(" ", "_", 1)): yield x
else:
yield input_data
for s in iterate_replacements("Timeline Integration Enabled"):
print(s)
Result:
TimelineIntegrationEnabled
TimelineIntegration_Enabled
Timeline_IntegrationEnabled
Timeline_Integration_Enabled
Here is another way you can connect strings together although the way you did was simple.Using Django's slugify.
from django.template.defaultfilters import slugify
print(slugify("Variant Type"))
So if I understand correctly you are simply trying to remove underscores and spaces.
If you get a match for Timeline_Integration you will also get a match for TimelineIntegration, so I am a little confused as to why you want every permutation possible for replacing spaces with '_' or ''.
Example: searching "Timeline Integration Method":
Searching for "Timeline_Integration":
Why not just replace _ in the input strings for the search as well as the searched text. This keeps consistency. Or if you want to be case sensitive just replace '_' with ' '.
Unless I am completely misunderstanding the goal, I believe a possible solution would be to just completely remove the first search and do something like this:
Solutions:
space insensitive
search_string = ''.join([x for x in search_string if x != ' ' and x!= '_'])
to_be_searched_string = ''.join([x for x in to_be_searched_string if x != ' ' and x!= '_'])
space sensitive
search_string = ''.join([x for x in search_string if x != ' ' else '_'])
to_be_searched_string = ''.join([x for x in to_be_searched_string if x !=' ' else '_'])

python: dictionary of words and wordforms

I have the following problem: I created a dictionary (german) with words and their corresponding lemma. exemple:
"Lagerbestände", "Lager-bestand"; "Wohnhäuser", "Wohn-haus"; "Bahnhof", "Bahn-hof"
I now have a text and I want to check for all word their lemmata. It can happen that it appears a word which is not in the dict, such as "Restbestände". But the lemma of "bestände", we already know it. So I want to take the first part of the word which is unknown in dicti and add this to the lemmatized second part and print this out (or return it).
Example: "Restbestände" --> "Rest-bestand". ("bestand" is taken from the lemma of "Lagerbestände")
I coded the following:
for limit in range(1, len(Word)):
for k, v in dicti.iteritems():
if re.search('[\w]*'+Word[limit:], k, re.IGNORECASE) != None:
if '-' in v:
tmp = v.find('-')
end = v[tmp:]
end = re.sub(ur'[-]',"", end)
Word = Word[:limit] + '-' + end `
But I got 2 problems:
At the end of the words, it is printed out every time "&#10". How can I avoid this?
The second part of the word is sometimes not correct - there must be a logical error.
However; how would you solve this?
At the end of the words, it is printed out every time "&#10". How can
I avoid this?
In must use UNICODE everywhere in your script. Everywhere, everywhere, everywhere.
Also, python RegEx functions accept flag re.UNICODE that you should always set. German letters are out of ASCII set, so RegEx can be sometimes confused, for instance when matching r'\w'

Categories

Resources