Python - Iterations of character replacements - python

I apologize if this has been answered previously - I wasn't entirely sure how to explain/search for this so I couldn't find anything existing.
I'm going through a large list of strings and trying to find matches in another data set. The input data set is space separated, and the existing data set uses an inconsistent combination of underscores and camel-case.
I'm looking for a clean way to iterate across all possibilities of these combinations.
The simplest case is this:
Input: "Variant Type"
Desired Output: "Variant_Type", "VariantType"
I had been accomplishing this 2-word case by searching twice:
x = input.replace(' ','_')
# Search
x = x.replace('_','')
# Search again
But now that I'm realizing there are a lot of longer strings like:
Input: "Timeline Integration Enabled"
Desired output:
"Timeline_Integration_Enabled", "TimelineIntegration_Enabled", "Timeline_IntegrationEnabled", "TimelineIntegrationEnabled"
Is there a clever, Pythonic way to accomplish this?
Note: I know I could use something like difflib.get_close_matches(), but I was hoping to do that as a last pass over the data, prompting the user to make decisions on any fields that weren't clear.
Thanks in advance and let me know if you need any more details.

def iterate_replacements(input_data):
if " " in input_data:
yield from iterate_replacements(input_data.replace(" ", "", 1))
yield from iterate_replacements(input_data.replace(" ", "_", 1))
else:
yield input_data
for s in iterate_replacements("Timeline Integration Enabled"):
print(s)
Or, for 2.7, which doesn't support yield from:
def iterate_replacements(input_data):
if " " in input_data:
for x in iterate_replacements(input_data.replace(" ", "", 1)): yield x
for x in iterate_replacements(input_data.replace(" ", "_", 1)): yield x
else:
yield input_data
for s in iterate_replacements("Timeline Integration Enabled"):
print(s)
Result:
TimelineIntegrationEnabled
TimelineIntegration_Enabled
Timeline_IntegrationEnabled
Timeline_Integration_Enabled

Here is another way you can connect strings together although the way you did was simple.Using Django's slugify.
from django.template.defaultfilters import slugify
print(slugify("Variant Type"))

So if I understand correctly you are simply trying to remove underscores and spaces.
If you get a match for Timeline_Integration you will also get a match for TimelineIntegration, so I am a little confused as to why you want every permutation possible for replacing spaces with '_' or ''.
Example: searching "Timeline Integration Method":
Searching for "Timeline_Integration":
Why not just replace _ in the input strings for the search as well as the searched text. This keeps consistency. Or if you want to be case sensitive just replace '_' with ' '.
Unless I am completely misunderstanding the goal, I believe a possible solution would be to just completely remove the first search and do something like this:
Solutions:
space insensitive
search_string = ''.join([x for x in search_string if x != ' ' and x!= '_'])
to_be_searched_string = ''.join([x for x in to_be_searched_string if x != ' ' and x!= '_'])
space sensitive
search_string = ''.join([x for x in search_string if x != ' ' else '_'])
to_be_searched_string = ''.join([x for x in to_be_searched_string if x !=' ' else '_'])

Related

how to search a string with spaces within another string in python?

I want to search and blank out those sentences which contain words like "masked 111","My Add no" etc. from another sentences like "XYZ masked 111" or "Hello My Add" in python.How can I do that?
I was trying to make changes to below codes but it was not working due to spaces.
def garbagefin(x):
k = " ".join(re.findall("[a-zA-Z0-9]+", x))
print(k)
t=re.split(r'\s',k)
print(t)
Glist={'masked 111', 'DATA',"My Add no" , 'MASKEDDATA',}
for n, m in enumerate(t): ##to remove entire ID
if m in Glist:
return ''
else:
return x
The outputs that I am expecting is:
garbagefin("I am masked 111")-Blank
garbagefin("I am My Add No")-Blank
garbagefin("I am My add")-I am My add
garbagefin("I am My MASKEDDATA")-Blank
You can also use a regex approach like this:
import re
Glist={'masked 111', 'DATA',"My Add no" , 'MASKEDDATA',}
glst_rx = r"\b(?:{})\b".format("|".join(Glist))
def garbagefin(x):
if re.search(glst_rx, x, re.I):
return ''
else:
return x
See the Python demo.
The glst_rx = r"\b(?:{})\b".format("|".join(Glist)) code will generate the \b(?:My Add no|DATA|MASKEDDATA|masked 111)\b regex (see the online demo).
It will match the strings from Glist in a case insensitive way (note the re.I flag in re.search(glst_rx, x, re.I)) as whole words, and once found, an empty string will be returned, else, the input string will be returned.
If there are too many items in Glist, you could leverage a regex trie (see here how to use the trieregex library to generate such tries.)
Seems like you don't actually need regex. Just the usual in operator.
def garbagefin(x):
return "" if any(text in x for text in Glist) else x
If your matching is case insensitive, then compare against lowercase text.
Glist = set(map(lambda text: text.casefold(), Glist))
...
def garbagefin(x):
x_lower = x.casefold()
return "" if any(text in x_lower for text in Glist) else x
Output
1.
2.
3. I am My add
4.
If you're just trying to find a string from another string, I don't think you even need to use such messed-up code. Plus you can just store the key strings in a array
You can simply use the in method and use return.
def garbagefin (x):
L=["masked 111","DATA","My Add no", "MASKEDDATA"]
for i in L:
if i in x:
print("Blank")
return
print(x)

Python inserting spaces in string

Alright, I'm working on a little project for school, a 6-frame translator. I won't go into too much detail, I'll just describe what I wanted to add.
The normal output would be something like:
TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD
The important part of this string are the M and the _ (the start and stop codons, biology stuff). What I wanted to do was highlight these like so:
TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSD
Now here is where (for me) it gets tricky, I got my output to look like this (adding a space and a ' to highlight the start and stop). But it only does this once, for the first start and stop it finds. If there are any other M....._ combinations it won't highlight them.
Here is my current code, attempting to make it highlight more than once:
def start_stop(translation):
index_2 = 0
while True:
if 'M' in translation[index_2::1]:
index_1 = translation[index_2::1].find('M')
index_2 = translation[index_1::1].find('_') + index_1
new_translation = translation[:index_1] + " '" + \
translation[index_1:index_2 + 1] + "' " +\
translation[index_2 + 1:]
else:
break
return new_translation
I really thought this would do it, guess not. So now I find myself being stuck.
If any of you are willing to try and help, here is a randomly generated string with more than one M....._ set:
'TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLYMPPARRLATKSRFLTPVISSG_DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI'
Thank you to anyone willing to help :)
Regular expressions are pretty handy here:
import re
sequence = "TTCP...."
highlighted = re.sub(r"(M\w*?_)", r" '\1' ", sequence)
# Output:
"TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLY 'MPPARRLATKSRFLTPVISSG_' DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI"
Regex explanation:
We look for an M followed by any number of "word characters" \w* then an _, using the ? to make it a non-greedy match (otherwise it would just make one group from the first M to the last _).
The replacement is the matched group (\1 indicates "first group", there's only one), but surrounded by spaces and quotes.
You just require little slice of 'slice' module , you don't need any external module :
Python string have a method called 'index' just use it.
string_1='TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD'
before=string_1.index('M')
after=string_1[before:].index('_')
print('{} {} {}'.format(string_1[:before],string_1[before:before+after+1],string_1[before+after+1:]))
output:
TTCPTISPALGLAWS_DLGTLGF MSYSANTASGETLVSLYQLGLFEM_ VVSYGRTKYYLICP_LFHLSVGFVPSD

How to replace user's input?

I'm trying to make a Text to Binary converter script. Here's what I've got..
userInput = input()
a = ('00000001')
b = ('00000010')
#...Here I have every remaining letter translated in binary.
z = ('00011010')
So let's say the user types the word "Tree", I want to convert every letter in binary and display it. I hope you can understand what I'm trying to do here.
PS. I'm a bit newbie! :)
The way you have attempted to solve the problem isn't ideal. You've backed yourself into a corner by assigning the binary values to variables.
With variables, you are going to have to use eval() to dynamically get their value:
result = ' '.join((eval(character)) for character in myString)
Be advised however, the general consensus regarding the use of eval() and similar functions is don't. A much better solution would be to use a dictionary to map the binary values, instead of using variables:
characters = { "a" : '00000001', "b" :'00000010' } #etc
result = ' '.join(characters[character] for character in myString)
The ideal solution however, would be to use the built-in ord() function:
result = ' '.join(format(ord(character), 'b') for character in myString)
Check the ord function:
userinput = input()
binaries = [ord(letter) for letter in userinput]
cheeky one-liner that prints each character on a new line with label
[print(val, "= ", format(ord(val), 'b')) for val in input()]
#this returns a list of "None" for each print statement
similarly cheeky one-liner to print with arbitrary seperator specified by print's sep value:
print(*[str(val) + "= "+str(format(ord(val), 'b')) for val in input()], sep = ' ')
Copy and paste into your favorite interpreter :)

Python - Print Each Sentence On New Line

Per the subject, I'm trying to print each sentence in a string on a new line. With the current code and output shown below, what's the syntax to return "Correct Output" shown below?
Code
sentence = 'I am sorry Dave. I cannot let you do that.'
def format_sentence(sentence):
sentenceSplit = sentence.split(".")
for s in sentenceSplit:
print s + "."
Output
I am sorry Dave.
I cannot let you do that.
.
None
Correct Output
I am sorry Dave.
I cannot let you do that.
You can do this :
def format_sentence(sentence) :
sentenceSplit = filter(None, sentence.split("."))
for s in sentenceSplit :
print s.strip() + "."
There are some issues with your implementation. First, as Jarvis points out in his answer, if your delimiter is the first or last character in your string or if two delimiter characters are right next to each other, None will be inserted into your array. To fix this, you need to filter out the None values. Also, instead of using the + operator, use formatting instead.
def format_sentence(sentences):
sentences_split = filter(None, sentences.split('.'))
for s in sentences_split:
print '{0}.'.format(s.strip())
You can split the string by ". " instead of ".", then print each line with an additional "." until the last one, which will have a "." already.
def format_sentence(sentence):
sentenceSplit = sentence.split(". ")
for s in sentenceSplit[:-1]:
print s + "."
print sentenceSplit[-1]
Try:
def format_sentence(sentence):
print(sentence.replace('. ', '.\n'))

Python String Replace Behaving Weirdly

I am trying to get users who are mentioned in an article. That is, words starting with # symbol and then wrap < and > around them.
WHAT I TRIED:
def getUsers(content):
users = []
l = content.split(' ')
for user in l:
if user.startswith('#'):
users.append(user)
return users
old_string = "Getting and replacing mentions of users. #me #mentee #you #your #us #usa #wo #world #word #wonderland"
users = getUsers(old_string)
new_array = old_string.split(' ')
for mention in new_array:
for user in users:
if mention == user and len(mention) == len(user):
old_string = old_string.replace(mention, '<' + user + '>')
print old_string
print users
The code is behaving funny. It wraps words starting with the same alphabets and even truncate subsequent as shown in the print below:
RESULT:
Getting and replacing mentions of users. <#me> <#me>ntee <#you> <#you>r <#us> <#us>a <#wo> <#wo>rld <#wo>rd <#wo>nderland
['#me', '#mentee', '#you', '#your', '#us', '#usa', '#wo', '#world', '#word', '#wonderland']
EXPECTED RESULT:
Getting and replacing mentions of users. <#me> <#mentee> <#you> <#your> <#us> <#usa> <#wo> <#world> <#word> <#wonderland>
['#me', '#mentee', '#you', '#your', '#us', '#usa', '#wo', '#world', '#word', '#wonderland']
Process finished with exit code 0
Why is this happening and how can do this the right way?
Why this happens: When you split the string, you put a lot of checks in to make sure you are looking at the right user e.g. you have #me and #mentee - so for user me, it will match the first, and not the second.
However, when you do replace, you are doing replace on the whole string - so when you say to replace e.g. #me with <#me>, it doesn't know anything about your careful split - it's just going to look for #me in the string and replace it. So #mentee ALSO contains #me, and will get replaced.
Two (well, three) choices: One is to add the spaced around it, to gate it (like #parchment wrote).
Second is to use your split: Instead of replacing the original string, replace the local piece. The simplest way to do this is with enumerate:
new_array = old_string.split(' ')
for index, mention in enumerate(new_array):
for user in users:
if mention == user and len(mention) == len(user):
#We won't replace this in old_string, we'll replace the current entry
#old_string = old_string.replace(a, '<' + user + '>')
new_array[index] = '<%s>'%user
new_string = ' '.join(new_array)
Third way... this is a bit more complex, but what you really want is for any instance of '#anything' to be replaced with <#anything> (perhaps with whitespace?). You can do this in one shot with re.sub:
new_string = re.sub(r'(#\w+)', r'<\g<0>>', old_string)
My previous answer was based entirely on correcting the problems in your current code. But, there is a better way to do this, which is using regular expressions.
import re
oldstring = re.sub(r'(#\w+)\b', r'<\1>', oldstring)
For more information, see the documentation on the re module.
Because #me occurs first in your array, your code replaces the #me in #mentee.
Simplest way to fix that is to add a space after the username that you want to be replaced:
old_string = old_string.replace(a + ' ', '<' + user + '> ')
# I added space here ^ and here ^
A new problem occurs, though. The last word is not wrapped, because there's no space after it. A very simple way to fix it would be:
oldstring = oldstring + ' '
for mention in ... # Your loop
oldstring = oldstring[:-1]
This should work, as long as there isn't any punctuation (like commas) next to the usernames.
def wrapUsers(content):
L = content.split()
newL = []
for word in L:
if word.startswith('#'): word = '<'+word+'>'
newL.append(word)
return " ".join(newL)

Categories

Resources