Alright, I'm working on a little project for school, a 6-frame translator. I won't go into too much detail, I'll just describe what I wanted to add.
The normal output would be something like:
TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD
The important part of this string are the M and the _ (the start and stop codons, biology stuff). What I wanted to do was highlight these like so:
TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSD
Now here is where (for me) it gets tricky, I got my output to look like this (adding a space and a ' to highlight the start and stop). But it only does this once, for the first start and stop it finds. If there are any other M....._ combinations it won't highlight them.
Here is my current code, attempting to make it highlight more than once:
def start_stop(translation):
index_2 = 0
while True:
if 'M' in translation[index_2::1]:
index_1 = translation[index_2::1].find('M')
index_2 = translation[index_1::1].find('_') + index_1
new_translation = translation[:index_1] + " '" + \
translation[index_1:index_2 + 1] + "' " +\
translation[index_2 + 1:]
else:
break
return new_translation
I really thought this would do it, guess not. So now I find myself being stuck.
If any of you are willing to try and help, here is a randomly generated string with more than one M....._ set:
'TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLYMPPARRLATKSRFLTPVISSG_DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI'
Thank you to anyone willing to help :)
Regular expressions are pretty handy here:
import re
sequence = "TTCP...."
highlighted = re.sub(r"(M\w*?_)", r" '\1' ", sequence)
# Output:
"TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLY 'MPPARRLATKSRFLTPVISSG_' DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI"
Regex explanation:
We look for an M followed by any number of "word characters" \w* then an _, using the ? to make it a non-greedy match (otherwise it would just make one group from the first M to the last _).
The replacement is the matched group (\1 indicates "first group", there's only one), but surrounded by spaces and quotes.
You just require little slice of 'slice' module , you don't need any external module :
Python string have a method called 'index' just use it.
string_1='TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD'
before=string_1.index('M')
after=string_1[before:].index('_')
print('{} {} {}'.format(string_1[:before],string_1[before:before+after+1],string_1[before+after+1:]))
output:
TTCPTISPALGLAWS_DLGTLGF MSYSANTASGETLVSLYQLGLFEM_ VVSYGRTKYYLICP_LFHLSVGFVPSD
Related
I have a list of titles with combined dates and descriptions, but I have to reduce this to just a list of dates. Some examples of these titles are stuff like this:
1/16 Stories of Time
5/18 Cock'a'doodle'do
However, some people are really bad at typing and have forgotten the spaces between the dates and the rest of the title. I need to remove everything except for numbers and the slashes between them. Using any method, but preferably regex, is there a simple way to do this? For the record, I do understand how to split and recompile the list for any method that would work on a single string.
You're thinking about this backwards. If you want to extract the date at the start of a line, do that instead of trying to get rid of everything else.
You can use a regex like this: ^\d{1,2}/\d{1,2} which means:
^ start of line
\d digit
{1,2} repeated one or two times
For example:
import re
lines = [
'1/16 Stories of Time',
"5/18 Cock'a'doodle'do",
'6/22Bible']
for line in lines:
match = re.match(r'^\d{1,2}/\d{1,2}', line)
if match:
print(match.group(0))
Output:
1/16
5/18
6/22
(Note that re.match always starts matching from the start of the string, so the ^ is redundant here.)
This is more rigorous against titles containing numbers and slashes, like say, 4/5 The 39 Steps / The Thirty-Nine Steps -> 4/5.
However, you'll have a problem if someone forgot the space for a title starts with a number, like say, 7/8100 Years of Solitude -> 7/81.
You can import string to get easy access to a string of all digits, add the slash to it, and then compare your date string against that to drop any character from the date string that's not in there:
import string
string.digits += "/"
for character in date_string:
if not character in string.digits:
date_string = date_string.replace(character, "")
This will convert the date_string 5/18 Cock'a'doodle'do to just 5/18 without using regex at all.
Barmar on the comment of the original question had the best answer. To remove all but the numbers and a slash from the string you can use the one line of code,
string = re.sub(r'[^\d/]', '', string)
This removes all letters but ignores slashes. Thank you Barmar, if you want to post this as an answer I can take this down and flag that instead.
string = "rk3k3rr3kk____"
print("".join([letter for letter in string if not letter.isalpha()]))
But this is what you actually want, since your data seems to always have be a specific kind of format:
string.split(" ")[0]
okay,okay,okay ... this is what you want:
string[:4]
for completness sake:
string = " 2/24 4/12 333333 effee24/22"
for i, x in enumerate(string):
if len(string) <= i + 4:
break
if i > 0 and x != " " and not x.isalpha():
continue
if not string[i+1].isnumeric():
continue
if string[i+2] != "/":
continue
if not string[i+3].isnumeric():
continue
if not string[i+4].isnumeric():
continue
if len(string) == i + 6 and string[i+5] != " " and not string[i+5].isalpha():
continue
print(string[i+1:i+5])
I have a little "problem" that I would like to solve via programming a simple script. I don't have much programming experience and thought I'd ask here for help for what I should look for or know to do this.
Basically I want to take an email address such as placeholder.1234#fakemail.com and replace it into pl*************4#fakemail.com.
I need the script to take the letters after the first two, and before the last, and turn those letters into asterisks, and they have to match the amount of characters.
Just for clarification, I am not asking for someone to write this for me, I just need some guidance for how to go about this. I would also like to do this in Python, as I already have Python setup on my PC.
You can use Python string slicing:
email = "placeholder.1234#fakemail.com"
idx1 = 2
idx2 = email.index("#") - 1
print(email[:idx1] + "*" * (idx2 - idx1) + email[idx2:])
Output:
pl*************4#fakemail.com
Explanation:
Define the string that will contain the email address:
email = "placeholder.1234#fakemail.com"
Define the index of where the asterisks should begin, which is 2:
idx = 2
Define the index of where the asterisks should end, which is the index of where the # symbol is minus 1:
idx2 = email.index("#") - 1
Finally, using the indices defined, you can slice and concatenate the string defined accordingly:
print(email[:idx1] + "*" * (idx2 - idx1) + email[idx2:])
So this email will be a string.
Try use a combination of String indexing and (string replacement or string concatenation).
First, let's think about what data type we would store this in. It should be a String since it contains letters and characters.
We know we want to replace a portion of the String from the THIRD character to the character 2 before "#".
Let's think about this in terms of indexing now. The third character, or our start index for replacement is at index 2. To find the index of the character "#", we can use the: index() function as so:
end = email.index('#')
However, we want to start replacing from 2 before that index, so we can just subtract 2 from it.
Now we have the indexes (start and endpoint) of what we want to replace. We can now use a substring that goes from the starting to ending indexes and use the .replace() function to replace this substring in our email with a bunch of *'s.
To determine how many *'s we need, we can find the difference in indexes and add 1 to get the total number. So we could do something like:
stars = "*" * (end - start + 1)
email = email.replace(email[start:end + 1], stars)
Notice how I did start:end + 1 for the substring. This is because we want to include that ending index, and the substring function on its own will not include the ending index.
I hope this helped answer your question! Please let me know if you need any further clarification or details:)
I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!
I have 4.5 million rows to process, so I need to speed this up so bad.
I don't understand how regex works pretty well, so other answers are so hard for me to understand.
I have a column that contains IDs, e.g.:
ELI-123456789
This numeric part of this ID is contained in this string in a columnm bk_name started with a "#":
AAAAA#123456789_BBBBB;CCCCC;
Now my goal is to change that string into this string, throw the ID at the end, started with a "#", save it in new_name:
AAAAA_BBBBB;CCCCC;#123456789
Here's what I tried:
Take the ID and replace "ELI-" with "#", save as ID2:
df["ID2"] = df["ID"].str.replace("ELI-", "#")
Find ID2 in the string of bk_name and replace it with "":
df["new_name"] = df["bk_name"].replace(regex = r'(?i)' + df["ID2"].astype(str), value = "")
Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
Now the problem is that step 2 is the line that takes most of the time, it took 6.5 second to process 6700 rows.
But now I have 4.6 millions rows to process, it's been 7 hours and it's still running and I have no idea why.
In my opinions, regex slows down my code. But I have no deeper understanding.
So thanks for your help in advance, any suggestions would be appreciated :)
So, your approach wasn't that bad.
I recommend you look into the functions docs before using them. replace takes no keyword arguments.
Regarding step 2, don't use a regex just for replacing a string with another string ("" in this case).
df = {};
df['ID'] = 'ELI-123456789'; df['bk_name'] = 'AAAAA#123456789_BBBBB;CCCCC;';
print(df)
# Take the ID and replace "ELI-" with "#", save as ID2:
# df["ID2"] = df["ID"].str.replace("ELI-", "#")
df["ID2"] = df["ID"].replace("ELI-", "#")
print(df)
# Find ID2 in the string of bk_name and replace it with "":
# df["new_name"] = df["bk_name"].replace(regex = r'(?i)' + df["ID2"].astype(str), value = "")
df["new_name"] = df["bk_name"].replace(df["ID2"], "")
print(df)
# Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
# df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
df["new_name"] = df["new_name"] + df["ID"].replace("ELI-", "#")
print(df)
Hope it helps
EDIT:
Are you sure it's this piece of code the one slowing down your project? Have you tried to isolate it and see how long it actually takes to execute?
Try this:
import timeit
code_to_test = """
df = {};
df['ID'] = 'ELI-123456789'; df['bk_name'] = 'AAAAA#123456789_BBBBB;CCCCC;';
df["new_name"] = df['bk_name'].replace('#' + df['ID'][4:],'') + '#' + df['ID'][4:] # <----- ONE LINER
# print(df['new_name'])
"""
elapsed_time = timeit.timeit(code_to_test, number=4600000)
print(elapsed_time)
I shortened it to a one-liner which ONLY WORKS IF:
ELI- (or a 4-character string) is always the string to be replaced
# is always the character to be removed and # is always the one to be added
At least in the laptop I'm coding this, it always executes the code 4600000 times in under 4 secs
Well, I would say that if you know that the ID part is always ELI- followed by a number and that this number is always behind the # at the beginning then I would consider not reading the ID at step 1. Directly work at step 2 with this regular expression and replacement:
https://regex101.com/r/QBQB7E/1
You should be able to create the content of your new field just with one single assignement with the result of the substitution of the old field. The idea to
speed up things is to avoid having 4-5 lines of code to do the operation of
getting the id, searching it and then recomposing the result. The regular
expression and the substitution pattern can do all that in one single operation.
You can generate some Python code directly from Regex101 so that you can integrate it to create your new_name field's content.
Explanation
^(.*?)#(\d+)(.*)$ is the regular expression
^ means "starting with".
$ means "ending with".
() is to capture a group and create a variable that we can then use
in the replacement pattern. \1 for the first matching group, \2 for
the second, etc.
.*? matches any character between zero and unlimited times, as few times
as possible, expanding as needed (lazy).
# matches the character # literally (case sensitive).
\d+ matches a digit (equal to [0-9]) one or more times.
.* matches any character between zero and unlimited times, as many times as possible, giving back as needed (greedy).
The replacement string is \1\3#\2, so it takes the first matching group
followed by the last one and then adds a # followed by the second matching
group, which is your id.
In terms of speed of the regex itself, it could also be changed a little bit
to find a faster version of it, depending how it's written.
Second version:
^([^#]+)#(\d+)(.*)$ where I replaced .*? by [^#]+ meaning find any
character which is not # one or more times.
Solution here: https://regex101.com/r/QBQB7E/3
Tested the code and I got it in about 4 seconds...
import timeit
code_to_test = """
import re
regex = r"^([^#]+)#(\d+)(.*)$"
subst = "\\1\\3#\\2"
df = {
"ID": "ELI-123456789",
"bk_name": "AAAAA#123456789_BBBBB;CCCCC;"
}
df["new_name"] = re.sub(regex, subst, df["bk_name"])
"""
elapsed_time = timeit.timeit(code_to_test, number=4600000)
print(elapsed_time)
I really think that if you code runs in 7 hours, it's probably somewhere else
where the problem is. The regex engine doesn't seem much slower than doing
manual search/replace operations.
I don't understand what caused the problem but I solved it.
The only thing I changed is in step 2. I used apply/lambda function and it suddenly works and I have no idea why.
Take the ID and replace "ELI-" with "#", save as ID2:
df["ID2"] = df["ID"].str.replace("ELI-", "#")
Find ID2 in the string of bk_name and replace it with "":
df["new_name"] = df[["bk_name", "ID2"]].astype(str).apply(lambda x: x["bk_name"].replace(x["ID2"], ""), axis=1)
Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
I have looked over similar questions, but I still have trouble figuring this one out.
I have two Lists of strings, one of which consists of characters like 'abcdefg' and another one consisting of strings which consist of white spaces and a special character. The special character indicates where I should remove characters from my 'abcdefg' string. The special character's position in the list would be the same position I would need to remove a character from the first list. I also need to remove the adjacent characters.
EDIT: I want to remove a character (and the adjacent characters) at the same position the '*' char is located in airstrikes, but in reinforces. Does this make sense?
reinforces = ["abcdefg", "hijklmn"]
airstrikes = [" * "]
battlefield = reinforces[0]
bomb_range = []
count = 0
if range(len(airstrikes)) != 0:
for airstrike in airstrikes:
for char in airstrike:
print(count)
count = count + 1
if (char == '*'):
bomb_range.append(count-1)
bomb_range.append(count)
bomb_range.append(count+1)
break
#Trying to hardcode it initially just to get it to work. Some kind of looping is needed though.
battlefield = battlefield[:bomb_range[0]] + battlefield[bomb_range[1]:]
battlefield = battlefield[:bomb_range[1]] + battlefield[bomb_range[2]:]
#battlefield = battlefield[:bomb_range[2]] + battlefield[bomb_range[3]:] #Will not work of course. But how could I achieve what I want?
I am sorry about the nested loops. If it hurts looking at it, feel free to bash and correct me. I am sorry if I missed any answers on this forum which could have helped me find a solution. Know that I did try to find one.
Use index to find where to strike, then remove the character the usual way:
>>> reinforce = "abcdefg"
>>> airstrike = " * "
>>> strike_at = airstrike.index('*')
>>> reinforce[:strike_at]+reinforce[strike_at+1:]
'abcefg'
of course, you need to make sure strike_at+1 is a legal index (see try and except).