Problem to research a special character in Python - python

I have a file (I only show a part) where I would like to remove a special character.
OTU1359 UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0 UniRef90_A0A1Z9THL2 UniRef90_A0A2E3B6A5 UniRef90_A0A2E5MT47 UniRef90_A0A2E5VCW9 UniRef90_A0A2E6CDK4 UniRef90_A0A2E6KTE6 UniRef90_A0A2E8AIM6 UniRef90_A0A2E8RIG1 UniRef90_A0A2E8YNS3 UniRef90_A0A2E9VEK0 UniRef90_W6RCT6
OTU0980 UniRef90_A0A084TMQ7 UniRef90_A0A090PK65 UniRef90_A0A0P1G8P0 UniRef90_A0A0P1IHL1 UniRef90_A0A286ILS7 UniRef90_A0A2A5E7H9 UniRef90_A0A2D9J217 UniRef90_H3NS47 UniRef90_H3NSN9 UniRef90_H3NSP0 UniRef90_H3NSP7 UniRef90_H3NUB2 UniRef90_H3NY28 UniRef90_H3NY47 UniRef90_UPI000C2CBC51
I would like to remove the character "OTUXXXX" (it always start by OTU and has always 4 numbers after) . It can appears multiple OTUXXXX by line
I tried :
re.search("OTU[0-9]{4}", line)
It doesn't work.. Any help?

I suggest using re.sub and find your pattern matches as whole words to avoid partial matches inside other words.
s = re.sub(r"\s*\bOTU[0-9]{4}\b", "", line).strip()
See the regex demo. The .strip() at the end removes any redundant leading/trailing whitespaces that remain after removing the matches at the end/start of the string.
See the regex graph:

You could make use of re.sub which actually performs replacemnt or substitution of matching text with the one you provide. Here you find the doc: https://docs.python.org/3/library/re.html
And here one possible implementaiton:
from re import compile, sub, MULTILINE
text = '''
OTU1359 UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0 UniRef90_A0A1Z9THL2 UniRef90_A0A2E3B6A5 UniRef90_A0A2E5MT47 UniRef90_A0A2E5VCW9 UniRef90_A0A2E6CDK4 UniRef90_A0A2E6KTE6 UniRef90_A0A2E8AIM6 UniRef90_A0A2E8RIG1 UniRef90_A0A2E8YNS3 UniRef90_A0A2E9VEK0 UniRef90_W6RCT6
OTU0980 UniRef90_A0A084TMQ7 UniRef90_A0A090PK65 UniRef90_A0A0P1G8P0 UniRef90_A0A0P1IHL1 UniRef90_A0A286ILS7 UniRef90_A0A2A5E7H9 UniRef90_A0A2D9J217 UniRef90_H3NS47 UniRef90_H3NSN9 UniRef90_H3NSP0 UniRef90_H3NSP7 UniRef90_H3NUB2 UniRef90_H3NY28 UniRef90_H3NY47 UniRef90_UPI000C2CBC51
'''
replacemnt = ''
regex = compile(r'OTU\d{4}', flags=MULTILINE)
cleaned = sub(regex, replacemnt, text)

Related

Match words using this regex pattern, only if these words do not appear within a list of substrings

import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"((?:\w+))" # pattern that doesnt tolerate whitespace in middle
imput_text = re.sub(r"(?:^|\s+)a\s+" + noun_pattern,
"\(\g<0>\)",
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
I need the regex to identify and replace a substring containing no whitespaces in between "((?:\w+))" when it is at the beginning of the line or preceded by "a", "(?:^|\s+)a\s+", only if "((?:\w+))" does not match any of the strings that are inside the list list_verbs_in_this_input or a dot . , using a regex pattern similar to this re.compile(r"(?:" + rf"({'|'.join(list_verbs_in_this_input)})" + r"|[.;\n]|$)", flags = re.IGNORECASE)
And the correct output should look like this:
'(áshgdhSdah) saasas a corrEr, assasass a saltó sasass (sdssaa)'
Note that the substrings "a corrEr" and "a saltó" were not modified, since they contained substring(words) that are in the list_verbs_in_this_input list
To exclude some words, you can use a negative look ahead assertion when at the start of the word you're about to match.
A few things to correct:
re.sub takes the flags as 5th argument, not 4th
"\(" is not an escape sequence, so you should just do "(\g<0>)" without "escaping" the parentheses -- they have no special meaning in that string.
r"(?:^|\s+)a\s+" will always require the a to be there. From your description I understood that the a could be optional when the word is at the start of a line, so r"(?:\ba\s|^)\s*"
In the regex that should match the forbidden words, make sure to require that the word ends right after the match, so add \b in the pattern.
Here is what you could do:
import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"\w+"
exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
article = r"(?:\ba\s|^)\s*"
regex = article + exclude + noun_pattern
input_text = re.sub(regex, "(\g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text))

Remove Characters From A String Until A Specific Format is Reached

So I have the following strings and I have been trying to figure out how to manipulate them in such a way that I get a specific format.
string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo
I want to be able to get rid of any of the last string so I am just left with the month and year, like below:
string1-itd_jan2021
string2itd_mar2021
string3itd_feb2021
string4-itd_mar2021
string5itd_jun2021
string6-itd_feb2021
I thought about using string.split on the - but then realized that for some strings this wouldn't work. I also thought about getting rid of a set amount of characters by putting it into a list and slicing but the end is varying characters length?
Is there anything I can do it with regex or any other python module?
Use str.rsplit with the appropriate maxsplit parameter:
s = s.rsplit("-", 1)[0]
You could also use str.split (even though this is clearly the worse choice):
s = "-".join(s.split("-")[:-1])
Or using regular expressions:
s = re.sub(r'-[^-]*$', '', s)
# "-[^-]*" a "-" followed by any number of non-"-"
With a regex:
import re
re.sub(r'([0-9]{4}).*$', r'\1', s)
Use re.sub like so:
import re
lines = '''string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo'''
for old in lines.split('\n'):
new = re.sub(r'[-][^-]+$', '', old)
print('\t'.join([old, new]))
Prints:
string1-itd_jan2021-internal string1-itd_jan2021
string2itd_mar2021-space string2itd_mar2021
string3itd_feb2021-internal string3itd_feb2021
string4-itd_mar2021-moon string4-itd_mar2021
string5itd_jun2021-internal string5itd_jun2021
string6-itd_feb2021-apollo string6-itd_feb2021
Explanation:
r'[-][^-]+$' : Literal dash (-), followed by any character other than a dash ([^-]) repeated 1 or more times, followed by the end of the string ($).

Replacing when a word is in another word but with special circumstances

My program replaces tokens with values when they are in a file. When reading in a certain line it gets stuck here is an example:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a
The two tokens in the example are Token100 and Token100a. I need a way to only replace Token100 with its data and not replace Token100a with Token100's data with an a afterwards. I can't look for spaces before and after because sometimes they are in the middle of lines. Any thoughts are appreciated. Thanks.
You can use regex:
import re
line = "1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a"
match = re.sub("Token100a", "data", line)
print(match)
Outputs:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1data
More about regex here:
https://www.w3schools.com/python/python_regex.asp
You can use a regular expression with a negative lookahead to ensure that the following character is not an "a":
>>> import re
>>> test = '1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a'
>>> re.sub(r'Token100(?!a)', 'data', test)
'1.1.1.1.1.1.1.1.1.1 data.1 1.1.1.1.1.1.1Token100a'

Python - Regex to avoid matching duplicates

My string looks like this:
bo_1
bo_1
bo_2
bo_2
bo_3
bo_3
bo_4
bo_4
bo_5
bo_5
bo_6
bo_6
bo_7
bo_7
bo_8
bo_8
bo_9
bo_9
bo_10
bo_10
I want to match the first instance of each digit and ignore the next duplicate line. My regex is as follows:
(bo_\d)(?![\s\S]*\1)
which returns the following:
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_1'
How would I modify the regex to return a result like this instead (to include 'bo_1' at the start and 'bo_10' at the end):
'bo_1'
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_10'
Technically you don't need regex for that (you can use set() for instance):
>>> # Assume your string is in the variable called "text"
>>> result = set(text.split('\n'))
>>> result
{'bo_7', 'bo_3', 'bo_1', 'bo_6', 'bo_5', 'bo_8', 'bo_9', 'bo_2', 'bo_4', 'bo_10'}
Anyway, the issue with your regex is that bo_1 is also matching bo_10, so it will be seen as a duplicate by the regex. You can solve it using word boundaries to ensure that the full 'word' is tested for a match:
\b(bo_\d+)\b(?![\s\S]*\b\1\b)
regex101 demo
Use
(bo_\d+$)(?![\s\S]*^\1$)
Since you want to include bo_10, you should use \d+ and not just \d in the initial group. Then, in your negative lookahead, put the backrefrence between start-of-line and end-of-line anchors, so that, for example, bo_1 does not get excluded because it's followed by a bo_10.
https://regex101.com/r/8khbcc/1

Python: strip function definition using regex

I am a very beginner of programming and reading the book "Automate the boring stuff with Python'. In Chapter 7, there is a project practice: the regex version of strip(). My code below does not work (I use Python 3.6.1). Could anyone help?
import re
string = input("Enter a string to strip: ")
strip_chars = input("Enter the characters you want to be stripped: ")
def strip_fn(string, strip_chars):
if strip_chars == '':
blank_start_end_regex = re.compile(r'^(\s)+|(\s)+$')
stripped_string = blank_start_end_regex.sub('', string)
print(stripped_string)
else:
strip_chars_start_end_regex = re.compile(r'^(strip_chars)*|(strip_chars)*$')
stripped_string = strip_chars_start_end_regex.sub('', string)
print(stripped_string)
You can also use re.sub to substitute the characters in the start or end.
Let us say if the char is 'x'
re.sub(r'^x+', "", string)
re.sub(r'x+$', "", string)
The first line as lstrip and the second as rstrip
This just looks simpler.
When using r'^(strip_chars)*|(strip_chars)*$' string literal, the strip_chars is not interpolated, i.e. it is treated as a part of the string. You need to pass it as a variable to the regex. However, just passing it in the current form would result in a "corrupt" regex because (...) in a regex is a grouping construct, while you want to match a single char from the define set of chars stored in the strip_chars variable.
You could just wrap the string with a pair of [ and ] to create a character class, but if the variable contains, say z-a, it would make the resulting pattern invalid. You also need to escape each char to play it safe.
Replace
r'^(strip_chars)*|(strip_chars)*$'
with
r'^[{0}]+|[{0}]+$'.format("".join([re.escape(x) for x in strip_chars]))
I advise to replace * (zero or more occurrences) with + (one or more occurrences) quantifier because in most cases, when we want to remove something, we need to match at least 1 occurrence of the unnecessary string(s).
Also, you may replace r'^(\s)+|(\s)+$' with r'^\s+|\s+$' since the repeated capturing groups will keep on re-writing group values upon each iteration slightly hampering the regex execution.
#! python
# Regex Version of Strip()
import re
def RegexStrip(mainString,charsToBeRemoved=None):
if(charsToBeRemoved!=None):
regex=re.compile(r'[%s]'%charsToBeRemoved)#Interesting TO NOTE
return regex.sub('',mainString)
else:
regex=re.compile(r'^\s+')
regex1=re.compile(r'$\s+')
newString=regex1.sub('',mainString)
newString=regex.sub('',newString)
return newString
Str=' hello3123my43name is antony '
print(RegexStrip(Str))
Maybe this could help, it can be further simplified of course.

Categories

Resources