No luck finding regex pattern python - python

I am having no luck getting anything from this regex search.
I have a text file that looks like this:
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
I want to extract the lines that begin with "REF*23*" and ending with the "~"
txtfile = open(i + fileName, "r")
for line in txtfile:
line = line.rstrip()
p = re.findall(r'^REF*23*.+~', line)
print(p)
But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks

You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line)
print(results)
If you need to get the digit chunks:
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line[7:-1]) # Just grab the slice
See non-regex approach demo.
NOTES
In a regex, * must be escaped to match a literal asterisk
You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like
m = re.search(r'^REF\*23\*(.*)~$', line):
if m:
results.append(m.group(1)) # To grab just the contents between delimiters
# or
results.append(line) # To get the whole line
See this Python demo
In your case, you search for lines that start and end with fixed text, thus, no need using a regex.
Edit as an answer to the comment
Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.
You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:
results = []
with open(fileName, "r") as txtfile:
for line in txtfile:
res = re.findall(r'REF\*0F\*(\d+)~', line)
if len(res):
results.extend(res)
print(results)

You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):
raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""
result = [digits[:-1]
for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
for splitted in [line.split("*")]
for digits in [splitted[-1]]]
print(result)
This yields
['526344060']

* is a special character in regex, so you have to escape it as #The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.
r'^REF\*23\*.+~'
or
'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine
will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.
Additional changes
You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.
results = []
with open(i + fileName, "r") as txtfile:
line = line.rstrip()
p = re.match(r'^REF\*23\*(.+)~', line)
if p:
results.append(int(p.group(1)))
Consider using a regex tester such as this one.

Related

python regax [] does not work [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

Why cant i change this string? - Python [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

Use findall and sub functions from regex to search and replace exact string

Based on this forum Replacing a line in a file based on a keyword search, by line from another file i am having little difficulty in my real file. Where as shown in picture below, i want to search a keyword "PBUSH followed by number(keeps increasing)" and based on that keyword i'd search in the other file if it is present or not. If it is present then replace the data from the line "PBUSH number K Some decimals" to the found line in another file, keeping search keyword as same. It'll keep going till the end of file, which looks like
and the code i modified (notice the findall and sub format) looks like:
import re
path1 = "C:\Users\sony\Desktop\PBUSH1.BDF"
path2 = "C:\Users\sony\Desktop\PBUSH2.BDF"
with open(path1) as f1, open(path2) as f2:
dat1 = f1.read()
dat2 = f2.read()
matches = re.findall('^PBUSH \s [0-9] \s K [0-9 ]+', dat1, flags=re.MULTILINE)
for match in matches:
dat2 = re.sub('^{} \s [0-9] \s K \s'.format(match.split(' ')[0]), match, dat2, flags=re.MULTILINE)
with open(path2, 'w') as f:
f.write(dat2)
Here my search keyword is PBUSH spaces Number and then the rest follows as shown in the PBUSH lines. I am unable to make it work. What could be the possible reason!
Better to use groups in such case and separate the whole string into two, one for matching phrase and other for data.
import re
# must use raw strings for paths, otherwise we need to
# escape \ characters
input1 = r"C:\Users\sony\Desktop\PBUSH1.BDF"
input2 = r"C:\Users\sony\Desktop\PBUSH2.BDF"
with open(input1) as f1, open(input2) as f2:
dat1 = f1.read()
dat2 = f2.read()
# use finditer instead of findall so that we will get
# a match object for each match.
# For each matching line we also have one subgroup, containing the
# "PBUSH NNN " part, whereas the whole regex matches until
# the next end of line
matches = re.finditer('^(PBUSH\s+[0-9]+\s+).*$', dat1, flags=re.MULTILINE)
for match in matches:
# for each match we construct a regex that looks like
# "^PBUSH 123 .*$", then replace all matches thereof
# with the contents of the whole line
dat2 = re.sub('^{}.*$'.format(match.group(1)), match.group(0), dat2, flags=re.MULTILINE)
with open(input2, 'w') as outf:
outf.write(dat2)

Remove a specific pattern in fasta sequences

I have a fasta file like this,
>IWB12626
AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAA[T/C]GACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA
>IWB49383
CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTC[A/G]ATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC
I need to remove square brackets "[]","/" and the nucleotide that follows this symbol "/", so basically choosing the 1st of the two variants. This is my script, but I dont know how to specify to program that I need one letter be removed after /.
with open('myfile.fasta') as f:
with open('outfile.fasta', 'w') as out:
for line in f:
if line.startswith('>'):
out.write(line)
else:
out.write(line.translate(None, '[/a-z0-9]'))
my expected output,
>IWB12626
AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAATGACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA
>IWB49383
CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTCAATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC
You could use re.sub function.
with open('myfile.fasta') as f:
with open('outfile.fasta', 'w') as out:
for line in f:
if line.startswith('>'):
out.write(line)
else:
out.write(re.sub(r'[\[\]]|/.', '', line))
/. matches / and also the character following forward slash. [\[\]] character class which matches [ or ] symbols. | called alternation operator or logical OR operator usually used to combine two patterns. So by replacing all the matched characters with an empty string will give you the desired output.

Regex in Python. NOT matches

I'll go straight: I have a string like this (but with thousands of lines)
Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2
and I need to remove lines that does not match a-z and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). And this should be case insensitive. I think regex should be
[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier
But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried
re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)
but no good.
Thanks in advance, sorry if my english is not quite good.
As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex:
regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)
As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:
new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))
First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:
abcdefg_nodigitshere
But you can subfilter that this way:
import re
mydigre = re.compile(r'_\d+$')
myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+$', re.I)
for line in inputs.splitlines():
if re.match(myreg, line):
# do x
elif re.match(mydigre, line):
# do y
else:
# line doesn't end with _\d+
Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.
all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])
Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):
edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).
s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;
where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.
Breakdown: modifiers are global and multiline
(?i) // case insensitive flag
^ // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$)) // look ahead, not this form of a line ?
.* // ok then select all except newline or eos
(?:\n|$) // select newline or end of string

Categories

Resources