Remove a specific pattern in fasta sequences - python

I have a fasta file like this,
>IWB12626
AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAA[T/C]GACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA
>IWB49383
CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTC[A/G]ATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC
I need to remove square brackets "[]","/" and the nucleotide that follows this symbol "/", so basically choosing the 1st of the two variants. This is my script, but I dont know how to specify to program that I need one letter be removed after /.
with open('myfile.fasta') as f:
with open('outfile.fasta', 'w') as out:
for line in f:
if line.startswith('>'):
out.write(line)
else:
out.write(line.translate(None, '[/a-z0-9]'))
my expected output,
>IWB12626
AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAATGACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA
>IWB49383
CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTCAATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC

You could use re.sub function.
with open('myfile.fasta') as f:
with open('outfile.fasta', 'w') as out:
for line in f:
if line.startswith('>'):
out.write(line)
else:
out.write(re.sub(r'[\[\]]|/.', '', line))
/. matches / and also the character following forward slash. [\[\]] character class which matches [ or ] symbols. | called alternation operator or logical OR operator usually used to combine two patterns. So by replacing all the matched characters with an empty string will give you the desired output.

Related

Python - Edit lines in a list

I have insert my text file with about 10 lines in the form of a list. Now I want to cut off the firstpart in each line.
To be precise, the first 5 words should be cut off.
How exactly do I have to do this?
Edit:
I have insert my text file:
with open("test.txt", "r") as file:
list = []
for line in file:
list += [line.strip()]
print(list)
If i only have one line, this works for me:
newlist = " ".join(list.split(" ")[5:])
print(newlist)
But how can I do this with a list (10 lines)
Python has a method split() that splits a string by a delimiter. By default, it splits at every white space. Then, to cut the first 5 words you can either copy all the other items in the list starting from index 5, or, delete the indexes 0-4 from the list.
Perhaps something along the lines of:
text = []
with open('input.txt') as f:
for l in f.readlines():
words = l.split()
text.append(words[5:])
Obviously you should do all sorts of error checking here but the gist should be there.
If you want to remove all first 5 words from all lines in the file, you don't have to read it line by line and then split it.
You can read the whole file, then then use re.sub to remove the first 5 words surrounded by spaces.
import re
pattern = r"(?m)^[^\S\n]*(?:\S+[^\S\n]+){4}\S+[^\S\n]*"
with open('test.txt') as f:
print(re.sub(pattern, "", f.read()))
The pattern matches:
(?m) Inline modifier for multiline
^ Start of string
[^\S\n]* Match optional leading spaces without newlines
(?:\S+[^\S\n]+){4} Repeat 4 times matching 1+ non whitespace chars followed by 1+ spaces
\S+ Match 1+ non whitespace chars
[^\S\n]* Match optional trailing spaces without a newline
See a regex 101 demo for the matches.

how to write a regular expression to match a small part of a repeating pattern?

I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?
We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.
What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo

No luck finding regex pattern python

I am having no luck getting anything from this regex search.
I have a text file that looks like this:
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
I want to extract the lines that begin with "REF*23*" and ending with the "~"
txtfile = open(i + fileName, "r")
for line in txtfile:
line = line.rstrip()
p = re.findall(r'^REF*23*.+~', line)
print(p)
But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks
You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line)
print(results)
If you need to get the digit chunks:
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line[7:-1]) # Just grab the slice
See non-regex approach demo.
NOTES
In a regex, * must be escaped to match a literal asterisk
You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like
m = re.search(r'^REF\*23\*(.*)~$', line):
if m:
results.append(m.group(1)) # To grab just the contents between delimiters
# or
results.append(line) # To get the whole line
See this Python demo
In your case, you search for lines that start and end with fixed text, thus, no need using a regex.
Edit as an answer to the comment
Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.
You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:
results = []
with open(fileName, "r") as txtfile:
for line in txtfile:
res = re.findall(r'REF\*0F\*(\d+)~', line)
if len(res):
results.extend(res)
print(results)
You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):
raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""
result = [digits[:-1]
for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
for splitted in [line.split("*")]
for digits in [splitted[-1]]]
print(result)
This yields
['526344060']
* is a special character in regex, so you have to escape it as #The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.
r'^REF\*23\*.+~'
or
'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine
will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.
Additional changes
You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.
results = []
with open(i + fileName, "r") as txtfile:
line = line.rstrip()
p = re.match(r'^REF\*23\*(.+)~', line)
if p:
results.append(int(p.group(1)))
Consider using a regex tester such as this one.

How can I make elements of a list part of a regex?

So I have a list of strings, lets say: my_list = ['hope', 'faith', 'help']
now I open a textfile with the name infile and seperate the words with
for line in infile:
line_list = line.split()
now I want to make a regex that i can change by using for loop like this:
for word in line_list:
match = re.findall(word$, line_list)
print(match)
I've tried several ways to get 'word' into that regex but none seems to work
Any ideas?
You don't need to use a regular expression. There is the method endswith for the standard type str in Python.
with open('path/name.ext') as infile :
line_list = infile.readlines()
for line in line_list :
match = [word for word in my_list if line.endswith(word)]
print(match)
This would print out either the matching word or an empty list for every line in the file.
But you can do it with a regular expression if you absolutely want...
pattern = r'({0})$'.format('|'.join(my_list))
for line in line_list :
match = re.findall(pattern, line)
print(match)
The search pattern contains of a group with all elements from my_list operated with a logical or |.
A regex is just a string which may or may not contain wildcard or special characters. So the best way to "make elements of a list part of a regex" is to 'write' the regex :
my_list = ['hope', 'faith', 'help']
for regex_el in my_list:
regex = "{0:s}".format(regex_el)
print regex
Of course that is over simplistic. That's just using a plain string as a regex. You could have small regular expressions to bolt into the larger regex or you could surround the element from the list with other parts of a regex :
regex = "^ *{0:s} ".format(regex_el)
Would construct a regex to find your word only if it were the first word in a string, preceded by none or more spaces and followed by a space.
Then in your code, replace the 'word' in your call to findall with the 'regex' constructed above.
You will need to replace the line_list in your call to findall as well as findall expects a pattern (be that a simple string or a genuine regex) and a string in which to search (that could be word in your loop or line from the loop over lines in the file.
Also note print match will print an empty list if no match was found. You may wish to replace that with
if match:
print(match)
To only print words from the line which match your constructed regex.
Could I recommend you check out this website : https://regex101.com/ to experiment with regexs and the strings you're aplying them to.

Use findall and sub functions from regex to search and replace exact string

Based on this forum Replacing a line in a file based on a keyword search, by line from another file i am having little difficulty in my real file. Where as shown in picture below, i want to search a keyword "PBUSH followed by number(keeps increasing)" and based on that keyword i'd search in the other file if it is present or not. If it is present then replace the data from the line "PBUSH number K Some decimals" to the found line in another file, keeping search keyword as same. It'll keep going till the end of file, which looks like
and the code i modified (notice the findall and sub format) looks like:
import re
path1 = "C:\Users\sony\Desktop\PBUSH1.BDF"
path2 = "C:\Users\sony\Desktop\PBUSH2.BDF"
with open(path1) as f1, open(path2) as f2:
dat1 = f1.read()
dat2 = f2.read()
matches = re.findall('^PBUSH \s [0-9] \s K [0-9 ]+', dat1, flags=re.MULTILINE)
for match in matches:
dat2 = re.sub('^{} \s [0-9] \s K \s'.format(match.split(' ')[0]), match, dat2, flags=re.MULTILINE)
with open(path2, 'w') as f:
f.write(dat2)
Here my search keyword is PBUSH spaces Number and then the rest follows as shown in the PBUSH lines. I am unable to make it work. What could be the possible reason!
Better to use groups in such case and separate the whole string into two, one for matching phrase and other for data.
import re
# must use raw strings for paths, otherwise we need to
# escape \ characters
input1 = r"C:\Users\sony\Desktop\PBUSH1.BDF"
input2 = r"C:\Users\sony\Desktop\PBUSH2.BDF"
with open(input1) as f1, open(input2) as f2:
dat1 = f1.read()
dat2 = f2.read()
# use finditer instead of findall so that we will get
# a match object for each match.
# For each matching line we also have one subgroup, containing the
# "PBUSH NNN " part, whereas the whole regex matches until
# the next end of line
matches = re.finditer('^(PBUSH\s+[0-9]+\s+).*$', dat1, flags=re.MULTILINE)
for match in matches:
# for each match we construct a regex that looks like
# "^PBUSH 123 .*$", then replace all matches thereof
# with the contents of the whole line
dat2 = re.sub('^{}.*$'.format(match.group(1)), match.group(0), dat2, flags=re.MULTILINE)
with open(input2, 'w') as outf:
outf.write(dat2)

Categories

Resources