Splitting FASTA sequence using Python - python

The following error message appears when I try to run the following script:
Background: I am trying to split up a large FASTA file (~45Mb) into smaller files based on gene id. I would like to chop it everytime the ">" appears. The following .py script allows me to do so. However, everynow and then I get the following error. Any feedback would be greatly appreciated.
Script:
import os
os.chdir("/vmb/Flavia_All/Python_Commands")
outfile = os.chdir("/vmb/Flavia_All/Python_Commands")
import sys
infile = open(sys.argv[1])
outfile = []
for line in infile:
if line.startswith(">"):
if (outfile != []): outfile.close()
genename = line.strip().split('|')[1]
filename = genename+".fasta"
outfile = open(filename,'w')
outfile.write(line)
else:
outfile.write(line)
outfile.close()
Error Message when script is run:
Traceback (most recent call last):
File "splitting_fasta.py", line 14, in <module>
outfile = open(filename,'w')
IOError: [Errno 2] No such file or directory: 'AY378100.1_cds_AAR07818.1_173 [gene=pbrB/pbrC] [protein=PbrB/PbrC] [protein_id=AAR07818.1] [location=complement(152303..153451)].fasta'
*NOTE: AY378100.1_cds_AAR07818.1 is one of many genes in this FASTA sequence. This is not the only gene i have had the same message appear with. I would like to stop deleting each gene that comes up with this message.

It seems that some of the fasta "names" (the first description line with the >) contains too much. In particular contains some characters not allowed in file names.
If names like the GenBank ID - AY378100 are sufficiently unambiguous then:
genename = line.strip().split('|')[1].split('.')[0]
may be OK. If you have many fasta sequences with the same ID you probably will be OK with:
genename = line.strip().split('|')[1].split('[')[0].strip()

Related

How to find first and last characters in a file using python?

I am stuck on this revision exercise which asks to copy an input file to an output file and return the first and last letters.
def copy_file(filename):
input_file = open(filename, "r")
content = input_file.read()
content[0]
content[1]
return content[0] + content[-1]
input_file.close()
Why do I get an error message which I try get the first and last letters? And how would I copy the file to the output file?
Here is the test:
input_f = "FreeAdvice.txt"
first_last_chars = copy_file(input_f)
print(first_last_chars)
print_content('cure737.txt')
Error Message:
FileNotFoundError: [Errno 2] No such file or directory: 'hjac737(my username).txt'
All the code after a return statement is never executed, a proper code editor would highlight it to you, so I recommend you use one. So the file was never closed. A good practice is to use a context manager for that : it will automatically call close for you, even in case of an exception, when you exit the scope (indentation level).
The code you provided also miss to write the file content, which may be causing the error you reported.
I explicitely used the "rt" (and "wt") mode for the files (althought they are defaults), because we want the first and last character of the file, so it supports Unicode (any character, not just ASCII).
def copy_file(filename):
with open(filename, "rt") as input_file:
content = input_file.read()
print(input_file.closed) # True
my_username = "LENORMJU"
output_file_name = my_username + ".txt"
with open(output_file_name, "wt") as output_file:
output_file.write(content)
print(output_file.closed) # True
# last: return the result
return content[0] + content[-1]
print(copy_file("so67730842.py"))
When I run this script (on itself), the file is copied and I get the output d) which is correct.

Bulk autoreplacing string in the KML file

I have a set of placemarks, which include quite a wide description included in its balloon within the property. Next each single description (former column header) is bounded in . Because of the shapefile naming restriction to 10 characters only.
https://gis.stackexchange.com/questions/15784/bypassing-10-character-limit-of-field-name-in-shapefiles
I have to retype most of these names manually.
Obviously, I use Notepad++, where I can swiftly press Ctrl+F and toggle Replace mode, as you can see below.
The green bounded strings were already replaced, the red ones still remain.
Basically, if I press "Replace All" then it works fine and quickly. Unfortunately, I have to go one by one. As you can see I have around 20 separate strings to "Replace all". Is there a possibility to do it quicker? Because all the .kml files are similar to each other, this is going to be the same everywhere. I need some tool, which will be able to do auto-replace for these headers cut by 10 characters limit. I think, that maybe Python tools might be helpful.
https://pythonhosted.org/pykml/
But in the tool above there is no information about bulk KML editing.
How can I set something like the "Replace All" tool for all my strings preferably if possible?
UPDATE:
I tried the code below:
files = []
with open("YesNF016.kml") as f:
for line in f.readlines():
if line[-1] == '\n':
files.append(line[:-1])
else:
files.append(line)
old_expression = 'ab'
new_expression = 'it worked'
for file in files:
new_file = ""
with open(file) as f:
for line in f.readlines():
new_file += line.replace(old_expression, new_expression)
with open(file, 'w') as f:
f.write(new_file)
The debugger shows:
[Errno 22] Invalid argument: ''
File "\test.py", line 13, in
with open(file) as f:
whereas line 13 is:
with open(file) as f:
The solutions here:
https://www.reddit.com/r/learnpython/comments/b9cljd/oserror_while_using_elementtree_to_parse_simple/
and
OSError: [Errno 22] Invalid argument Getting invalid argument while parsing xml in python
weren't helpful enough for me.
So you want to replace all occurence of X to Y in bunch of files ?
Pretty easy.
Just create a file_list.txt containing the list of files to edit.
python code:
files = []
with open("file_list.txt") as f:
for line in f.readlines():
if line[-1] == '\n':
files.append(line[:-1])
else:
files.append(line)
old_expression = 'ab'
new_expression = 'it worked'
for file in files:
new_file = ""
with open(file) as f:
for line in f.readlines():
new_file += line.replace(old_expression, new_expression)
with open(file, 'w') as f:
f.write(new_file)

Parse multiple log files for strings

I'm trying to parse a number of log files from a log directory, to search for any number of strings in a list along with a server name. I feel like I've tried a million different options, and I have it working fine with just one log file.. but when I try to go through all the log files in the directory I can't seem to get anywhere.
if args.f:
logs = args.f
else:
try:
logs = glob("/var/opt/cray/log/p0-current/*")
except IndexError:
print "Something is wrong. p0-current is not available."
sys.exit(1)
valid_errors = ["error", "nmi", "CATERR"]
logList = []
for log in logs:
logList.append(log)
#theLog = open("logList")
#logFile = log.readlines()
#logFile.close()
#printList = []
#for line in logFile:
# if (valid_errors in line):
# printList.append(line)
#
#for item in printList:
# print item
# with open("log", "r") as tmp_log:
# open_log = tmp_log.readlines()
# for line in open_log:
# for down_nodes in open_log:
# if valid_errors in open_log:
# print valid_errors
down_nodes is a pre-filled list further up the script containing a list of servers which are marked as down.
Commented out are some of the various attempts I've been working through.
logList = []
for log in logs:
logList.append(log)
I thought this may be the way forward to put each individual log file in a list, then loop through this list and use open() followed by readlines() but I'm missing some kind of logic here.. maybe I'm not thinking correctly.
I could really do with some pointers here please.
Thanks.
So your last for loop is redundant because logs is already a list of strings. With that information, we can iterate through logs and do something for each log.
for log in logs:
with open(log) as f:
for line in f.readlines():
if any(error in line for error in valid_errors):
#do stuff
The line if any(error in line for error in valid_errors): checks the line to see if any of the errors in valid_errors are in the line. The syntax is a generator that yields error for each error in valid_errors.
To answer your question involving down_nodes, I don't believe you should include this in the same any(). You should try something like
if any(error in line for error in valid_errors) and \
any(node in line for node in down_nodes):
Firstly you need to find all logs:
import os
import fnmatch
def find_files(pattern, top_level_dir):
for path, dirlist, filelist in os.walk(top_level_dir):
for name in fnmatch.filter(filelist, pattern)
yield os.path.join(path, name)
For example, to find all *.txt files in current dir:
txtfiles = find_files('*.txt', '.')
Then get file objects from the names:
def open_files(filenames):
for name in filenames:
yield open(name, 'r', encoding='utf-8')
Finally individual lines from files:
def lines_from_files(files):
for f in files:
for line in f:
yield line
Since you want to find some errors the check could look like this:
import re
def find_errors(lines):
pattern = re.compile('(error|nmi|CATERR)')
for line in lines:
if pattern.search(line):
print(line)
You can now process a stream of lines generated from a given directory:
txt_file_names = find_files('*.txt', '.')
txt_files = open_files(txt_file_names)
txt_lines = lines_from_files(txt_files)
find_errors(txt_lines)
The idea to process logs as a stream of data originates from talk by David Beazley.

IO Error 22 python

infile1 = open("D:/p/non_rte_header_path.txt","r")
infile2 = open("D:/p/fnsinrte.txt","r")
for line in infile1:
for item in infile2:
eachfile = open(line,"r")
For the above code I am getting the below error. infile1 contains paths of may files like D:/folder/Src/em.h but here \n is automatically at the end of the path.I am not sure why it happens. Please help.
IOError: [Errno 22] invalid mode ('r') or filename: 'D:/folder/Src/em.h\n'
Everyone has provided comments telling you what the problem is but if you are a beginner you probably don't understand why it's happening, so i'll explain that.
Basicly, when opening a file with python, each new line (when you press the Enter Key) is represented by a "\n".
As you read the file, it reads line by line, but unless you remove the "\n", it your line variable will read
thethingsonthatline\n
This can be useful to see if a file contains multiple lines, but you'll want to get rid of it. Edchum and alvits has given a good way of doing this !
Your corrected code would be :
infile1 = open("D:/p/non_rte_header_path.txt","r")
infile2 = open("D:/p/fnsinrte.txt","r")
for line in infile1:
for item in infile2:
eachfile = open(line.rstrip('\n'), "r")

Confusing Error when Reading from a File in Python

I'm having a problem opening the names.txt file. I have checked that I am in the correct directory. Below is my code:
import os
print(os.getcwd())
def alpha_sort():
infile = open('names', 'r')
string = infile.read()
string = string.replace('"','')
name_list = string.split(',')
name_list.sort()
infile.close()
return 0
alpha_sort()
And the error I got:
FileNotFoundError: [Errno 2] No such file or directory: 'names'
Any ideas on what I'm doing wrong?
You mention in your question body that the file is "names.txt", however your code shows you trying to open a file called "names" (without the ".txt" extension). (Extensions are part of filenames.)
Try this instead:
infile = open('names.txt', 'r')
As a side note, make sure that when you open files you use universal mode, as windows and mac/unix have different representations of carriage returns (/r/n vs /n etc.). Universal mode gets python to handle this, so it's generally a good idea to use it whenever you need to read a file. (EDIT - should read: a text file, thanks cameron)
So the code would just look like this
infile = open( 'names.txt', 'rU' ) #capital U indicated to open the file in universal mode
This doesn't solve that issue, but you might consider using with when opening files:
with open('names', 'r') as infile:
string = infile.read()
string = string.replace('"','')
name_list = string.split(',')
name_list.sort()
return 0
This closes the file for you and handles exceptions as well.

Categories

Resources