I am trying to modify my .fasta files from this:
>YP_009208724.1 hypothetical protein ADP65_00072 [Achromobacter phage phiAxp-3]
MSNVLLKQ...
>YP_009220341.1 terminase large subunit [Achromobacter phage phiAxp-1]
MRTPSKSE...
>YP_009226430.1 DNA packaging protein [Achromobacter phage phiAxp-2]
MMNSDAVI...
to this:
>Achromobacter phage phiAxp-3
MSNVLLKQ...
>Achromobacter phage phiAxp-1
MRTPSKSE...
>Achromobacter phage phiAxp-2
MMNSDAVI...
Now, I've already have a script that can do it to a single file:
with open('Achromobacter.fasta', 'r') as fasta_file:
out_file = open('./fastas3/Achromobacter.fasta', 'w')
for line in fasta_file:
line = line.rstrip()
if '[' in line:
line = line.split('[')[-1]
out_file.write('>' + line[:-1] + "\n")
else:
out_file.write(str(line) + "\n")
but I can't get to automate the process for all 120 files in my folder.
I tried using glob.glob, but I can't seem to make it work:
import glob
for fasta_file in glob.glob('*.fasta'):
outfile = open('./fastas3/'+fasta_file, 'w')
with open(fasta_file, 'r'):
for line in fasta_file:
line = line.rstrip()
if '[' in line:
line2 = line.split('[')[-1]
outfile.write('>' + line2[:-1] + "\n")
else:
outfile.write(str(line) + "\n")
it gives me this output:
A
c
i
n
e
t
o
b
a
c
t
e
r
.
f
a
s
t
a
I managed to get a list of all files in the folder, but can't open certain files using the object on the list.
import os
file_list = []
for file in os.listdir("./fastas2/"):
if file.endswith(".fasta"):
file_list.append(file)
Considering you are able to change the contents of file name now you need to automate the process. We changed the function for one file by removing file handler which was used twice for the opening of file.
def file_changer(filename):
data_to_put = ''
with open(filename, 'r+') as fasta_file:
for line in fasta_file.readlines():
line = line.rstrip()
if '[' in line:
line = line.split('[')[-1]
data_to_put += '>' + str(line[:-1]) + "\n"
else:
data_to_put += str(line) + "\n"
fasta_file.write(data_to_put)
fasta_file.close()
Now we need to iterate over all your files. So lets use glob module for it
import glob
for file in glob.glob('*.fasta'):
file_changer(file)
You are iterating the file name, which gives you all the characters in the name instead of the lines of the file. Here is a corrected version of the code:
import glob
for fasta_file_name in glob.glob('*.fasta'):
with open(fasta_file_name, 'r') as fasta_file, \
open('./fastas3/' + fasta_file_name, 'w') as outfile:
for line in fasta_file:
line = line.rstrip()
if '[' in line:
line2 = line.split('[')[-1]
outfile.write('>' + line2[:-1] + "\n")
else:
outfile.write(str(line) + "\n")
As an alternative to the Python script, you can simply use sed from the command line:
sed -i 's/^>.*\[\(.*\)\].*$/>\1/' *.fasta
This will modify all files in place, so consider copying them first.
Related
I'm creating a program that should create a file (.txt) based on each line of 'clouds.txt'. This is my code:
def CreateFile():
global file_name
f = open(file_name,"w+")
f.write(list_email + ":")
f.close()
def WriteInConfig():
f = open("config/config.txt","a")
f.write(list_name + "\n")
f.close()
with open("clouds.txt","r") as f:
list_lines = sum(1 for line in open('clouds.txt'))
lines = f.readline()
for line in lines:
first_line = f.readline().strip()
list_email = first_line.split('|')[1] #email
print("Email: " + list_email)
list_pass = first_line.split('|')[2] #pass
print("Pass: " + list_pass)
list_name = first_line.split('|')[3] #name
print(list_name)
global file_name
file_name = "config/." + list_name + ".txt"
with open('clouds.txt', 'r') as fin:
data = fin.read().splitlines(True)
with open('clouds.txt', 'w') as fout:
fout.writelines(data[1:])
CreateFile()
WriteInConfig()
The clouds.txt file looks like this:
>|clouds.n1c0+mega01#gmail.com|cwSHklDIybllCD1OD4M|Mega01|15|39.91|FdUkLiW0ThDeDkSlqRThMQ| |x
|clouds.n1c0+mega02#gmail.com|tNFVlux4ALC|Mega02|50|49.05|lq1cTyp13Bh9-hc6cZp1RQ|xxx|x
|clouds.n1c0+mega03#gmail.com|7fe4196A4CUT3V|Mega03|50|49.94|BzW7NOGmfhQ01cy9dAdlmg|xxx|xxx >
Everything works fine until 'Mega48'. There I get "IndexError: list index out of range"
>|clouds.n1c0+mega47#gmail.com|bd61t9zxcuC1Yx|Mega47|50|10|Xjff6C8mzEqpa3VcaalUuA|xxx|x
|clouds.n1c0+mega48#gmail.com|kBdnyB6i0PUyUb|Mega48|50|0|R6YfuGP2hvE-uds0ylbQtQ|xxx|x
|clouds.n1c0+mega49#gmail.com|OcAdgpS4tmSLTO|Mega49|50|28.65|xxx| >
I checked and there are no spaces/other characters. As you could see, after creating the file, the program deletes the line. After the error, if I'm starting the program again (and starts from 'Mega47') it doesn't show the error, and everything works as planned.
Any ideas how to fix this?
I see many mistakes in your code. First, what do you want with this list_lines = sum(1 for line in open('clouds.txt'))?
You have a problem in your for loop because you did lines = f.readline() so lines is the first line, then you do for line in lines where line will be each character of the first line and there are more character in the first line than lines in your file to read.
[edited]
you don't need to know the number of lines in the file to do a for loop. You can just do for line in f:, then you don't need to read the line again with readline it is already in the variable line
What I want to do:
a) open all files in directory (in this case: chapters from long stories)
b) remove all empty lines
c) find sentences started with "- " (in this case: dialogues)
I was able to create code that works well, but only for one file:
file = open('.\\stories\\test\\01.txt', 'r', encoding="utf-16 LE")
string_with_empty_lines = file.read()
lines = string_with_empty_lines.split("\n")
non_empty_lines = [line for line in lines if line.strip() != ""]
string_without_empty_lines = ""
for line in non_empty_lines:
if line.startswith('- '):
string_without_empty_lines += line + "\n"
print(string_without_empty_lines)
I started mixed up with this because I have a lot of files and I want to open them all and print the results from all files (and probably save all results to one file, but it's not necessary right now). The first part of the new code successfully open files (checked with commented print line), but when I add the part with editing, nothing happens at all (I don't even have errors in console).
import os
import glob
folder_path = os.path.join('G:' '.\\stories\\test')
for filename in glob.glob(os.path.join(folder_path, '**', '*.txt'), recursive=True):
with open(filename, 'r', encoding="utf-16 LE") as f:
string_with_empty_lines = f.read()
# print(string_with_empty_lines)
lines = string_with_empty_lines.split("\n")
non_empty_lines = [line for line in lines if line.strip() != ""]
string_without_empty_lines = ""
for line in non_empty_lines:
if line.startswith("- "):
string_without_empty_lines += line + "\n"
print(string_without_empty_lines)
If you have your source files in the source_dir and you want to output the target files in the target_dir, you can do it like that:
import os
import path
source_dir = "source_dir"
target_dir = "target_dir"
# on linux or mac, you can get filenames in the specific dir.
# not sure what will happen on Windows
filenames = os.listdir(source_dir)
for filename in filenames:
# get full path of source and target file
filepath_source = path.join(source_dir, filename)
filepath_target = path.join(target_dir, filename)
# open source file and target file
with open(filepath_source) as f_source, open(filepath_target, 'w') as f_target:
for line in f_source:
if len(line.strip()) == 0:
continue
if line[0] == '-':
# do something
f_target.write(line)
On the example of one file, if there are more files, before you can say smt like
for file in dir: with open(file) ...., remember that you would also have to change the target file
with open('source.txt') as source:
with open('target.txt','w') as target:
for line in source.readlines():
l = line.strip('\n')
# identify if the 1st char is '-'
if l[0] == '-':
# do somethin e.g. add 'dialog' at the beginning...
# skip empty line
if len(l) == 0:
continue
#Rewrite to target file
target.write(l + '\n')
target.close()
source.close()
I am trying to loop over the lines in a file and create multiple directories. My script is working only for the first line of list in a file. Here is my script. I have attached image of list as well. That is for both list_bottom.dat and list_top.dat.
import os
f = open("list_top.dat", "r")
g = open("list_bottom.dat", "r")
for lines in f:
m_top = lines.split()[0]
m_bot = lines.split()[0]
os.mkdir(m_top)
os.chdir(m_top)
for lines in g:
print(lines)
m_bot = lines.split()[0]
print(m_bot)
os.mkdir(m_top + "_" + m_bot)
os.chdir(m_top + "_" + m_bot)
for angle in range(5):
os.mkdir(m_top + "_" + "m_bot" + "_angle_" + str(angle))
os.chdir(m_top + "_" + "m_bot" + "_angle_" + str(angle))
os.chdir("../")
os.chdir("../")
os.chdir("../")
os.chdir("../")
you are trying to read from a file pointer, not from its content. you should do this instead
with open("file.txt") as f:
lines = f.readlines()
for line in lines:
do_stuff()
(for readability i don't post this as a comment, but that's a comment)
i have a python file named file_1.py
it has some code in which, i just have to change a word "file_1" to "file_2"
and also preserve indentation of other functions`
and save it as file_2.py
there are 3 occurances of the word file_1
i have to do this for 100 such times. `file_1.py, file_2.py.....file_100.py`
is there any way to automate this?
Run this script:
import fileinput
with fileinput.FileInput('file_1.py', inplace=True, backup='.bak') as file:
for line in file:
print(line.replace('file_1', 'file_2'), end='')
hope this help :)
create a script:
first: read file
with open("./file1.py") as f:
content = f.read()
second: replace filename
new_content = content.replace("file1","file2")
third: write new file(I would suggest you write a new file)
with open("./file2.py", "w") as f:
f.write(new_content)
if you have multiple files, use something like
filenames = ["file" + str(item) for item in range(1,100)]
for filename in filenames:
with open(filename + ".py") as f:
content = f.read()
new_filename = filename[:-1] + str(int(filename[-1]) + 1)
new_content = content.replace(filename,new_filename)
with open("./another_folder" + new_filename + ".py", "w") as f:
f.write(new_content)
I have a folder full of .mpt files, each of them having the same data format.
I need to delete the first 57 lines from all files and append these files into one csv - output.csv.
I have that section already:
import glob
import os
dir_name = 'path name'
lines_to_ignore = 57
input_file_format = '*.mpt'
output_file_name = "output.csv"
def convert():
files = glob.glob(os.path.join(dir_name, input_file_format))
with open(os.path.join(dir_name, output_file_name), 'w') as out_file:
for f in files:
with open(f, 'r') as in_file:
content = in_file.readlines()
content = content[lines_to_ignore:]
for i in content:
out_file.write(i)
print("working")
convert()
print("done")
This part works ok.
how do i add the filename of each .mpt file as the last column of the output.csv
Thank you!
This is a quick 'n dirty solution.
In this loop the variable i is just a string (a line from a CSV file):
for i in content:
out_file.write(i)
So you just need to 1) strip off the end of line character(s) (either "\n" or "\r\n") and append ",".
If you're using Unix, try:
for i in content:
i = i.rstrip("\n") + "," + output_file_name + "\n"
out_file.write(i)
This assumes that the field separator is a comma. Another option is:
for i in content:
i = i.rstrip() + "," + output_file_name
print >>out_file, i
This will strip all white space from the end of i.
Add quotes if you need to quote the output file name:
i = i.rstrip(...) + ',"' + output_file_name '"'
The relevant part:
with open(f, 'r') as in_file:
content = in_file.readlines()
content = content[lines_to_ignore:]
for i in content:
new_line = ",".join([i.rstrip(), f]) + "\n" #<-- this is new
out_file.write(new_line) #<-- this is new