Using Biopython SeqIO.convert over an entire directory

Using Biopython SeqIO.convert over an entire directory - python

I have 51 files with metagenomic sequence data that I would like to convert from fastq to fasta using a Biopython script in Windows. The module SeqIO.convert easily converts an individually specified file, but I can't figure out how to convert the entire directory. It's not really too many files to do individually, but I'm trying to learn.
I'm brand new to Biopython, so please forgive my ignorance. This convo was helpful, but I'm still not able to convert the directory from fastq to fasta.
Here's the code I've been trying to run:
#modules-
import sys
import re
import os
import fileinput
from Bio import SeqIO
#define directory
Directory = "FastQ”
#convert files
def process(filename):
return SeqIO.convert(filename, "fastq", "files.fa", filename + ".fasta", "fasta", alphabet= IUPAC.ambiguous_dna)

You need to iterate over the files in the directory and convert them, so assuming your directory is FastQ and that you are calling your script from the proper folder (i.e. the one that your directory is in, since you are using a relative path), you would need to do something like:
def process(directory):
filelist = os.listdir(directory)
for f in filelist:
SeqIO.convert(f, "fastq", f.replace(".fastq",".fasta"), "fasta", alphabet= IUPAC.ambiguous_dna)
then you would call your script in your main:
my_directory = "FastQ"
process(my_directory)
I think that should work.

Related

Move files one by one to newly created directories for each file with Python 3

What I have is an initial directory with a file inside D:\BBS\file.x and multiple .txt files in the work directory D:\
What I am trying to do is to copy the folder BBS with its content and incrementing it's name by number, then copy/move each existing .txt file to the newly created directory to make it \BBS1, \BBS2, ..., BBSn (depends on number of the txt).
Visual example of the Before and After:
Initial view of the \WorkFolder
Desired view of the \WorkFolder
Right now I have reached only creating of a new directory and moving txt in it but all at once, not as I would like to. Here's my code:
from pathlib import Path
from shutil import copy
import shutil
import os
wkDir = Path.cwd()
src = wkDir.joinpath('BBS')
count = 0
for content in src.iterdir():
addname = src.name.split('_')[0]
out_folder = wkDir.joinpath(f'!{addname}')
out_folder.mkdir(exist_ok=True)
out_path = out_folder.joinpath(content.name)
copy(content, out_path)
files = os.listdir(wkDir)
for f in files:
if f.endswith(".txt"):
shutil.move(f, out_folder)
I kindly request for assistance with incrementing and copying files one by one to the newly created directory for each as mentioned.
Not much skills with python in general. Python3 OS Windows
Thanks in advance

Now, I understand what you want to accomplish. I think you can do it quite easily by only iterating over the text files and for each one you copy the BBS folder. After that you move the file you are currently at. In order to get the folder_num, you may be able to just access the file name's characters at the particular indexes (e.g. f[4:6]) if the name is always of the pattern TextXX.txt. If the prefix "Text" may vary, it is more stable to use regular expressions like in the following sample.
Also, the function shutil.copytree copies a directory with its children.
import re
import shutil
from pathlib import Path
wkDir = Path.cwd()
src = wkDir.joinpath('BBS')
for f in os.listdir(wkDir):
if f.endswith(".txt"):
folder_num = re.findall(r"\d+", f)[0]
target = wkDir.joinpath(f"{src.name}{folder_num}")
# copy BBS
shutil.copytree(src, target)
# move .txt file
shutil.move(f, target)

Is it possible to read in a whole folder of csv files using one line of read code in python?

I have about 20 csv files that I need to read in, is it possible to read in the whole folder instead of doing them individually? I am using python. Thanks

You can't. The fileinput module almost meets your needs, allowing you to pretend a bunch of files are a single file, but it also doesn't meet the requirements of files for the csv module (namely, that newline translation must be turned off). Just open the files one-by-one and append the results of parsing to a single list; it's not that much more effort. No matter what you do something must "do them individually"; there is no magic to say "read 20 files exactly as if they were one file". Even fobbing off to cat or the like (to concatenate all the files into a single stream you can read from) is just shunting the same file-by-file work elsewhere.

You can pull a list of files in Python by using os.listdir. From there, you can loop over your list of files, and generate a list of CSV files:
import os
filenames = os.listdir("path/to/directory/")
csv_files = []
for name in filenames:
if filename.endswith("csv"):
csv_files.append(name)
From there, you'll have a list containing every CSV in your directory.

The shortest thing that I can think of is this, it's not in one line because you have to import a bunch of stuff so that line is not that long:
from os import listdir
from os.path import isfile
from os.path import splitext
from os.path import join
import pandas as pd
source = '/tmp/'
dfs = [
pd.read_csv(join(source, path)) for path in listdir(source) if isfile(join(source, path)) and splitext(join(source, path))[1] == '.csv'
]

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.

If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.

Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)

you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

Renaming multiple file names with date

I would like to ask for your help on renaming multiple files with date. I have netcdf files "wrfoutput_d01_2016-08-01_00:00:00" until "wrfoutput_d01_2016-08-31_00:00:00" which windows do not read since output is from Linux. I wanted to change the file name to "wrfoutput_d01_2016-08-01_00" until "wrfoutput_d01_2016-08-31_00". How do I do that using python?
Edit:
The containing folder has two set of files. One for domain 1 as denoted by d01, wrfoutput_d01_2016-08-31_00:00:00, and the other set is denoted by d02, wrfoutput_d02_2016-08-31_00:00:00. Total files for d01 is 744 since time step output is hourly same as with d02.
I wanted to rename for each day on an hourly basis. Say, wrfoutput_d01_2016-08-01_00:00:00, wrfoutput_d01_2016-08-01_01:00:00,... to wrfoutput_d01_2016-08-01_00, wrfoutput_d01_2016-08-01_01,...
I saw a code which allows me to access the specific file, e.g. d01 or d02.
import os
from netCDF4 import Dataset
from wrf import getvar
filedir = "/home/gil/WRF/Output/August/"
wrfin = [Dataset(f) for f in os.listdir(filedir)
if f.startswith("wrfout_d02_")]
After this code I get stuck.

First get the filenames, giving the folder path ('/home/user/myfolder...'), then rename them.
import os
import re
filenames = os.listdir(folder_path)
for fn in filenames:
os.rename(fn, re.sub(':','-',fn))

The other answer converts the colons to hyphens. If you wish to truncate the time from the file name, you can use this.
This assumes the files are in the same directory as the python script. If not, change '.' to 'path/to/dir/'. It also only looks at files that have the name format 'wrfoutput...' when it renames them.
from os import listdir, rename
from os.path import isfile, join
only_files = [f for f in listdir('.') if isfile(join('.', f))]
for f in only_files:
# Get the relevant files
if 'wrfoutput' in f:
# Remove _HH:MM:SS from end of file name
rename(f, f[:-9])

Open the Terminal
cd into your directory (cd /home/myfolder)
Start python (python)
Now, a simple rename.
import os
AllFiles=os.listdir('.')
for eachfile in AllFiles:
os.rename(eachfile,eachfile.replace(':','_'))

How do I use wild cards in python as specified in a text file

Hello I'm new to python and I'd like to know how to process a .txt file line by line to copy files specifid as wild cards
basically the .txt file looks like this.
bin/
bin/*.txt
bin/*.exe
obj/*.obj
document
binaries
so now with that information I'd like to be able to read my .txt file match the directory copy all the files that start with * for that directory, also I'd like to be able to copy the folders listed in the .txt file. What's the best practical way of doing this? your help is appreciated, thanks.

Here's something to start with...
import glob # For specifying pathnames with wildcards
import shutil # For doing common "shell-like" operations.
import os # For dealing with pathnames
# Grab all the pathnames of all the files matching those specified in `text_file.txt`
matching_pathnames = []
for line in open('text_file.txt','r'):
matching_pathnames += glob.glob(line)
# Copy all the matched files to the same filename + '.new' at the end
for pathname in matching_pathnames:
shutil.copyfile(pathname, '%s.new' % (pathname,))

You might want to look at the glob and re modules
http://docs.python.org/library/glob.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Biopython SeqIO.convert over an entire directory - python

Related

Move files one by one to newly created directories for each file with Python 3

Is it possible to read in a whole folder of csv files using one line of read code in python?

How can I read files with similar names on python, rename them and then work with them?

Renaming multiple file names with date

How do I use wild cards in python as specified in a text file

Categories

Resources