Randomly select a subset of lines in a fasta file - python

I have a fasta file of around 18 million reads. I brought the head of it into Python and built built a dictionary where the key = readID and value = sequence using a forloop w/ if/else statements.
I would like to now randomly select a subset of 10,000 reads from my original file . I think another for loop is necessary, but I'm not sure where to begin.
Thanks in advance

When you're working with FASTQfiles, then you should really be using BioPython. It has support for reading FASTQ files and turning them into a dictionary, no for-loop needed. For taking random samples, use the random module from the standard library.
from Bio import SeqIO
import random
record_dict = SeqIO.to_dict(SeqIO.parse("example.fastq", "fastq"))
random_reads = random.sample(record_dict.items(), 10000)
for readID, sequence in random_reads:
print(readID, sequence)


Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
The code where I read and import the txt files is:
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:

import images from folder according to a list - python

I have a folder named all that contains, say, 10000 coloured images named "0.jpg", "1.jpg", "2.jpg", ..., "9998.jpg", "9999.jpg".
I would like to import them in a ndarray for training a neural network; however, I want to import only a subset of them, according to a given list of strings representing their names, for instance:
example_list = ['0.jpg','32.jpg','256.jpg','1024.jpg','4096.jpg','9998.jpg']
Is it possible to do such a thing in order to save file-reading time? If yes, how? Or is it better to import all the 10000 images in a ndarray nonetheless and then subselecting it?
I hope I have been clear in the explanation. Thanks in advance!
Would something like
files = [open(file, 'r') for file in example_list]
Here is a way to get all the jpgs into a regular python list.
jpg_obj_list = []
for jpg in example_list:
with open("directory/here{}".format(jpg), "r") as f:

Merge two large text files by common row to one mapping file

I have two text files that have similar formatting. The first (732KB):
The second (5.26GB):
>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.
Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.
For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.
Thank you!
In my opinion, the easiest way would be to use BLAST+...
Set up the larger file as a BLAST database and use the smaller file as the query...
Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.
BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...
BioPython should be able to read in large FASTA files.
from Bio import SeqIO
from collections import defaultdict
mapping = defaultdict(list)
for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
stool_seq = str(stool_record.seq)
for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
lib_seq = str(lib_record.seq)
if stool_seq.startswith(lib_seq):

Extracting a random line in a file without loading the file into RAM in python

I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.
I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.
I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.
Could someone give me some hints? Thank you.
EDIT : forgot to say that I know the number of lines in my files beforehand.
You can use a heapq to select n records based on a random number, eg:
import heapq
import random
SIZE = 10
with open('yourfile') as fin:
sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())
This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.
One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.
import random
import os
import glob
for name in glob.glob("*.py"):
mode, ino, den, nlink, uid, gid, size, atime, mtime, ctime = os.stat(name)
inf = open(name, "r")
location = random.randint(0, size)
while location > 0:
char = inf.read(1)
if char == "\n":
location -= 1
line = inf.readline()
print name, ":", line[:-1]
As long as the lines aren't huge this shouldn't be unduly burdensome.
You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.
Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.

Python - Import txt in a sequential pattern

In the directory I have say, 30 txt files each containing two columns of numbers with roughly 6000 numbers in each column. What i want to do is to import the first 3 txt files, process the data which gives me the desired output, then i want to move onto the next 3 txt files.
The directory looks like:
file1c ... and so on.
I don't want to import all of the txt files simultaneously, I want to import the first 3, process the data, then the next 3 and so forth. I was thinking of making a dictionary - though i have a feeling this might involve writing each file name in the dictionary, which would take far too long.
For those that are interested, I think i have come up with a work around. Any feedback would greatly be appreciated, since i'm not sure if this is the quickest way to do things or the most pythonic.
import glob
def chunks(l,n):
for i in xrange(0,len(l),n):
yield l[i:i+n]
Data = []
txt_files = glob.iglob("./*.txt")
for data in txt_files:
d = np.loadtxt(data, dtype = np.float64)
Data_raw_all = list(chunks(Data,3))
Here the list 'Data' is all of the text files from the directory, and 'Data_raw_all' uses the function 'chunks' to group the elements in 'Data' into sets of 3. This way you can selecting one element in Data_raw_all selects the corresponding 3 text files in the directory.
First of all, I have nothing original to include here and I definitely do not want to claim credit for it at all because it all comes from the Python Cookbook 3rd Ed and from this wonderful presentation on generators by David Beazley (one of the co-authors of the aforementioned Cookbook). However, I think you might really benefit from the examples given in the slideshow on generators.
What Beazley does is chain a bunch of generators together in order to do the following:
yields filenames matching a given filename pattern.
yields open file objects from a sequence of filenames.
concatenates a sequence of generators into a single sequence
greps a series of lines for those that match a regex pattern
All of these code examples are located here. The beauty of this method is that the chained generators simply chew up the next pieces of information: they don't load all files into memory in order to process all the data. It's really a nice solution.
Anyway, if you read through the slideshow, I believe it will give you a blueprint for exactly what you want to do: you just have to change it for the information you are seeking.
In short, check out the slideshow linked above and follow along and it should provide a blueprint for solving your problem.
I'm presuming you want to hardcode as few of the file names as possible. Therefore most of this code is for generating the filenames. The files are then opened with a with statement.
Example code:
from itertools import cycle, count
root = "UVF2CNa"
for n in count(1):
for char in cycle("abc"):
first_part = "{}{}{}".format(root, n, char)
with open(first_part + "i") as i,\
open(first_part + "j") as j,\
open(first_part + "k") as k:
# do stuff with files i, j and k here
except FileNotFoundError:
# deal with this however

