Merge two large text files by common row to one mapping file - python

I have two text files that have similar formatting. The first (732KB):
>lib_1749;size=599;
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGTCACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
--
>lib_2235;size=456;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
>lib_13686;size=69;
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCAGGCAAGTCAGAAGTGAAAGCCCGGGGCTCAACCCCGGGGCTGGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
The second (5.26GB):
>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
TACGGAGGATGCGAGCGTTATCCGGATTTACTGGGTTTAAAGGGAGCGCAGACGGGACGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCCTAAAACTGCTAGCGGTGAAATGCTTAGATATCGGGAGGAACTCCGGTTGCGAAGGCAGCATACTGGACTGCAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACAGG
--
Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.
Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.
For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.
Thank you!

In my opinion, the easiest way would be to use BLAST+...
Set up the larger file as a BLAST database and use the smaller file as the query...
Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.
BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...

BioPython should be able to read in large FASTA files.
from Bio import SeqIO
from collections import defaultdict
mapping = defaultdict(list)
for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
stool_seq = str(stool_record.seq)
for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
lib_seq = str(lib_record.seq)
if stool_seq.startswith(lib_seq):
mapping[lib_record.id.split(';')[0]].append(stool_record.id)

Related

Memory efficient way of keeping track of unique entries

I have around a 50GB folder full of files. Each file consists of line after line of JSON data and in this JSON structure is a field for user_id.
I need to count the number of unique User IDs across all of the files (and only need the total count). What is the most memory efficient and relatively quick way of counting these?
Of course, loading everything into a huge list maybe isn't the best option. I tried pandas but it took quite a while. I then tried to simple write the IDs to text files but I thought I'd find out if I was maybe missing something far simpler.
Since it was stated that the JSON context of user_id does not matter, we just treat the JSON files as the pure text files they are.
GNU tools solution
I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:
cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l
cat *.json: Output contents of all files to stdout
sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p': Look for lines containting "user_id": "{number}" and only print the number to stdout
sort -un --parallel=4: Sort the output numerically, ignoring duplicates (i.e. output only unique values), using multiple (4) jobs, and output to stdout
wc -l: Count number of lines, and output to stdout
To determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.
Python solution
If you want to use Python nonetheless, I'd recommend using a set and re (regular expressions)
import fileinput
import re
r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')
s = set()
for line in fileinput.input():
m = r.match(line)
if m:
s.add(m.groups()[0])
print(len(s))
Run this using python3 <scriptname>.py *.json.
Since you only need the user_ids, load a .json (as a data stucture), extract any ids, then destroy all references to that structure and any its parts so that it's garbage collected.
To speed up the process, you can do this in a few processes in parallel, take a look at multiprocessing.Pool.map.
Try the simplest approach first.
Write a function get_user_ids(filepath) that returns a list of user_id in a JSON file.
Then do:
from pathlib import Path
the_folder = Path("path/to/the/folder")
user_ids = set()
for jsonpath in the_folder.glob('*.json'):
user_ids.update(get_user_ids(jsonpath))
print(len(user_ids))
If the list of user IDs is so large that it can't reasonably fit into a set in memory, an easy and memory-efficient way to de-duplicate is to simply create files named after user IDs in an empty directory, and then count the number of files in the directory. This works because most filesystems are efficient at indexing file names in a directory.
import os
os.chdir('/')
os.mkdir('/count_unique')
os.chdir('/count_unique')
# change the following demo tuple to a generator that reads your JSON files and yields user IDs
for user_id in 'b', 'c', 'b', 'a', 'c':
open(user_id, 'w').close()
print(sum(1 for _ in os.scandir('/count_unique')))
This outputs: 3

Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt

How To Identify Files with Identical Content But a Different Arrangement of the Data

I'm testing an upgrade we ran on an application that processes data. I took archived data that has already run through the system before and comparing it with output from the newly upgraded application. I'm noticing that the data is the same but the arrangement of the data in the new output is different. For example, in the new file line 57's data used to be on line 43 in the old output. Is there a way to detect that the files contain identical content? When I run a file compare in TextPad or do an MD5 hash compare, it doesn't detect that the files have the same content. It sees them as different files.
As Enak and Dominique have mentioned, sorting text files line by line and then comparing the two will reveal with complete certainty if anything is missing or not.
You might calculate some aggregate values of both files and compare them for sufficient proof though, which will be a lot faster. Are the number of words and characters the same? What about the number of different alphabets? Count all 26 alphabets in both files (you could also do the same for any character set of your choice), if their numbers match up exactly, there is a very high probability that both files contain the same information. This is on the same lines as your hashing approach, but obviously isn't as reliable.
If you need to know with certainty, you will have to compare each line of file A with each line of file B somehow. If the lines are completely shuffled, sorting the lines in file A and B and then comparing the files will be the best option. If there is locality however (line number x of file A tends to stay around location x in file B), you might as well just compare the two files without sorting, but rather by starting your search for line x of file A around location x in file B.
A hash compare is meaningless. Since e.g. two files with
foo
bar
and
bar
foo
would generate a completly different hash. Otherwise hash functions would be really broken.
I think your only chance here is to look if every line in file A is in file B (line by line). Maybe you could implement a sort algorithm. This could be done concurrent on both files and then you could compare the hash of these two files since the sort algorithm is deterministic in its output.

Randomly select a subset of lines in a fasta file

I have a fasta file of around 18 million reads. I brought the head of it into Python and built built a dictionary where the key = readID and value = sequence using a forloop w/ if/else statements.
I would like to now randomly select a subset of 10,000 reads from my original file . I think another for loop is necessary, but I'm not sure where to begin.
Thanks in advance
When you're working with FASTQfiles, then you should really be using BioPython. It has support for reading FASTQ files and turning them into a dictionary, no for-loop needed. For taking random samples, use the random module from the standard library.
from Bio import SeqIO
import random
record_dict = SeqIO.to_dict(SeqIO.parse("example.fastq", "fastq"))
random_reads = random.sample(record_dict.items(), 10000)
for readID, sequence in random_reads:
print(readID, sequence)

Detecting duplicate files with Python

I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?
There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime
A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )
If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open
os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.

Categories

Resources