Finding identical numbers in large files python - python

I have two data files in python, each containing two-column data as below:
3023084 5764
9152549 5812
18461998 5808
45553152 5808
74141469 5753
106932238 5830
112230478 5795
135207137 5800
148813978 5802
154818883 5798
There are about 10M entries in each file (~400Mb).
I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.
The code I currently have converted the files to lists:
ch1 = []
with open('ch1.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch1.append([line[0], line[1]])
ch2 = []
with open('ch2.txt', 'r+') as file:
for line in file:
if ':' not in line:
line = line.split()
ch2.append([line[0], line[1]])
I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list 'coin'
coin = []
for item1 in ch1:
for item2 in ch2:
if item1[0] == item2[0]:
coin.append(int(item1[1]) + int(item2[1]))
The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?

There are lots of ways to improve this; for example:
Since you only scan through the contents of ch1.txt once, you don't need to read it into a list, and should thus take up less memory, but probably won't speed things up all that much.
If you sort each of your lists, you can check for matches much more efficiently. Something like:
i1, i2 = 0, 0
while i1 < len(ch1) and i2 < len(ch2):
if ch1[i1][0] == ch2[i2][0]:
# Do what you do for matches
...
# Advance both indices
i1 += 1
i2 += 1
elif ch1[i1][0] < ch2[i2][0]:
# Advance index of the smaller value
i1 += 1
else: # ch1[i1][0] > ch2[i2][0]
i2 += 1
If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.

Few ideas to improve this:
store your data in dictionaries in such a way your first column is the key and the second column is the value of a dictionary for later use,
a match is if a key is in the intersection of the keys of the two dictionaries
Code example:
# store your data in dicts as following
ch1_dict[line[0]] = line[1]
ch2_dict[line[0]] = line[1]
#this is what you want to achieve
coin = [int(ch1_dict[key]) + int(ch2_dict[key]) for key in ch1_dict.keys() & ch2_dict.keys()]

Related

Is there a faster way to apply a function instead of using a loop within a loop in python

I am iterating over a file and creating a set of unique "in" strings. I next iterate over a pair of files and extract the sequence string for each fastq read. I next iterate through the "in" set to see if the string has a Levenshtein distance of <=2 and pick the first "in" sequence that does.
The problem I have is that its very slow having a loop within a loop.
I there a way of speeding this up or a better way of mapping the function to the whole list of in strings and returning the best match?
# This part created a set of strings from infile
inlist = open("umi_tools_inlist_2000.txt", "r")
barcodes = []
for line in inlist:
barcodes.append(line.split("\t")[0])
barcodes = set(barcodes)
# Next I iterate through two fastq files and extract the sequence of each read
with pysam.FastxFile("errors_fullbarcode_read_R1.fastq") as fh, pysam.FastxFile("errors_fullbarcode_read_R2.fastq") as fh2:
for record_fh, record_fh2 in zip(fh, fh2):
barcode = record_fh.sequence[0:24]
for b in barcodes:
if Levenshtein.distance(barcode, b) <= 2:
b = b + record_fh.sequence[24:]
break
else:
pass
You can use dictionary instead of list where keys will be barcodes.
If you don't like to use the loop in loop you can use list comprehension and test performance
inlist = open("umi_tools_inlist_2000.txt", "r")
barcodes = dict()
for line in inlist:
barcodes[line.split("\t")[0]] = 0
# Next I iterate through two fastq files and extract the sequence of each read
with pysam.FastxFile("errors_fullbarcode_read_R1.fastq") as fh, pysam.FastxFile("errors_fullbarcode_read_R2.fastq") as fh2:
for record_fh, record_fh2 in zip(fh, fh2):
barcode = record_fh.sequence[0:24]
b_found = [b for b in barcodes.keys() if Levenshtein.distance(barcode, b) <= 2]
# as per your logic b_found will have only zero or one ele
if b_found:
new_b = barcodes[b_found[0]] + record_fh.sequence[24:]
barcodes[new_b] = 0
del barcodes[b_found[0]]

How to parse letter by letter and make a list with Python?

I have a text file I am attempting to parse. Fairly new to Python.
It contains an ID, a sequence, and frequency
SA1 GDNNN 12
SA2 TDGNNED 8
SA3 VGGNNN 3
Say the user wants to compare the frequency of the first two sequences. They would input the ID number. I'm having trouble figuring out how I would parse with python to make a list like
GD this occurs once in the two so it = 12
DN this also occurs once =12
NN occurs 3 times = 12 + 12 + 8 =32
TD occurs once in the second sequence = 8
DG ""
NE ""
ED ""
What do you recommend to parse letter by letter? In a sequence GD, then DN, then NN (without repeating it in the list), TD.. Etc.?
I currently have:
#Read File
def main():
file = open("clonedata.txt", "r")
lines = file.readlines()
file.close()
class clone_data:
def __init__(id, seq, freq):
id.seq = seq
id.freq = freq
def myfunc(id)
id = input ("Input ID number to see frequency: ")
for line in infile:
line = line.strip().upper()
line.find(id)
#print('y')
I'm not entirely sure from the example, but it sounds like you're trying to look at each line in the file and determine if the ID is in a given line. If so, you want to add the number at the end of that line to the current count.
This can be done in Python with something like this:
def get_total_from_lines_for_id(id_string, lines):
total = 0 #record the total at the end of each line
#now loop over the lines searching for the ID string
for line in lines:
if id_string in line: #this will be true if the id_string is in the line and will only match once
split_line = line.split(" ") #split the line at each space character into an array
number_string = split_line[-1] #get the last item in the array, the number
number_int = int(number_string) #make the string a number so we can add it
total = total + number_int #increase the total
return total
I'm honestly not sure what part of that task seems difficult to you, in part because I'm not sure what exactly is the task you're trying to accomplish.
Unless you expect the datafile to be enormous, the simplest way to start would be to read it all into memory, recording the id, sequence and frequency in a dictionary indexed by id: [Note 1]
with open('clonedata.txt') as file:
data = { id : (sequence, int(frequency))
for id, sequence, frequency in (
line.split() for line in file)}
With the sample data provided, that gives you: (newlines added for legibility)
>>> data
{'SA1': ('GDNNN', 12),
'SA2': ('TDGNNED', 8),
'SA3': ('VGGNNN', 3)}
and you can get an individual sequence and frequency with something like:
seq, freq = data['SA2']
Apparently, you always want to count the number of digrams (instances of two consecutive letters) in a sequence of letters. You can do that easily with collections.Counter: [Note 2]
from collections import Counter
# ...
seq, freq = data['SA1']
Counter(zip(seq, seq[1:]))
which prints
Counter({('N', 'N'): 2, ('G', 'D'): 1, ('D', 'N'): 1})
It would probably be most convenient to make that into a function:
def count(seq):
return Counter(zip(seq, seq[1:]))
Also apparently, you actually want to multiply the counted frequency by the frequency extracted from the file. Unfortunately, Counter does not support multiplication (although you can, conveniently, add two Counters to get the sum of frequencies for each key, so there's no obvious reason why they shouldn't support multiplication.) However, you can multiply the counts afterwards:
def count_freq(seq, freq):
retval = count(seq)
for digram in retval:
retval[digram] *= freq
return retval
If you find tuples of pairs of letters annoying, you can easily turn them back into strings using ''.join().
Notes:
That code is completely devoid of error checking; it assumes that your data file is perfect, and will throw an exception for any line with two few elements, including blank lines. You could handle the blank lines by changing for line in file to for line in file if line.strip() or some other similar test, but a fully bullet-proof version would require more work.)
zip(a, a[1:]) is the idiomatic way of making an iterator out of overlapping pairs of elements of a list. If you want non-overlapping pairs, you can use something very similar, using the same list iterator twice:
def pairwise(a):
it = iter(a)
return zip(it, it)
(Or, javascript style: pairwise = lambda a: (lambda it:zip(it, it))(iter(a)).)

Comparing 2 huge (5-6 GB) csv files and count the number of matching and unmatched no. of rows

There are 2 huge (5-6 GB) each csv files. Now the objective is to compare both these files. how many rows are matching and how many rows are not matching?
Lets say file1.csv contains 5 similar lines, we need to count it as 1 but not 5.
Similarly, for file2.csv if there are redundant data, we need to count it as 1.
I expect the output to display the number of rows that are matching and the no. of rows that are different.
I have written a file comparer in python that can optimally compare huge files and get matching lines count and different lines count. Replace the input_file1 and input_file2 with your 2 large files and run it. Let me know the results.
input_file1 = r'input_file.txt'
input_file2 = r'input_file.1.txt'
__author__ = 'https://github.com/praveen-kumar-rr'
# Simple Memory Efficient high perfomance file comparer.
# Can be used to efficiently compare large files.
# Alogrithm:
# Hashes the lines and compared first.
# Non matching lines are picked as different count.
# All the matching lines are taken and the exact lines are read from file
# These strings undergo same comparison process based on string itself
def accumulate_index(values):
'''
Returns dict like key: [indexes]
'''
result = {}
for i, v in enumerate(values):
indexes = result.get(v, [])
result[v] = indexes + [i]
return result
def get_lines(fp, line_numbers):
'''
Reads lines from the file pointer based on the lines_numbers list of indexes
'''
return (v for i, v in enumerate(fp) if i in line_numbers)
def get_match_diff(left, right):
'''
Compares the left and right iterables and returns the matching and different items
'''
left_set = set(left)
right_set = set(right)
return left_set ^ right_set, left_set & right_set
if __name__ == '__main__':
# Gets hashes of all lines for both files
dict1 = accumulate_index(map(hash, open(input_file1)))
dict2 = accumulate_index(map(hash, open(input_file2)))
diff_hashes, matching_hashes = get_match_diff(
dict1.keys(), dict2.keys())
diff_lines_count = len(diff_hashes)
matching_lines_count = 0
for h in matching_hashes:
with open(input_file1) as fp1, open(input_file2) as fp2:
left_lines = get_lines(fp1, dict1[h])
right_lines = get_lines(fp2, dict2[h])
d, m = get_match_diff(left_lines, right_lines)
diff_lines_count += len(d)
matching_lines_count += len(m)
print('Total number of matching lines is : ', matching_lines_count)
print('Total number of different lines is : ', diff_lines_count)
I hope this algorithm work
create hash of every line in both file
now create set of that hash
difference and intersection of that set.

How to speed up this word-tuple finding algorithm?

I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
else:
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.
I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.
Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.
Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?
Thanks.
Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
else:
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
markovPairsFile.write(keyText)
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
markovFrequencyFile.write(valueText)
As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)
I would also look into collections.Counter, a data structure made for your task. A Counter object is like a dictionary but counts the occurrences of a key entry. You could use this by simply incrementing a word pair as you encounter it:
from collections import Counter
word_counts = Counter()
with open("markovData.txt", "r") as f:
# iterate over word pairs
word_counts[(word1, word2)] += 1
Alternatively, you can construct the tuple list as you have and simply pass this into a Counter as an object to compute the frequencies at the end:
word_counts = Counter(word_tuple_list)

Python: compare list items to dictionary keys twice in one for loop?

I'm stuck in a script I have to write and can't find a way out...
I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files.
The first is simply a table with IDs and group information (which is used for the splitting).
The other contains the same IDs, but each twice with slightly different information.
What I'm doing:
I create a list of lists with ID and group informazion, like this:
table = [[ID, group], [ID, group], [ID, group], ...]
Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I'm in doubt about this) is adding an -a or -b to the ID:
index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}
The code for this:
for line in file:
read = (line.split("\t"))[0]
if not (read+"-a") in indices:
index = read + "-a"
length = len(line)
indices[index] = [FPos, length]
else:
index = read + "-b"
length = len(line)
indices[index] = [FPos, length]
FPos += length
What I am wondering now is if the next step is actually valid (I don't get errors, but I have some doubts about the output files).
for name in table:
head = name[0]
## first round
(FPos,length) = indices[head+"-a"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
##second round
(FPos,length) = indices[head+"-b"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
Is it ok to use a for loop like that?
Is there a better, cleaner way to do this?
Use a defaultdict(list) to save all your file offsets by ID:
from collections import defaultdict
index = defaultdict(list)
for line in file:
# ...code that loops through file finding ID lines...
index[id_value].append((fileposn,length))
The defaultdict will take care of initializing to an empty list on the first occurrence of a given id_value, and then the (fileposn,length) tuple will be appended to it.
This will accumulate all references to each id into the index, whether there are 1, 2, or 20 references. Then you can just search through the given fileposn's for the related data.

Categories

Resources