Match lines in two text files - python

i have two text files, first file is 40GB (data2) second is around 50MB (data1)
i want to check if any line in file1 have a match in file2 so I've written a python script (below) to do so, the process with this script takes too much time as it takes the line from file1 then it checks the whole file2 line by line.
for line in open("data1.txt","r"):
for line2 in open("data2.txt","r"):
if line==line2:
print(line)
is there any way/code to make this fast? the script is running since 5 days and still didn't finish. is there a way also to show a % or current line number in process?

Use a set and reverse the logic, checking if any line from the large data file is in the set of lines of f2 which is the smaller 50mb file:
with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
lines = set(f1) # efficient 0(1) lookups using a set
for line in f2: # single pass over large file
if line in lines:
print(line)
If you want the line number use enumerate:
with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
lines = set(f1) # efficient 0(1) lookups using a set
for lined_no, line in enumerate(f2, 1): # single pass over large file
# print(line_no) # uncomment if you want to see every line number
if line in lines:
print(line,line_no)

Related

Python prints two lines in the same line when merging files

I am new to Python and I'm getting this result and I am not sure how to fix it efficiently.
I have n files, let's say for simplicity just two, with some info with this format:
1.250484649 4.00E-02
2.173737246 4.06E-02
... ...
This continues up to m lines. I'm trying to append all the m lines from the n files in a single file. I prepared this code:
import glob
outfile=open('temp.txt', 'w')
for inputs in glob.glob('*.dat'):
infile=open(inputs,'r')
for row in infile:
outfile.write(row)
It reads all the .dat files (the ones I am interested in) and it does what I want but it merges the last line of the first file and the first line of the second file into a single line:
1.250484649 4.00E-02
2.173737246 4.06E-02
3.270379524 2.94E-02
3.319202217 6.56E-02
4.228424345 8.91E-03
4.335169497 1.81E-02
4.557886098 6.51E-02
5.111075901 1.50E-02
5.547288248 3.34E-02
5.685118615 3.22E-03
5.923718239 2.86E-02
6.30299944 8.05E-03
6.528018125 1.25E-020.704223685 4.98E-03
1.961058114 3.07E-03
... ...
I'd like to fix this in a smart way. I can fix this if I introduce a blank line between each data line and then at the end remove all the blank likes but this seems suboptimal.
Thank you!
There's no newline on the last line of each .dat file, so you'll need to add it:
import glob
with open('temp.txt', 'w') as outfile:
for inputs in glob.glob('*.dat'):
with open(inputs, 'r') as infile:
for row in infile:
if not row.endswith("\n"):
row = f"{row}\n"
outfile.write(row)
Also using with (context managers) to automatically close the files afterwards.
To avoid a trailing newline - there's a few ways to do this, but the simplest one that comes to mind is to load all the input data into memory as individual lines, then write it out in one go using "\n".join(lines). This puts "\n" between each line but not at the end of the last line in the file.
import glob
lines = []
for inputs in glob.glob('*.dat'):
with open(inputs, 'r') as infile:
lines += [line.rstrip('\n') for line in infile.readlines()]
with open('temp.txt', 'w') as outfile:
outfile.write('\n'.join(lines))
[line.rstrip('\n') for line in infile.readlines()] - this is a list comprehension. It makes a list of each line in an individual input file, with the '\n' removed from the end of the line. It can then be += appended to the overall list of lines.
While we're here - let's use logging to give status updates:
import glob
import logging
OUT_FILENAME = 'test.txt'
lines = []
for inputs in glob.glob('*.dat'):
logging.info(f'Opening {inputs} to read...')
with open(inputs, 'r') as infile:
lines += [line.rstrip('\n') for line in infile.readlines()]
logging.info(f'Finished reading {inputs}')
logging.info(f'Opening {OUT_FILENAME} to write...')
with open(OUT_FILENAME, 'w') as outfile:
outfile.write('\n'.join(lines))
logging.info(f'Finished writing {OUT_FILENAME}')

improve search using a dict and pyahocorasick

I´m new at python and I don´t know how to program well. How do I edit this code so it can works using pyahocorasick? My code is very slow, because I need to search lots of strings at a very big file.
Any other way to improve the search?
import sys
with open('C:/dict_search.txt', 'r') as search_list:
targets = [line.strip() for line in search_list]
with open('C:/source.txt', 'r') as source_file, open('C:/out.txt', 'w') as fout:
for line in source_file:
if any(target in line for target in targets):
fout.write(line)
Dict_search.txt
509344
827276
324194
782211
772854
727246
858908
280903
377881
247333
538710
182734
701212
379326
148310
542129
315285
840427
581092
485581
867746
434527
746814
749479
252045
189668
418513
624231
620284
(...)
source.txt
1,324194,20190103,0000048632,00000000000004870,0000045054!
1,701212,20190103,0000048632,00000000000147072,0000045055!
1,581092,20190103,0000048632,00000000000032900,0000045056!
(...)
I need to find the "word" from dict_search.txt is in the source.txt and if the word is on the line, i need to copy the line to other file.
The problem is that my source.txt is very big and I have more than 100k words at dict_search.txt
My code takes to execute. I tried using the set() method, but I got a blank file.
After looking at your files, it looks like each line in the dict_search.txt file match with the format of second column in source.txt file. If this is the case, the below code will work for you. It's a linear time solution so it's going to be fast on the cost of space because it creates dictionary in memory.
d={}
with open("source.txt", 'r') as f:
for index, line in enumerate(f):
l = line.strip().split(",")
d[l[1]]= line
with open("Dict_search.txt", 'r') as search, open('output.txt', 'w') as output:
for line in search:
row = line.strip()
if row in d:
output.write(d[row])

Search Large file for text and write result to file

I have file one that is 2.4 millions lines (256mb) and file two that is 32 thousand lines (1.5mb).
I need to go through file two line by line and print matching line in file one.
Pseudocode:
open file 1, read
open file 2, read
open results, write
for line2 in file 2:
for line1 in file 1:
if line2 in line1:
write line1 to results
stop inner loop
My Code:
p = open("file1.txt", "r")
d = open("file2.txt", "r")
o = open("results.txt", "w")
for hash1 in p:
hash1 = hash1.strip('\n')
for data in d:
hash2 = data.split(',')[1].strip('\n')
if hash1 in hash2:
o.write(data)
o.close()
d.close()
p.close()
I am expecting 32k results.
Your file2 is not too large, so it is perfectly well to load it in memory.
Load file2.txt into a set to speed up search process and remove duplicates;
Remove empty line from a set;
Scan file1.txt line-by-line and write found matches in results.txt.
with open("file2.txt","r") as f:
lines = set(f.readlines())
lines.discard("\n")
with open("results.txt", "w") as o:
with open("file1.txt","r") as f:
for line in f:
if line in lines:
o.write(line)
If file2 was larger, we could have split it in chunks and repeat the same for every chunk, but in that case it would be harder to compile the results together

Replacing text from one file from another file

The f1.write(line2) works but it does not replace the text in the file, it just adds it to the file. I want the file1 to be identical to file2 if they are different by overwriting the text from file1 with the text from file2
Here is my code:
with open("file1.txt", "r+") as f1, open("file2.txt", "r") as f2:
for line1 in f1:
for line2 in f2:
if line1 == line2:
print("same")
else:
print("different")
f1.write(line2)
break
f1.close()
f2.close()
I would read both files create a new list with the different elements replaced and then write the entire list to the file
with open('file2.txt', 'r') as f:
content = [line.strip() for line in f]
with open('file1.txt', 'r') as j:
content_a = [line.strip() for line in j]
for idx, item in enumerate(content_a):
if content_a[idx] == content[idx]:
print('same')
pass
else:
print('different')
content_a[idx] = content[idx]
with open('file1.txt', 'w') as k:
k.write('\n'.join(content_a))
file1.txt before:
chrx#chrx:~/python/stackoverflow/9.28$ cat file1.txt
this
that
this
that
who #replacing
that
what
blah
code output:
same
same
same
same
different
same
same
same
file1.txt after:
chrx#chrx:~/python/stackoverflow/9.28$ cat file1.txt
this
that
this
that
vash #replaced who
that
what
blah
I want the file1 to be identical to file2
import shutil
with open('file2', 'rb') as f2, open('file1', 'wb') as f1:
shutil.copyfileobj(f2, f1)
This will be faster as you don't have to read file1.
Your code is not working because you'd have to position the file current pointer (with f1.seek() in the correct position to write the line.
In your code, you're reading a line first, and that positions the pointer after the line just read. When writing, the line data will be written in the file in that point, thus duplicating the line.
Since lines can have different sizes, making this work won't be easy, because even if you position the pointer correctly, if some line is modified to get bigger it would overwrite part of the next line inside the file when you write it. You would end up having to cache at least part of the file contents in memory anyway.
Better truncate the file (erase its contents) and write the other file data directly - then they will be identical. That's what the code in the answer does.

Making the same change to every line in a BED/Interval file in python

I have a BED Interval file that I'm trying to work with using the Galaxy online tool. Currently, every line in the file begins with a number (which stands for chromosome number). In order to upload it properly, I need every line to begin with "chr" and then the number. So for example lines that start with "2L", I need to change so that they will start with "chr2L", and do the same for every other line that start with a number (not just 2L, there are many different numbers). I was thinking if I could just add a "chr" to the start of every line, without affecting the other columns, that would be great, but I have no idea how to do that (very new to python)
Can you please help me out?
Thanks.
http://docs.python.org/2/library/stdtypes.html#file.writelines
with open('bed-interval') as f1, open('bed-interval-modified', 'w') as f2:
f2.writelines('chr' + line for line in f1)
step one open the file
file = open("somefile.txt")
step 2 get the lines
lines = list(file.readlines())
file.close()
step 3 use a list comprehension
new_lines = ["chr"+line for line in lines]
step 4 write new lines back to file
with open("somefile.txt","w") as f:
f.writelines(new_lines)
In order to not store all the lines in memory
file1 = open("some.txt")
file2 = open("output.txt","w")
for line in file1:
print >> file2, "chr"+ line
file1.close()
file2.close()
then just copy output.txt to your original filename

Categories

Resources