python mmap regex searching common entries in two files

python mmap regex searching common entries in two files - python

I have 2 huge xml files. One is around 40GB, the other is around 2GB. Assume the xml format is
something like this
< xml >
...
< page >
< id > 123 < /id >
< title > ABC < /title >
< text > .....
.....
.....
< /text >
< /page >
...
< /xml >
I have created an index file for both file 1 and file 2 using mmap.
Each of the index files complies with this format:
Id <page>_byte_position </page>_byte_position
So, basically given an Id, from the index files, I know where the tag starts for that Id and where it ends i.e. tag byte pos.
Now, what I need to do is:
- I need to be able to figure out for each id in the smaller index file (for 2GB),
if the id exists in the larger index file
- If the id exists, I need to be able to get the _byte_pos and _byte_pos for
that id from the larger index file (for 40GBfile )
My current code is awfully slow. I guess I am doing an O(m*n) algorithm assuming m is size of
larger file and n of smaller file.
with open(smaller_idx_file, "r+b") as f_small_idx:
for line in f_small_idx.readlines():
split = line.split(" ")
with open(larger_idx_file, "r+b") as f_large_idx:
for line2 in f_large_idx.readlines():
split2 = line2.split(" ")
if split[0] in split2:
print split[0]
print split2[1] + " " + split2[2]
This is AWFULLY slow !!!!
Any better suggestions ??
Basically, given 2 huge files, how do you search if each word in a particular column in smaller file exists in the huge file and if it does, you need to extract other relevant fields as well.
Any suggestions would be greatly appreciated!! : )

Don't have time for an elaborate answer right now but this should work (assuming the temporary dict will fit into memory):
Iterate over smaller file and put all the words of the relevant column in a dict (lookup in a dict has an average case performance of O(1))
Iterate over larger file and look up each word in the dict storing the relevant information either directly with the dict entries or elsewhere.
If this does not work I would suggest sorting (or filtering) the files first so that chunks can then be processed independently (i.e. compare only everything that starts with A then B...)

Related

Comparing all contents of two files

I am trying to compare two files. One file has a list of stores. The other list has the same list of stores, except it is missing a few from a filter I had run against it from another script. I would like to compare these two files, if the store in file 1 is not anywhere to be located in file 2, I want to print it out, or append to a list, not too picky on that part. Below are examples of partials in both files:
file 1:
Store: 00377
Main number: 8033056238
Store: 00525
Main number: 4075624470
Store: 00840
Main number: 4782736996
Store: 00920
Main number: 4783337031
Store: 00998
Main number: 9135631751
Store: 02226
Main number: 3107501983
Store: 02328
Main number: 8642148700
Store: 02391
Main number: 7272645342
Store: 02392
Main number: 9417026237
Store: 02393
Main number: 4057942724
File 2:
00377
00525
00840
00920
00998
02203
02226
02328
02391
02392
02393
02394
02395
02396
02397
02406
02414
02425
02431
02433
02442
Here is what I built to try and make this work, but it just keeps spewing all stores in the file.
def comparesitestest():
with open("file_1.txt", "r") as pairsin:
pairs = pairsin.readlines()
pairsin.close
with open("file_2.txt", "r") as storesin:
stores = storesin.readlines()
storesin.close
for pair in pairs:
for store in stores:
if store not in pair:
print(store)

When you read your first file, add the store number to a set.
store_nums_1 = set()
with open("file_1.txt") as f:
for line in f:
line = line.strip() # Remove trailing whitespace
if line.startswith("Store"):
store_nums_1.add(line[7:]) # Add only store number to set
Next, read the other file and add those numbers to another set
store_nums_2 = set()
with open("file_2.txt") as f:
for line in f:
line = line.strip() # Remove trailing whitespace
store_nums_2.add(line) # The entire line is the store number, so no need to slice.
Finally, find the set difference between the two sets.
file1_extras = store_nums_1 - store_nums_2
Which gives a set containing only the store numbers in file 1 but not in file 2. (I changed your file_2 to have only the first three lines, because the file you've shown actually contains more store numbers than file_1, so the result file1_extras was empty using your input)
{'00920', '00998', '02226', '02328', '02391', '02392', '02393'}
This is more efficient than using lists, because checking if something exists in a list is an O(N) operation. When you do it once for each of the M items in your first list, you end up with an O(N*M) operation. On the other hand, membership checks in a set are O(1), so the entire set-difference operation is O(M) instead of O(N*M)

You are getting the output you get because your check is not checking what you want. Try changing your for loop to something like this:
for pairline in pairs:
if pairline:
name, number = pairline.split(': ')
if name == "Store":
if number not in stores:
print(number)
Explanation is as follows:
You start with a File 1 of pairs, and a File 2 of stores (store numbers, really). Your file 2 is in decent shape. After you read it in, you've got a list of store numbers. You don't need to put that through a second loop. In fact, it's wasteful and unnecessary.
Your File 1 is a little more complicated. Although you refer to the info as pairs, it's a little more complicated than that, because the lines have a store number and what I assume is a phone number. So, for each line in the File 1, I would check if the line starts with "Store:", knowing I can ignore all the other lines. If the line starts with "Store;", the next part of the line is the store number I actually want to check for in the list of File 2.
So, the program above does a little more checking to see if it's reading in a line it needs to act on. and then it acts on it if necessary by checking whether the store number is in the store number list.
Also, as a side note, it's great to use the with structure. It's good coding practice. But when you do that, you do not need to explicitly close the file. That happens automatically with that context structure. Once you leave the context, the close happens automatically.
As another side note, there are usually multiple good ways and bad ways to solve a problem. Another possible reasonable solution/version is:
for pairline in pairs:
if pairline and pairline.startswith("Store:"):
store = pairline.split()[1]
if store not in stores:
print(stores)
It's different. Not necessarily better or worse, just different.

select and filtered files in directory with enumerating in loop

I have a folder that contains many eof extension files name I want to sort them in ordinary way with python code (as you can see in my example the name of all my files contain a date like:20190729_20190731 and they are just satellite orbital information files, then select and filtered 1th,24th,47th and.... (index ) of files and delete others because I need every 24 days information files( for example:V20190822T225942_20190824T005942) not all days information .for facility I select and chose these information files from first day I need so the first file is found then I should select 24 days after from first then 47 from first or 24 days after second file and so on. I exactly need to keep my desire files as I said and delete other files in my EOF source folder my desire files are like these
S1A_OPER_AUX_POEORB_OPOD_20190819T120911_V20190729T225942_20190731T005942.EOF
S1A_OPER_AUX_POEORB_OPOD_20190912T120638_V20190822T225942_20190824T005942.EOF
.
.
.
Mr Zach Young wrote this code below and I appreciate him so much I never thought some body would help me. I think I'm very close to the goal
the error is
error is print(f'Keeping {eof_file}') I changed syntax but the same error: print(f"Keeping {eof_file}")
enter code here
from importlib.metadata import files
import pprint
items = os.listdir("C:/Users/m/Desktop/EOF")
eof_files = []
for item in items:
# make sure case of item and '.eof' match
if item.lower().endswith('.eof'):
eof_files.append(item)
eof_files.sort(key=lambda fname : fname.split('_')[5])
print('All EOF files, sorted')
pprint.pprint(eof_files)
print('\nKeeping:')
files_to_delete = []
count = 0
offset = 2
for eof_file in eof_files:
if count == offset:
print(f"Keeping: [eof_file]")
# reset count
count = 0
continue
files_to_delete.append(eof_file)
count += 1
print('\nRemoving:')
for f_delete in files_to_delete:
print(f'Removing: [f_delete]')
staticmethod

Here's a top-to-bottom demonstration.
I recommend that you:
Run that script as-is and make sure your print statements match mine
Swap in your item = os.listdir(...), and see that your files are properly sorted
Play with the offset variable and make sure you can control what should be kept and what should be deleted; notice that an offset of 2 keeps every third file because count starts at 0
You might need to play around and experiment to make sure you're happy before moving to the final step:
Finally, swap in your os.remove(f_delete)
#!/usr/bin/env python3
from importlib.metadata import files
import pprint
items = [
'foo_bar_baz_bak_bam_20190819T120907_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120901_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120905_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120902_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120903_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120904_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120906_V2..._SomeOtherDate.EOF',
'bogus.txt'
]
eof_files = []
for item in items:
# make sure case of item and '.eof' match
if item.lower().endswith('.eof'):
eof_files.append(item)
eof_files.sort(key=lambda fname : fname.split('_')[5])
print('All EOF files, sorted')
pprint.pprint(eof_files)
print('\nKeeping:')
files_to_delete = []
count = 0
offset = 2
for eof_file in eof_files:
if count == offset:
print(f'Keeping {eof_file}')
# reset count
count = 0
continue
files_to_delete.append(eof_file)
count += 1
print('\nRemoving:')
for f_delete in files_to_delete:
print(f'Removing {f_delete}')
When I run that, I get:
All EOF files, sorted
['foo_bar_baz_bak_bam_20190819T120901_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120902_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120903_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120904_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120905_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120906_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120907_V2..._SomeOtherDate.EOF']
Keeping:
Keeping foo_bar_baz_bak_bam_20190819T120903_V2..._SomeOtherDate.EOF
Keeping foo_bar_baz_bak_bam_20190819T120906_V2..._SomeOtherDate.EOF
Removing:
Removing foo_bar_baz_bak_bam_20190819T120901_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120902_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120904_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120905_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120907_V2..._SomeOtherDate.EOF

Conversion to Logn Python 3.7

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.
import pandas
import fileinput
'''This code runs fine and does what I expect removing duplicates from big
file that are in small file, however it is a linear function.'''
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
'''This code is my attempt at conversion to a log function.'''
def log_search(small, big):
first = 0
last = len(big.txt) - 1
while first <= last:
mid = (first + last) / 2
if str(mid) == small.txt:
return True
elif small.txt < str(mid):
last = mid - 1
else:
first = mid + 1
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
return log_search(small, big)
big file has millions of lines of int data.
small file has hundreds of lines of int data.
compare data and remove duplicated data in big file but leave line number blank.
running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.

I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.
What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.
This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.
So, yes, theoretically it can be done in O(s*logb), if you do the following:
implement binary search, searching not on the lines, but on the characters of the big file
for each position, backtrack to the last line break, then read the line to get the number
try again in the lower/upper half as usual with binary search
if the number is found, replace with as many spaces as there are digits in the number
repeat with the next number from the small file
On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.
Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.
import os
def getlineat(f, pos):
pos = f.seek(pos)
while pos > 0 and f.read(1) != "\n":
pos = f.seek(pos-1)
return pos+1 if pos > 0 else 0
def bsearch(f, num):
lower = 0
upper = os.stat(f.name).st_size - 1
while lower <= upper:
mid = (lower + upper) // 2
pos = getlineat(f, mid)
line = f.readline()
if not line: break # end of file
val = int(line)
if val == num:
return (pos, len(line.strip()))
elif num < val:
upper = mid - 1
elif num > val:
lower = mid + 1
return (-1, -1)
def overwrite(filename, to_remove):
with open(filename, "r+") as f:
positions = [bsearch(f, n) for n in to_remove]
for n, (pos, length) in sorted(zip(to_remove, positions)):
print(n, pos)
if pos != -1:
f.seek(pos)
f.write("-" * length)
import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)
This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).
The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

Python: Extract substring from text file based on character index

So I have a File with some thousand entries of the form (fasta format, if anyone wants to know):
>scaffold1110_len145113_cov91
TAGAAAATTGAATAATTGATAGTTCTTAACGAAAAGTAAAAGTTTAAAGTATACAGAAATTTCAGGCTATTCACTCTTTT
ATAATCCAAAATTAGAAATACCACACCTTGCATAAAGTTTAAGATATTTACAAAAACCTGAAGTGGATAATCCGAAATCG
...
>Next_Header
ATGCTA...
And I have a python-dictionary from part of my code that contains information like the following for a number of headers:
{'scaffold1110_len145113_cov91': [[38039, 38854, 106259], [40035, 40186, 104927]]}
This describes the entry by header and a list of start position, end position and rest of characters in that entry (so start=1 means the first character of the line below that corresponding header). [start, end, left]
What I want to do is extract the string for this interval inclusive 25 (or a variable number) of characters in front and behind of it, if the entry allows for, otherwise include all characters to the begin/end. (like when the start position is 8, I cant include 25 chars in front but only 8.)
And that for every entry in my dict.
Sounds not too hard probably but I am struggling to come up with a clever way to do it.
For now my idea was to read lines from my file, check if they begin with ">" and look up if they exist in my dict. Then add up the chars per line until they exceed my start position and from there somehow manage to get the right part of that line to match my startPos - X.
for line in genomeFile:
line = line.strip()
if(line[0] == ">"):
header = line
currentCluster = foundClusters.get(header[1:])
if(currentCluster is not None):
outputFile.write(header + "\n")
if(currentCluster is not None):
charCount += len(line)
# *crazy calculations to find the actual part i want to extract*
I am quite the python beginner so maybe someone has a better idea how to solve this?
-- While typing this I got the idea to use file.read(startPos-X-1) after a line matches to a header I am looking for to read characters to get to my desired position and from there use file.read((endPos+X - startPos-X)) to extract the part I am looking for. If this works it seems pretty easy to accomplish what I want.
I'll post this anyway, maybe someone has an even better way or maybe my idea wont work.
thanks for any input.
EDIT:
turns out you cant mix for line in file with file.read(x) since the former uses buffering, soooooo back to the batcave. also file.read(x) probably counts newlines too, which my data for start and end position do not.
(also fixed some stupid errors in my posted code)

Perhaps you could use a function to generate your needed splice indices.
def biggerFrame( start, end, left, frameSize=25 ) : #defaults to 25 frameSize
newStart = start - frameSize
if newStart < 0 :
newStart = 0
if frameSize > left :
newEnd = left
else :
newEnd = end + frameSize
return newStart, newEnd
With that function, you can add something like the following to your code.
for indices in currentCluster :
slice, dice = biggerFrame( indices[0], indices[1], indices[2], 50) # frameSize is 50 here; you can make it whatever you want.
outputFile.write(line[slice:dice] + '\n')

Binary search over a huge file with unknown line length

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data.
I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search

To resolve it, you also can use binary search, but you need to change it a bit:
Get the file size.
Seek to the middle of size, with File.seek.
And search the first EOL character. Then you find a new line.
Check this line's key and if not what you want, update size and go to 2.
Here is a sample code:
fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()
while (begin < end):
fp.seek((end + begin) / 2, 0)
fp.readline()
line_key = get_key(fp.readline())
if (key == line_key):
pass # find what you want
elif (key > line_key):
begin = fp.tell()
else:
end = fp.tell()
Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.
Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.