find sequence in non tab delimited file - python

Today i encountered again a problem.
I have a file looking like:
File A
>chr1
ACGACTGACTGTCGATCGATCGATGCTCGATGCTCGACGATCGTGCTCGATC
>chr2
GTGACGCACACGTGCTAGCGCTGATCGATCGTAGCTCAGTCAG
>chr3
CAGTCGTCGATCGTCGATCGTCG
and so on (basicly a FASTA file).
In other file I have a nice tab delimited informations about my read:
File B
chr2 0 * 2S3M5I2M1D3M * CACTTTTTGTCTA NM:i:6
Both files are truly huge
I want write everything that needs to be done, only the part that I have a problem with:
if filed chr2 from File B matches line >chr2 in file A, look for CACTTTTTGTCTA (fileB) in sequence of file A (only in sequence in >chr2 region. Next >chr is a different chromosome so I don't want to search there).
To simplify this let's look for : CACACGTGCTAG sequence in file A
I was trying using dictionary for the file A, but it's completely not feasible.
Any suggestions?

Something like:
for req in fileb:
(tag, pattern) = parseB(req)
tag_matched = False
filea = open(file_a_name)
for line in filea:
if line.startswith('>'):
tag_matched = line[1:].startswith(tag)
elif tag_matched and (line.find(pattern) > -1)
do_whatever()
filea.close
Should do the job if you can write a parseB function.

Dictionary lookups are fast, so it seems like the part that's taking a long time must be searching within the sequence. string.contains() is implemented in C, so it's pretty efficient. If that's not fast enough, you'll probably need to go with an algorithm more specialized for efficiency, as discussed here: Python efficient way to check if very large string contains a substring

Related

Most optimized way to replace string in a set of files from a list

I have a list that contains some strings.
I have a set of files that may or may not contain these strings.
I need to replace these strings with modified version of the string in every instance of the files. (eg. string1_abc -> string1_xyz, string2_abc -> string2_xyz). In essence, the substring that needs to be replaced and/or modified is common among the all the items in the list.
Is there any optimized or easy way of doing that? The most naive algorithm I can think of looks at each line in each file, and for each line, iterate over each of the items in the list and replace that using line.replace . I know this would give me an O(mnq) complexity where m = number of files , n = number of lines per file and q = number of items in the list
Note:
All the file sizes aren't very large so I'm not sure if reading line
by line vs doing a file.read() into the buffer would be better?
q isn't very large either. The list is about 40-50 items.
m is quite
large.
n can go upto 5000 lines.
Also, I've only played around with Python on the side and am not very used to it. Also, I'm limited to using Python 2.6
Pseudo Python:
import glob
LoT=[("string1_abc","string1_xyz"), ("string2_abc","string2_xyz")]
for fn in glob.glob(glob_describes_your_files):
with open(fn) as f_in:
buf=f_in.read() # You said n is about 5000 lines so
# I would just read it in
for t in LoT:
buf=buf.replace(*t)
# write buf back out to a new file or the existing one
with open(fn, "w") as f_out:
f_out.write(buf)
Something like that...
If the files are BIG, investigate using a mmap on the files and everything else is more or less the same.

Replace Multiple Strings in a Large Text File in Python

Problem:
Replacing multiple string patterns in a large text file is taking a lot of time. (Python)
Scenario:
I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.
The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.
At present, I am replacing the matches using re.sub() and the approach for performing replaces looks as shown below.
readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines
for line in lines:
if len(line.strip()) != 0: # strip the empty lines
linestr += line
for pattern in patterns: # patterns contains all regex and respective replaces
regex = pattern[0]
replace = pattern[1]
compiled_regex = compile_regex(regex)
linestr = re.sub(compiled_regex, replace, linestr)
This approach is taking a lot of time for large files. Is there a better way to optimize it?
I am thinking of replacing += with .join() but not sure how much that would help.
you could use lineprofiler to find which lines in your code take the most time
pip install line_profiler
kernprof -l run.py
another thing, I think you're building the string too large in memory, maybe you can make use of generators
You may obtain slightly better results doing :
large_list = []
with gzip.open(path, 'r') as fp:
for line in fp.readlines():
if line.strip():
large_list.append(line)
merged_lines = ''.join(large_list)
for regex, replace in patterns:
compiled_regex = compile_regex(regex)
merged_lines = re.sub(compiled_regex, replace, merged_lines)
However, further optimization can be achieved knowing what kind of processing you apply. In fact the last line will be the one that takes up all CPU power (and memory allocation). If regexes can be applied on a per-line basis, you can achieve great results using the multiprocessing package. Threading won't give you anything because of the GIL (https://wiki.python.org/moin/GlobalInterpreterLock)

Python large files, how to find specific lines with a particular string

I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.
The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!
The data looks something like this:
scaffold126 1 C 0:0:20:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold126 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...
I am using something like this:
for (thisScaff in AllScaffs):
InFile = open(sys.argv[2], 'r')
for line in InFile:
LineList = line.split()
currentScaff = LineList[0]
if (thisScaff == currentScaff):
#Then do this stuff...
The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?
Many thanks in advance!
Clearly you only want to read the file once. It's very expensive to read it over and over again. To speed searching, make a set of the strings you're looking for. Like so:
looking_for = set(AllScaffs)
with open(sys.argv[2]) as f:
for line in f:
if line.split(None, 1)[0] in looking_for:
# bingo! found one
line.split(None, 1) splits on whitespace, but at most 1 split is done. For example,
>>> "abc def ghi".split(None, 1)
['abc', 'def ghi']
This is significantly faster than splitting 29 times (which will happen if each line has 30 whitespace-separated columns).
An alternative:
if line[:line.find(' ')] in looking_for:
That's probably faster still, since no list at all is created. It searches for the leftmost blank, and takes the initial slice of line up to (but not including) that blank.
Create an Index. It'll require a lot of disk space. Use it only if you have to perform these scaffold lookups too many times.
This will be a one time job, will take a good amount of time, but will definitely serve you in long run.
Your Index will be of the form:
scaffold126:3|34|234443|4564564|3453454
scaffold666:1|2
scaffold5112:4|23|5456456|345345|234234
where 3,4 etc are line numbers. Make sure the final file is sorted alphabetically (to make way for Binary Search). Let's call this Index as Index_Primary
Now you will create a secondary index to make the search faster. Let's call it Index_Second. Let's say Index_Primary contains hundred thousand lines, each line representing one scaffold. Index_Second will give us jump points. It can be like:
scaffold1:1
scaffold1000:850
scaffold2000:1450
This says that information about scaffold2000 is present in line number 1450 of Index_Primary.
So now let's say you want to find lines with scaffold1234, you will go to Index_Second. This will tell you that scaffold1234 is present somewhere between line 850 and 1450 of Index_Primary. Now load that and start from the middle of this block, ie, line 1150. Find the required scaffold using Binary Search and voila! You get the line numbers of lines containing that scaffold! Possibly within milliseconds!
My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.
For a Python approach, try this:
wanted_scaffs = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
curr_scaff = line.split(' ', 1)[0] # minimal splitting
if curr_scaff in wanted_scaffs:
files[key].write(line)
for f in files.values():
f.close()
Then do your summary reports:
for scaff in wanted_scaffs:
with open(scaff + '.txt', 'r') as f:
... # summarize your data

In python, is there a way for re.finditer to take a file as input instead of a string?

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:
f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
doSomething()
Is there a way to do this without having to store the entire file in memory?
NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.
UPDATE: I would also like this to work with stdin if possible.
UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.
If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.
from collections import deque
def textwindow(filename, numlines):
with open(filename) as f:
window = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
nextline = True
while nextline:
text = "".join(window)
yield text
nextline = f.readline()
window.append(nextline)
for text in textwindow("bigfile.txt", 10):
# test to see whether your regex matches and do something
Either you will have to read the file chunk-wise, with overlaps to allow for the maximum possible length of the expression, or use an mmapped file, which will work almost/just as good as using a stream: https://docs.python.org/library/mmap.html
UPDATE to your UPDATE:
consider that stdin isn't a file, it just behaves a lot like one in that it has a file descriptor and so on. it is a posix stream. if you are unclear on the difference, do some googling around. the OS cannot mmap it, therefore python can not.
also consider that what you're doing may be an ill-suited thing to use a regex for. regex's are great for capturing small stuff, like parsing a connection string, a log entry, csv data and so on. they are not a good tool to parse through huge chunks of data. this is by design. you may be better off writing a custom parser.
some words of wisdom from the past:
http://regex.info/blog/2006-09-15/247
Perhaps you could write a function that yields one line (reads one line) at a time of the file and call re.finditer on that until it yields an EOF signal.
Here is another solution, using an internal text buffer to progressively yield found matches without loading the entire file in memory.
This buffer acts like a "sliding windows" through the file text, moving forward while yielding found matches.
As the file content is loaded by chunks, this means this solution works with multilines regexes too.
def find_chunked(fileobj, regex, *, chunk_size=4096):
buffer = ""
while 1:
text = fileobj.read(chunk_size)
buffer += text
matches = list(regex.finditer(buffer))
# End of file, search through remaining final buffer and exit
if not text:
yield from matches
break
# Yield found matches except the last one which is maybe
# incomplete because of the chunk cut (think about '.*')
if len(matches) > 1:
end = matches[-2].end()
buffer = buffer[end:]
yield from matches[:-1]
However, note that it may end up loading the whole file in memory if no matches are found at all, so you better should use this function if you are confident that your file contains the regex pattern many times.

Python goto text file line without reading previous lines

I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?
If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.
You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.
As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line

Categories

Resources