I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.
The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!
The data looks something like this:
scaffold126 1 C 0:0:20:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold126 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
scaffold5112 2 C 0:0:10:0:0:0 0:0:1:0:0:0 0:0:0:0:0:0
and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...
I am using something like this:
for (thisScaff in AllScaffs):
InFile = open(sys.argv[2], 'r')
for line in InFile:
LineList = line.split()
currentScaff = LineList[0]
if (thisScaff == currentScaff):
#Then do this stuff...
The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?
Many thanks in advance!
Clearly you only want to read the file once. It's very expensive to read it over and over again. To speed searching, make a set of the strings you're looking for. Like so:
looking_for = set(AllScaffs)
with open(sys.argv[2]) as f:
for line in f:
if line.split(None, 1)[0] in looking_for:
# bingo! found one
line.split(None, 1) splits on whitespace, but at most 1 split is done. For example,
>>> "abc def ghi".split(None, 1)
['abc', 'def ghi']
This is significantly faster than splitting 29 times (which will happen if each line has 30 whitespace-separated columns).
An alternative:
if line[:line.find(' ')] in looking_for:
That's probably faster still, since no list at all is created. It searches for the leftmost blank, and takes the initial slice of line up to (but not including) that blank.
Create an Index. It'll require a lot of disk space. Use it only if you have to perform these scaffold lookups too many times.
This will be a one time job, will take a good amount of time, but will definitely serve you in long run.
Your Index will be of the form:
scaffold126:3|34|234443|4564564|3453454
scaffold666:1|2
scaffold5112:4|23|5456456|345345|234234
where 3,4 etc are line numbers. Make sure the final file is sorted alphabetically (to make way for Binary Search). Let's call this Index as Index_Primary
Now you will create a secondary index to make the search faster. Let's call it Index_Second. Let's say Index_Primary contains hundred thousand lines, each line representing one scaffold. Index_Second will give us jump points. It can be like:
scaffold1:1
scaffold1000:850
scaffold2000:1450
This says that information about scaffold2000 is present in line number 1450 of Index_Primary.
So now let's say you want to find lines with scaffold1234, you will go to Index_Second. This will tell you that scaffold1234 is present somewhere between line 850 and 1450 of Index_Primary. Now load that and start from the middle of this block, ie, line 1150. Find the required scaffold using Binary Search and voila! You get the line numbers of lines containing that scaffold! Possibly within milliseconds!
My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.
For a Python approach, try this:
wanted_scaffs = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
curr_scaff = line.split(' ', 1)[0] # minimal splitting
if curr_scaff in wanted_scaffs:
files[key].write(line)
for f in files.values():
f.close()
Then do your summary reports:
for scaff in wanted_scaffs:
with open(scaff + '.txt', 'r') as f:
... # summarize your data
Related
The program I'm working on needs to read in data files which can be quite large (up to 5GB) in ASCII. The format can vary that's why I came up with using readline(), split every line to get just the pure entries, append them all to one big list of strings and divide this one in smaller string lists depending on the occurrence of certain marker words, and then pass the data to a program internal data structure for further unified processing.
This method is working good enough except that it needs way to much memory and I wonder why.
So I wrote this little test case which makes you understand my problem:
The input data here is the text of Shakespears Romeo and Juliet (actually I expect mixed alphabetic - numeric input) - note that I want you to copy the data yourself to keep things clearly. The script generates a .txt file which is then read in again using. The original memory size in this case is 153 KB.
Reading this file with...
f.read() gives you a single string with a size of 153 KB, too.
f.readlines() gives you a list with single strings for every line with a overall size of 420 KB.
Splitting the line strings of f.readlines() at every whiespace and save all those single entries in a new list results in 1619 KB in memory use.
As these numbers don't seem to be a problem in this cases, a factor of >10 in increase of RAM requirement is definitly one for input data in GB order.
I don't have any idea why this is or how to avoid this. From my point of understanding a list is just a structure of pointers pointing on all the values stored in the list (this is also the reason, why sys.getsizeof() on a list gives you a 'wrong' result).
For the values themself it shouldn't make a difference in memory if I have "LONG STRING" or "LONG" + "STRING" as both use the same characters which should result in the same amount of bits/bytes.
Maybe the answer is really simple but I am really stuck with this problem so I am thankful for every idea.
# step1: http://shakespeare.mit.edu/romeo_juliet/full.html
# step2: Ctrl+A and then Ctrl+C
# step3: Ctrl+V after benchmarkText
benchmarkText = """ >>INSERT ASCII DATA HERE<< """
#=== import modules =======================================
from pympler import asizeof
import sys
#=== open files and safe data to a structure ==============
#-- original memory size
print("\n\nAll memory sizes are in KB:\n")
print("Original string size:")
print(asizeof.asizeof(benchmarkText)/1e3)
print(sys.getsizeof(benchmarkText)/1e3)
#--- write bench mark file
with open('benchMarkText.txt', 'w') as f:
f.write(benchmarkText)
#--- read the whole file (should always be equal to original size)
with open('benchMarkText.txt', 'r') as f:
# read the whole file as one string
wholeFileString = f.read()
# check size:
print("\nSize using f.read():")
print(asizeof.asizeof(wholeFileString)/1e3)
#--- read the file in a list
listOfWordOrNumberStrings = []
with open('benchMarkText.txt', 'r') as f:
# safe every line of the file
listOfLineStrings = f.readlines()
print("\nSize using f.readlines():")
print(asizeof.asizeof(listOfLineStrings)/1e3)
# split every line into the words or punctation marks
for stringLine in listOfLineStrings:
line = stringLine[:-1] # get rid of the '\n'
# line = re.sub('"', '', line) # The final implementation will need this, however for the test case it doesn't matter.
elemsInLine = line.split()
for elem in elemsInLine:
listOfWordOrNumberStrings.append(elem)
# check size
print("\nSize after splitting:")
print(asizeof.asizeof(listOfWordOrNumberStrings)/1e3)
(I am aware that I use readlines() instead of readline() here - I changed it for this test case because I think it makes things easier to understand.)
I have a list that contains some strings.
I have a set of files that may or may not contain these strings.
I need to replace these strings with modified version of the string in every instance of the files. (eg. string1_abc -> string1_xyz, string2_abc -> string2_xyz). In essence, the substring that needs to be replaced and/or modified is common among the all the items in the list.
Is there any optimized or easy way of doing that? The most naive algorithm I can think of looks at each line in each file, and for each line, iterate over each of the items in the list and replace that using line.replace . I know this would give me an O(mnq) complexity where m = number of files , n = number of lines per file and q = number of items in the list
Note:
All the file sizes aren't very large so I'm not sure if reading line
by line vs doing a file.read() into the buffer would be better?
q isn't very large either. The list is about 40-50 items.
m is quite
large.
n can go upto 5000 lines.
Also, I've only played around with Python on the side and am not very used to it. Also, I'm limited to using Python 2.6
Pseudo Python:
import glob
LoT=[("string1_abc","string1_xyz"), ("string2_abc","string2_xyz")]
for fn in glob.glob(glob_describes_your_files):
with open(fn) as f_in:
buf=f_in.read() # You said n is about 5000 lines so
# I would just read it in
for t in LoT:
buf=buf.replace(*t)
# write buf back out to a new file or the existing one
with open(fn, "w") as f_out:
f_out.write(buf)
Something like that...
If the files are BIG, investigate using a mmap on the files and everything else is more or less the same.
I have a 45 million row txt file that contains hashes. What would be the most efficient way to compare the file to another file, and provide only items from the second file that are not in the large txt file?
Current working:
comm -13 largefile.txt smallfile.txt >> newfile.txt
This works pretty fast but I am looking to push this into python to run regardless of os?
Attempted with memory issues:
tp = pd.read_csv(r'./large file.txt',encoding='iso-8859-1', iterator=True, chunksize=50000)
full = pd.concat(tp, ignore_index=True)`
This method taps me out in memory usage and generally faults for some reason.
Example:
<large file.txt>
hashes
fffca07998354fc790f0f99a1bbfb241
fffca07998354fc790f0f99a1bbfb242
fffca07998354fc790f0f99a1bbfb243
fffca07998354fc790f0f99a1bbfb244
fffca07998354fc790f0f99a1bbfb245
fffca07998354fc790f0f99a1bbfb246
<smaller file.txt>
hashes
fffca07998354fc790f0f99a1bbfb245
fffca07998354fc790f0f99a1bbfb246
fffca07998354fc790f0f99a1bbfb247
fffca07998354fc790f0f99a1bbfb248
fffca07998354fc790f0f99a1bbfb249
fffca07998354fc790f0f99a1bbfb240
Expected Output
<new file.txt>
hashes
fffca07998354fc790f0f99a1bbfb247
fffca07998354fc790f0f99a1bbfb248
fffca07998354fc790f0f99a1bbfb249
fffca07998354fc790f0f99a1bbfb240
Hash table. Or in Python terms, just use a set.
Put each item from the smaller file into the set. 200K items is perfectly fine. Enumerate each item in the larger file to see if it exists in the smaller file. If there is a match, remove the item from the the hash table.
When you are done, any item remaining in the set represents an item not found in the larger file.
My Python is a little rusty, but it would go something like this:
s = set()
with open("small_file.txt") as f:
content = f.readlines()
for line in content:
line = line.strip()
s.add(line)
with open("large_file.txt") as f:
for line in f:
if line in s:
s.discard(line.strip())
for i in s:
print(i)
Haven't tested, but I think this would be non memory intensive (no idea on speed):
unique = []
with open('large_file.txt') as lf, open('small_file.txt') as sf:
for small_line in sf:
for large_line in lf:
if small_line == large_line:
break
else:
unique.append(small_line)
lf.seek(0)
Answer ended up being an idiot check that should have been conducted well before I posted.
I was running 32 bit python in my IDE versus 64 bit due to a reinstall I had to conduct. After conducting this change, all worked well with loading the file in all at once and running a concat and drop duplicates on the dataframe. I appreciate all your answers and the input and help you gave. Thanks.
Today i encountered again a problem.
I have a file looking like:
File A
>chr1
ACGACTGACTGTCGATCGATCGATGCTCGATGCTCGACGATCGTGCTCGATC
>chr2
GTGACGCACACGTGCTAGCGCTGATCGATCGTAGCTCAGTCAG
>chr3
CAGTCGTCGATCGTCGATCGTCG
and so on (basicly a FASTA file).
In other file I have a nice tab delimited informations about my read:
File B
chr2 0 * 2S3M5I2M1D3M * CACTTTTTGTCTA NM:i:6
Both files are truly huge
I want write everything that needs to be done, only the part that I have a problem with:
if filed chr2 from File B matches line >chr2 in file A, look for CACTTTTTGTCTA (fileB) in sequence of file A (only in sequence in >chr2 region. Next >chr is a different chromosome so I don't want to search there).
To simplify this let's look for : CACACGTGCTAG sequence in file A
I was trying using dictionary for the file A, but it's completely not feasible.
Any suggestions?
Something like:
for req in fileb:
(tag, pattern) = parseB(req)
tag_matched = False
filea = open(file_a_name)
for line in filea:
if line.startswith('>'):
tag_matched = line[1:].startswith(tag)
elif tag_matched and (line.find(pattern) > -1)
do_whatever()
filea.close
Should do the job if you can write a parseB function.
Dictionary lookups are fast, so it seems like the part that's taking a long time must be searching within the sequence. string.contains() is implemented in C, so it's pretty efficient. If that's not fast enough, you'll probably need to go with an algorithm more specialized for efficiency, as discussed here: Python efficient way to check if very large string contains a substring
I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?
If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.
You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.
As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line