Threading/Multiprocessing - Match searching a 60gb file with 600k terms - python

I have a python script that would take ~93 days to complete on 1 CPU, or 1.5 days on 64.
I have a large file (FOO.sdf) and would like to extract the "entries" from FOO.sdf that match a pattern. An "entry" is a block of ~150 lines delimited by "$$$$". The output desired is 600K blocks of ~150 lines. This script I have now is shown below. Is there a way to use multiprocessing or threading to divy up this task across many cores/cpus/threads? I have access to a server with 64 cores.
name_list = []
c=0
#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
for name in names:
name_list.append(name.strip())
#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:
#Opening the large file with many textblocks I don't want
with open("FOO.sdf",'r') as f:
#Loop through each line in the file
for line in f:
#Avoids appending extreanous lines or choking
if line.split() == []:
continue
#Simply, this line would check if that line matches any name in "name_list".
#But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
c=1 #when c=1 it designates that line should be written
#Write this line to output file
if c==1:
subset.write(line)
#Stop writing to file once we see "$$$$"
if c==1 and line.split()[0] == "$$$$":
c=0
subset.write(line)

Related

Counting Entries in a File

I am trying to count entries in a text file but having difficulty. The key is that each line is one entry and if the term "ADALIMUMAB" shows up in the line, it counts as one. If it shows up twice, it still should only count as one. Here is an example of lines in the text file.
101700392$10170039$3$I$BUDESONIDE.$BUDESONIDE$1$Oral$9 MG, DAILY$$$$$$$$9$MG$$
101700392$10170039$4$C$ADALIMUMAB$ADALIMUMAB$1$$UNK$$$$$$$$$$$
102117144$10211714$1$PS$HUMIRA$ADALIMUMAB$1$Subcutaneous$$$$$N$ NOT AVAILABLE,NOT
I currently have this working:
fDRUG14Q3 = open("DRUG14Q3.txt")
data = fDRUG14Q3.read()
occurencesDRUG14Q3 = data.count("ADALIMUMAB")
But it will count line 2 in the example above as 2 entries rather than one.
You can use a generator expression passed to sum(). Each line will either be True(1) of False(0) and you'll take the total count. Basically you are counting how many lines return True for 'ADALIMUMAB' in line:
with open(path, 'r') as f:
total = sum('ADALIMUMAB' in line for line in f)
print(total)
# 2
This has the added benefit of not requiring you to read the whole file into memory first too.

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Function to divide a text file into two files

I wrote a function to input a text file and a ratio (eg. 80%) to divide the first 80% of the file into a file and the other 20% to another file. The first part is correct but the second part is empty. can someone take a look and let me know my mistake?
def splitFile(inputFilePatheName, outputFilePathNameFirst, outputFilePathNameRest, splitRatio):
lines = 0
buffer = bytearray(2048)
with open(inputFilePatheName) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
print lines
line80 = int(splitRatio * lines)
print line80
with open(inputFilePatheName) as originalFile:
firstNlines = originalFile.readlines()[0:line80]
restOfTheLines=originalFile.readlines()[(line80+1):lines]
print len(firstNlines)
print len(restOfTheLines)
with open(outputFilePathNameFirst, 'w') as outputFileNLines:
for item in firstNlines:
outputFileNLines.write("{}".format(item))
with open(outputFilePathNameRest,'w') as outputFileRest:
for word in restOfTheLines:
outputFileRest.write("{}".format(word))
I believe this is your problem:
firstNlines = originalFile.readlines()[0:line80]
restOfTheLines=originalFile.readlines()[(line80+1):lines]
When you call readlines() the second time, you don't get anything, because you've already read all the lines from the file. Try:
allLines = originalFile.readlines()
firstNLines, restOfTheLines = allLines[:line80], allLines[(line80+1):]
Of course, for very large files there is a problem that you are reading the entire file into memory.

Concatenate lines together if they are not emty in a file

I have a file where some sentence are spread over multiple lines.
For example:
1:1 This is a simple sentence
[NEWLINE]
1:2 This line is spread over
multiple lines and it goes on
and on.
[NEWLINE]
1:3 This is a line spread over
two lines
[NEWLINE]
So I want it to look like this
1:1 This is a simple sentence
[NEWLINE]
1:2 This line is spread over multiple lines and it goes on and on.
[NEWLINE]
1:3 This is a line spread over two lines
Some lines a spread over 2 or 3 or 4 lines. If there follows al line which is not an new line it should be merged into one single line.
I would like to overwrite the given file of to make an new file.
I've tried it with a while loop but without succes.
input = open(file, "r")
zin = ""
lines = input.readlines()
#Makes array with the lines
for i in lines:
while i != "\n"
zin += i
.....
But this creates a infinite loop.
You should not be nesting for and while loops in your use case. What happens in your code is that a line is assigned to the variable i by the for loop, but it isn't being modified by the nested while loop, so if the while clause is True, then it will remain that way and without a breaking condition, you end up with an infinite loop.
A solution might look like this:
single_lines = []
current = []
for i in lines:
i = i.strip()
if i:
current.append(i)
else:
if not current:
continue # treat multiple blank lines as one
single_lines.append(' '.join(current))
current = []
else:
if current:
# collect the last line if the file doesn't end with a blank line
single_lines.append(' '.join(current))
Good ways of overwriting the input file would be to either collect all output in memory, close the file after reading it out and reopen it for writing, or to write to another file while reading the input and renaming the second one to overwrite the first after closing both.

How do I efficiently crossmatch two ASCII catalogs?

I have two ASCII text files with columnated data. The first column of both files is a 'name' that is consistent across both files. One file has some 6000 rows, the other only has 800. Without doing a for line in file.readlines(): approach - e.g.,
with open('big_file.txt') as catalogue:
with open('small_file.txt') as targets:
for tline in targets.readlines()[2:]:
name = tline.split()[0]
for cline in catalogue.readlines()[8:]:
if name == cline.split()[0]
print cline
catalogue.seek(0)
break
is there an efficient way to return only the rows (or lines) from the larger file that also appear in the smaller file (using the 'name' as the check)?
It's okay if it is one row at a time for say a file.write(matching_line) the idea would be to create a third file with all the info from the large file for only the objects that are in the small file.
for line in file.readlines() is not inherently bad. What's bad is the nested loops you have there. You can use a set to keep track of and check all the names in the smaller file:
s = set()
for line in targets:
s.add(line.split()[0])
Then, just loop through the bigger file and check if the name is in s:
for line in catalogue:
if line.split()[0] in s:
print line

Categories

Resources