How do I efficiently crossmatch two ASCII catalogs? - python

I have two ASCII text files with columnated data. The first column of both files is a 'name' that is consistent across both files. One file has some 6000 rows, the other only has 800. Without doing a for line in file.readlines(): approach - e.g.,
with open('big_file.txt') as catalogue:
with open('small_file.txt') as targets:
for tline in targets.readlines()[2:]:
name = tline.split()[0]
for cline in catalogue.readlines()[8:]:
if name == cline.split()[0]
print cline
catalogue.seek(0)
break
is there an efficient way to return only the rows (or lines) from the larger file that also appear in the smaller file (using the 'name' as the check)?
It's okay if it is one row at a time for say a file.write(matching_line) the idea would be to create a third file with all the info from the large file for only the objects that are in the small file.

for line in file.readlines() is not inherently bad. What's bad is the nested loops you have there. You can use a set to keep track of and check all the names in the smaller file:
s = set()
for line in targets:
s.add(line.split()[0])
Then, just loop through the bigger file and check if the name is in s:
for line in catalogue:
if line.split()[0] in s:
print line

Related

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Threading/Multiprocessing - Match searching a 60gb file with 600k terms

I have a python script that would take ~93 days to complete on 1 CPU, or 1.5 days on 64.
I have a large file (FOO.sdf) and would like to extract the "entries" from FOO.sdf that match a pattern. An "entry" is a block of ~150 lines delimited by "$$$$". The output desired is 600K blocks of ~150 lines. This script I have now is shown below. Is there a way to use multiprocessing or threading to divy up this task across many cores/cpus/threads? I have access to a server with 64 cores.
name_list = []
c=0
#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
for name in names:
name_list.append(name.strip())
#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:
#Opening the large file with many textblocks I don't want
with open("FOO.sdf",'r') as f:
#Loop through each line in the file
for line in f:
#Avoids appending extreanous lines or choking
if line.split() == []:
continue
#Simply, this line would check if that line matches any name in "name_list".
#But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
c=1 #when c=1 it designates that line should be written
#Write this line to output file
if c==1:
subset.write(line)
#Stop writing to file once we see "$$$$"
if c==1 and line.split()[0] == "$$$$":
c=0
subset.write(line)

Count of unique column values from large CSV file using Python or php

I have a Csv File That Is 217gb, how can I get the count of unique column values using python or php script without timeout?
Not sure what you mean by timeout, for big files like this it will always take a long time.
tokens = {}
with open("your.csv") as infile:
for line in infile:
columns = line.split(',')
# Where idx is your desired column index
if columns[idx] not in tokens:
tokens[columns[idx]] = 0
else:
tokens[columns[idx]] += 1
print tokens
This loads the file line by line, so your compute doesn't crash from loading the whole 217 Gb into ram. You can try this first to see if the dictionary fits in your computer's memory. Otherwise you might wanna consider splitting the files to smaller chunks in a divide and conquer approach.
You could try to increase the field_size_limit
import csv
csv.field_size_limit(1000000000)
r = csv.reader(open('doc.csv', 'rb'))
for row in r:
print(row) # do the processing

Connecting similar lines from two files

I have two files, both are very big. The files have mixed up information between themselves and I need to compare two files and connect the lines that intersect.
An example would be:
1st file has
var1:var2:var3
2nd would have
var2:var3:var4
I need to connect these in a third file with output: var1:var2:var3:var4.
Please note that the lines do not match, var4 which should go with var1 (since they have var2 and var3 together). Var2 and Var3 are common for Var1 and Var4. could be far away in these huge files.
I need to find a way to compare each line and connect it to the one in the 2nd file. I can't seem to think of anything of an adequate loop. Any ideas?
Try the following (assuming var2:var3 is always a unique key in both files):
Iterate over all lines in the first file
Add all entries into a dictionary with the value var2:var3 as key (and var1 as value)
Iterate over all entries in the second file
look up if the dictionary from part 1 contains an entry for the key var2:var3 and if it does output var1:var2:var3:var4 into the output file and delete the entry from the dictionary.
This approach can use very big amounts of memory and therefore should probably not be used for very large files.
Based on the specific fields you said that you want to match (2 & 3 from file 1, 1 & 2 from file 2):
#!/usr/bin/python3
# Iterate over every line in file1.
# Iterate over every line in file2.
# If lines intersect, print combined line.
with open('file1') as file1:
for line1 in file1:
u1,h1,s1 = line1.rstrip().split(':')
with open('file2') as file2:
for line2 in file2:
h2,s2,p2 = line2.rstrip().split(':')
if h1 == h2 and s1 == s2:
print(':'.join((u1,h1,s2,p2)))
This is horrendously slow (in theory), but uses a minimum of RAM. If the files aren't absolutely huge, it might not perform too badly.
If memory isn't problem, use dictionary where the key is the same as the value:
#!/usr/bin/python
out_dict = {}
with open ('file1','r') as file_in:
lines = file_in.readlines()
for line in lines:
out_dict[line] = line
with open ('file2','r') as file_in:
lines = file_in.readlines()
for line in lines:
out_dict[line] = line
with open ('output_file','w') as file_out:
for key in out_dict:
file_out.write (key)

sorting large text data

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.
Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.
Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
Will sort input.txt on its 4th field, and output the result to sorted.txt
you want to build an in-memory index for the file:
create an empty list
open the file
read it line by line (using f.readline(), and store in the list a tuple consisting of the value on which you want to sort (extracted with line.split('\t').strip()) and the offset of the line in the file (which you can get by calling f.tell() before calling f.readline())
close the file
sort the list
Then to print the sorted file, reopen the file and for each element of your list, use f.seek(offset) to move the file pointer to the beginning of the line, f.readline() to read the line and print the line.
Optimization: you may want to store the length of the line in the list, so that you can use f.read(length) in the printing phase.
Sample code (optimized for readability, not speed):
def build_index(filename, sort_col):
index = []
f = open(filename)
while True:
offset = f.tell()
line = f.readline()
if not line:
break
length = len(line)
col = line.split('\t')[sort_col].strip()
index.append((col, offset, length))
f.close()
index.sort()
return index
def print_sorted(filename, col_sort):
index = build_index(filename, col_sort)
f = open(filename)
for col, offset, length in index:
f.seek(offset)
print f.read(length).rstrip('\n')
if __name__ == '__main__':
filename = 'somefile.txt'
sort_col = 2
print_sorted(filename, sort_col)
Split up into files that can be sorted in memory. Sort each file in memory. Then merge the resulting files.
Merge by reading a portion of each of the files to be merged. The same amount from each file leaving enough space in memory for the merged result. Once merged saving this. Repeating adding blocks of merged data onto the file.
This minimises the file i/o and moving around the file on the disk.
I would store the file in a good relational database, index it on the field your are interested in and then read the ordered items.

Categories

Resources