Most efficient way to compare multiple files in python

Most efficient way to compare multiple files in python - python

My problem is this. I have one file with 3000 lines and 8 columns(space delimited). The important thing is that the first column is a number ranging from 1 to 22. So in the principle of divide-n-conquer I splitted the original file in to 22 subfiles depending on the first column value.
And I have some result files. Which are 15 sources each containing 1 result file. But the result file is too big, so I applied divide-n-conquer once more to split each of the 15 results in to 22 subfiles.
the file structure is as follows:
Original_file Studies
split_1 study1
split_1, split_2, ...
split_2 study2
split_1, split_2, ...
split_3 ...
... study15
split_1, split_2, ...
split_22
So by doing this, we pay a slight overhead in the beginning, but all of these split files will be deleted at the end. so it doesn't really matter.
I need my final output to be the original file with some values from the studies appended to it.
So, my take so far is this:
Algorithm:
for i in range(1,22):
for j in range(1,15)
compare (split_i of original file) with the jth studys split_i
if one value on a specific column matches:
create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.
Is there a better way to go around this problem? Because the study files range from 300MB to 1.5GB in size.
and here's my Python code so far:
folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
for i in range(1, 23):
chr = i
small_file = "split_"+str(chr)+".txt"
with open(small_file, 'r') as sf:
for sline in sf: #small_files
sf_parts = sline.split(' ')
for f in folders:
file_to_compare_with = f + "split_" + str(chr) + ".txt"
with open(file_to_compare_with, 'r') as cf: #comparison files
for cline in cf:
cf_parts = cline.split(' ')
if cf_parts[0] == sf_parts[1]:
to_write = ' '.join(cf_parts+sf_parts)
outfile.write(to_write)
But this code uses 4 loops which is an overkill, but you have to do it since you need to read the lines from the 2 files being compared at the same time. This is my concern...

I found one solution that seems to work in a good amount of time. The code is the following:
with open("output_file", 'w') as outfile:
for i in range(1,23):
dict1 = {} # use a dictionary to map values from the inital file
with open("split_i", 'r') as split:
next(split) #skip the header
line_list = line.split(delimiter)
for line in split:
dict1[line_list[whatever_key_u_use_as_id]] = line_list
compare_dict = {}
for f in folders:
with open("each folder", 'r') as comp:
next(comp) #skip the header
for cline in comp:
cparts = cline.split('delimiter')
compare_dict[cparts[whatever_key_u_use_as_id]] = cparts
for key in dict1:
if key in compare_dict:
outfile.write("write your data")
outfile.close()
With this approach, I'm able to compute this dataset in ~10mins. Surely, there are ways for improvement. One idea, is to take the time and sort the datasets, that way search later on will be more quick, and we might save time!

Related

Python: concatenating text files

Using Python, I'm seeking to iteratively combine two set of txt files to create a third set of txt files.
I have a directory of txt files in two categories:
text_[number].txt (eg: text_0.txt, text_1.txt, text_2.txt....text_20.txt)
comments_[number].txt (eg: comments_0.txt, comments_1.txt, comments_2.txt...comments_20.txt).
I'd like to iteratively combine the text_[number] files with the matching comments_[number] files into a new file category feedback_[number].txt. The script would combine text_0.txt and comments_0.txt into feedback_0.txt, and continue through each pair in the directory. The number of text and comments files will always match, but the total number of text and comment files is variable depending on preceding scripts.
I can combine two pairs using the code below with a list of file pairs:
filenames = ['text_0.txt', 'comments_0.txt']
with open("feedback_0.txt", "w") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
However, I'm uncertain how to structure iteration for the rest of the files. I'm also curious how to generate lists from the contents of the file directory. Any advice or assistance on moving forward is greatly appreciated.

It would be far simpler (and possibly faster) to just fork a cat process:
import subprocess
n = ... # number of files
for i in range(n):
with open(f'feedback_{i}.txt', 'w') as f:
subprocess.run(['cat', 'text_{i}.txt', 'comments_{i}.txt'], stdout=f)
Or, if you already have lists of the file names:
for text, comment, feedback in zip(text_files, comment_files, feedback_files):
with open(feedback, 'w') as f:
subprocess.run(['cat', text, comment], stdout=f)
Unless these are all extremely small files, the cost of reading and writing the bytes will outweigh the cost of forking a new process for each pair.

Maybe not the most elegant but...
length = 10
txt = [f"text_{n}.txt" for n in range(length)]
com = [f"comments_{n}.txt" for n in range(length)]
feed = [f"feedback_{n}.txt" for n in range(length)]
for f, t, c in zip(feed, txt, com):
with open(f, "w") as outfile:
with open(t) as infile1:
contents = infile1.read()
outfile.write(contents)
with open(c) as infile2:
contents = infile2.read()
outfile.write(contents)

There are many ways to achieve this, but I don't seem to see any solution that's both beginner-friendly and takes into account the structure of the files you described.
You can iterate through the files, and for every text_[num].txt, fetch the corresponding comments_[num].txt and write to feedback_[num].txt as shown below. There's no need to add any counters or make any other assumptions about the files that might not always be true:
import os
srcpath = 'path/to/files'
for f in os.listdir(srcpath):
if f.startswith('text'):
index = f[5:-4] # extract the [num] part
# Build the paths to text, comment, feedback files
txt_path = os.path.join(srcpath, f)
cmnt_path = os.path.join(srcpath, f'comments_{index}.txt')
fb_path = os.path.join(srcpath, f'feedback_{index}.txt')
# write to output – reading in in byte mode following chepner's advice
with open(fb_path, 'wb') as outfile:
outfile.write(open(txt_path, 'rb').read())
outfile.write(open(cmnt_path, 'rb').read())

The simplest way would probably be to just iterate from 1 onwards, stopping at the first missing file. This works assuming that your files are numbered in increasing order and with no gaps (e.g. you have 1, 2, 3 and not 1, 3).
import os
from itertools import count
for i in count(1):
t = f'text_{i}.txt'
c = f'comments_{i}.txt'
if not os.path.isfile(t) or not os.path.isfile(c):
break
with open(f'feedback_{i}.txt', 'wb') as outfile:
outfile.write(open(t, 'rb').read())
outfile.write(open(c, 'rb').read())

You can try this
filenames = ['text_0.txt', 'comments_0.txt','text_1.txt', 'comments_1.txt','text_2.txt', 'comments_2.txt','text_3.txt', 'comments_3.txt']
for i,j in enumerate (zip(filenames[::2],filenames[1::2])):
with open(f'feedback_{i}','a+') as file:
for k in j:
with open(k,'r') as f:
files=f.read()
file.write(files)
I have taken a list here. Instead, you can do
import os
filenames=os.listdir('path/to/folder')

More efficient way than zipping arrays for transposing a table in Python?

I have been trying to transpose my table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?
import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
m.append(line.strip().split("\t"))
for i in zip(*m):
for j in range(len(i)):
if j != len(i):
print(i[j] +Seperator)
else:
print(i[j])
print ("\n")
Thanks very much.

The first thing to note is that you've been careless with your variables. You're loading a large file into memory as a single string, and then a list of a strings, then a list of list of strings, before finally transposing said list. This will result in you storing all the data in the file three times before you even begin to transpose it.
If each individual string in the file is only about 10 characters long then you're going to need 18GB of memory just to store that (2e6 rows * 300 columns * 10 bytes * 3 duplicates). This is before you factor in all the overhead of python objects (~27 bytes per string object).
You have a couple of options options here.
create each new transposed row incrementally by reading over the file once for each old row and appending each new row one at a time (sacrifices time efficiency).
create one file for each new row and combine these row files at the end (sacrifices disk space efficiency, possibly problematic if you have a lot of columns in the initial file due to a limit of the number of open files a process may have).
Transposing with a limited number of open files
delimiter = ','
input_filename = 'file.csv'
output_filename = 'out.csv'
# find out the number of columns in the file
with open(input_filename) as input:
old_cols = input.readline().count(delimiter) + 1
temp_files = [
'temp-file-{}.csv'.format(i)
for i in range(old_cols)
]
# create temp files
for temp_filename in temp_files:
open(temp_filename, 'w') as output:
output.truncate()
with open(input_filename) as input:
for line in input:
parts = line.rstrip().split(delimiter)
assert len(parts) == len(temp_files), 'not enough or too many columns'
for temp_filename, cell in zip(temp_files, parts):
with open(temp_filename, 'a') as output:
output.write(cell)
output.write(',')
# combine temp files
with open(output_filename, 'w') as output:
for temp_filename in temp_files:
with open(temp_filename) as input:
line = input.read().rstrip()[:-1] + '\n'
output.write(line)

As the number of columns is far smaller than nuber of rows I would consider writing each column to separate file. And then combine them together.
import sys
Separator = "\t"
f = open(sys.argv[1], 'r')
for line in f:
for i, c in enumerate(line.strip().split("\t")):
dest = column_file[i] # you shoud open 300+ file handlers, one for each column
dest.write(c)
dest.write(Separator)
# all you need to do after than is combine the content of you "row" files

If you cannot store all of your file into memory, you can read it n times:
column_number = 4 # if necessary, read the first line of the file to calculate it
seperetor = '\t'
filename = sys.argv[1]
def get_nth_column(filename, n):
with open(filename, 'r') as file:
for line in file:
if line: # remove empty lines
yield line.strip().split('\t')[n]
for column in range(column_number):
print(seperetor.join(get_nth_column(filename, column)))
Note that an exception will be raised if the file does not have the right format. You could catch it if necessary.
When reading files : use with construct, to ensure that your file will be closed. And iterate directly on the file, instead of reading the content first. It is more readable and more efficient.

Trying to read a file and loop through at the same time

I'm pretty new to this so please move this topic if it's in the wrong place or something else.
Problem: (Quick note: This is all in Python) I am trying to go through these 100 or so files, each with the same number of columns, and take certain columns of the input (the same ones for each file) and write them in a new file. However, these 100 files don't necessarily all have the same number of rows. In the code below, filec is in a loop and continues altering throughout the 100 files. I am trying to get these certain columns that I want by looking at the number of rows in each txt file and looping that many times then taking the numbers I want.
filec = open(string,'r').read().split(',')
x = len(filec.readlines())
I realize the issue is that filec has become a list after using the split function and was originally a string when I used .read(). How would one go about finding the number of lines, so I can loop through the number of rows and get the positions in each row that I want?
Thank you!

You could do it like this:
filec = open (filename, 'r')
lines = filec.readlines ()
for line in lines:
words = line.split(',')
# Your code here
Excuse me if there are any errors, I'm doing this on mobile.

As you are just looking for the count of rows, then how about this -
t = tuple(open(filepath\filename.txt, 'r'))
print len(t)

I tried to keep the code clear, it is very possible to do with fewer lines. take in a list of file names, give out a dictionary, mapping the filename to the column you wanted (as a list).
def read_col_from_files(file_names, column_number):
ret = {}
for file_name in file_names:
with open(file_name) as fp:
column_for_file = []
for line in fp:
columns = line.split('\t')
column_for_file.append(columns[column_number])
ret[file_name] = column_for_file
return ret
I have assumed you have tab delimited columns. Call it like this:
data = read_col_from_files(["file_1.txt", "/tmp/file_t.txt"], 5)
Here is a sensible shortening of the code using a list comprehension
def read_col_from_files(file_names, column_number):
ret = {}
for file_name in file_names:
with open(file_name) as fp:
ret[file_name] = [line.split('\t')[column_number] for line in fp]
return ret
And here is how to do it on the command line:
cat FILENAMES | awk '{print $3}'

Output comes in dictionary sorted order in python

I have a collection of csv files with names like 2.csv , 3.csv ...., 999.csv. Each file has 91 rows in it. I want to have a set of new files collecting a particular row from all the files. Eg. row1.csv should have first row of all the 998 files and similarly row35.csv should have 35th row of all the 998 files. Therefore I should have in total 91 files (one for each row) with each file having 998 rows (one for each original file) after my script completes the run.
I use the following code for the task
import glob
import os
for i in range(2,92):
outfile = open("row_%i.csv" %i,'w')
for filename in glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
with open(filename, 'r') as infile:
lineno = 0
for line in infile:
lineno += 1
if lineno == i:
outfile.write(line)
outfile.close()
Now in any of the outfile row_i.csv my data is arranged in dictionary sorted order. Example :
First row in row_50.csv file is the 50th row of 10.csv.
In other words in any row_i.csv the rows comes from 10.csv, 100.csv, 101.csv and so on.
I wanted to know why is that happening and is there a way in which I can ensure that my row_i.csv is ordered in the way the files are ordered i.e. 2.csv, 3.csv and so on.
Thanks for the time spent reading this.

Not sure if this works out or if there are more problems, but it seems like glob either returns the filenames in sorted order (sorted as strings), or in random order. In both cases, you will have to extract the number from the filenames and sort by that number.
Try something like this:
p = re.compile(r"/(\d+)\.csv")
filenames = glob.glob(...)
for filename in sorted(filenames, key=lambda s: int(re.search(p, s).group(1))):
...
Also, it seems like you are opening, looping and closing all those 999 files for all of the 92 outfiles again and again! It might be better to open all of the 92 outfiles at once and store them in a dictionary, mapping line numbers to files. This way, you have to loop the 999 files just once.
Something like this (totally not tested):
outfiles = {i: open("row_%i.csv" %i, 'w') for i in range(2,92)}
p = re.compile(r"/(\d+)\.csv")
filenames = glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
for filename in sorted(filenames, key=lambda s: int(re.search(p, s).group(1))):
with open(filename, 'r') as infile:
for lineno, line in enumerate(infile):
outfiles[lineno].write(line)
for outfile in outfiles.values():
outfile.close()

You need to sort the filename list before start iteration on it. This can help you:
import re
import glob
filename_list = glob.glob('DataSet-MediaEval/devFeatures/*.csv')
def splitByNumbers(x):
r = re.compile('(\d+)')
l = r.split(x)
return [int(y) if y.isdigit() else y for y in l]
filenames = sorted(filename_list, key = splitByNumbers)
then you can use instead of
for filename in glob.glob('DataSet-MediaEval/devFeatures/*.csv'):
this
for filename in filenames:

Copy columns from multiple text files in Python

I have a large number of text files containg data arranged into a fixed number of rows and columns, the columns being separated by spaces. (like a .csv but using spaces as the delimiter). I want to extract a given column from each of these files, and write it into a new text file.
So far I have tried:
results_combined = open('ResultsCombined.txt', 'wb')
def combine_results():
for num in range(2,10):
f = open("result_0."+str(num)+"_.txt", 'rb') # all the text files have similar filename styles
lines = f.readlines() # read in the data
no_lines = len(lines) # get the number of lines
for i in range (0,no_lines):
column = lines[i].strip().split(" ")
results_combined.write(column[5] + " " + '\r\n')
f.close()
if __name__ == "__main__":
combine_results()
This produces a text file containing the data I want from the separate files, but as a single column. (i.e. I've managed to 'stack' the columns on top of each other, rather than have them all side by side as separate columns). I feel I've missed something obvious.
In another attempt, I manage to write all the separate files to a single file, but without picking out the columns that I want.
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'wb')
for row in range(0,488):
for f in files:
fout.write( f.readline().strip() )
fout.write(' ')
fout.write('\n')
fout.close()
What I basically want is to copy column 5 from each file (it is always the same column) and write them all to a single file.

If you don't know the maximum number of rows in the files and if the files can fit into memory, then the following solution would work:
import glob
files = [open(f) for f in glob.glob("*.txt")]
# Given file, Read the 6th column in each line
def readcol5(f):
return [line.split(' ')[5] for line in f]
filecols = [ readcol5(f) for f in files ]
maxrows = len(max(filecols, key=len))
# Given array, make sure it has maxrows number of elements.
def extendmin(arr):
diff = maxrows - len(arr)
arr.extend([''] * diff)
return arr
filecols = map(extendmin, filecols)
lines = zip(*filecols)
lines = map(lambda x: ','.join(x), lines)
lines = '\n'.join(lines)
fout = open('output.csv', 'wb')
fout.write(lines)
fout.close()

Or this option (following your second approach):
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'w')
for row in range(0,488):
for f in files:
fout.write(f.readline().strip().split(' ')[5])
fout.write(' ')
fout.write('\n')
fout.close()
... which uses a fixed number of rows per file but will work for very large numbers of rows because it is not storing the intermediate values in memory. For moderate numbers of rows, I'd expect the first answer's solution to run more quickly.

Why not read all the entries from each 5th column into a list and after reading in all the files, write them all to the output file?
data = [
[], # entries from first file
[], # entries from second file
...
]
for i in range(number_of_rows):
outputline = []
for vals in data:
outputline.append(vals[i])
outfile.write(" ".join(outputline))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way to compare multiple files in python - python

Related

Python: concatenating text files

More efficient way than zipping arrays for transposing a table in Python?

Trying to read a file and loop through at the same time

Output comes in dictionary sorted order in python

Copy columns from multiple text files in Python

Categories

Resources