I have the following code to compare two files. I would like this program run if I point them to files which are as big as 4 or 5 MB. When I do that, the prompt cursor in python console just blinks, and no output is shown. Once, I ran it for the whole night and the next morning it was still blinking. What can I change in this code?
import difflib
file1 = open('/home/michel/Documents/first.csv', 'r')
file2 = open('/home/michel/Documents/second.csv', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
delta = ''.join(diff)
print delta
If you use linux based system, you can call external command diff and you can use result of it. I try it for two file 14M and 9.3M with diff command. It takes 1.3 second.
real 0m1.295s
user 0m0.056s
sys 0m0.192s
When I have tried to use difflib in your way I had the same issue, because for big files difflib buffer the whole file in the memory and then compare them. As the solution you can compare two file partially. Here I am doing it for each 100 line.
import difflib
file1 = open('1.csv', 'r')
file2 = open('2.csv', 'r')
lines_file1 = []
lines_file2 = []
# i: number of line
# line: content of line
for i, line in enumerate(zip(file1, file2)):
# check if it is in line 100
if not (i % 100 == 0):
lines_file1.append(line[0])
lines_file2.append(line[1])
else:
# show the different for 100 line
diff = difflib.ndiff("".join(lines_file1), "".join(lines_file2))
print ''.join(list(diff))
lines_file1 = []
lines_file2 = []
# show the different if any lines left
diff = difflib.ndiff("".join(lines_file1), "".join(lines_file2))
print ''.join(list(diff))
file1.close()
file2.close()
Hope it helps.
Related
I have a huge text file that I need to split based on matching a 'EKYC' only value. However, when other values with similar pattern show up my script fails.
I am new in Python and it is wearing me out.
import sys;
import os;
MASTER_TEXT_FILE=sys.argv[1];
OUTPUT_FILE=sys.argv[2];
L = file(MASTER_TEXT_FILE, "r").read().strip().split("EKYC")
i = 0
for l in L:
i = i + 1
f = file(OUTPUT_FILE+"-%d.ekyc" % i , "w")
print >>f, "EKYC" + l
The script breaks when there is EKYCSMRT or EKYCVDA or EKYCTIGO then how can I make the guard to prevent the splitting to occur before the point.
This is the content of all of the messages
EKYC
WIK 12
EKYC
WIK 12
EKYCTIGO
EKYC
WIK 13
TTL
EKYCVD
EKYC
WIK 14
TTL D
Thanks for the assistance.
If possible, you should avoid reading large files into memory all at once. Instead, stream chunks of them at a time.
The sensible chunks of text files are usually lines. This can be done with .readline(), but simply iterating over the file yields its lines too.
After reading a line (which includes the newline), you can .write() it directly to the current output file.
import sys
master_filename = sys.argv[1]
output_filebase = sys.argv[2]
output = None
output_number = 0
for line in open(master_filename):
if line.strip() == 'EKYC':
if output is not None:
output.close()
output = None
else:
if output is None:
output_number += 1
output_filename = '%s-%d.ekyc' % (output_filebase, output_number)
output = open(output_filename, 'w')
output.write(line)
if output is not None:
output.close()
The output file is closed and reset upon encountering 'EKYC' on its own line.
Here, you'll notice that the output file isn't (re)opened until right before there is a line to write to it: this avoids creating an empty output file in case there are no further lines to write to it. You'll have to re-order this slightly if you want the 'EKYC' line to appear in the output file also.
Based on your sample input file, you need to: split('\nEKYC\n')
#!/usr/bin/env python
import sys
MASTER_TEXT_FILE = sys.argv[1]
OUTPUT_FILE = sys.argv[2]
with open(MASTER_TEXT_FILE) as f:
fdata = f.read()
i = 0
for subset in fdata.split('\nEKYC\n'):
i += 1
with open(OUTPUT_FILE+"-%d.ekyc" % i, 'w') as output:
output.write(subset)
Other comments:
Python doesn't use ;.
Your original code wasn't using os.
It's recommended to use with open(<filename>, <mode>) as f: ... since it handles possible errors and closes the file afterward.
I have a list named master_lst created from a CSV file using the following code
infile= open(sys.argv[1], "r")
lines = infile.readlines()[1:]
master_lst = ["read"]
for line in lines:
line= line.strip().split(',')
fourth_field = line [3]
master_lst.append(fourth_field)
This master list has the unique set of sequences. Now I have to loop 30 collapsed FASTA files to count the number of occurrences of each of these sequences in the master list. The file format of the 30 files is as follow:
>AAAAAAAAAAAAAAA
7451
>AAAAAAAAAAAAAAAA
4133
>AAAAAAAAAAAAAAAAA
2783
For counting the number of occurrences, I looped through each of the 30 file and created a dictionary with sequences as key and number of occurrences as values. Then I iterated each element of the master_lst and matched it with the key in the dictionary created from the previous step. If there is a match, I appended the value of the key to a new list (ind_lst). If not I appended 0 to the ind_lst. The code for that is as follow:
for file in files:
ind_lst = []
if file.endswith('.fa'):
first = file.split(".")
first_field = first [0]
ind_lst.append(first_field)
fasta= open(file)
individual_dict= {}
for line in fasta:
line= line.strip()
if line == '':
continue
if line.startswith('>'):
header = line.lstrip('>')
individual_dict[header]= ''
else:
individual_dict[header] += line
for key in master_lst[1:]:
a = 0
if key in individual_dict.keys():
a = individual_dict[key]
else:
a = 0
ind_lst.append(a)
then I write the master_lst to a CSV file and ind_lst using the code explained here: How to append a new list to an existing CSV file?
The final output should look like this:
Read file1 file2 so on until file 30
AAAAAAAAAAAAAAA 7451 4456
AAAAAAAAAAAAAAAA 4133 3624
AAAAAAAAAAAAAAAAA 2783 7012
This codes work perfectly fine when I use a smaller master_lst. But when the size of the master_lst increases then the execution time increases too much. The master_lst I am working with right now has 35,718,501 sequences(elements). When I subset 50 sequences and run the code, the script takes 2 hours to execute. So for 35,718,501 sequences it will take forever to complete.
Now I don't know how to speed up the script. I am not quite sure if there could be some improvements that can be made to this script to make it execute in a shorter time. I am running my script on a Linux server which has 16 CPU cores. When I use the command top, I could see that the script uses only one CPU. But I am not a expert in python and I don't know how to make it run on all available CPU cores using multiprocessing module. I checked this webpage: Learning Python's Multiprocessing Module.
But, I wasn't quite sure what should come under def and if __name__ == '__main__':. I am also not quite sure what arguments should I pass to the function. I was getting an error when I try the first code from Douglas, without passing any arguments as follow:
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
I have been working this for the last few days and I haven't been successful in generating my desired output. If anyone can suggest an alternative code that could run fast or if anyone could suggest how to run this code on multiple CPUs, that would be awesome. Any help to resolve this issue would be much appreciated.
Here's a multiprocessing version. It uses a slightly different approach than you do in your code which does away with the need for creating the ind_lst.
The essence of the difference is that it first produces a transpose of the desired data, and then transpose that into the desired result.
In other words, instead of creating this directly:
Read,file1,file2
AAAAAAAAAAAAAAA,7451,4456
AAAAAAAAAAAAAAAA,4133,3624
AAAAAAAAAAAAAAAAA,2783,7012
It first produces:
Read,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAA
file1,7451,4133,2783
file2,4456,3624,7012
...and then transposes that with the built-in zip() function to obtain the desired format.
Besides not needing to create the ind_lst, it also allows the creation of one row of data per file rather than one column of it (which is easier and can be done more efficiently with less effort).
Here's the code:
from __future__ import print_function
import csv
from functools import partial
from glob import glob
from itertools import izip # Python 2
import operator
import os
from multiprocessing import cpu_count, Pool, Queue
import sys
def get_master_list(filename):
with open(filename, "rb") as csvfile:
reader = csv.reader(csvfile)
next(reader) # ignore first row
sequence_getter = operator.itemgetter(3) # retrieves fourth column of each row
return map(sequence_getter, reader)
def process_fa_file(master_list, filename):
fa_dict = {}
with open(filename) as fa_file:
for line in fa_file:
if line and line[0] != '>':
fa_dict[sequence] = int(line)
elif line:
sequence = line[1:-1]
get = fa_dict.get # local var to expedite access
basename = os.path.basename(os.path.splitext(filename)[0])
return [basename] + [get(key, 0) for key in master_list]
def process_fa_files(master_list, filenames):
pool = Pool(processes=4) # "processes" is the number of worker processes to
# use. If processes is None then the number returned
# by cpu_count() is used.
# Only one argument can be passed to the target function using Pool.map(),
# so create a partial to pass first argument, which doesn't vary.
results = pool.map(partial(process_fa_file, master_list), filenames)
header_row = ['Read'] + master_list
return [header_row] + results
if __name__ == '__main__':
master_list = get_master_list('master_list.csv')
fa_files_dir = '.' # current directory
filenames = glob(os.path.join(fa_files_dir, '*.fa'))
data = process_fa_files(master_list, filenames)
rows = zip(*data) # transpose
with open('output.csv', 'wb') as outfile:
writer = csv.writer(outfile)
writer.writerows(rows)
# show data written to file
for row in rows:
print(','.join(map(str, row)))
filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(0, os.SEEK_END)
while 1:
time.sleep(1)
where = mycsv.tell()
line = mycsv.readline()
if not line:
mycsv.seek(where)
else:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
I have this Paython code which is reading the values from a csv file every time there is a new line printed in the csv from external program. My problem is that the csv file is periodically completely rewriten and then python stops reading the new lines. My guess is that python is stuck on some line number and the new update can put maybe 50 more or less lines. So for example python is now waiting a new line at line 70 and the new line has come at line 95. I think the solution is to let mycsv.seek(0, os.SEEK_END) been updated but not sure how to do that.
What you want to do is difficult to accomplish without rewinding the file every time to make sure that you are truly on the last line. If you know approximately how many characters there are on each line, then there is a shortcut you could take using mycsv.seek(-end_buf, os.SEEK_END), as outlined in this answer. So your code could work somehow like this:
avg_len = 50 # use an appropriate number here
end_buf = 3 * avg_len / 2
filename = 'NTS.csv'
mycsv = open(filename, 'r')
mycsv.seek(-end_buf, os.SEEK_END)
last = mycsv.readlines()[-1]
while 1:
time.sleep(1)
mycsv.seek(-end_buf, os.SEEK_END)
line = mycsv.readlines()[-1]
if not line == last:
arr_line = line.split(',')
var3 = arr_line[3]
print (var3)
Here, in each iteration of the while loop, you seek to a position close to the end of the file, just far back enough that you know for sure the last line will be contained in what remains. Then you read in all the remaining lines (this will probably include a partial amount of the second or third to last lines) and check if the last line of these is different to what you had before.
You can do a simpler way of reading lines in your program. Instead of trying to use seek in order to get what you need, try using readlines on the file object mycsv.
You can do the following:
mycsv = open('NTS.csv', 'r')
csv_lines = mycsv.readlines()
for line in csv_lines:
arr_line = line.split(',')
var3 = arr_line[3]
print(var3)
How can I make a python script change itself?
To boil it down, I would like to have a python script (run.py)like this
a = 0
b = 1
print a + b
# do something here such that the first line of this script reads a = 1
Such that the next time the script is run it would look like
a = 1
b = 1
print a + b
# do something here such that the first line of this script reads a = 2
Is this in any way possible? The script might use external resources; however, everything should work by just running the one run.py-file.
EDIT:
It may not have been clear enough, but the script should update itself, not any other file. Sure, once you allow for a simple configuration file next to the script, this task is trivial.
For an example (changing the value of a each time its run):
a = 0
b = 1
print a + b
with open(__file__, 'r') as f:
lines = f.read().split('\n')
val = int(lines[0].split(' = ')[-1])
new_line = 'a = {}'.format(val+1)
new_file = '\n'.join([new_line] + lines[1:])
with open(__file__, 'w') as f:
f.write('\n'.join([new_line] + lines[1:]))
What you're asking for would require you to manipulate files at the {sys} level; basically, you'd read the current file in, modify it, over-write it, and reload the current module. I played with this briefly because I was curious, but I ran into file locking and file permission issues. Those are probably solvable, but I suspect that this isn't really what you want here.
First: realize that it's generally a good idea to maintain a separation between code and data. There are exceptions to this, but for most purposes, you'll want to make the parts of your program that can change at runtime read their configuration from a file, and write changes to that same file.
Second: idomatically, most python projects use YAML for configuration
Here's a simple script that uses the yaml library to read from a file called 'config.yaml', and increments the value of 'a' each time the program runs:
#!/usr/bin/python
import yaml
config_vals = ""
with open("config.yaml", "r") as cr:
config_vals = yaml.load(cr)
a = config_vals['a']
b = config_vals['b']
print a + b
config_vals['a'] = a + 1
with open("config.yaml", "w") as cw:
yaml.dump(config_vals, cw, default_flow_style=True)
The runtime output looks like this:
$ ./run.py
3
$ ./run.py
4
$ ./run.py
5
The initial YAML configuration file looks like this:
a: 1
b: 2
Make a file a.txt that contains one character on one line:
0
Then in your script, open that file and retrieve the value, then immediately change it:
with open('a.txt') as f:
a = int(f.read())
with open('a.txt', 'w') as output:
output.write(str(a+1))
b = 1
print a+b
On the first run of the program, a will be 0, and it will change the file to contain a 1. On subsequent runs, a will continue to be incremented by 1 each time.
Gerrat's code but modified.
#some code here
a = 0
b = 1
print(a + b)
applyLine = 1#apply to wich line(line 1 = 0, line 2 = 1)
with open(__file__, 'r') as f:
lines = f.read().split('\n')#make each line a str in a list called 'lines'
val = int(lines[applyLine].split(' = ')[-1])#make an int to get whatever is after ' = ' to applyed line
new_line = 'a = {}'.format(val+1)#generate the new line
lines[applyLine] = new_line#update 'lines' to add the new line
write = "\n".join(lines)#create what to rewrite and store it in 'write' as str
with open(__file__, 'w') as f:
f.write(write)#update the code
I am currently in some truble regarding python and reading files. I have to open a file in a while loop and do some stuff with the values of the file. The results are written into a new file. This new file is then read in the next run of the while loop. But in this second run I get no values out of this file... Here is a code snippet, that hopefully clarifies what I mean.
while convergence == 0:
run += 1
prevrun = run-1
if os.path.isfile("./Output/temp/EmissionMat%d.txt" %prevrun) == True:
matfile = open("./Output/temp/EmissionMat%d.txt" %prevrun, "r")
EmissionMat = Aux_Functions.EmissionMat(matfile)
matfile.close()
else:
matfile = open("./Input/EmissionMat.txt", "r")
EmissionMat = Aux_Functions.EmissionMat(matfile)
matfile.close()
# now some valid operations, which produce a matrix
emissionmat_file = open("./output/temp/EmissionMat%d.txt" %run, "w")
emissionmat_file.flush()
emissionmat_file.write(str(matrix))
emissionmat_file.close()
Solved it!
matfile.seek(0)
This resets the pointer to the begining of the file and allows me to read the file in the next run correctly.
Why to write to a file and then read it ? Moreover you use flush, so you are doing potentially long io. I would do
with open(originalpath) as f:
mat = f.read()
while condition :
run += 1
write_mat_run(mat, run)
mat = func(mat)
write_mat_run may be done in another thread. You should check io exceptions.
BTW this will probably solve your bug, or at least make it clear.
I can see nothing wrong with your code. The following concrete example worked on my Linux machine:
import os
run = 0
while run < 10:
run += 1
prevrun = run-1
if os.path.isfile("output%d.txt" %prevrun):
matfile = open("output%d.txt" %prevrun, "r")
data = matfile.readlines()
matfile.close()
else:
matfile = open("input.txt", "r")
data = matfile.readlines()
matfile.close()
data = [ s[:-1] + "!\n" for s in data ]
emissionmat_file = open("output%d.txt" %run, "w")
emissionmat_file.writelines(data)
emissionmat_file.close()
It adds an exclamation mark to each line in the file input.txt.
I solved it
before closing the file I do
matfile.seek(0)
This solved my problem. This methods sets the pointer of the reader to the beginning.