I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.
start_line = b
num_lines = 377763316
while b < num_lines:
plasmid1 = linecache.getline("Result.txt", b-1)
plasmid1 = plasmid1.strip("\n")
plasmid1 = plasmid1.split("\t")
plasmid2 = linecache.getline("Result.txt", b)
plasmid2 = plasmid2.strip("\n")
plasmid2 = plasmid2.split("\t")
if not str(plasmid1[0]) == str(plasmid2[0]):
end_line = b
#do something
The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.
I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!
Thanks,
Philipp
I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.
After calling loadtxt() you will get ndarray back.
You can use itertools:
from itertools import takewhile
class EqualityChecker(object):
def __init__(self, id):
self.id = id
def __call__(self, current_line):
result = False
current_id = current_line.split('\t')[0]
if self.id == current_id:
result = True
return result
with open('hugefile.txt', 'r') as f:
for id in ids:
checker = EqualityChecker(id)
for line in takewhile(checker, f.xreadlines()):
do_stuff(line)
In outer loop id can actually be obtain from the first line with an id non-matching previous value.
You should open the file just once, and iterate over the lines.
with open('Result.txt', 'r') as f:
aline = f.next()
currentid = aline.split('\t', 1)[0]
for nextline in f:
nextid = nextline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
You get the idea, just use plain python.
Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.
If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:
with open('Result.txt', 'r') as f:
aline = f.readline()
currentid = aline.split('\t', 1)[0]
while aline != '':
aline = f.readline()
nextid = aline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
Related
I'm using the 're' library to replace occurrences of different strings in multiple files. The replacement pattern works fine, but I'm not able to maintain the changes to the files. I'm trying to get the same functionality that comes with the following lines:
with open(KEY_FILE, mode='r', encoding='utf-8-sig') as f:
replacements = csv.DictReader(f)
user_data = open(temp_file, 'r').read()
for col in replacements:
user_data = user_data.replace(col[ORIGINAL_COLUMN], col[TARGET_COLUMN])
data_output = open(f"{temp_file}", 'w')
data_output.write(user_data)
data_output.close()
The key line here is:
user_data = user_data.replace(col[ORIGINAL_COLUMN], col[TARGET_COLUMN])
It takes care of updating the data in place using the replace method.
I need to do the same but with the 're' library:
with open(KEY_FILE, mode='r', encoding='utf-8-sig') as f:
replacements = csv.DictReader(f)
user_data = open(temp_file, 'r').read()
a = open(f"{test_file}", 'w')
for col in replacements:
original_str = col[ORIGINAL_COLUMN]
target_str = col[TARGET_COLUMN]
compiled = re.compile(re.escape(original_str), re.IGNORECASE)
result = compiled.sub(target_str, user_data)
a.write(result)
I only end up with the last item in the .csv dict changed in the output file. Can't seem to get the changes made in previous iterations of the for loop to persist.
I know that it is pulling from the same file each time... which is why it is getting reset each loop, but I can't sort out a workaround.
Thanks
Try something like this?
#!/usr/bin/env python3
import csv
import re
import sys
from io import StringIO
KEY_FILE = '''aaa,bbb
xxx,yyy
'''
TEMP_FILE = '''here is aaa some text xxx
bla bla aaaxxx
'''
ORIGINAL_COLUMN = 'FROM'
TARGET_COLUMN = 'TO'
user_data = StringIO(TEMP_FILE).read()
with StringIO(KEY_FILE) as f:
reader = csv.DictReader(f, ['FROM','TO'])
for row in reader:
original_str = row[ORIGINAL_COLUMN]
target_str = row[TARGET_COLUMN]
compiled = re.compile(re.escape(original_str), re.IGNORECASE)
user_data = compiled.sub(target_str, user_data)
sys.stdout.write("modified user_data:\n" + user_data)
Some things to note:
The main problem was result = sub(..., user_data) rather than result = sub(..., result). You want to keep updating the same string, rather than always applying to the original.
The compiling of regex is fairly pointless in this case, since each is just used once.
I don't have access to your test files, so I used StringIO versions inline and printing to stdout; hopefully that's easy enough to translate back to your real code (:
In future posts, you might consider doing similar, so that your question has 100% runnable code someone else can try out without guessing.
I want to read a textfile using python and print out specific lines. The problem is that I want to print a line which starts with the word "nominal" (and I know how to do it) and the line following this which is not recognizable by some specific string. Could you point me to some lines of code that are able to do that?
In good faith and under the assumption that this will help you start coding and showing some effort, here you go:
file_to_read = r'myfile.txt'
with open(file_to_read, 'r') as f_in:
flag = False
for line in f_in:
if line.startswith('nominal'):
print(line)
flag = True
elif flag:
print(line)
flag = False
it might work out-of-the-box but please try to spend some time going through it and you will definitely get the logic behind it. Note that text comparison in python is case sensitive.
If the file isn't too large, you can put it all in a list:
def printLines(fname):
with open(fname) as f:
lines = f.read().split('\n')
if len(lines) == 0: return None
if lines[0].startswith('Nominal'): print(lines[0])
for i, line in enumerate(lines[1:]):
if lines[i-1].startswith('Nominal') or line.startswith('Nominal'):
print(line)
Then e.g. printLines('test.txt') will do what you want.
I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.
This probably measures how pythonic you are. I'm playing around trying to learn python so Im not close to being pythonic enough. The infile is a dummy patriline and I want a list of father son.
infile:
haffi jolli dkkdk lkskkk lkslll sdkjl kljdsfl klsdlj sdklja asldjkl
code:
def main():
infile = open('C:\Users\Notandi\Desktop\patriline.txt', 'r')
line = infile.readline()
tmpstr = line.split('\t')
for i in tmpstr[::2]:
print i, '\t', i + 1
infile.close()
main()
The issue is i + 1; I want to print out two strings in every line. Is this clear?
You are getting confused between the words in the split string and their indices. For example, the first word is "haffi" but the first index is 0.
To iterate over both the indices and their corresponding words, use enumerate:
for i, word in enumerate(tmpstr):
print word, tmpstr[i+1]
Of course, this looks messy. A better way is to just iterate over pairs of strings. There are many ways to do this; here's one.
def pairs(it):
it = iter(it)
for element in it:
yield element, next(it)
for word1, word2 in pairs(tmpstr):
print word1, word2
I'd use the with statement here, which if you're using an older version of python you need to import:
from __future__ import with_statement
for the actual code, if you can afford to load the whole file into memory twice (ie, it's pretty small) I would do this:
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as f:
strings = f.read().split('\t')
for father, son in zip(string, string[1:]):
print "%s \t %s" % (father, son)
main()
That way you skip the last line with out having too much overhead to not include the childless leaf at the end, which is think is what you were asking for(?)
As a bit of a tangent: if the file is really big, you may not want to load the whole thing into memory, in which case you may need a generator. You probably don't need to do this if you're actually printing everything out, but in case this is some simplified version of the problem, this is how I would approach making a generator to split the file:
class reader_and_split():
def __init__(self, fname, delim='\t'):
self.fname = fname
self.delim = delim
def __enter__(self):
self.file = open(self.fname, 'r')
return self.word_generator()
def __exit__(self, type, value, traceback):
self.file.close()
def word_generator(self):
current = []
while True:
char = self.file.read(1)
if char == self.delim:
yield ''.join(current)
current = []
elif not char:
break
else:
current.append(char)
The value of a generator is that you don't load the entire contents of the file into memory, before running the split on it, which can be expensive for very, very large files. This implementation only allows single character delimiter for simplicity. Which means all you need to do to parse out everything is to use the generator, a quick dirty way to do this is:
with reader_and_split(fileloc) as f:
previous = f.next()
for word in f:
print "%s \t %s" % (previous, word)
previous = word
You can be more pythonic in both your file reading and printing. Try this:
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as f:
strings = f.readline().split('\t')
for i, word in enumerate(strings):
print "{} \t {}".format(word, strings[i+1:i+2])
main()
Using strings[i+1:i+2] ensures an IndexError isn't thrown (instead, returning a []) when trying to reach the i+1th index at the end of the list.
Here's one clean way to do it. It has the benefit of not crashing when fed an odd number of items, but of course you may prefer an exception for that case.
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as infile:
line = infile.readline()
previous = None
for i in line.split('\t'):
if previous is None:
previous = i
else:
print previous, '\t', i
previous = None
I won't make any claims that this is pythonic though.
This question already has answers here:
In python, how to check the end of standard input streams (sys.stdin) and do something special on that
(2 answers)
Closed 6 months ago.
How do I check for EOF in Python? I found a bug in my code where the last block of text after the separator isn't added to the return list. Or maybe there's a better way of expressing this function?
Here's my code:
def get_text_blocks(filename):
text_blocks = []
text_block = StringIO.StringIO()
with open(filename, 'r') as f:
for line in f:
text_block.write(line)
print line
if line.startswith('-- -'):
text_blocks.append(text_block.getvalue())
text_block.close()
text_block = StringIO.StringIO()
return text_blocks
You might find it easier to solve this using itertools.groupby.
def get_text_blocks(filename):
import itertools
with open(filename,'r') as f:
groups = itertools.groupby(f, lambda line:line.startswith('-- -'))
return [''.join(lines) for is_separator, lines in groups if not is_separator]
Another alternative is to use a regular expression to match the separators:
def get_text_blocks(filename):
import re
seperator = re.compile('^-- -.*', re.M)
with open(filename,'r') as f:
return re.split(seperator, f.read())
The end-of-file condition holds as soon as the for statement terminates -- that seems the simplest way to minorly fix this code (you can extract text_block.getvalue() at the end if you want to check it's not empty before appending it).
This is the standard problem with emitting buffers.
You don't detect EOF -- that's needless. You write the last buffer.
def get_text_blocks(filename):
text_blocks = []
text_block = StringIO.StringIO()
with open(filename, 'r') as f:
for line in f:
text_block.write(line)
print line
if line.startswith('-- -'):
text_blocks.append(text_block.getvalue())
text_block.close()
text_block = StringIO.StringIO()
### At this moment, you are at EOF
if len(text_block) > 0:
text_blocks.append( text_block.getvalue() )
### Now your final block (if any) is appended.
return text_blocks
Why do you need StringIO here?
def get_text_blocks(filename):
text_blocks = [""]
with open(filename, 'r') as f:
for line in f:
if line.startswith('-- -'):
text_blocks.append(line)
else: text_blocks[-1] += line
return text_blocks
EDIT: Fixed the function, other suggestions might be better, just wanted to write a function similar to the original one.
EDIT: Assumed the file starts with "-- -", by adding empty string to the list you can "fix" the IndexError or you could use this one:
def get_text_blocks(filename):
text_blocks = []
with open(filename, 'r') as f:
for line in f:
if line.startswith('-- -'):
text_blocks.append(line)
else:
if len(text_blocks) != 0:
text_blocks[-1] += line
return text_blocks
But both versions look a bit ugly to me, the reg-ex version is much more cleaner.
This is a fast way to see if you have an empty file:
if f.read(1) == '':
print "EOF"
f.close()