I have a file of about 100 million lines in which I want to replace text with alternate text stored in a tab-delimited file. The code that I have works, but is taking about an hour to process the first 70K lines.In trying to incrementally advance my python skills, I am wondering whether there is a faster way to do this. Thanks!
The input file looks something like this:
CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518
CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518
and the file with replacement values looks like this:
WBGene00045518 21ur-5153
Here is my code:
infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')
import re
from datetime import datetime
startTime = datetime.now()
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()
You should split your lines into "words" and only look up these words in your dictionary:
>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']
This will eliminate the loop over the dictionary you do for every single line.
Here' the complete code:
import re
with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
for word in re.findall(r"\w+", line):
if word in udict:
line = line.replace(word, udict[word])
outfile.write(line)
Edit: An alternative approach is to build a single mega-regex from your dictionary:
with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
outfile.write(regex.sub(lambda m: udict[m.group()], line))
I was thinking on your loop over the dicionary keys, and a wqya to optimize this, and let to make other comments on your code later.
But then I stumbled on this part:
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
This inocent looking snippet, actually puts Python sequentially looking at and comparing 10000 items in your "linecounter" list for each line in your file.
Replace this part with:
if linecounter % 10000 == 0:
print linecounter
print (datetime.now()-startTime)
(And forget all the mult10k part) - and you should get a significant speed up.
Also, it seems like you are recording multiple output lines for each input line -
your mainloop is like this:
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
Replace it for this:
for linecounter, line in enumerate(infile2):
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
Which properly writes only one output line for each input line (besides eleminating code duplication, and taking care of the line counting in a "pythonic" way)
This code is full of linear searches. It's no wonder it's running slowly. Without knowing more about the input, I can't give you advice on how to fix these problems, but I can at least point out the problems. I'll note major issues, and a couple of minor ones.
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
Don't use update here; just add the item to the dictionary:
udict[linelist[0]] = linelist[1]
This will be faster than creating a dictionary for every entry. (And actually, Sven Marnach's generator-based approach to creating this dictionary is better still.) This is fairly minor though.
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
This is totally unnecessary. Remove this; I'll show you one way to print at intervals without this.
linecounter = 0
for line in infile2:
for key, value in udict.items():
This is your first big problem. You're doing a linear search through the dictionary for keys in the line, for each line. If the dictionary is very large, this will require a huge number of operations: 100,000,000 * len(udict).
matches = line.count(key)
This is another problem. You're looking for matches using a linear search. Then you do replace, which does the same linear search! You don't need to check for a match; replace just returns the same string if there isn't one. This won't make a huge difference either, but it will gain you something.
line = line.replace(key, value)
Keep doing these replaces, and then only write the line once all replacements are done:
outfile.write(line + '\n')
And finally,
linecounter += 1
if linecounter in mult10K:
Forgive me, but this is a ridiculous way to do this! You're doing a linear search through linecounter to determine when to print a line. Here again, this adds a total of almost 100,000,000 * 100 operations. You should at least search in a set; but the best approach (if you really must do this) would be to do a modulo operation and test that.
if not linecounter % 10000:
print linecounter
print (datetime.now()-startTime)
To make this code efficient, you need to get rid of these linear searches. Sven Marnach's answer suggests one way that might work, but I think it depends on the data in your file, since the replacement keys might not correspond to obvious word boundaries. (The regex approach he added addresses that, though.)
This is not Python specific, but you might unroll your double for loop a bit so that the file writes to not occur on every iteration of the loop. Perhaps write to the file every 1000 or 10,000 lines.
I'm hoping that writing a line of output for each line of input times the number of replacement strings is a bug, and you really only intended to write one output for each input.
You need to find a way to test the lines of input for matches as quickly as possible. Looping through the entire dictionary is probably your bottleneck.
I believe regular expressions are precompiled into state machines that can be highly efficient. I have no idea on how the performance suffers when you generate a huge expression, but it's worth a try.
freakin_huge_re = re.compile('(' + ')|('.join(udict.keys()) + ')')
for line in infile2:
matches = [''.join(tup) for tup in freakin_huge_re.findall(line)]
if matches:
for key in matches:
line = line.replace(key, udict[key])
They obvious one in Python is the list comprehension - it's a faster (and more readable) way of doing this:
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
as this:
mult10K = [x*10000 for x in range(100)]
Likewise, where you have:
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
We can use a dict comprehension (with a generator expression):
lines = (line.strip().split('\t') for line in infile1)
udict = {line[0]: line[1] for line in lines}
It's also worth noting here that you appear to be working with a tab delimited file. In which case, the csv module might be a much better option than using split().
Also note that using the with statement will increase readability and make sure your files get closed (even on exceptions).
Print statements will also slow things down quite a lot if they are being performed on every loop - they are useful for debugging, but when running on your main chunk of data, it's probably worth removing them.
Another 'more pythonic' thing you can do is use enumerate() instead of adding one to a variable each time. E.g:
linecounter = 0
for line in infile2:
...
linecouter += 1
Can be replaced with:
for linecounter, line in enumerate(infile2):
...
Where you are counting occurrences of a key, the better solution is to use in:
if key in line:
As this short-circuits after finding an instance.
Adding all this up, let's see what we have:
import csv
from datetime import datetime
startTime = datetime.now()
with open('f1.txt', 'r') as infile1:
reader = csv.reader(delimiter='\t')
udict = dict(reader)
with open('f2.txt', 'r') as infile2, open('out.txt', 'w') as outfile:
for line in infile2:
for key, value in udict.items():
if key in line:
line = line.replace(key, value)
outfile.write(line + '\n')
Edit: List comp vs normal loop, as requested in the comments:
python -m timeit "[i*10000 for i in range(10000)]"
1000 loops, best of 3: 909 usec per loop
python -m timeit "a = []" "for i in range(10000):" " a.append(i)"
1000 loops, best of 3: 1.01 msec per loop
Note usec vs msec. It's not massive, but it's something.
Related
i have a relatively large text file (around 7m lines) and i want to run a specific logic over it which i ll try to explain below:
A1KEY1
A2KEY1
B1KEY2
C1KEY3
D1KEY3
E1KEY4
I want to count the frequency of appearence of the keys, and then output those with a frequency of 1 into one text file, those with a frequency of 2 in another, and those with a frequency higher than 2 in another.
This is the code i have so far, but it iterates over the dictionary painfully slow, and it gets slower the more it progresses.
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
dict_in = dict()
seen = []
fileinlist = filetoliststrip(file_in)
out_file = open(file_ot, 'w')
out_file2 = open(file_ot2, 'w')
out_file3 = open(file_ot3, 'w')
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
for j in dict_in.keys():
print("Processing key: " + str(j))
#print(dict_in[j])
if dict_in[j][0] < 2:
out_file.write(str(dict_in[j][1]))
elif dict_in[j][0] == 2:
for line_in in dict_in[j][1:]:
out_file2.write(str(line_in) + "\n")
elif dict_in[j][0] > 2:
for line_in in dict_in[j][1:]:
out_file3.write(str(line_in) + "\n")
out_file.close()
out_file2.close()
out_file3.close()
I m running this on a windows PC i7 with 8GB Ram, this should be not taking hours to perform. Is this a problem with the way i read the file into a list? Should i use a different method? Thanks in advance.
You have multiple points that slow down your code - there is no need to load the whole file into memory only to iterate over it again, there is no need to get a list of keys each time you want to do a lookup (if key not in dict_in: ... will suffice and will be blazingly fast), you don't need to keep the line count as you can post-check the lines length anyway... to name but a few.
I'd completely restructure your code as:
import collections
dict_in = collections.defaultdict(list) # save some time with a dictionary factory
with open(file_in, "r") as f: # open the file_in for reading
for line in file_in: # read the file line by line
key = line.strip()[10:69] # assuming this is how you get your key
dict_in[key].append(line) # add the line as an element of the found key
# now that we have the lines in their own key brackets, lets write them based on frequency
with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
selector = {1: f1, 2: f2} # make our life easier with a quick length-based lookup
for values in dict_in.values(): # use dict_in.itervalues() on Python 2.x
selector.get(len(values), f3).writelines(values) # write the collected lines
And you'll hardly get more efficient than that, at least in Python.
Keep in mind that this will not guarantee the order of lines in the output prior to Python 3.7 (or CPython 3.6). The order within a key itself will be preserved, however. If you need to keep the line order prior to the aforementioned Python versions you'll have to do keep a separate key order list and iterate over it to pick up the dict_in values in order.
The first function:
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
Here a list of raw lines is produced only to be stripped. That will require roughly twice as much memory as necessary, and just as importantly, several passes over data that doesn't fit in cache. We also don't need to make str of things repeatedly. So we can simplify it a bit:
def filetoliststrip(filename):
return [line.strip() for line in open(filename, 'r')]
This still produces a list. If we're reading through the data only once, not storing each line, replace [] with () to turn it into a generator expression; in this case, since lines are actually held intact in memory until the end of the program, we'd only save the space for the list (which is still at least 30MB in your case).
Then we have the main parsing loop (I adjusted the indentation as I thought it should be):
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
There are several suboptimal things here.
First, the counter could be an enumerate (when you don't have an iterable, there's range or itertools.count). Changing this will help with clarity and reduce the risk of mistakes.
for counter, line in enumerate(fileinlist, 1):
Second, it's more efficient to form a string in one operation than add it from bits:
print("Loading line {} : {}".format(counter, line))
Third, there's no need to extract the keys for a dictionary member check. In Python 2 that builds a new list, which means copying all the references held in the keys, and gets slower with every iteration. In Python 3, it still means building a key view object needlessly. Just use keyf not in dict_in if the check is needed.
Fourth, the check really isn't needed. Catching the exception when a lookup fails is pretty much as fast as the if check, and repeating the lookup after the if check is almost certainly slower. For that matter, stop repeating lookups in general:
try:
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
except KeyError:
dict_in[keyf] = [1, line]
This is such a common pattern, however, that we have two standard library implementations of it: Counter and defaultdict. We could use both here, but the Counter is more practical when you only want the count.
from collections import defaultdict
def newentry():
return [0]
dict_in = defaultdict(newentry)
for counter, line in enumerate(fileinlist, 1):
keyf = line[10:69]
print("Loading line {} : {}".format(counter, line))
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
Using defaultdict let us not worry about whether the entries existed or not.
We now arrive at the output phase. Again we have needless lookups, so let's reduce them to one iteration:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
out_file.write(lines[0])
elif count == 2:
for line_in in lines:
out_file2.write(line_in + "\n")
elif count > 2:
for line_in in lines:
out_file3.write(line_in + "\n")
That still has a few annoyances. We've repeated the writing code, it builds other strings (tagging on "\n"), and it has a whole chunk of similar code for each case. In fact, the repetition probably caused a bug: there's no newline separator for the single occurrences in out_file. Let's factor out what really differs:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
key_outf = out_file
elif count == 2:
key_outf = out_file2
else: # elif count > 2: # Test not needed
key_outf = out_file3
key_outf.writelines(line_in + "\n" for line_in in lines)
I've left the newline concatenation because it's more complex to mix them in as separate calls. The string is short-lived and it serves a purpose to have the newline in the same place: it makes it less likely at OS level that a line is broken up by concurrent writes.
You'll have noticed there are Python 2 and 3 differences here. Most likely your code wasn't all that slow if run in Python 3 in the first place. There exists a compatibility module called six to write code that more easily runs in either; it lets you use e.g. six.viewkeys and six.iteritems to avoid this gotcha.
You load a very large file onto memory at once. When you don't actually needs the lines, and you just need to process it, use a generator. It is more memory-efficient.
Counter is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. You can use that to count the frequency of the keys. Then simply iterate over the new dict and append the key to the relevant file:
from collections import Counter
keys = ['A1KEY1', 'A2KEY1', 'B1KEY2', 'C1KEY3', 'D1KEY3', 'E1KEY4']
count = Counter(keys)
with open('single.txt') as f1:
with open('double.txt') as f2:
with open('more_than_double.txt') as f3:
for k, v in count.items():
if v == 1:
f1.writelines(k)
elif v == 2:
f2.writelines(k)
else:
f3.writelines(k)
I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.
start_line = b
num_lines = 377763316
while b < num_lines:
plasmid1 = linecache.getline("Result.txt", b-1)
plasmid1 = plasmid1.strip("\n")
plasmid1 = plasmid1.split("\t")
plasmid2 = linecache.getline("Result.txt", b)
plasmid2 = plasmid2.strip("\n")
plasmid2 = plasmid2.split("\t")
if not str(plasmid1[0]) == str(plasmid2[0]):
end_line = b
#do something
The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.
I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!
Thanks,
Philipp
I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.
After calling loadtxt() you will get ndarray back.
You can use itertools:
from itertools import takewhile
class EqualityChecker(object):
def __init__(self, id):
self.id = id
def __call__(self, current_line):
result = False
current_id = current_line.split('\t')[0]
if self.id == current_id:
result = True
return result
with open('hugefile.txt', 'r') as f:
for id in ids:
checker = EqualityChecker(id)
for line in takewhile(checker, f.xreadlines()):
do_stuff(line)
In outer loop id can actually be obtain from the first line with an id non-matching previous value.
You should open the file just once, and iterate over the lines.
with open('Result.txt', 'r') as f:
aline = f.next()
currentid = aline.split('\t', 1)[0]
for nextline in f:
nextid = nextline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
You get the idea, just use plain python.
Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.
If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:
with open('Result.txt', 'r') as f:
aline = f.readline()
currentid = aline.split('\t', 1)[0]
while aline != '':
aline = f.readline()
nextid = aline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).
This is the first thing I tired to achieve that result:
mydict = {}
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
with open("dataset.nt", "rb") as input:
for line in input:
if regex.search(line):
total = total+1
mydict[i] = total
It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:
mydict = {}
with open("dataset.nt", "rb") as input:
for line in input:
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:
mydict = {}
file = open("dataset.nt", "rb")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
for i in list:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
That one has been running for the last 30 minutes, so I'm assuming it's not any better.
How should I structure this code so that it completes in a reasonable amount of time?
I'd favor a slight modification of your second version:
mydict = {}
re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if not '/Main/' in line:
continue
# do the regex-part
for i, regex in zip(mylist, re_list):
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
As #matsjoyce already suggested, this avoids re-compiling the regex on each iteration.
If you really need to that many different regex patterns then I don't think there's much you can do.
Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.
Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.
This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):
targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if '/Main/' not in line:
continue
# do the regex-part
for i, regex in targets:
if regex.search(line):
results[i] += 1
Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).
The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall.
Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.
import re
mydict = {}
mylist = [...] # A list with 500 items
# Optimizing calls
findall = re.findall # Then python don't have to resolve this functions for every call
escape = re.escape
with open("dataset.nt", "rb") as input:
text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
for elem in mylist:
mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.
I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).
I don't test the whole code with real data, just the findall part.
I have the following text file:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,456
FRUIT
DRINK
FOOD,BURGER
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
CAR
And I have the following list called 'wanted':
['123', '789']
What I'm trying to do is if the numbers after NUM is not in the list called 'wanted', then that line along with 4 lines below it gets deleted. So the output file will looks like:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
My code so far is:
infile = open("inputfile.txt",'r')
data = infile.readlines()
for beginning_line, ube_line in enumerate(data):
UNIT = data[beginning_line].split(',')[1]
if UNIT not in wanted:
del data_list[beginning_line:beginning_line+4]
You shouldn't modify a list while you are looping over it.
What you could try is to just advance the iterator on the file object when needed:
wanted = set(['123', '789'])
with open("inputfile.txt",'r') as infile, open("outfile.txt",'w') as outfile:
for line in infile:
if line.startswith('NUM,'):
UNIT = line.strip().split(',')[1]
if UNIT not in wanted:
for _ in xrange(4):
infile.next()
continue
outfile.write(line)
And use a set. It is faster for constantly checking the membership.
This approach doesn't make you read in the entire file at once to process it in a list form. It goes line by line, reading from the file, advancing, and writing to the new file. If you want, you can replace the outfile with a list that you are appending to.
There are some issues with the code; for instance, data_list isn't even defined. If it's a list, you can't del elements from it; you can only pop. Then you use both enumerate and direct index access on data; also readlines is not needed.
I'd suggest to avoid keeping all lines in memory, it's not really needed here. Maybe try with something like (untested):
with open('infile.txt') as fin, open('outfile.txt', 'w') as fout:
for line in fin:
if line.startswith('NUM,') and line.split(',')[1] not in wanted:
for _ in range(4):
fin.next()
else:
fout.write(line)
import re
# find the lines that match NUM,XYZ
nums = re.compile('NUM,(?:' + '|'.join(['456','012']) + ")")
# find the three lines after a nums match
line_matches = breaks = re.compile('.*\n.*\n.*\n')
keeper = ''
for line in nums.finditer(data):
keeper += breaks.findall( data[line.start():] )[0]
result on the given string is
NUM,456
FRUIT
DRINK
FOOD,BURGER
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
edit: deleting items while iterating is probably not a good idea, see: Remove items from a list while iterating
infile = open("inputfile.txt",'r')
data = infile.readlines()
SKIP_LINES = 4
skip_until = False
result_data = []
for current_line, line in enumerate(data):
if skip_until and skip_until < current_line:
continue
try:
_, num = line.split(',')
except ValueError:
pass
else:
if num not in wanted:
skip_until = current_line + SKIP_LINES
else:
result_data.append(line)
... and result_data is what you want.
If you don't mind building a list, and iff your "NUM" lines come every 5 other line, you may want to try:
keep = []
for (i, v) in enumerate(lines[::5]):
(num, current) = v.split(",")
if current in wanted:
keep.extend(lines[i*5:i*5+5])
Don't try to think of this in terms of building up a list and removing stuff from it while you loop over it. That way leads madness.
It is much easier to write the output file directly. Loop over lines of the input file, each time deciding whether to write it to the output or not.
Also, to avoid difficulties with the fact that not every line has a comma, try just using .partition instead to split up the lines. That will always return 3 items: when there is a comma, you get (before the first comma, the comma, after the comma); otherwise, you get (the whole thing, empty string, empty string). So you can just use the last item from there, since wanted won't contain empty strings anyway.
skip_counter = 0
for line in infile:
if line.partition(',')[2] not in wanted:
skip_counter = 5
if skip_counter:
skip_counter -= 1
else:
outfile.write(line)
What is the most pythonic way to read in a named file, strip lines that are either empty, contain only spaces, or have # as a first character, and then process remaining lines? Assume it all fits easily in memory.
Note: it's not tough to do this -- what I'm asking is for the most pythonic way. I've been writing a lot of Ruby and Java and have lost my feel.
Here's a strawman:
file_lines = [line.strip() for line in open(config_file, 'r').readlines() if len(line.strip()) > 0]
for line in file_lines:
if line[0] == '#':
continue
# Do whatever with line here.
I'm interested in concision, but not at the cost of becoming hard to read.
Generators are perfect for tasks like this. They are readable, maintain perfect separation of concerns, and efficient in memory-use and time.
def RemoveComments(lines):
for line in lines:
if not line.strip().startswith('#'):
yield line
def RemoveBlankLines(lines):
for line in lines:
if line.strip():
yield line
Now applying these to your file:
filehandle = open('myfile', 'r')
for line in RemoveComments(RemoveBlankLines(filehandle)):
Process(line)
In this case, it's pretty clear that the two generators can be merged into a single one, but I left them separate to demonstrate their composability.
lines = [r for r in open(thefile) if not r.isspace() and r[0] != '#']
The .isspace() method of strings is by far the best way to test if a string is entirely whitespace -- no need for contortions such as len(r.strip()) == 0 (ech;-).
for line in open("file"):
sline=line.strip()
if sline and not sline[0]=="#" :
print line.strip()
output
$ cat file
one
#
#
two
three
$ ./python.py
one
two
three
I would use this:
processed = [process(line.strip())
for line in open(config_file, 'r')
if line.strip() and not line.strip().startswith('#')]
The only ugliness I see here is all the repeated stripping. Getting rid of it complicates the function a bit:
processed = [process(line)
for line in (line.strip() for line in open(config_file, 'r'))
if line and not line.startswith('#')]
This matches the description, ie
strip lines that are either empty,
contain only spaces, or have # as a
first character, and then process
remaining lines
So lines that start or end in spaces are passed through unfettered.
with open("config_file","r") as fp:
data = (line for line in fp if line.strip() and not line.startswith("#"))
for item in data:
print repr(item)
I like Paul Hankin's thinking, but I'd do it differently:
from itertools import ifilter, ifilterfalse, imap
with open(r'c:\temp\testfile.txt', 'rb') as f:
s1 = ifilterfalse(str.isspace, f)
s2 = ifilter(lambda x: not x.startswith('#'), s1)
s3 = imap(str.rstrip, s2)
print "\n".join(s3)
I'd probably only do it this way instead of using some of the more obvious approaches suggested here if I were concerned about memory usage. And I might define an iscomment function to eliminate the lambda.
The file is small, so performance is not really an issue. I will go for clarity than conciseness:
fp = open('file.txt')
for line in fp:
line = line.strip()
if line and not line.startswith('#'):
# process
fp.close()
If you want, you can wrap this in a function.
Using slightly newer idioms (or with Python 2.5 from __future__ import with) you could do this, which has the advantage of cleaning up safely yet is quite concise.
with file('file.txt') as fp:
for line in fp:
line = line.strip()
if not line or line[0] == '#':
continue
# rest of processing here
Note that stripping the line first means the check for "#" will actually reject lines with that as the first non-blank, not merely "as first character". Easy enough to modify if you're strict about that.