i have a relatively large text file (around 7m lines) and i want to run a specific logic over it which i ll try to explain below:
A1KEY1
A2KEY1
B1KEY2
C1KEY3
D1KEY3
E1KEY4
I want to count the frequency of appearence of the keys, and then output those with a frequency of 1 into one text file, those with a frequency of 2 in another, and those with a frequency higher than 2 in another.
This is the code i have so far, but it iterates over the dictionary painfully slow, and it gets slower the more it progresses.
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
dict_in = dict()
seen = []
fileinlist = filetoliststrip(file_in)
out_file = open(file_ot, 'w')
out_file2 = open(file_ot2, 'w')
out_file3 = open(file_ot3, 'w')
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
for j in dict_in.keys():
print("Processing key: " + str(j))
#print(dict_in[j])
if dict_in[j][0] < 2:
out_file.write(str(dict_in[j][1]))
elif dict_in[j][0] == 2:
for line_in in dict_in[j][1:]:
out_file2.write(str(line_in) + "\n")
elif dict_in[j][0] > 2:
for line_in in dict_in[j][1:]:
out_file3.write(str(line_in) + "\n")
out_file.close()
out_file2.close()
out_file3.close()
I m running this on a windows PC i7 with 8GB Ram, this should be not taking hours to perform. Is this a problem with the way i read the file into a list? Should i use a different method? Thanks in advance.
You have multiple points that slow down your code - there is no need to load the whole file into memory only to iterate over it again, there is no need to get a list of keys each time you want to do a lookup (if key not in dict_in: ... will suffice and will be blazingly fast), you don't need to keep the line count as you can post-check the lines length anyway... to name but a few.
I'd completely restructure your code as:
import collections
dict_in = collections.defaultdict(list) # save some time with a dictionary factory
with open(file_in, "r") as f: # open the file_in for reading
for line in file_in: # read the file line by line
key = line.strip()[10:69] # assuming this is how you get your key
dict_in[key].append(line) # add the line as an element of the found key
# now that we have the lines in their own key brackets, lets write them based on frequency
with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
selector = {1: f1, 2: f2} # make our life easier with a quick length-based lookup
for values in dict_in.values(): # use dict_in.itervalues() on Python 2.x
selector.get(len(values), f3).writelines(values) # write the collected lines
And you'll hardly get more efficient than that, at least in Python.
Keep in mind that this will not guarantee the order of lines in the output prior to Python 3.7 (or CPython 3.6). The order within a key itself will be preserved, however. If you need to keep the line order prior to the aforementioned Python versions you'll have to do keep a separate key order list and iterate over it to pick up the dict_in values in order.
The first function:
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
Here a list of raw lines is produced only to be stripped. That will require roughly twice as much memory as necessary, and just as importantly, several passes over data that doesn't fit in cache. We also don't need to make str of things repeatedly. So we can simplify it a bit:
def filetoliststrip(filename):
return [line.strip() for line in open(filename, 'r')]
This still produces a list. If we're reading through the data only once, not storing each line, replace [] with () to turn it into a generator expression; in this case, since lines are actually held intact in memory until the end of the program, we'd only save the space for the list (which is still at least 30MB in your case).
Then we have the main parsing loop (I adjusted the indentation as I thought it should be):
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
There are several suboptimal things here.
First, the counter could be an enumerate (when you don't have an iterable, there's range or itertools.count). Changing this will help with clarity and reduce the risk of mistakes.
for counter, line in enumerate(fileinlist, 1):
Second, it's more efficient to form a string in one operation than add it from bits:
print("Loading line {} : {}".format(counter, line))
Third, there's no need to extract the keys for a dictionary member check. In Python 2 that builds a new list, which means copying all the references held in the keys, and gets slower with every iteration. In Python 3, it still means building a key view object needlessly. Just use keyf not in dict_in if the check is needed.
Fourth, the check really isn't needed. Catching the exception when a lookup fails is pretty much as fast as the if check, and repeating the lookup after the if check is almost certainly slower. For that matter, stop repeating lookups in general:
try:
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
except KeyError:
dict_in[keyf] = [1, line]
This is such a common pattern, however, that we have two standard library implementations of it: Counter and defaultdict. We could use both here, but the Counter is more practical when you only want the count.
from collections import defaultdict
def newentry():
return [0]
dict_in = defaultdict(newentry)
for counter, line in enumerate(fileinlist, 1):
keyf = line[10:69]
print("Loading line {} : {}".format(counter, line))
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
Using defaultdict let us not worry about whether the entries existed or not.
We now arrive at the output phase. Again we have needless lookups, so let's reduce them to one iteration:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
out_file.write(lines[0])
elif count == 2:
for line_in in lines:
out_file2.write(line_in + "\n")
elif count > 2:
for line_in in lines:
out_file3.write(line_in + "\n")
That still has a few annoyances. We've repeated the writing code, it builds other strings (tagging on "\n"), and it has a whole chunk of similar code for each case. In fact, the repetition probably caused a bug: there's no newline separator for the single occurrences in out_file. Let's factor out what really differs:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
key_outf = out_file
elif count == 2:
key_outf = out_file2
else: # elif count > 2: # Test not needed
key_outf = out_file3
key_outf.writelines(line_in + "\n" for line_in in lines)
I've left the newline concatenation because it's more complex to mix them in as separate calls. The string is short-lived and it serves a purpose to have the newline in the same place: it makes it less likely at OS level that a line is broken up by concurrent writes.
You'll have noticed there are Python 2 and 3 differences here. Most likely your code wasn't all that slow if run in Python 3 in the first place. There exists a compatibility module called six to write code that more easily runs in either; it lets you use e.g. six.viewkeys and six.iteritems to avoid this gotcha.
You load a very large file onto memory at once. When you don't actually needs the lines, and you just need to process it, use a generator. It is more memory-efficient.
Counter is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. You can use that to count the frequency of the keys. Then simply iterate over the new dict and append the key to the relevant file:
from collections import Counter
keys = ['A1KEY1', 'A2KEY1', 'B1KEY2', 'C1KEY3', 'D1KEY3', 'E1KEY4']
count = Counter(keys)
with open('single.txt') as f1:
with open('double.txt') as f2:
with open('more_than_double.txt') as f3:
for k, v in count.items():
if v == 1:
f1.writelines(k)
elif v == 2:
f2.writelines(k)
else:
f3.writelines(k)
Related
I have a Python script which I'm trying to use to print duplicate numbers in the Duplicate.txt file:
newList = set()
datafile = open ("Duplicate.txt", "r")
for i in datafile:
if datafile.count(i) >= 2:
newList.add(i)
datafile.close()
print(list(newList))
I'm getting the following error, could anyone help please?
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
The problem is exactly what it says: file objects don't know how to count anything. They're just iterators, not lists or strings or anything like that.
And part of the reason for that is that it would potentially be very slow to scan the whole file over and over like that.
If you really need to use count, you can put the lines into a list first. Lists are entirely in-memory, so it's not nearly as slow to scan them over and over, and they have a count method that does exactly what you're trying to do with it:
datafile = open ("Duplicate.txt", "r")
lines = list(datafile)
for i in lines:
if lines.count(i) >= 2:
newList.add(i)
datafile.close()
However, there's a much better solution: Just keep counts as you go along, and then keep the ones that are >= 2. In fact, you can write that in two lines:
counts = collections.Counter(datafile)
newList = {line for line, count in counts.items() if count >= 2}
But if it isn't clear to you why that works, you may want to do it more explicitly:
counts = collections.Counter()
for i in datafile:
counts[i] += 1
newList = set()
for line, count in counts.items():
if count >= 2:
newList.add(line)
Or, if you don't even understand the basics of Counter:
counts = {}
for i in datafile:
if i not in counts:
counts[i] = 1
else:
counts[i] += 1
The error in your code is trying to apply count on a file handle, not on a list.
Anyway, you don't need to count the elements, you just need to see if the element already has been seen in the file.
I'd suggest a marker set to note down which elements already occured.
seen = set()
result = set()
with open ("Duplicate.txt", "r") as datafile:
for i in datafile:
# you may turn i to a number here with: i = int(i)
if i in seen:
result.add(i) # data is already in seen: duplicate
else:
seen.add(i) # next time it occurs, we'll detect it
print(list(result)) # convert to list (maybe not needed, set is ok to print)
Your immediate error is because you're asking if datafile.count(i) and datafile is a file, which doesn't know how to count its contents.
Your question is not about how to solve the larger problem, but since I'm here:
Assuming Duplicate.txt contains numbers, one per line, I would probably read each line's contents into a list and then use a Counter to count the list's contents.
You are looking to use the list.count() method, instead you've mistakenly called it on a file object. Instead, lets read the file, split it's contents into a list, and then obtain the count of each item using the list.count() method.
# read the data from the file
with open ("Duplicate.txt", "r") as datafile:
datafile_data = datafile.read()
# split the file contents by whitespace and convert to list
datafile_data = datafile_data.split()
# build a dictionary mapping words to their counts
word_to_count = {}
unique_data = set(datafile_data)
for data in unique_data:
word_to_count[data] = datafile_data.count(data)
# populate our list of duplicates
all_duplicates = []
for x in word_to_count:
if word_to_count[x] > 2:
all_duplicates.append(x)
I have a textfile with the following format:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
... with 1 million items
But some of the word_forms contain an apostrophe ('), others do not, so I would like to count them as instances of the same word, that's to say I would like to merge lines like these two:
cup'board cup blabla 12
cupboard cup blabla2 10
into this one (frequencies added):
cupboard cup blabla2 22
I am searching a solution in Python 2.7 to do that, my first idea was to read the textfile, store in two different dictionaries the words with apostrophe and the words without, then go over the dictionary of words with apostrophe, test if these words are already in the dictionary without apostrophe, if they are actualise the frequency, if not simply add this line with apostrophe removed. Here is my code:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
This solution works in a reasonable time till the storage of the two dictionaries, whether I write in two textfiles line by line to avoid any storage or I store them as dict objects in the program. But the merging of the two dictionaries never ends!
The function 'update' for dictionaries would work but override one frequency count instead of adding the two. I saw some solutions of merging dictionaries
with addition with Counter:
Python: Elegantly merge dictionaries with sum() of values
Merge and sum of two dictionaries
How to sum dict elements
How to merge two Python dictionaries in a single expression?
Is there any pythonic way to combine two dicts (adding values for keys that appear in both)?
but they seem to work only when the dictionaries are of the form (word, count) whereas I want to carry the other fields in the dictionary as well.
I am open to all your ideas or reframing of the problem, since my goal is
to have this program running once only to obtain this merged list in a text file, thank you in advance!
Here's something that does more or less what you want. Just change the file names at the top. It doesn't modify the original file.
input_file_name = "input.txt"
output_file_name = "output.txt"
def custom_comp(s1, s2):
word1 = s1.split()[0]
word2 = s2.split()[0]
stripped1 = word1.translate(None, "'")
stripped2 = word2.translate(None, "'")
if stripped1 > stripped2:
return 1
elif stripped1 < stripped2:
return -1
else:
if "'" in word1:
return -1
else:
return 1
def get_word(line):
return line.split()[0].translate(None, "'")
def get_num(line):
return int(line.split()[-1])
print "Reading file and sorting..."
lines = []
with open(input_file_name, 'r') as f:
for line in sorted(f, cmp=custom_comp):
lines.append(line)
print "File read and sorted"
combined_lines = []
print "Combining entries..."
i = 0
while i < len(lines) - 1:
if get_word(lines[i]) == get_word(lines[i+1]):
total = get_num(lines[i]) + get_num(lines[i+1])
new_parts = lines[i+1].split()
new_parts[-1] = str(total)
combined_lines.append(" ".join(new_parts))
i += 2
else:
combined_lines.append(lines[i].strip())
i += 1
print "Entries combined"
print "Writing to file..."
with open(output_file_name, 'w+') as f:
for line in combined_lines:
f.write(line + "\n")
print "Finished"
It sorts the words and messes up the spacing a bit. If that's important, let me know and it can be adjusted.
Another thing is that it sorts the whole thing. For only a million lines, the probably won't take overly long, but again, let me know if that's an issue.
highest_score = 0
g = open("grades_single.txt","r")
arrayList = []
for line in highest_score:
if float(highest_score) > highest_score:
arrayList.extend(line.split())
g.close()
print(highest_score)
Hello, wondered if anyone could help me , I'm having problems here. I have to read in a file of which contains 3 lines. First line is no use and nor is the 3rd. The second contains a list of letters, to which I have to pull them out (for instance all the As all the Bs all the Cs all the way upto G) there are multiple letters of each. I have to be able to count how many off each through this program. I'm very new to this so please bear with me if the coding created is wrong. Just wondered if anyone could point me in the right direction of how to pull out these letters on the second line and count them. I then have to do a mathamatical function with these letters but I hope to work that out for myself.
Sample of the data:
GTSDF60000
ADCBCBBCADEBCCBADGAACDCCBEDCBACCFEABBCBBBCCEAABCBB
*
You do not read the contents of the file. To do so use the .read() or .readlines() method on your opened file. .readlines() reads each line in a file seperately like so:
g = open("grades_single.txt","r")
filecontent = g.readlines()
since it is good practice to directly close your file after opening it and reading its contents, directly follow with:
g.close()
another option would be:
with open("grades_single.txt","r") as g:
content = g.readlines()
the with-statement closes the file for you (so you don't need to use the .close()-method this way.
Since you need the contents of the second line only you can choose that one directly:
content = g.readlines()[1]
.readlines() doesn't strip a line of is newline(which usually is: \n), so you still have to do so:
content = g.readlines()[1].strip('\n')
The .count()-method lets you count items in a list or in a string. So you could do:
dct = {}
for item in content:
dct[item] = content.count(item)
this can be made more efficient by using a dictionary-comprehension:
dct = {item:content.count(item) for item in content}
at last you can get the highest score and print it:
highest_score = max(dct.values())
print(highest_score)
.values() returns the values of a dictionary and max, well, returns the maximum value in a list.
Thus the code that does what you're looking for could be:
with open("grades_single.txt","r") as g:
content = g.readlines()[1].strip('\n')
dct = {item:content.count(item) for item in content}
highest_score = max(dct.values())
print(highest_score)
highest_score = 0
arrayList = []
with open("grades_single.txt") as f:
arraylist.extend(f[1])
print (arrayList)
This will show you the second line of that file. It will extend arrayList then you can do whatever you want with that list.
import re
# opens the file in read mode (and closes it automatically when done)
with open('my_file.txt', 'r') as opened_file:
# Temporarily stores all lines of the file here.
all_lines_list = []
for line in opened_file.readlines():
all_lines_list.append(line)
# This is the selected pattern.
# It basically means "match a single character from a to g"
# and ignores upper or lower case
pattern = re.compile(r'[a-g]', re.IGNORECASE)
# Which line i want to choose (assuming you only need one line chosen)
line_num_i_need = 2
# (1 is deducted since the first element in python has index 0)
matches = re.findall(pattern, all_lines_list[line_num_i_need-1])
print('\nMatches found:')
print(matches)
print('\nTotal matches:')
print(len(matches))
You might want to check regular expressions in case you need some more complex pattern.
To count the occurrences of each letter I used a dictionary instead of a list. With a dictionary, you can access each letter count later on.
d = {}
g = open("grades_single.txt", "r")
for i,line in enumerate(g):
if i == 1:
holder = list(line.strip())
g.close()
for letter in holder:
d[letter] = holder.count(letter)
for key,value in d.iteritems():
print("{},{}").format(key,value)
Outputs
A,9
C,15
B,15
E,4
D,5
G,1
F,1
One can treat the first line specially (and in this case ignore it) with next inside try: except StopIteration:. In this case, where you only want the second line, follow with another next instead of a for loop.
with open("grades_single.txt") as f:
try:
next(f) # discard 1st line
line = next(f)
except StopIteration:
raise ValueError('file does not even have two lines')
# now use line
I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).
This is the first thing I tired to achieve that result:
mydict = {}
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
with open("dataset.nt", "rb") as input:
for line in input:
if regex.search(line):
total = total+1
mydict[i] = total
It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:
mydict = {}
with open("dataset.nt", "rb") as input:
for line in input:
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:
mydict = {}
file = open("dataset.nt", "rb")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
for i in list:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
That one has been running for the last 30 minutes, so I'm assuming it's not any better.
How should I structure this code so that it completes in a reasonable amount of time?
I'd favor a slight modification of your second version:
mydict = {}
re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if not '/Main/' in line:
continue
# do the regex-part
for i, regex in zip(mylist, re_list):
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
As #matsjoyce already suggested, this avoids re-compiling the regex on each iteration.
If you really need to that many different regex patterns then I don't think there's much you can do.
Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.
Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.
This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):
targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if '/Main/' not in line:
continue
# do the regex-part
for i, regex in targets:
if regex.search(line):
results[i] += 1
Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).
The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall.
Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.
import re
mydict = {}
mylist = [...] # A list with 500 items
# Optimizing calls
findall = re.findall # Then python don't have to resolve this functions for every call
escape = re.escape
with open("dataset.nt", "rb") as input:
text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
for elem in mylist:
mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.
I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).
I don't test the whole code with real data, just the findall part.
I have a file of about 100 million lines in which I want to replace text with alternate text stored in a tab-delimited file. The code that I have works, but is taking about an hour to process the first 70K lines.In trying to incrementally advance my python skills, I am wondering whether there is a faster way to do this. Thanks!
The input file looks something like this:
CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518
CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518
and the file with replacement values looks like this:
WBGene00045518 21ur-5153
Here is my code:
infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')
import re
from datetime import datetime
startTime = datetime.now()
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()
You should split your lines into "words" and only look up these words in your dictionary:
>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']
This will eliminate the loop over the dictionary you do for every single line.
Here' the complete code:
import re
with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
for word in re.findall(r"\w+", line):
if word in udict:
line = line.replace(word, udict[word])
outfile.write(line)
Edit: An alternative approach is to build a single mega-regex from your dictionary:
with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
outfile.write(regex.sub(lambda m: udict[m.group()], line))
I was thinking on your loop over the dicionary keys, and a wqya to optimize this, and let to make other comments on your code later.
But then I stumbled on this part:
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
This inocent looking snippet, actually puts Python sequentially looking at and comparing 10000 items in your "linecounter" list for each line in your file.
Replace this part with:
if linecounter % 10000 == 0:
print linecounter
print (datetime.now()-startTime)
(And forget all the mult10k part) - and you should get a significant speed up.
Also, it seems like you are recording multiple output lines for each input line -
your mainloop is like this:
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
Replace it for this:
for linecounter, line in enumerate(infile2):
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
Which properly writes only one output line for each input line (besides eleminating code duplication, and taking care of the line counting in a "pythonic" way)
This code is full of linear searches. It's no wonder it's running slowly. Without knowing more about the input, I can't give you advice on how to fix these problems, but I can at least point out the problems. I'll note major issues, and a couple of minor ones.
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
Don't use update here; just add the item to the dictionary:
udict[linelist[0]] = linelist[1]
This will be faster than creating a dictionary for every entry. (And actually, Sven Marnach's generator-based approach to creating this dictionary is better still.) This is fairly minor though.
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
This is totally unnecessary. Remove this; I'll show you one way to print at intervals without this.
linecounter = 0
for line in infile2:
for key, value in udict.items():
This is your first big problem. You're doing a linear search through the dictionary for keys in the line, for each line. If the dictionary is very large, this will require a huge number of operations: 100,000,000 * len(udict).
matches = line.count(key)
This is another problem. You're looking for matches using a linear search. Then you do replace, which does the same linear search! You don't need to check for a match; replace just returns the same string if there isn't one. This won't make a huge difference either, but it will gain you something.
line = line.replace(key, value)
Keep doing these replaces, and then only write the line once all replacements are done:
outfile.write(line + '\n')
And finally,
linecounter += 1
if linecounter in mult10K:
Forgive me, but this is a ridiculous way to do this! You're doing a linear search through linecounter to determine when to print a line. Here again, this adds a total of almost 100,000,000 * 100 operations. You should at least search in a set; but the best approach (if you really must do this) would be to do a modulo operation and test that.
if not linecounter % 10000:
print linecounter
print (datetime.now()-startTime)
To make this code efficient, you need to get rid of these linear searches. Sven Marnach's answer suggests one way that might work, but I think it depends on the data in your file, since the replacement keys might not correspond to obvious word boundaries. (The regex approach he added addresses that, though.)
This is not Python specific, but you might unroll your double for loop a bit so that the file writes to not occur on every iteration of the loop. Perhaps write to the file every 1000 or 10,000 lines.
I'm hoping that writing a line of output for each line of input times the number of replacement strings is a bug, and you really only intended to write one output for each input.
You need to find a way to test the lines of input for matches as quickly as possible. Looping through the entire dictionary is probably your bottleneck.
I believe regular expressions are precompiled into state machines that can be highly efficient. I have no idea on how the performance suffers when you generate a huge expression, but it's worth a try.
freakin_huge_re = re.compile('(' + ')|('.join(udict.keys()) + ')')
for line in infile2:
matches = [''.join(tup) for tup in freakin_huge_re.findall(line)]
if matches:
for key in matches:
line = line.replace(key, udict[key])
They obvious one in Python is the list comprehension - it's a faster (and more readable) way of doing this:
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
as this:
mult10K = [x*10000 for x in range(100)]
Likewise, where you have:
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
We can use a dict comprehension (with a generator expression):
lines = (line.strip().split('\t') for line in infile1)
udict = {line[0]: line[1] for line in lines}
It's also worth noting here that you appear to be working with a tab delimited file. In which case, the csv module might be a much better option than using split().
Also note that using the with statement will increase readability and make sure your files get closed (even on exceptions).
Print statements will also slow things down quite a lot if they are being performed on every loop - they are useful for debugging, but when running on your main chunk of data, it's probably worth removing them.
Another 'more pythonic' thing you can do is use enumerate() instead of adding one to a variable each time. E.g:
linecounter = 0
for line in infile2:
...
linecouter += 1
Can be replaced with:
for linecounter, line in enumerate(infile2):
...
Where you are counting occurrences of a key, the better solution is to use in:
if key in line:
As this short-circuits after finding an instance.
Adding all this up, let's see what we have:
import csv
from datetime import datetime
startTime = datetime.now()
with open('f1.txt', 'r') as infile1:
reader = csv.reader(delimiter='\t')
udict = dict(reader)
with open('f2.txt', 'r') as infile2, open('out.txt', 'w') as outfile:
for line in infile2:
for key, value in udict.items():
if key in line:
line = line.replace(key, value)
outfile.write(line + '\n')
Edit: List comp vs normal loop, as requested in the comments:
python -m timeit "[i*10000 for i in range(10000)]"
1000 loops, best of 3: 909 usec per loop
python -m timeit "a = []" "for i in range(10000):" " a.append(i)"
1000 loops, best of 3: 1.01 msec per loop
Note usec vs msec. It's not massive, but it's something.