I have a Python script which I'm trying to use to print duplicate numbers in the Duplicate.txt file:
newList = set()
datafile = open ("Duplicate.txt", "r")
for i in datafile:
if datafile.count(i) >= 2:
newList.add(i)
datafile.close()
print(list(newList))
I'm getting the following error, could anyone help please?
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
The problem is exactly what it says: file objects don't know how to count anything. They're just iterators, not lists or strings or anything like that.
And part of the reason for that is that it would potentially be very slow to scan the whole file over and over like that.
If you really need to use count, you can put the lines into a list first. Lists are entirely in-memory, so it's not nearly as slow to scan them over and over, and they have a count method that does exactly what you're trying to do with it:
datafile = open ("Duplicate.txt", "r")
lines = list(datafile)
for i in lines:
if lines.count(i) >= 2:
newList.add(i)
datafile.close()
However, there's a much better solution: Just keep counts as you go along, and then keep the ones that are >= 2. In fact, you can write that in two lines:
counts = collections.Counter(datafile)
newList = {line for line, count in counts.items() if count >= 2}
But if it isn't clear to you why that works, you may want to do it more explicitly:
counts = collections.Counter()
for i in datafile:
counts[i] += 1
newList = set()
for line, count in counts.items():
if count >= 2:
newList.add(line)
Or, if you don't even understand the basics of Counter:
counts = {}
for i in datafile:
if i not in counts:
counts[i] = 1
else:
counts[i] += 1
The error in your code is trying to apply count on a file handle, not on a list.
Anyway, you don't need to count the elements, you just need to see if the element already has been seen in the file.
I'd suggest a marker set to note down which elements already occured.
seen = set()
result = set()
with open ("Duplicate.txt", "r") as datafile:
for i in datafile:
# you may turn i to a number here with: i = int(i)
if i in seen:
result.add(i) # data is already in seen: duplicate
else:
seen.add(i) # next time it occurs, we'll detect it
print(list(result)) # convert to list (maybe not needed, set is ok to print)
Your immediate error is because you're asking if datafile.count(i) and datafile is a file, which doesn't know how to count its contents.
Your question is not about how to solve the larger problem, but since I'm here:
Assuming Duplicate.txt contains numbers, one per line, I would probably read each line's contents into a list and then use a Counter to count the list's contents.
You are looking to use the list.count() method, instead you've mistakenly called it on a file object. Instead, lets read the file, split it's contents into a list, and then obtain the count of each item using the list.count() method.
# read the data from the file
with open ("Duplicate.txt", "r") as datafile:
datafile_data = datafile.read()
# split the file contents by whitespace and convert to list
datafile_data = datafile_data.split()
# build a dictionary mapping words to their counts
word_to_count = {}
unique_data = set(datafile_data)
for data in unique_data:
word_to_count[data] = datafile_data.count(data)
# populate our list of duplicates
all_duplicates = []
for x in word_to_count:
if word_to_count[x] > 2:
all_duplicates.append(x)
Related
i have a relatively large text file (around 7m lines) and i want to run a specific logic over it which i ll try to explain below:
A1KEY1
A2KEY1
B1KEY2
C1KEY3
D1KEY3
E1KEY4
I want to count the frequency of appearence of the keys, and then output those with a frequency of 1 into one text file, those with a frequency of 2 in another, and those with a frequency higher than 2 in another.
This is the code i have so far, but it iterates over the dictionary painfully slow, and it gets slower the more it progresses.
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
dict_in = dict()
seen = []
fileinlist = filetoliststrip(file_in)
out_file = open(file_ot, 'w')
out_file2 = open(file_ot2, 'w')
out_file3 = open(file_ot3, 'w')
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
for j in dict_in.keys():
print("Processing key: " + str(j))
#print(dict_in[j])
if dict_in[j][0] < 2:
out_file.write(str(dict_in[j][1]))
elif dict_in[j][0] == 2:
for line_in in dict_in[j][1:]:
out_file2.write(str(line_in) + "\n")
elif dict_in[j][0] > 2:
for line_in in dict_in[j][1:]:
out_file3.write(str(line_in) + "\n")
out_file.close()
out_file2.close()
out_file3.close()
I m running this on a windows PC i7 with 8GB Ram, this should be not taking hours to perform. Is this a problem with the way i read the file into a list? Should i use a different method? Thanks in advance.
You have multiple points that slow down your code - there is no need to load the whole file into memory only to iterate over it again, there is no need to get a list of keys each time you want to do a lookup (if key not in dict_in: ... will suffice and will be blazingly fast), you don't need to keep the line count as you can post-check the lines length anyway... to name but a few.
I'd completely restructure your code as:
import collections
dict_in = collections.defaultdict(list) # save some time with a dictionary factory
with open(file_in, "r") as f: # open the file_in for reading
for line in file_in: # read the file line by line
key = line.strip()[10:69] # assuming this is how you get your key
dict_in[key].append(line) # add the line as an element of the found key
# now that we have the lines in their own key brackets, lets write them based on frequency
with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
selector = {1: f1, 2: f2} # make our life easier with a quick length-based lookup
for values in dict_in.values(): # use dict_in.itervalues() on Python 2.x
selector.get(len(values), f3).writelines(values) # write the collected lines
And you'll hardly get more efficient than that, at least in Python.
Keep in mind that this will not guarantee the order of lines in the output prior to Python 3.7 (or CPython 3.6). The order within a key itself will be preserved, however. If you need to keep the line order prior to the aforementioned Python versions you'll have to do keep a separate key order list and iterate over it to pick up the dict_in values in order.
The first function:
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
Here a list of raw lines is produced only to be stripped. That will require roughly twice as much memory as necessary, and just as importantly, several passes over data that doesn't fit in cache. We also don't need to make str of things repeatedly. So we can simplify it a bit:
def filetoliststrip(filename):
return [line.strip() for line in open(filename, 'r')]
This still produces a list. If we're reading through the data only once, not storing each line, replace [] with () to turn it into a generator expression; in this case, since lines are actually held intact in memory until the end of the program, we'd only save the space for the list (which is still at least 30MB in your case).
Then we have the main parsing loop (I adjusted the indentation as I thought it should be):
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
There are several suboptimal things here.
First, the counter could be an enumerate (when you don't have an iterable, there's range or itertools.count). Changing this will help with clarity and reduce the risk of mistakes.
for counter, line in enumerate(fileinlist, 1):
Second, it's more efficient to form a string in one operation than add it from bits:
print("Loading line {} : {}".format(counter, line))
Third, there's no need to extract the keys for a dictionary member check. In Python 2 that builds a new list, which means copying all the references held in the keys, and gets slower with every iteration. In Python 3, it still means building a key view object needlessly. Just use keyf not in dict_in if the check is needed.
Fourth, the check really isn't needed. Catching the exception when a lookup fails is pretty much as fast as the if check, and repeating the lookup after the if check is almost certainly slower. For that matter, stop repeating lookups in general:
try:
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
except KeyError:
dict_in[keyf] = [1, line]
This is such a common pattern, however, that we have two standard library implementations of it: Counter and defaultdict. We could use both here, but the Counter is more practical when you only want the count.
from collections import defaultdict
def newentry():
return [0]
dict_in = defaultdict(newentry)
for counter, line in enumerate(fileinlist, 1):
keyf = line[10:69]
print("Loading line {} : {}".format(counter, line))
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
Using defaultdict let us not worry about whether the entries existed or not.
We now arrive at the output phase. Again we have needless lookups, so let's reduce them to one iteration:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
out_file.write(lines[0])
elif count == 2:
for line_in in lines:
out_file2.write(line_in + "\n")
elif count > 2:
for line_in in lines:
out_file3.write(line_in + "\n")
That still has a few annoyances. We've repeated the writing code, it builds other strings (tagging on "\n"), and it has a whole chunk of similar code for each case. In fact, the repetition probably caused a bug: there's no newline separator for the single occurrences in out_file. Let's factor out what really differs:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
key_outf = out_file
elif count == 2:
key_outf = out_file2
else: # elif count > 2: # Test not needed
key_outf = out_file3
key_outf.writelines(line_in + "\n" for line_in in lines)
I've left the newline concatenation because it's more complex to mix them in as separate calls. The string is short-lived and it serves a purpose to have the newline in the same place: it makes it less likely at OS level that a line is broken up by concurrent writes.
You'll have noticed there are Python 2 and 3 differences here. Most likely your code wasn't all that slow if run in Python 3 in the first place. There exists a compatibility module called six to write code that more easily runs in either; it lets you use e.g. six.viewkeys and six.iteritems to avoid this gotcha.
You load a very large file onto memory at once. When you don't actually needs the lines, and you just need to process it, use a generator. It is more memory-efficient.
Counter is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. You can use that to count the frequency of the keys. Then simply iterate over the new dict and append the key to the relevant file:
from collections import Counter
keys = ['A1KEY1', 'A2KEY1', 'B1KEY2', 'C1KEY3', 'D1KEY3', 'E1KEY4']
count = Counter(keys)
with open('single.txt') as f1:
with open('double.txt') as f2:
with open('more_than_double.txt') as f3:
for k, v in count.items():
if v == 1:
f1.writelines(k)
elif v == 2:
f2.writelines(k)
else:
f3.writelines(k)
I am trying to copy the contents of a file that has MANY words and moving the contents into another file. The original file has 3 letter words that i'd like to sort out. unfortunately I have been unsuccessful in getting it to happen. I am newer to Python with some Java experience so im trying to do this pretty basic. Code is as follows:
# Files that were going to open
filename = 'words.txt'
file_two = 'new_words.txt'
# Variables were going to use in program
# Program Lists to transfer long words
words = []
# We open the file and store it into our list here
with open(filename, 'r') as file_object:
for line in file_object:
words.append(line.rstrip("\n"))
# We transfer the info into the new file
with open(file_two, 'a') as file:
x = int(0)
for x in words:
if len(words[x]) >= 5:
print(words[x])
file.write(words[x])
x += 1
I understand my problem is at the bottom while trying to import to the new file and perhaps a simple explanation might get me there, many thanks.
The problem is here:
with open(file_two, 'a') as file:
x = int(0)
for x in words:
if len(words[x]) >= 5:
print(words[x])
file.write(words[x])
x += 1
The reason for the error you're getting is that x isn't a number once the loop begins. It is a string.
I think you misunderstand how for loops work in python. They're more akin to foreach loops from other languages. When you do for x in words, x is given the value of the first element in words, then the second, and so on for each iteration. You however are trying to treat it like a normal for loop, going through the list by index. Of course this doesn't work.
There are two ways to go about fixing your code. You can either take the foreach approach:
with open(file_two, 'w') as file:
for x in words: #x is a word
if len(x) >= 5:
print(x)
file.write(x)
Or, use len() to loop through the range of indices of the list. This will yield behavior similar to that of a traditional for loop:
with open(file_two, 'a') as file:
for x in range(len(words)): #x is a number
if len(words[x]) >= 5:
print(words[x])
file.write(words[x])
There is also no need to manually increment x, or to give x an initial value, as it is reassigned at the beginning of the for loop.
highest_score = 0
g = open("grades_single.txt","r")
arrayList = []
for line in highest_score:
if float(highest_score) > highest_score:
arrayList.extend(line.split())
g.close()
print(highest_score)
Hello, wondered if anyone could help me , I'm having problems here. I have to read in a file of which contains 3 lines. First line is no use and nor is the 3rd. The second contains a list of letters, to which I have to pull them out (for instance all the As all the Bs all the Cs all the way upto G) there are multiple letters of each. I have to be able to count how many off each through this program. I'm very new to this so please bear with me if the coding created is wrong. Just wondered if anyone could point me in the right direction of how to pull out these letters on the second line and count them. I then have to do a mathamatical function with these letters but I hope to work that out for myself.
Sample of the data:
GTSDF60000
ADCBCBBCADEBCCBADGAACDCCBEDCBACCFEABBCBBBCCEAABCBB
*
You do not read the contents of the file. To do so use the .read() or .readlines() method on your opened file. .readlines() reads each line in a file seperately like so:
g = open("grades_single.txt","r")
filecontent = g.readlines()
since it is good practice to directly close your file after opening it and reading its contents, directly follow with:
g.close()
another option would be:
with open("grades_single.txt","r") as g:
content = g.readlines()
the with-statement closes the file for you (so you don't need to use the .close()-method this way.
Since you need the contents of the second line only you can choose that one directly:
content = g.readlines()[1]
.readlines() doesn't strip a line of is newline(which usually is: \n), so you still have to do so:
content = g.readlines()[1].strip('\n')
The .count()-method lets you count items in a list or in a string. So you could do:
dct = {}
for item in content:
dct[item] = content.count(item)
this can be made more efficient by using a dictionary-comprehension:
dct = {item:content.count(item) for item in content}
at last you can get the highest score and print it:
highest_score = max(dct.values())
print(highest_score)
.values() returns the values of a dictionary and max, well, returns the maximum value in a list.
Thus the code that does what you're looking for could be:
with open("grades_single.txt","r") as g:
content = g.readlines()[1].strip('\n')
dct = {item:content.count(item) for item in content}
highest_score = max(dct.values())
print(highest_score)
highest_score = 0
arrayList = []
with open("grades_single.txt") as f:
arraylist.extend(f[1])
print (arrayList)
This will show you the second line of that file. It will extend arrayList then you can do whatever you want with that list.
import re
# opens the file in read mode (and closes it automatically when done)
with open('my_file.txt', 'r') as opened_file:
# Temporarily stores all lines of the file here.
all_lines_list = []
for line in opened_file.readlines():
all_lines_list.append(line)
# This is the selected pattern.
# It basically means "match a single character from a to g"
# and ignores upper or lower case
pattern = re.compile(r'[a-g]', re.IGNORECASE)
# Which line i want to choose (assuming you only need one line chosen)
line_num_i_need = 2
# (1 is deducted since the first element in python has index 0)
matches = re.findall(pattern, all_lines_list[line_num_i_need-1])
print('\nMatches found:')
print(matches)
print('\nTotal matches:')
print(len(matches))
You might want to check regular expressions in case you need some more complex pattern.
To count the occurrences of each letter I used a dictionary instead of a list. With a dictionary, you can access each letter count later on.
d = {}
g = open("grades_single.txt", "r")
for i,line in enumerate(g):
if i == 1:
holder = list(line.strip())
g.close()
for letter in holder:
d[letter] = holder.count(letter)
for key,value in d.iteritems():
print("{},{}").format(key,value)
Outputs
A,9
C,15
B,15
E,4
D,5
G,1
F,1
One can treat the first line specially (and in this case ignore it) with next inside try: except StopIteration:. In this case, where you only want the second line, follow with another next instead of a for loop.
with open("grades_single.txt") as f:
try:
next(f) # discard 1st line
line = next(f)
except StopIteration:
raise ValueError('file does not even have two lines')
# now use line
I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).
This is the first thing I tired to achieve that result:
mydict = {}
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
with open("dataset.nt", "rb") as input:
for line in input:
if regex.search(line):
total = total+1
mydict[i] = total
It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:
mydict = {}
with open("dataset.nt", "rb") as input:
for line in input:
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:
mydict = {}
file = open("dataset.nt", "rb")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
for i in list:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
That one has been running for the last 30 minutes, so I'm assuming it's not any better.
How should I structure this code so that it completes in a reasonable amount of time?
I'd favor a slight modification of your second version:
mydict = {}
re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if not '/Main/' in line:
continue
# do the regex-part
for i, regex in zip(mylist, re_list):
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
As #matsjoyce already suggested, this avoids re-compiling the regex on each iteration.
If you really need to that many different regex patterns then I don't think there's much you can do.
Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.
Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.
This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):
targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if '/Main/' not in line:
continue
# do the regex-part
for i, regex in targets:
if regex.search(line):
results[i] += 1
Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).
The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall.
Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.
import re
mydict = {}
mylist = [...] # A list with 500 items
# Optimizing calls
findall = re.findall # Then python don't have to resolve this functions for every call
escape = re.escape
with open("dataset.nt", "rb") as input:
text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
for elem in mylist:
mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.
I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).
I don't test the whole code with real data, just the findall part.
I have the following text file:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,456
FRUIT
DRINK
FOOD,BURGER
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
CAR
And I have the following list called 'wanted':
['123', '789']
What I'm trying to do is if the numbers after NUM is not in the list called 'wanted', then that line along with 4 lines below it gets deleted. So the output file will looks like:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
My code so far is:
infile = open("inputfile.txt",'r')
data = infile.readlines()
for beginning_line, ube_line in enumerate(data):
UNIT = data[beginning_line].split(',')[1]
if UNIT not in wanted:
del data_list[beginning_line:beginning_line+4]
You shouldn't modify a list while you are looping over it.
What you could try is to just advance the iterator on the file object when needed:
wanted = set(['123', '789'])
with open("inputfile.txt",'r') as infile, open("outfile.txt",'w') as outfile:
for line in infile:
if line.startswith('NUM,'):
UNIT = line.strip().split(',')[1]
if UNIT not in wanted:
for _ in xrange(4):
infile.next()
continue
outfile.write(line)
And use a set. It is faster for constantly checking the membership.
This approach doesn't make you read in the entire file at once to process it in a list form. It goes line by line, reading from the file, advancing, and writing to the new file. If you want, you can replace the outfile with a list that you are appending to.
There are some issues with the code; for instance, data_list isn't even defined. If it's a list, you can't del elements from it; you can only pop. Then you use both enumerate and direct index access on data; also readlines is not needed.
I'd suggest to avoid keeping all lines in memory, it's not really needed here. Maybe try with something like (untested):
with open('infile.txt') as fin, open('outfile.txt', 'w') as fout:
for line in fin:
if line.startswith('NUM,') and line.split(',')[1] not in wanted:
for _ in range(4):
fin.next()
else:
fout.write(line)
import re
# find the lines that match NUM,XYZ
nums = re.compile('NUM,(?:' + '|'.join(['456','012']) + ")")
# find the three lines after a nums match
line_matches = breaks = re.compile('.*\n.*\n.*\n')
keeper = ''
for line in nums.finditer(data):
keeper += breaks.findall( data[line.start():] )[0]
result on the given string is
NUM,456
FRUIT
DRINK
FOOD,BURGER
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
edit: deleting items while iterating is probably not a good idea, see: Remove items from a list while iterating
infile = open("inputfile.txt",'r')
data = infile.readlines()
SKIP_LINES = 4
skip_until = False
result_data = []
for current_line, line in enumerate(data):
if skip_until and skip_until < current_line:
continue
try:
_, num = line.split(',')
except ValueError:
pass
else:
if num not in wanted:
skip_until = current_line + SKIP_LINES
else:
result_data.append(line)
... and result_data is what you want.
If you don't mind building a list, and iff your "NUM" lines come every 5 other line, you may want to try:
keep = []
for (i, v) in enumerate(lines[::5]):
(num, current) = v.split(",")
if current in wanted:
keep.extend(lines[i*5:i*5+5])
Don't try to think of this in terms of building up a list and removing stuff from it while you loop over it. That way leads madness.
It is much easier to write the output file directly. Loop over lines of the input file, each time deciding whether to write it to the output or not.
Also, to avoid difficulties with the fact that not every line has a comma, try just using .partition instead to split up the lines. That will always return 3 items: when there is a comma, you get (before the first comma, the comma, after the comma); otherwise, you get (the whole thing, empty string, empty string). So you can just use the last item from there, since wanted won't contain empty strings anyway.
skip_counter = 0
for line in infile:
if line.partition(',')[2] not in wanted:
skip_counter = 5
if skip_counter:
skip_counter -= 1
else:
outfile.write(line)