Python: delete element of one txt file from another txt file - python

I hope you are well.
I have two txt files: data.txt and to_remove.txt
data.txt has many lines, and each line has several integers with spaces in between. One line in data.txt looks like this: 1001 1229 19910
to_remove.txt has many line, each line has one integer. One line in to_remove.txt looks like this: 1229
I would like to write a new txt file which has data.txt without the integers in to_remove.txt
I know the first element of each line of data.txt does not have any of the elements of to_remove.txt; so I need to check all non-first elements of each line with each integer in to_remove.txt
I wrote to code to do this, but my code is far too slow. data.txt has more than a million lines, and to_remove.txt has few hundred thousand lines
Would be useful if you can suggest a faster way to do this.
Here is my code:
with open('new.txt', 'w') as new:
with open('data.txt') as data:
for line in data:
connections = []
currentline = line.split(" ")
for i in xrange(len(currentline)-2):
n = int(currentline[i+1])
connections.append(n)
with open('to_remove.txt') as to_remove:
for ID in to_remove:
ID = int(ID)
if ID in connections:
connections.remove(ID)
d = '%d '
connections.insert(0,int(currentline[0]))
for j in xrange(len(connections)-1):
d = d + '%d '
new.write((d % tuple(connections) + '\n'))

Your code is a bit messy, so I've re-written rather than edited. The main way to improve your speed is store the numbers to remove in a set(), which allows for efficient O(l) membership testing:
with open('data.txt') as data, open('to_remove.txt') as to_remove, open('new.txt', 'w') as new:
nums_to_remove = {item.strip() for item in to_remove} # create a set of strings to check for removing
for line in data:
numbers = line.rstrip().split() # create numbers list (note: these are stored as strings)
if not any(num in nums_to_remove for num in numbers[1:]): # check for the presence of numbers to remove
new.write(line) # write to the new file

I developed a code to answer my question using the code in the some of the answers, and the suggestion in the comment to the question.
def return_nums_remove():
with open('to_remove.txt') as to_remove:
nums_to_remove = {item.strip() for item in to_remove}
return nums_to_remove
with open('data.txt') as data, open('new.txt', 'w') as new:
nums_to_remove = return_nums_remove()
for line in data:
numbers = line.rstrip().split()
for n in numbers:
if n in nums_to_remove:
numbers.remove(n)
if len(numbers) > 1:
s = '%s '
for j in xrange(len(numbers)-1):
s = s + '%s '
new.write((s % tuple(numbers) + '\n'))

Related

Python: Removing dupes from large text file

I need my code to remove duplicate lines from a file, at the moment it is just reproducing the same file as output. Can anyone see how to fix this? The for loop is not running as I would have liked.
#!usr/bin/python
import os
import sys
#Reading Input file
f = open(sys.argv[1]).readlines()
#printing no of lines in the input file
print "Total lines in the input file",len(f)
#temporary dictionary to store the unique records/rows
temp = {}
#counter to count unique items
count = 0
for i in range(0,9057,1):
if i not in temp: #if row is not there in dictionary i.e it is unique so store it into a dictionary
temp[f[i]] = 1;
count += 1
else: #if exact row is there then print duplicate record and dont store that
print "Duplicate Records",f[i]
continue;
#once all the records are read print how many unique records are there
#u can print all unique records by printing temp
print "Unique records",count,len(temp)
#f = open("C://Python27//Vendor Heat Map Test 31072015.csv", 'w')
#print f
#f.close()
nf = open("C://Python34//Unique_Data.csv", "w")
for data in temp.keys():
nf.write(data)
nf.close()
# Written by Gary O'Neill
# Date 03-08-15
This is a much better way to do what you want:
infile_path = 'infile.csv'
outfile_path = 'outfile.csv'
written_lines = set()
with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
for line in infile:
if line not in written_lines:
outfile.write(line)
written_lines.add(line)
else:
print "Duplicate record: {}".format(line)
print "{} unique records".format(len(written_lines))
This will read one line at a time, so it works even on large files that don't fit into memory. While it's true that if they're mostly unique lines, written_lines will end up being large anyway, it's better than having two copies of almost every line in memory.
You should test the existence of f[i] in temp not i. Change the line:
if i not in temp:
with
if f[i] not in temp:

python delete specific line and re-assign the line number

I would like delete specific line and re-assign the line number:
eg:
0,abc,def
1,ghi,jkl
2,mno,pqr
3,stu,vwx
what I want: if line 1 is the line need to be delete, then
output should be:
0,abc,def
1,mno,pqr
2,stu,vwx
What I have done so far:
f=open(file,'r')
lines = f.readlines()
f.close()
f.open(file,'w')
for line in lines:
if line.rsplit(',')[0] != 'line#':
f.write(line)
f.close()
above lines can delete specifc line#, but I don't konw how to rewrite the line number before the first ','
Here is a function that will do the job.
def removeLine(n, file):
f = open(file,"r+")
d = f.readlines()
f.seek(0)
for i in range(len(d)):
if i > n:
f.write(d[i].replace(d[i].split(",")[0],str(i -1)))
elif i != n:
f.write(d[i])
f.truncate()
f.close()
Where the parameters n and file are the line you wish to delete and the filepath respectively.
This is assuming the line numbers are written in the line as implied by your example input.
If the number of the line is not included at the beginning of each line, as some other answers have assumed, simply remove the first if statement:
if i > n:
f.write(d[i].replace(d[i].split(",")[0],str(i -1)))
I noticed that your account wasn't created in the past few hours, so I figure that there's no harm in giving you the benefit of the doubt. You will really have more fun on StackOverflow if you spend the time to learn its culture.
I wrote a solution that fits your question's criteria on a file that's already written (you mentioned that you're opening a text file), so I assume it's a CSV.
I figured that I'd answer your question differently than the other solutions that implement the CSV reader library and use a temporary file.
import re
numline_csv = re.compile("\d\,")
# substitute your actual file opening here
so_31195910 = """
0,abc,def
1,ghi,jkl
2,mno,pqr
3,stu,vwx
"""
so = so_31195910.splitlines()
# this could be an input or whatever you need
delete_line = 1
line_bank = []
for l in so:
if l and not l.startswith(str(delete_line)+','):
print(l)
l = re.split(numline_csv, l)
line_bank.append(l[1])
so = []
for i,l in enumerate(line_bank):
so.append("%s,%s" % (i,l))
And the output:
>>> so
['0,abc,def', '1,mno,pqr', '2,stu,vwx']
In order to get a line number for each line, you should use the enumerate method...
for line_index, line in enumerate(lines):
# line_index is 0 for the first line, 1 for the 2nd line, &ct
In order to separate the first element of the string from the rest of the string, I suggest passing a value for maxsplit to the split method.
>>> '0,abc,def'.split(',')
['0', 'abc', 'def']
>>> '0,abc,def'.split(',',1)
['0', 'abc,def']
>>>
Once you have those two, it's just a matter of concatenating line_index to split(',',1)[1].

Printing specific lines txt file python

I have a text file I wish to analyze. I'm trying to find every line that contains certain characters (ex: "#") and then print the line located 3 lines before it (ex: if line 5 contains "#", I would like to print line 2)
This is what I got so far:
file = open('new_file.txt', 'r')
a = list()
x = 0
for line in file:
x = x + 1
if '#' in line:
a.append(x)
continue
x = 0
for index, item in enumerate(a):
for line in file:
x = x + 1
d = a[index]
if x == d - 3:
print line
continue
It won't work (it prints nothing when I feed it a file that has lines containing "#"), any ideas?
First, you are going through the file multiple times without re-opening it for subsequent times. That means all subsequent attempts to iterate the file will terminate immediately without reading anything.
Second, your indexing logic a little convoluted. Assuming your files are not huge relative to your memory size, it is much easier to simply read the whole into memory (as a list) and manipulate it there.
myfile = open('new_file.txt', 'r')
a = myfile.readlines();
for index, item in enumerate(a):
if '#' in item and index - 3 >= 0:
print a[index - 3].strip()
This has been tested on the following input:
PrintMe
PrintMe As Well
Foo
#Foo
Bar#
hello world will print
null
null
##
Ok, the issue is that you have already iterated completely through the file descriptor file in line 4 when you try again in line 11. So line 11 will make an empty loop. Maybe it would be a better idea to iterate the file only once and remember the last few lines...
file = open('new_file.txt', 'r')
a = ["","",""]
for line in file:
if "#" in line:
print(a[0], end="")
a.append(line)
a = a[1:]
For file IO it is usually most efficient for programmer time and runtime to use reg-ex to match patterns. In combination with iteration through the lines in the file. your problem really isn't a problem.
import re
file = open('new_file.txt', 'r')
document = file.read()
lines = document.split("\n")
LinesOfInterest = []
for lineNumber,line in enumerate(lines):
WhereItsAt = re.search( r'#', line)
if(lineNumber>2 and WhereItsAt):
LinesOfInterest.append(lineNumber-3)
print LinesOfInterest
for lineNumber in LinesOfInterest:
print(lines[lineNumber])
Lines of Interest is now a list of line numbers matching your criteria
I used
line1,0
line2,0
line3,0
#
line1,1
line2,1
line3,1
#
line1,2
line2,2
line3,2
#
line1,3
line2,3
line3,3
#
as input yielding
[0, 4, 8, 12]
line1,0
line1,1
line1,2
line1,3

How to read comma separated values from a text file, then output result to a text file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I need to create a program that will add numbers read from a text file separated by commas. i.e.
in file.txt:
1,2,3
4,5,6
7,8,9
So far I have the simple code:
x = 1
y = 2
z = 3
sum = x + y + z
print(sum)
I'm not sure how I would assign each number in the text file to x, y and z.
What I would like is that it will iterate through each line in the text file, this would be with a simple loop.
However I also do not know how I would then output the results to another text file.
i.e. answers.txt:
6
15
24
You have the right idea going, let's start by opening some files:
with open("text.txt", "r") as filestream:
with open("answers.txt", "w") as filestreamtwo:
Here, we have opened two filestreams - "text.txt" and "answers.txt".
Since we used with, these filestreams will automatically close after the code that is indented beneath them finishes running.
Now, let's run through the file "text.txt" line by line:
for line in filestream:
This will run a for loop and end at the end of the file.
Next, we need to change the input text into something we can work with, such as an array:
currentline = line.split(",")
Now, currentline contains all the integers listed in the first line of "text.txt".
Let's sum up these integers:
total = str(int(currentline[0]) + int(currentline[1]) + int(currentline [2])) + "\n"
We had to wrap each element in currentline with the int function around. Otherwise, instead of adding the integers, we would be concatenating strings!
Afterwards, we add the carriage return, "\n" in order to make "answers.txt" clearer to understand.
filestreamtwo.write(total)
Now, we are writing to the file "answers.txt"... That's it! You're done!
Here's the code again:
with open("test.txt", "r") as filestream:
with open("answers.txt", "w") as filestreamtwo:
for line in filestream:
currentline = line.split(",")
total = str(int(currentline[0]) + int(currentline[1]) + int(currentline [2])) + "\n"
filestreamtwo.write(total)
You can do this in fewer lines, but I hope you find this solution readable and easy to understand:
out = file('answers.txt', 'w')
for line in file('file.txt', 'r'):
s = 0
for num in line.strip().split(','):
s += int(num)
out.write("%d\n" % s)
For this task you may want to not work with files in your program directly, but work with standard input (input() or raw_input() in Python 2) and standard output (just print()).
Then you specify input and output file names during invocation of your script:
python script.py < file.txt > answer.txt
With this scheme you can have a program like this (Python 2.7):
while (True):
try:
x, y, z = [int(val) for val in raw_input().split(',')]
print (x + y + z)
except EOFError:
pass
Wondering if the file has only comma separated values then why not save file as ".csv" format. If you can then:
You can always use a csv reader to read any CSV file as mentioned under the docs: http://docs.python.org/2/library/csv.html
Quick example for your scenario:
with open('test.csv','rb') as csvfile:
csvreader = csv.reader(csvfile)
output_fil = open('output.txt', 'ab')
for row in csvreader:
result = 0
for elem in row:
result = result + int(elem)
print result
output_fil.writelines(str(result))
Where text.csv would contain input like:
1,2,3
4,5,6
...
and output.txt shall contain:
6
15
..
INFILE = "input.csv"
OUTFILE = "my.txt"
def row_sum(row, delim=","):
try:
return sum(int(i) for i in row.split(delim))
except ValueError:
return ""
with open(INFILE) as inf, open(OUTFILE, "w") as outf:
outf.write("\n".join(str(row_sum(row)) for row in inf))
with open("file2.txt","w") as f:
print >> f,"\n".join(map(lambda x:str(sum(int(y) for y in x.split(","))),open("file1.txt")))

Deleting certain line of text file in python

I have the following text file:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,456
FRUIT
DRINK
FOOD,BURGER
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
CAR
And I have the following list called 'wanted':
['123', '789']
What I'm trying to do is if the numbers after NUM is not in the list called 'wanted', then that line along with 4 lines below it gets deleted. So the output file will looks like:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
My code so far is:
infile = open("inputfile.txt",'r')
data = infile.readlines()
for beginning_line, ube_line in enumerate(data):
UNIT = data[beginning_line].split(',')[1]
if UNIT not in wanted:
del data_list[beginning_line:beginning_line+4]
You shouldn't modify a list while you are looping over it.
What you could try is to just advance the iterator on the file object when needed:
wanted = set(['123', '789'])
with open("inputfile.txt",'r') as infile, open("outfile.txt",'w') as outfile:
for line in infile:
if line.startswith('NUM,'):
UNIT = line.strip().split(',')[1]
if UNIT not in wanted:
for _ in xrange(4):
infile.next()
continue
outfile.write(line)
And use a set. It is faster for constantly checking the membership.
This approach doesn't make you read in the entire file at once to process it in a list form. It goes line by line, reading from the file, advancing, and writing to the new file. If you want, you can replace the outfile with a list that you are appending to.
There are some issues with the code; for instance, data_list isn't even defined. If it's a list, you can't del elements from it; you can only pop. Then you use both enumerate and direct index access on data; also readlines is not needed.
I'd suggest to avoid keeping all lines in memory, it's not really needed here. Maybe try with something like (untested):
with open('infile.txt') as fin, open('outfile.txt', 'w') as fout:
for line in fin:
if line.startswith('NUM,') and line.split(',')[1] not in wanted:
for _ in range(4):
fin.next()
else:
fout.write(line)
import re
# find the lines that match NUM,XYZ
nums = re.compile('NUM,(?:' + '|'.join(['456','012']) + ")")
# find the three lines after a nums match
line_matches = breaks = re.compile('.*\n.*\n.*\n')
keeper = ''
for line in nums.finditer(data):
keeper += breaks.findall( data[line.start():] )[0]
result on the given string is
NUM,456
FRUIT
DRINK
FOOD,BURGER
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
edit: deleting items while iterating is probably not a good idea, see: Remove items from a list while iterating
infile = open("inputfile.txt",'r')
data = infile.readlines()
SKIP_LINES = 4
skip_until = False
result_data = []
for current_line, line in enumerate(data):
if skip_until and skip_until < current_line:
continue
try:
_, num = line.split(',')
except ValueError:
pass
else:
if num not in wanted:
skip_until = current_line + SKIP_LINES
else:
result_data.append(line)
... and result_data is what you want.
If you don't mind building a list, and iff your "NUM" lines come every 5 other line, you may want to try:
keep = []
for (i, v) in enumerate(lines[::5]):
(num, current) = v.split(",")
if current in wanted:
keep.extend(lines[i*5:i*5+5])
Don't try to think of this in terms of building up a list and removing stuff from it while you loop over it. That way leads madness.
It is much easier to write the output file directly. Loop over lines of the input file, each time deciding whether to write it to the output or not.
Also, to avoid difficulties with the fact that not every line has a comma, try just using .partition instead to split up the lines. That will always return 3 items: when there is a comma, you get (before the first comma, the comma, after the comma); otherwise, you get (the whole thing, empty string, empty string). So you can just use the last item from there, since wanted won't contain empty strings anyway.
skip_counter = 0
for line in infile:
if line.partition(',')[2] not in wanted:
skip_counter = 5
if skip_counter:
skip_counter -= 1
else:
outfile.write(line)

Categories

Resources