Deleting certain line of text file in python - python

I have the following text file:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,456
FRUIT
DRINK
FOOD,BURGER
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
CAR
And I have the following list called 'wanted':
['123', '789']
What I'm trying to do is if the numbers after NUM is not in the list called 'wanted', then that line along with 4 lines below it gets deleted. So the output file will looks like:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
My code so far is:
infile = open("inputfile.txt",'r')
data = infile.readlines()
for beginning_line, ube_line in enumerate(data):
UNIT = data[beginning_line].split(',')[1]
if UNIT not in wanted:
del data_list[beginning_line:beginning_line+4]

You shouldn't modify a list while you are looping over it.
What you could try is to just advance the iterator on the file object when needed:
wanted = set(['123', '789'])
with open("inputfile.txt",'r') as infile, open("outfile.txt",'w') as outfile:
for line in infile:
if line.startswith('NUM,'):
UNIT = line.strip().split(',')[1]
if UNIT not in wanted:
for _ in xrange(4):
infile.next()
continue
outfile.write(line)
And use a set. It is faster for constantly checking the membership.
This approach doesn't make you read in the entire file at once to process it in a list form. It goes line by line, reading from the file, advancing, and writing to the new file. If you want, you can replace the outfile with a list that you are appending to.

There are some issues with the code; for instance, data_list isn't even defined. If it's a list, you can't del elements from it; you can only pop. Then you use both enumerate and direct index access on data; also readlines is not needed.
I'd suggest to avoid keeping all lines in memory, it's not really needed here. Maybe try with something like (untested):
with open('infile.txt') as fin, open('outfile.txt', 'w') as fout:
for line in fin:
if line.startswith('NUM,') and line.split(',')[1] not in wanted:
for _ in range(4):
fin.next()
else:
fout.write(line)

import re
# find the lines that match NUM,XYZ
nums = re.compile('NUM,(?:' + '|'.join(['456','012']) + ")")
# find the three lines after a nums match
line_matches = breaks = re.compile('.*\n.*\n.*\n')
keeper = ''
for line in nums.finditer(data):
keeper += breaks.findall( data[line.start():] )[0]
result on the given string is
NUM,456
FRUIT
DRINK
FOOD,BURGER
NUM,012
FRUIT
DRINK
FOOD,MEATBALL

edit: deleting items while iterating is probably not a good idea, see: Remove items from a list while iterating
infile = open("inputfile.txt",'r')
data = infile.readlines()
SKIP_LINES = 4
skip_until = False
result_data = []
for current_line, line in enumerate(data):
if skip_until and skip_until < current_line:
continue
try:
_, num = line.split(',')
except ValueError:
pass
else:
if num not in wanted:
skip_until = current_line + SKIP_LINES
else:
result_data.append(line)
... and result_data is what you want.

If you don't mind building a list, and iff your "NUM" lines come every 5 other line, you may want to try:
keep = []
for (i, v) in enumerate(lines[::5]):
(num, current) = v.split(",")
if current in wanted:
keep.extend(lines[i*5:i*5+5])

Don't try to think of this in terms of building up a list and removing stuff from it while you loop over it. That way leads madness.
It is much easier to write the output file directly. Loop over lines of the input file, each time deciding whether to write it to the output or not.
Also, to avoid difficulties with the fact that not every line has a comma, try just using .partition instead to split up the lines. That will always return 3 items: when there is a comma, you get (before the first comma, the comma, after the comma); otherwise, you get (the whole thing, empty string, empty string). So you can just use the last item from there, since wanted won't contain empty strings anyway.
skip_counter = 0
for line in infile:
if line.partition(',')[2] not in wanted:
skip_counter = 5
if skip_counter:
skip_counter -= 1
else:
outfile.write(line)

Related

Basic python, how to return lists of lists when reading a text file

im trying to store each new line of a text file as a different list within a list, where the characters of that nested list are also individual cells. Right now it only appends the ending character of each line, not sure why due to the nested while loop. Anyone see the mistakes? Thanks
def read_lines(filename):
ls_1 = []
x = open(filename, 'r')
i = 0
t = 0
while True: #nested while loop to read lines and seperate lines into individual characters (cells)
read = x.readline()
if read == '':
break
st = read.strip("''\n''")
while t < len(st):
ls_2 = []
ls_2.append(st[t])
t += 1
ls_1.append(ls_2) #append a new list to the original list every time the while loop resets and a new line is read
#ls_2.clear() # removes contents so the next loop doesn't repeat the first readline (doesnt work for unkown reason)
t = 0 # resets the index of read so the next new line can be read from start of line
i += 1
x.close()
return ls_1
Whole txt file:
Baby on board, how I've adored
That sign on my car's windowpane.
Bounce in my step,
Loaded with pep,
'Cause I'm driving in the carpool lane.
Call me a square,
Friend, I don't care.
That little yellow sign can't be ignored.
I'm telling you it's mighty nice.
Each trip's a trip to paradise
With my baby on board!
The reason you are only getting the last character is because you create *a new list inside your inner loop:
while t < len(st):
ls_2 = []
ls_2.append(st[t])
t += 1
ls_1.append(ls_2)
Instead, you would have to do:
ls_2 = []
while t < len(st):
ls_2.append(st[t])
t += 1
ls_1.append(ls_2)
However, don't use while loops to read from files, file objects are iterators, so just use a for-loop. Similarly, don't use a while loop to iterate over a string.
Here is how you would do it, Pythonically:
result = []
with open(filename) as f:
for line in f:
result.append(list(line.strip()))
Or with a list comprehension:
with open(filename) as f:
result = [list(line.strip()) for line in f]
You almost never use while-loops in Python. Everything is iterator based.
I suggested you to use the function readlines from python, that way you can iterate of each line of the opened file, then you can cast the string to list, by doing that you generate a list with all characters that compose that string (which seems to be what you want).
Try using the following code:
def read_lines(filename):
x = open(filename, 'r')
ls_1 = [list(line.strip()) for line in x.readlines()]
x.close()
return ls_1

Duplicate numbers reading a text file in Python

I have a Python script which I'm trying to use to print duplicate numbers in the Duplicate.txt file:
newList = set()
datafile = open ("Duplicate.txt", "r")
for i in datafile:
if datafile.count(i) >= 2:
newList.add(i)
datafile.close()
print(list(newList))
I'm getting the following error, could anyone help please?
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
The problem is exactly what it says: file objects don't know how to count anything. They're just iterators, not lists or strings or anything like that.
And part of the reason for that is that it would potentially be very slow to scan the whole file over and over like that.
If you really need to use count, you can put the lines into a list first. Lists are entirely in-memory, so it's not nearly as slow to scan them over and over, and they have a count method that does exactly what you're trying to do with it:
datafile = open ("Duplicate.txt", "r")
lines = list(datafile)
for i in lines:
if lines.count(i) >= 2:
newList.add(i)
datafile.close()
However, there's a much better solution: Just keep counts as you go along, and then keep the ones that are >= 2. In fact, you can write that in two lines:
counts = collections.Counter(datafile)
newList = {line for line, count in counts.items() if count >= 2}
But if it isn't clear to you why that works, you may want to do it more explicitly:
counts = collections.Counter()
for i in datafile:
counts[i] += 1
newList = set()
for line, count in counts.items():
if count >= 2:
newList.add(line)
Or, if you don't even understand the basics of Counter:
counts = {}
for i in datafile:
if i not in counts:
counts[i] = 1
else:
counts[i] += 1
The error in your code is trying to apply count on a file handle, not on a list.
Anyway, you don't need to count the elements, you just need to see if the element already has been seen in the file.
I'd suggest a marker set to note down which elements already occured.
seen = set()
result = set()
with open ("Duplicate.txt", "r") as datafile:
for i in datafile:
# you may turn i to a number here with: i = int(i)
if i in seen:
result.add(i) # data is already in seen: duplicate
else:
seen.add(i) # next time it occurs, we'll detect it
print(list(result)) # convert to list (maybe not needed, set is ok to print)
Your immediate error is because you're asking if datafile.count(i) and datafile is a file, which doesn't know how to count its contents.
Your question is not about how to solve the larger problem, but since I'm here:
Assuming Duplicate.txt contains numbers, one per line, I would probably read each line's contents into a list and then use a Counter to count the list's contents.
You are looking to use the list.count() method, instead you've mistakenly called it on a file object. Instead, lets read the file, split it's contents into a list, and then obtain the count of each item using the list.count() method.
# read the data from the file
with open ("Duplicate.txt", "r") as datafile:
datafile_data = datafile.read()
# split the file contents by whitespace and convert to list
datafile_data = datafile_data.split()
# build a dictionary mapping words to their counts
word_to_count = {}
unique_data = set(datafile_data)
for data in unique_data:
word_to_count[data] = datafile_data.count(data)
# populate our list of duplicates
all_duplicates = []
for x in word_to_count:
if word_to_count[x] > 2:
all_duplicates.append(x)

Python reading file problems

highest_score = 0
g = open("grades_single.txt","r")
arrayList = []
for line in highest_score:
if float(highest_score) > highest_score:
arrayList.extend(line.split())
g.close()
print(highest_score)
Hello, wondered if anyone could help me , I'm having problems here. I have to read in a file of which contains 3 lines. First line is no use and nor is the 3rd. The second contains a list of letters, to which I have to pull them out (for instance all the As all the Bs all the Cs all the way upto G) there are multiple letters of each. I have to be able to count how many off each through this program. I'm very new to this so please bear with me if the coding created is wrong. Just wondered if anyone could point me in the right direction of how to pull out these letters on the second line and count them. I then have to do a mathamatical function with these letters but I hope to work that out for myself.
Sample of the data:
GTSDF60000
ADCBCBBCADEBCCBADGAACDCCBEDCBACCFEABBCBBBCCEAABCBB
*
You do not read the contents of the file. To do so use the .read() or .readlines() method on your opened file. .readlines() reads each line in a file seperately like so:
g = open("grades_single.txt","r")
filecontent = g.readlines()
since it is good practice to directly close your file after opening it and reading its contents, directly follow with:
g.close()
another option would be:
with open("grades_single.txt","r") as g:
content = g.readlines()
the with-statement closes the file for you (so you don't need to use the .close()-method this way.
Since you need the contents of the second line only you can choose that one directly:
content = g.readlines()[1]
.readlines() doesn't strip a line of is newline(which usually is: \n), so you still have to do so:
content = g.readlines()[1].strip('\n')
The .count()-method lets you count items in a list or in a string. So you could do:
dct = {}
for item in content:
dct[item] = content.count(item)
this can be made more efficient by using a dictionary-comprehension:
dct = {item:content.count(item) for item in content}
at last you can get the highest score and print it:
highest_score = max(dct.values())
print(highest_score)
.values() returns the values of a dictionary and max, well, returns the maximum value in a list.
Thus the code that does what you're looking for could be:
with open("grades_single.txt","r") as g:
content = g.readlines()[1].strip('\n')
dct = {item:content.count(item) for item in content}
highest_score = max(dct.values())
print(highest_score)
highest_score = 0
arrayList = []
with open("grades_single.txt") as f:
arraylist.extend(f[1])
print (arrayList)
This will show you the second line of that file. It will extend arrayList then you can do whatever you want with that list.
import re
# opens the file in read mode (and closes it automatically when done)
with open('my_file.txt', 'r') as opened_file:
# Temporarily stores all lines of the file here.
all_lines_list = []
for line in opened_file.readlines():
all_lines_list.append(line)
# This is the selected pattern.
# It basically means "match a single character from a to g"
# and ignores upper or lower case
pattern = re.compile(r'[a-g]', re.IGNORECASE)
# Which line i want to choose (assuming you only need one line chosen)
line_num_i_need = 2
# (1 is deducted since the first element in python has index 0)
matches = re.findall(pattern, all_lines_list[line_num_i_need-1])
print('\nMatches found:')
print(matches)
print('\nTotal matches:')
print(len(matches))
You might want to check regular expressions in case you need some more complex pattern.
To count the occurrences of each letter I used a dictionary instead of a list. With a dictionary, you can access each letter count later on.
d = {}
g = open("grades_single.txt", "r")
for i,line in enumerate(g):
if i == 1:
holder = list(line.strip())
g.close()
for letter in holder:
d[letter] = holder.count(letter)
for key,value in d.iteritems():
print("{},{}").format(key,value)
Outputs
A,9
C,15
B,15
E,4
D,5
G,1
F,1
One can treat the first line specially (and in this case ignore it) with next inside try: except StopIteration:. In this case, where you only want the second line, follow with another next instead of a for loop.
with open("grades_single.txt") as f:
try:
next(f) # discard 1st line
line = next(f)
except StopIteration:
raise ValueError('file does not even have two lines')
# now use line

How to delete a line from a text file using the line number in python

here is an example text file
the bird flew
the dog barked
the cat meowed
here is my code to find the line number of the phrase i want to delete
phrase = 'the dog barked'
with open(filename) as myFile:
for num, line in enumerate(myFile, 1):
if phrase in line:
print 'found at line:', num
what can i add to this to be able to delete the line number (num)
i have tried
lines = myFile.readlines()
del line[num]
but this doesnt work how should i approach this?
You could use the fileinput module to update the file - note this will remove all lines containing the phrase:
import fileinput
for line in fileinput.input(filename, inplace=True):
if phrase in line:
continue
print(line, end='')
A user by the name of gnibbler posted something similar to this on another thread.
Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop
import os
from mmap import mmap
phrase = 'the dog barked'
filename = r'C:\Path\text.txt'
def removeLine(filename, num):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(num-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
with open(filename) as myFile:
for num, line in enumerate(myFile, 1):
if phrase in line:
removeLine(filename, num)
print 'Removed at line:', num
I found another solution that works efficiently and gets by without doing all the icky and not so elegant counting of lines within the file object:
del_line = 3 #line to be deleted: no. 3 (first line is no. 1)
with open("textfile.txt","r") as textobj:
list = list(textobj) #puts all lines in a list
del list[del_line - 1] #delete regarding element
#rewrite the textfile from list contents/elements:
with open("textfile.txt","w") as textobj:
for n in list:
textobj.write(n)
Detailed explanation for those who want it:
(1) Create a variable containing an integer value of the line-number you want to have deleted. Let's say I want to delete line #3:
del_line = 3
(2) Open the text file and put it into a file-object. Only reading-mode is necessary for now. Then, put its contents into a list:
with open("textfile.txt","r") as textobj:
list = list(textobj)
(3) Now every line should be an indexed element in "list". You can proceed by deleting the element representing the line you want to have deleted:
del list[del_line - 1]
At this point, if you got the line no. that is supposed to be deleted from user-input, make sure to convert it to integer first since it will be in string format most likely(if you used "input()").
It's del_line - 1 because the list's element-index starts at 0. However, I assume you (or the user) start counting at "1" for line no. 1, in which case you need to deduct 1 to catch the correct element in the list.
(4) Open the list file again, this time in "write-mode", rewriting the complete file. After that, iterate over the updated list, rewriting every element of "list" into the file. You don't need to worry about new lines because at the moment you put the contents of the original file into a list (step 2), the \n escapes will also be copied into the list elements:
with open("textfile.txt","w") as textobj:
for n in list:
textobj.write(n)
This has done the job for me when I wanted the user to decide which line to delete in a certain text file.
I think Martijn Pieters's answer does sth. similar, however his explanation is to little for me to be able to tell.
Assuming num is the line number to remove:
import numpy as np
a=np.genfromtxt("yourfile.txt",dtype=None, delimiter="\n")
with open('yourfile.txt','w') as f:
for el in np.delete(a,(num-1),axis=0):
f.write(str(el)+'\n')
You start counting at one, but python indices are always zero-based.
Start your line count at zero:
for num, line in enumerate(myFile): # default is to start at 0
or subtract one from num, deleting from lines (not line):
del lines[num - 1]
Note that in order for your .readlines() call to return any lines, you need to either re-open the file first, or seek to the start:
myFile.seek(0)
Try
lines = myFile.readlines()
mylines = [x for x in lines if x.find(phrase) < 0]
Implementing #atomh33ls numpy approach
So you want to delete any line in the file that contain the phrase string, right? instead of just deleting the phrase string
import numpy as np
phrase = 'the dog barked'
nums = []
with open("yourfile.txt") as myFile:
for num1, line in enumerate(myFile, 0):
# Changing from enumerate(myFile, 1) to enumerate(myFile, 0)
if phrase in line:
nums.append(num1)
a=np.genfromtxt("yourfile.txt",dtype=None, delimiter="\n", encoding=None )
with open('yourfile.txt','w') as f:
for el in np.delete(a,nums,axis=0):
f.write(str(el)+'\n')
where text file is,
the bird flew
the dog barked
the cat meowed
produces
the bird flew
the cat meowed

Python: Copying lines that meet requirements

So, basically, I need a program that opens a .dat file, checks each line to see if it meets certain prerequisites, and if they do, copy them into a new csv file.
The prerequisites are that it must 1) contain "$W" or "$S" and 2) have the last value at the end of the line of the DAT say one of a long list of acceptable terms. (I can simply make-up a list of terms and hardcode them into a list)
For example, if the CSV was a list of purchase information and the last item was what was purchased, I only want to include fruit. In this case, the last item is an ID Tag, and I only want to accept a handful of ID Tags, but there is a list of about 5 acceptable tags. The Tags have very veriable length, however, but they are always the last item in the list (and always the 4th item on the list)
Let me give a better example, again with the fruit.
My original .DAT might be:
DGH$G$H $2.53 London_Port Gyro
DGH.$WFFT$Q5632 $33.54 55n39 Barkdust
UYKJ$S.52UE $23.57 22#3 Apple
WSIAJSM_33$4.FJ4 $223.4 Ha25%ek Banana
Only the line: "UYKJ$S $23.57 22#3 Apple" would be copied because only it has both 1) $W or $S (in this case a $S) and 2) The last item is a fruit. Once the .csv file is made, I am going to need to go back through it and replace all the spaces with commas, but that's not nearly as problematic for me as figuring out how to scan each line for requirements and only copy the ones that are wanted.
I am making a few programs all very similar to this one, that open .dat files, check each line to see if they meet requirements, and then decides to copy them to the new file or not. But sadly, I have no idea what I am doing. They are all similar enough that once I figure out how to make one, the rest will be easy, though.
EDIT: The .DAT files are a few thousand lines long, if that matters at all.
EDIT2: The some of my current code snippets
Right now, my current version is this:
def main():
#NewFile_Loc = C:\Users\J18509\Documents
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
OldText = OldFile.read()
# for i in range(0, len(OldText)):
# if (OldText[i] != " "):
# print OldText[i]
i = split_line(OldText)
if u'$S' in i:
# $S is in the line
print i
main()
But it's very choppy still. I'm just learning python.
Brief update: the server I am working on is down, and might be for the next few hours, but I have my new code, which has syntax errors in it, but here it is anyways. I'll update again once I get it working. Thanks a bunch everyone!
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find($W)) or (LineParts[0].find($S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
There are two parts you need to implement: First, read a file line by line and write lines meeting a specific criteria. This is done by
with open('file.dat') as f:
for line in f:
stripped = line.strip() # remove '\n' from the end of the line
if test_line(stripped):
print stripped # Write to stdout
The criteria you want to check for are implemented in the function test_line. To check for the occurrence of "$W" or "$S", you can simply use the in-Operator like
if not '$W' in line and not '$S' in line:
return False
else:
return True
To check, if the last item in the line is contained in a fixed list, first split the line using split(), then take the last item using the index notation [-1] (negative indices count from the end of a sequence) and then use the in operator again against your fixed list. This looks like
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Now, you combine these two parts into the test_line function like
def test_line(line):
if not '$W' in line and not '$S' in line:
return False
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Note that the program writes the result to stdout, which you can easily redirect. If you want to write the output to a file, have a look at Correct way to write line to file in Python
inlineRequirements = ['$W','$S']
endlineRequirements = ['Apple','Banana']
inputFile = open(input_filename,'rb')
outputFile = open(output_filename,'wb')
for line in inputFile.readlines():
line = line.strip()
#trailing and leading whitespace has been removed
if any(req in line for req in inlineRequirements):
#passed inline requirement
lastWord = line.split(' ')[-1]
if lastWord in endlineRequirements:
#passed endline requirement
outputFile.write(line.replace(' ',','))
#replaced spaces with commas and wrote to file
inputFile.close()
outputFile.close()
tags = ['apple', 'banana']
match = ['$W', '$S']
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile.readlines(): # Loop through the file
line = line.strip() # Remove the newline and whitespace
if line and not line.isspace(): # If the line isn't empty
lparts = line.split() # Split the line
if any(tag.lower() == lparts[-1].lower() for tag in tags) and any(c in line for c in match):
# $S or $W is in the line AND the last section is in tags(case insensitive)
print line
import re
list_of_fruits = ["Apple","Bannana",...]
with open('some.dat') as f:
for line in f:
if re.findall("\$[SW]",line) and line.split()[-1] in list_of_fruits:
print "Found:%s" % line
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find(\$W)) or (LineParts[0].find(\$S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
This worked great, and has all the capabilities I needed. The other answers are good, but none of them do 100% of what I needed like this one does.

Categories

Resources