Finding missing lines in file

Finding missing lines in file - python

I have a 7000+ lines .txt file, containing description and ordered path to image. Example:
abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png
abnormal /Users/alex/Documents/X-ray-classification/data/images/2.png
normal /Users/alex/Documents/X-ray-classification/data/images/3.png
normal /Users/alex/Documents/X-ray-classification/data/images/4.png
Some lines are missing. I want to somehow automate the search of missing lines. Intuitively i wrote:
f = open("data.txt", 'r')
lines = f.readlines()
num = 1
for line in lines:
if num in line:
continue
else:
print (line)
num+=1
But of course it didn't work, since lines are strings.
Is there any elegant way to sort this out? Using regex maybe?
Thanks in advance!

the following should hopefully work - it grabs the number out of the filename, sees if it's more than 1 higher than the previous number, and if so, works out all the 'in-between' numbers and prints them. Printing the number (and then reconstructing the filename later) is needed as line will never contain the names of missing files during iteration.
# Set this to the first number in the series -1
num = lastnum = 0
with open("data.txt", 'r') as f:
for line in f:
# Pick the digit out of the filename
num = int(''.join(x for x in line if x.isdigit()))
if num - lastnum > 1:
for i in range(lastnum+1, num):
print("Missing: {}.png".format(str(i)))
lastnum = num
The main advantage of working this way is that as long as your files are sorted in the list, it can handle starting at numbers other than 1, and also reports more than one missing number in the sequence.

You can try this:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if str(num) not in lines[i]:
missing.append(num)
else:
i += 1
print(missing)
Or if you want to check for the line ending with XXX.png:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if not lines[i].endswith(str(num) + ".png"):
missing.append(num)
else:
i += 1
print(missing)
Example: here

Related

IndexError: list index out of range PYTHON (w/o using for loop)

I am trying to find averages from a text file. The text file has columns of numbers and I want to find the average of each column. I get the follwoing error:
IndexError: list index out of range
The code I am using is:
import os
os.chdir(r"path of my file")
file_open = open("name of my file", "r")
file_write = open ("average.txt", "w")
line = file_open.readlines()
list_of_lines = []
length = len(list_of_lines[0])
total = 0
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
print k
list_of_lines[k].remove('\n')
for o in range(0,count):
for p in range(0,length):
print list_of_lines[p][o]
number = int(list_of_lines[p][o])
total + number
average = total/count
print average
The error is in line
length = len(list_of_lines[0])
Please let me know if I can provide anymore information.

The issue is you are trying to get the length of something in the array, not the array itself.
Try this:
length = len(list_of_lines)

You wrote length = len(list_of_lines[0])
line_of_lines is defined right above this line, as a list with 0 items in it. As a result, you cannot select the first item (index number 0) because index number 0 does not exist. Therefore, it is out of range.

Python: read line after string is found

I have a file which contains blocks of lines that I would like to separate. Each block contains a number identifier in the block's header: "Block X" is the header line for the X-th block of lines. Like this:
Block X
#L E C A F X M N
11.2145 15 27 29.444444 7.6025229 1539742 29.419783
11.21451 13 28 24.607143 6.8247935 1596787 24.586264
...
Block Y
#L E C A F X M N
11.2145 15 27 29.444444 7.6025229 1539742 29.419783
11.21451 13 28 24.607143 6.8247935 1596787 24.586264
...
I can use "enumerate" to find the header line of the block as follows:
with open(filename,'r') as indata:
for num, line in enumerate(indata):
if 'Block X' in line:
startblock=num
print startblock
This will yield the line number of the first line of block #X.
However, my problem is identifying the last line of the block. To do that, I could find the next occurrence of a header line (i.e., the next block) and subtract a few numbers.
My question: how can I find the line number of a the next occurrence of a condition (i.e., right after a certain condition was met)?
I tried using enumerate again, this time indicating the starting value, like this:
with open(filename,'r') as indata:
for num, line in enumerate(indata,startblock):
if 'Block Y ' in line:
endscan=num
break
print endscan
That doesn't work, because it still begins reading the file from line 0, NOT from the line number "startblock". Instead, by starting the "enumerate" counter from a different number, the resulting value of the counter, in this case "endscan" is shifted from 0 by the amount "startblock".
Please, help! How can tell python to disregard the lines previous to "startblock"?

If you want the groups using Block as the delimiter for each section, you can use itertools.groupby:
from itertools import groupby
with open('test.txt') as f:
grp = groupby(f,key=lambda x: x.startswith("Block "))
for k,v in grp:
if k:
print(list(v) + list(next(grp, ("", ""))[1]))
Output:
['Block X\n', '#L E C A F X M N \n', '11.2145 15 27 29.444444 7.6025229 1539742 29.419783\n', '11.21451 13 28 24.607143 6.8247935 1596787 24.586264\n']
['Block Y\n', '#L E C A F X M N \n', '11.2145 15 27 29.444444 7.6025229 1539742 29.419783\n', '11.21451 13 28 24.607143 6.8247935 1596787 24.586264']
If Block can appear elsewhere but you want it only when followed by a space and a single char:
import re
with open('test.txt') as f:
r = re.compile("^Block \w$")
grp = groupby(f, key=lambda x: r.search(x))
for k, v in grp:
if k:
print(list(v) + list(next(grp, ("", ""))[1]))

You can use the .tell() and .seek() methods of file objects to move around. So for example:
with open(filename, 'r') as infile:
start = infile.tell()
end = 0
for line in infile:
if line.startswith('Block'):
end = infile.tell()
infile.seek(start)
# print all the bytes in the block
print infile.read(end - start)
# now go back to where we were so we iterate correctly
infile.seek(end)
# we finished a block, mark the start
start = end

If the difference between the header lines is uniform throughout the file, just use the distance to increase the indexing variable accordingly.
file1 = open('file_name','r')
lines = file1.readlines()
numlines = len(lines)
i=0
for line in file:
if line == 'specific header 1':
line_num1 = i
if line == 'specific header 2':
line_num2 = i
i+=1
diff = line_num2 - line_num1
Now that we know the difference between the line numbers we use for loops to acquire the data.
k=0
array = np.zeros([numlines, diff])
for i in range(numlines):
if k % diff == 0:
for j in range(diff):
array[i][j] = lines[i+j]
k+=1
% is the mod operator which returns 0 only when k is a multiple of the difference in line numbers between the two header lines in the file, which will only occur when the line corresponds to the a header line. Once the line is fixed we go on to the second for loop that fills the array so that we have a matrix that is numlines number of rows and a diff number of columns. The nonzeros rows will contain the data inbetween the header lines.
I have not tried this out, I am just writing off the top of my head. Hopefully it helps!

Calculating average of the numbers in a .txt file using Python [duplicate]

This question already has answers here:
How to find the average of values in a .txt file
(2 answers)
Closed 7 years ago.
I have a text file that contains the numbers like this:
a: 0.8475
b: 0.6178
c: 0.6961
d: 0.7565
e: 0.7626
f: 0.7556
g: 0.7605
h: 0.6932
i: 0.7558
j: 0.6526
I want to extract only the floating point numbers from this file and calculate the average. Here is my program so far,
fh = file.open('abc.txt')
for line in fh:
line_pos = line.find(':')
line = line[line_pos+1:]
line = line.rstrip()
sum = 0
average = 0
for ln in line:
sum = sum + ln
average = sum / len(line)
print average
Can anyone tell me, what is wrong with this code. Thanks

You have the sum addition in the wrong place, and you need to keep track of the number of lines, since you can't send a file object to len(). You will also have to cast the strings to floats. I'd recommend simply splitting on whitespace, as well. Finally, use the with construct to automatically close the file:
with open('abc.txt') as fh:
sum = 0 # initialize here, outside the loop
count = 0 # and a line counter
for line in fh:
count += 1 # increment the counter
sum += float(line.split()[1]) # add here, not in a nested loop
average = sum / count
print average

Convert the line to float to do numeric addition.
Initialize sum once (before the loop begin). Calculate the average after the loop once.
len(line) will give you wrong number. The number of digitis + newline character count for the last number.
Try to avoid using str.find + slicing. Using str.split is more readable.
with open('abc.txt') as fh:
sum = 0
numlines = 0
for line in fh:
n = line.split(':')[-1]
sum += float(n)
numlines += 1
average = sum / numlines
print average

Reorganizing data

I have to input a text file that contains comma seperated and line seperated data in the following format:
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11,04:00:00,REGULAR,003169415,001097588,05-21-11,08:00:00,REGULAR,003169431,001097607
Multiple sets of such data is present in the text file
I need to print all this in new lines with the condition:
1st 3 elements of every set followed by 5 parameters in a new line. So solution of the above set would be:
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585
A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,003169415,001097588
A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,003169431,001097607
My function to achieve it is given below:
def fix_turnstile_data(filenames):
for name in filenames:
f_in = open(name, 'r')
reader_in = csv.reader(f_in, delimiter = ',')
f_out = open('updated_' + name, 'w')
writer_out = csv.writer(f_out, delimiter = ',')
array=[]
for line in reader_in:
i = 0
j = -1
while i < len(line):
if i % 8 == 0:
i+=2
j+=1
del array[:]
array.append(line[0])
array.append(line[1])
array.append(line[2])
elif (i+1) % 8 == 0:
array.append(line[i-3*j])
writer_out.writerow(array)
else:
array.append(line[i-3*j])
i+=1
f_in.close()
f_out.close()
The output is wrong and there is a space of 3 lines at the end of those lines whose length is 8. I suspect it might be the writer_out.writerow(array) which is to blame.
Can anyone please help me out?

Hmm, the logic you use ends up being fairly confusing. I'd do it more along these lines (this replaces your for loop), and this is more Pythonic:
for line in reader_in:
header = line[:3]
for i in xrange(3, len(line), 5):
writer_out.writerow(header + line[i:i+5])

How to add specific lines from a file into List in Python?

I have an input file:
3
PPP
TTT
QPQ
TQT
QTT
PQP
QQQ
TXT
PRP
I want to read this file and group these cases into proper boards.
To read the Count (no. of boards) i have code:
board = []
count =''
def readcount():
fp = open("input.txt")
for i, line in enumerate(fp):
if i == 0:
count = int(line)
break
fp.close()
But i don't have any idea of how to parse these blocks into List:
TQT
QTT
PQP
I tried using
def readboard():
fp = open('input.txt')
for c in (1, count): # To Run loop to total no. of boards available
for k in (c+1, c+3): #To group the boards into board[]
board[c].append(fp.readlines)
But its wrong way. I know basics of List but here i am not able to parse the file.
These boards are in line 2 to 4, 6 to 8 and so on. How to get them into Lists?
I want to parse these into Count and Boards so that i can process them further?
Please suggest

I don't know if I understand your desired outcome. I think you want a list of lists.
Assuming that you want boards to be:
[[data,data,data],[data,data,data],[data,data,data]], then you would need to define how to parse your input file... specifically:
line 1 is the count number
data is entered per line
boards are separated by white space.
If that is the case, this should parse your files correctly:
board = []
count = 0
currentBoard = 0
fp = open('input.txt')
for i,line in enumerate(fp.readlines()):
if i == 0:
count = int(i)
board.append([])
else:
if len(line[:-1]) == 0:
currentBoard += 1
board.append([])
else: #this has board data
board[currentBoard].append(line[:-1])
fp.close()
import pprint
pprint.pprint(board)
If my assumptions are wrong, then this can be modified to accomodate.
Personally, I would use a dictionary (or ordered dict) and get the count from len(boards):
from collections import OrderedDict
currentBoard = 0
board = {}
board[currentBoard] = []
fp = open('input.txt')
lines = fp.readlines()
fp.close()
for line in lines[1:]:
if len(line[:-1]) == 0:
currentBoard += 1
board[currentBoard] = []
else:
board[currentBoard].append(line[:-1])
count = len(board)
print(count)
import pprint
pprint.pprint(board)

If you just want to take specific line numbers and put them into a list:
line_nums = [3, 4, 5, 1]
fp = open('input.txt')
[line if i in line_nums for i, line in enumerate(fp)]
fp.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding missing lines in file - python

Related

IndexError: list index out of range PYTHON (w/o using for loop)

Python: read line after string is found

Calculating average of the numbers in a .txt file using Python [duplicate]

Reorganizing data

How to add specific lines from a file into List in Python?

Categories

Resources