Python - how to navigate through text file multiple lines backwards using seek()?

Python - how to navigate through text file multiple lines backwards using seek()? - python

What im trying to do is match a phrase in a text file, then print that line(This works fine). I then need to move the cursor up 4 lines so I can do another match in that line, but I cant get the seek() method to move up 4 lines from the line that has been matched so that I can do another regex search. All I can seem to do with seek() is search from the very end of the file, or the beginning. It doesn't seem to let me just do seek(105,1) from the line that is matched.
### This is the example test.txt
This is 1st line
This is 2nd line # Needs to seek() to this line from the 6th line. This needs to be dynamic as it wont always be 4 lines.
This is 3rd line
This is 4th line
This is 5th line
This is 6st line # Matches this line, now need to move it up 4 lines to the "2nd line"
This is 7 line
This is 8 line
This is 9 line
This is 10 line
#
def Findmatch():
file = open("test.txt", "r")
print file.tell() # shows 0 which is the beginning of the file
string = file.readlines()
for line in string:
if "This is 6th line" in line:
print line
print file.tell() # shows 171 which is the end of the file. I need for it to be on the line that matches my search which should be around 108. seek() only lets me search from end or beginning of file, but not from the line that was matched.
Findmatch()

Since you've read all of it into memory at once with file.readlines(). tell() method does indeed correctly point to the end and your already have all your lines in an array. If you still wanted to, you'd have to read the file in line by line and record position within file for each line start so that you could go back four lines.
For your described problem. You can first find index of the line first match and then do the second operation starting from the list slice four items before that.
Here a very rough example of that (return None isn't really needed, it's just for sake of verbosity, clearly stating intent/expected behavior; raising an exception might be just as well a desired depending on what the overall plan is):
def relevant(value, lines):
found = False
for (idx, line) in enumerate(lines):
if value in line:
found = True
break # Stop iterating, last idx is a match.
if found is True:
idx = idx - 4
if idx < 0:
idx = 0 # Just return all lines up to now? Or was that broken input and fail?
return lines[idx:]
else:
return None
with open("test.txt") as in_file:
lines = in_file.readlines()
print(''.join(relevant("This is 6th line", lines)))
Please also note: It's a bit confusing to name list of lines string (one would probably expect a str there), go with lines or something else) and it's also not advisable (esp. since you indicate to be using 2.7) to assign your variable names already used for built-ins, like file. Use in_file for instance.
EDIT: As requested in a comment, just a printing example, adding it in parallel as the former seem potentially more useful for further extension. :) ...
def print_relevant(value, lines):
found = False
for (idx, line) in enumerate(lines):
if value in line:
found = True
print(line.rstrip('\n'))
break # Stop iterating, last idx is a match.
if found is True:
idx = idx - 4
if idx < 0:
idx = 0 # Just return all lines up to now? Or was that broken input and fail?
print(lines[idx].rstrip('\n'))
with open("test.txt") as in_file:
lines = in_file.readlines()
print_relevant("This is 6th line", lines)
Note, since lines are read in with trailing newlines and print would add one of its own I've rstrip'ed the line before printing. Just be aware of it.

Related

How to get the last character in a file from Python?

I'm trying to set a variable to the last character of a file. I am using Python, and I'm fairly new to it. If it is of any importance, my code appends a random number between 2 and 9 to the end of an HTML file. In a separate function, I want to set the last character of the HTML file (the last character being the random number between 2 and 9) to a variable, then delete the last character (as to not affect the function of the HTML). Doe's anyone know how I could do this? I can attach my code below if needed, but I chose not to as it is 50 lines long and all 50 lines are needed for full context.

try this,
"a.txt" file has number 1, 3, 4, 5
Below code will read the file and pulls out last character from the file.
file = open('a.txt','r')
lines = file.read()
print(lines[-1])
=> 5

Using #Jab's answer from the comment above as well as some assumptions, we can produce a more efficient solution to finding the last character and replacing it.
The assumptions that are made are common and most likely will be valid:
You will know whether there is a newline character at the very end of the file, or whether the random number is truly the last character in the file (meaning accounting for whitespace).
You know the encoding of the file. This is valid since almost all HTML is utf-8, (can be utf-16), and since you are the one editing it, you will know. Most times the encoding won't even matter.
So, this is what we can do:
with open("test.txt", "rb+", encoding='utf-8') as f:
f.seek(-2, 2)
# -1 or -2, may change depending on whitespace characters at end of the file
var = f.read(1) # read one byte for a number
f.seek(-1,1)
print("last character:", str(var, 'utf-8'))
f.write(bytes('variable', 'utf-8')) # set whatever info here
f.write(bytes('\n', 'utf-8')) # you may want a newline character at the end of the file
f.truncate()
This is efficient because we actually don't have to iterate through the entire file. We iterate through just the last character, once to read and once to write.

You can do something like that:
# Open the file to read and the file to write
with open('file.txt'), open('new_file.txt', 'w+') as f_in, f_out:
# Read all the lines to memory (you can't find the last line lazily)
lines = f_in.readlines()
# Iterate over every line
for i, line in enumerate(lines):
# If the current index is the last index (i.e. the last line)
if i == len(lines) - 1:
# Get the last character
last_char = line[-1]
# Write to the output file the line without the last character
print(line[:-1], file=f_out, end='')
else:
# Write to the output file the line as it is
print(line, file=f_out, end='')
# Print the removed char
print(last_char)
If you don't want to create a new file, you can load all the file to memory as we're currently doing:
# Read all the lines into memory
with open('file.txt') as f:
lines = f.readlines()
# Replace the lines inside the list using the previous logic
for i, line in enumerate(lines):
if i == len(lines) - 1:
last_char = line[-1]
lines[i] = line[:-1]
else:
lines[i] = line
# Write the changed lines to the same file
with open('file.txt', 'w+') as f:
print(''.join(lines), file=f, end='')
# Print the removed char
print(last_char)

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?

I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com

The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.

A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Python: Copying lines that meet requirements

So, basically, I need a program that opens a .dat file, checks each line to see if it meets certain prerequisites, and if they do, copy them into a new csv file.
The prerequisites are that it must 1) contain "$W" or "$S" and 2) have the last value at the end of the line of the DAT say one of a long list of acceptable terms. (I can simply make-up a list of terms and hardcode them into a list)
For example, if the CSV was a list of purchase information and the last item was what was purchased, I only want to include fruit. In this case, the last item is an ID Tag, and I only want to accept a handful of ID Tags, but there is a list of about 5 acceptable tags. The Tags have very veriable length, however, but they are always the last item in the list (and always the 4th item on the list)
Let me give a better example, again with the fruit.
My original .DAT might be:
DGH$G$H $2.53 London_Port Gyro
DGH.$WFFT$Q5632 $33.54 55n39 Barkdust
UYKJ$S.52UE $23.57 22#3 Apple
WSIAJSM_33$4.FJ4 $223.4 Ha25%ek Banana
Only the line: "UYKJ$S $23.57 22#3 Apple" would be copied because only it has both 1) $W or $S (in this case a $S) and 2) The last item is a fruit. Once the .csv file is made, I am going to need to go back through it and replace all the spaces with commas, but that's not nearly as problematic for me as figuring out how to scan each line for requirements and only copy the ones that are wanted.
I am making a few programs all very similar to this one, that open .dat files, check each line to see if they meet requirements, and then decides to copy them to the new file or not. But sadly, I have no idea what I am doing. They are all similar enough that once I figure out how to make one, the rest will be easy, though.
EDIT: The .DAT files are a few thousand lines long, if that matters at all.
EDIT2: The some of my current code snippets
Right now, my current version is this:
def main():
#NewFile_Loc = C:\Users\J18509\Documents
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
OldText = OldFile.read()
# for i in range(0, len(OldText)):
# if (OldText[i] != " "):
# print OldText[i]
i = split_line(OldText)
if u'$S' in i:
# $S is in the line
print i
main()
But it's very choppy still. I'm just learning python.
Brief update: the server I am working on is down, and might be for the next few hours, but I have my new code, which has syntax errors in it, but here it is anyways. I'll update again once I get it working. Thanks a bunch everyone!
import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find($W)) or (LineParts[0].find($S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()

There are two parts you need to implement: First, read a file line by line and write lines meeting a specific criteria. This is done by
with open('file.dat') as f:
for line in f:
stripped = line.strip() # remove '\n' from the end of the line
if test_line(stripped):
print stripped # Write to stdout
The criteria you want to check for are implemented in the function test_line. To check for the occurrence of "$W" or "$S", you can simply use the in-Operator like
if not '$W' in line and not '$S' in line:
return False
else:
return True
To check, if the last item in the line is contained in a fixed list, first split the line using split(), then take the last item using the index notation [-1] (negative indices count from the end of a sequence) and then use the in operator again against your fixed list. This looks like
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Now, you combine these two parts into the test_line function like
def test_line(line):
if not '$W' in line and not '$S' in line:
return False
items = line.split() # items is an array of strings
last_item = items[-1] # take the last element of the array
if last_item in ['Apple', 'Banana']:
return True
else:
return False
Note that the program writes the result to stdout, which you can easily redirect. If you want to write the output to a file, have a look at Correct way to write line to file in Python

inlineRequirements = ['$W','$S']
endlineRequirements = ['Apple','Banana']
inputFile = open(input_filename,'rb')
outputFile = open(output_filename,'wb')
for line in inputFile.readlines():
line = line.strip()
#trailing and leading whitespace has been removed
if any(req in line for req in inlineRequirements):
#passed inline requirement
lastWord = line.split(' ')[-1]
if lastWord in endlineRequirements:
#passed endline requirement
outputFile.write(line.replace(' ',','))
#replaced spaces with commas and wrote to file
inputFile.close()
outputFile.close()

tags = ['apple', 'banana']
match = ['$W', '$S']
OldFile_Loc=raw_input("Input File for MCLG:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile.readlines(): # Loop through the file
line = line.strip() # Remove the newline and whitespace
if line and not line.isspace(): # If the line isn't empty
lparts = line.split() # Split the line
if any(tag.lower() == lparts[-1].lower() for tag in tags) and any(c in line for c in match):
# $S or $W is in the line AND the last section is in tags(case insensitive)
print line

import re
list_of_fruits = ["Apple","Bannana",...]
with open('some.dat') as f:
for line in f:
if re.findall("\$[SW]",line) and line.split()[-1] in list_of_fruits:
print "Found:%s" % line

import os
NewFilePath = "A:\test.txt"
Acceptable_Values = ('Apple','Banana')
#Main
def main():
if os.path.isfile(NewFilePath):
os.remove(NewFilePath)
NewFile = open (NewFilePath, 'w')
NewFile.write('Header 1,','Name Header,','Header 3,','Header 4)
OldFile_Loc=raw_input("Input File for Program:")
OldFile = open(OldFile_Loc,"r")
for line in OldFile:
LineParts = line.split()
if (LineParts[0].find(\$W)) or (LineParts[0].find(\$S)):
if LineParts[3] in Acceptable_Values:
print(LineParts[1], ' is accepted')
#This Line is acceptable!
NewFile.write(LineParts[1],',',LineParts[0],',',LineParts[2],',',LineParts[3])
OldFile.close()
NewFile.close()
main()
This worked great, and has all the capabilities I needed. The other answers are good, but none of them do 100% of what I needed like this one does.

Efficiently reading a certain line in a file

Came across some different methods for reading files in Python, I was wondering which is the fastest way to do it.
For example reading the last line of a file, one can do
input_file = open('mytext.txt', 'r')
lastLine = ""
for line in input_file:
lastLine = line
print lastLine # This is the last line
Or
fileHandle = open('mytext.txt', 'r')
lineList = fileHandle.readlines()
print lineList[-1] #This is the last line
I'm assuming for that particular case this may be not really relevant discussing efficiency...
Question:
1. Which method is faster for picking a random line
2. Can we deal with concepts like "SEEK" in Python (if so is it faster?)

If you don't need a uniform distribution (i.e. it's okay that the chance for some line to be picked is not equal for all lines) and/or if your lines are all about the same length then the problem of picking the random line can be simplified to:
Determine the size of the file in bytes
Seek to a random position
Search for the last newline character if any (there may be none if there's no preceding line)
Pick all text up to the next newline character or the end of file, whichever comes first.
For (2) you do an educated guess for how far you've got to search backwards to find the previous newline. If you can tell that a line is n bytes on average then you could read the previous n bytes in a single step.

I had this problematic few days ago and I use this solution. My solution is similar to the #Frerich Raabe one, but with no random, just logic :)
def get_last_line(f):
""" f is a file object in read mode, I just extract the algorithm from a bigger function """
tries = 0
offs = -512
while tries < 5:
# Put the cursor at n*512nth character before the end.
# If we reach the max fsize, it puts the cursor at the beginning (fsize * -1 means move the cursor of -fsize from the end)
f.seek(max(fsize * -1, offs), 2)
lines = f.readlines()
if len(lines) > 1: # If there's more than 1 lines found, then we have the last complete line
return lines[-1] # Returns the last complete line
offs *= 2
tries += 1
raise ValueError("No end line found, after 5 tries (Your file may has only 1 line or the last line is longer than %s characters)" % offs)
The tries counters avoid to be block if the file has also one line (a very very long last line). The algorithm tries to get the last line from the last 512 characters, then 1024, 2048... and stop if there's still no complete line at the th iteration.

Update iteration value in Python for loop

Pretty new to Python and have been writing up a script to pick out certain lines of a basic log file
Basically the function searches lines of the file and when it finds one I want to output to a separate file, adds it into a list, then also adds the next five lines following that. This then gets output to a separate file at the end in a different funcition.
What I've been trying to do following that is jump the loop to continue on from the last of those five lines, rather than going over them again. I thought the last line in the code would solved the problem, but unfortunately not.
Are there any recommended variations of a for loop I could use for this purpose?
def readSingleDayLogs(aDir):
print 'Processing files in ' + str(aDir) + '\n'
lineNumber = 0
try:
open_aDirFile = open(aDir) #open the log file
for aLine in open_aDirFile: #total the num. lines in file
lineNumber = lineNumber + 1
lowerBound = 0
for lineIDX in range(lowerBound, lineNumber):
currentLine = linecache.getline(aDir, lineIDX)
if (bunch of logic conditions):
issueList.append(currentLine)
for extraLineIDX in range(1, 6): #loop over the next five lines of the error and append to issue list
extraLine = linecache.getline(aDir, lineIDX+ extraLineIDX) #get the x extra line after problem line
issueList.append(extraLine)
issueList.append('\n\n')
lowerBound = lineIDX

You should use a while loop :
line = lowerBound
while line < lineNumber:
...
if conditions:
...
for lineIDX in range(line, line+6):
...
line = line + 6
else:
line = line + 1

A for-loop uses an iterator over the range, so you can have the ability to change the loop variable.
Consider using a while-loop instead. That way, you can update the line index directly.

I would look at something like:
from itertools import islice
with open('somefile') as fin:
line_count = 0
my_lines = []
for line in fin:
line_count += 1
if some_logic(line):
my_lines.append(line)
next_5 = list(islice(fin, 5))
line_count += len(next_5)
my_lines.extend(next_5)
This way, by using islice on the input, you're able to move the iterator ahead and resume after the 5 lines (perhaps fewer if near the end of the file) are exhausted.
This is based on if I'm understanding correctly that you can read forward through the file, identify a line, and only want a fixed number of lines after that point, then resume looping as per normal. (You may not even require the line counting if that's all you're after as it only appears to be for the getline and not any other purpose).
If you indeed you want to take the next 5, and still consider the following line, you can use itertools.tee to branch at the point of the faulty line, and islice that and let the fin iterator resume on the next line.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - how to navigate through text file multiple lines backwards using seek()? - python

Related

How to get the last character in a file from Python?

Concatenate lines with previous line based on number of letters in first column

Python: Copying lines that meet requirements

Efficiently reading a certain line in a file

Update iteration value in Python for loop

Categories

Resources