Extract a part from a file between specific lines - python

I would like to know, how can I extract some data from a specific range in a big data file? Is there a way to read the content beginning and ending with "buzzwords".
I would like to read line per line between *NODE and **
*NODE
13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517
13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065
13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658
13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476
**
Before *NODE and after ** there are thousand of lines...
I know it should look something similar like:
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# NOW THERE SHOULD FOLLOW SOMETHING LIKE:
# Go to next line and "a.append" till there comes the "magical"
# "**"
Any idea? I am totally new to python. Thanks for help!
I hope u know what i mean.

You pretty much did it - the only thing missing is that once you find the beginning, you search for the sequence end and until that happens append every line you're iterating over to your list. i.e.:
data = None # a placeholder to store your lines
with open("file.txt", "r") as f: # do not shadow the built-in `file`
for line in f: # iterate over the lines
if data is None: # we haven't found `NODE*` yet
if line[:5] == "NODE*": # search for `NODE*` at the line beginning
data = [] # make `data` an empty list to begin collecting
elif line[:2] == "**": # data initialized, we look for the sequence's end
break # no need to iterate over the file anymore
else: # data initialized but not at the end...
data.append(line) # append the line to our data
Now data will contain either a list of lines between NODE* and **, or None if the sequence was not found.

Try this:
with open('file.txt') as file:
a = []
running = False # avoid NameError when 'if' statement below isn't reached
for line in file:
if line.startswith('*NODE'):
running = True # show that we are starting to add values
continue # make sure we don't add '*NODE'
if line.startswith('**'):
running = False # show that we're done adding values
continue # make sure we don't add '**'
if running: # only add the values if 'running' is True
a.extend([i.strip() for i in line.split(',')])
The output is a list containing the following as strings:
(I used print('\n'.join(a)))
13021145
2637.6073002472617
55.011929824413045
206.0394346892517
13021146
2637.6051226039867
55.21115693303926
206.05686503802065
13021147
2634.226986419154
54.98263035830583
205.9520084547658
13021148
2634.224808775879
55.181857466932044
205.96943880353476

We can iterate over lines until there is no any left or we've reached end of block like
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# collect block-related lines
while True:
try:
line = next(file)
except StopIteration:
# there is no lines left
break
if line.startswith('**'):
# we've reached the end of block
break
a.append(line)
# stop iterating over file
break
will give us
print(a)
['13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517\n',
'13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065\n',
'13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658\n',
'13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476\n']
Alternatively we can write helper predicates like
def not_a_block_start(line):
return not line.startswith('*NODE')
def not_a_block_end(line):
return not line.startswith('**')
and then use brilliance of itertools module like
from itertools import (dropwhile,
takewhile)
with open('file.txt') as file:
block_start = dropwhile(not_a_block_start, file)
# skip block start line
next(block_start)
a = list(takewhile(not_a_block_end, block_start))
this will give us the same value for a.

Related

How can I handle multiple lines at once while reading from a file?

The standard Python approach to working with files using the open() function to create a 'file object' f allows you to either load the entire file into memory at once using f.read() or to read lines one-by-one using a for loop:
with open('filename') as f:
# 1) Read all lines at once into memory:
all_data = f.read()
# 2) Read lines one-by-one:
for line in f:
# Work with each line
I'm searching through several large files looking for a pattern that might span multiple lines. The most intuitive way to do this is to read line-by-line looking for the beginning of the pattern, and then to load in the next few lines to see where it ends:
with open('large_file') as f:
# Read lines one-by-one:
for line in f:
if line.startswith("beginning"):
# Load in the next line, i.e.
nextline = f.getline(line+1) # ??? #
# or something
The line I've marked with # ??? # is my own pseudocode for what I imagine this should look like.
My question is, does this exist in Python? Is there any method for me to access other lines as needed while keeping the cursor at line and without loading the entire file into memory?
Edit Inferring from the responses here and other reading, the answer is "No."
Like this:
gather = []
for line in f:
if gather:
gather.append(line)
if "ending" in line:
process( ''.join(gather) )
gather = []
elif line.startswith("beginning"):
gather = [line]
Although in many cases it's easier just to load the whole file into a string and search it.
You may want to rstrip the newline before appending the line.
Just store the interesting lines into a list while going line-wise through the file:
with open("file.txt","w") as f:
f.write("""
a
b
------
c
d
e
####
g
f""")
interesting_data = []
inside = False
with open ("file.txt") as f:
for line in f:
line = line.strip()
# start of interesting stuff
if line.startswith("---"):
inside = True
# end of interesting stuff
elif line.startswith("###"):
inside = False
# adding interesting bits
elif inside:
interesting_data.append(line)
print(interesting_data)
to get
['c', 'd', 'e']
I think you're looking for .readline(), which does exactly that. Here is a sketch to proceed to the line where a pattern starts.
with open('large_file') as f:
line = f.readline()
while not line.startswith("beginning"):
line = f.readline()
# end of file
if not line:
print("EOF")
break
# do_something with line, get additional lines by
# calling .readline() again, etc.

Read text line by line in python

I would like to make a script that read a text line by line and based on lines if it finds a certain parameter populates an array. The idea is this
Read line
if Condition 1
#True
nested if Condition 2
...
else Condition 1 is not true
read next line
I can't get it to work though. I'm using readline () to read the text line by line, but the main problem is that the command never works to make it read the next line. Can you help me? Below an extract of my actual code:
col = 13 # colonne
rig = 300 # righe
a = [ [ None for x in range(col) ] for y in range(rig) ]
counter = 1
file = open('temp.txt', 'r')
files = file.readline()
for line in files:
if 'bandEUTRA: 32' in line:
if 'ca-BandwidthClassDL-EUTRA: a' in line:
a[counter][5] = 'DLa'
counter = counter + 1
else:
next(files)
else:
next(files)
print('\n'.join(map(str, a)))
Fixes for the code you asked about inline, and some other associated cleanup, with comments:
col = 13 # colonne
rig = 300 # righe
a = [[None] * col for y in range(rig)] # Innermost repeated list of immutable
# can use multiplication, just don't do it for
# outer list(s), see: https://stackoverflow.com/q/240178/364696
counter = 1
with open('temp.txt') as file: # Use with statement to get guaranteed file closure; 'r' is implicit mode and can be omitted
# Removed: files = file.readline() # This makes no sense; files would be a single line from the file, but your original code treats it as the lines of the file
# Replaced: for line in files: # Since files was a single str, this iterated characters of the file
for line in file: # File objects are iterators of their own lines, so you can get the lines one by one this way
if 'bandEUTRA: 32' in line and 'ca-BandwidthClassDL-EUTRA: a' in line: # Perform both tests in single if to minimize arrow pattern
a[counter][5] = 'DLa'
counter += 1 # May as well not say "counter" twice and use +=
# All next() code removed; next() advances an iterator and returns the next value,
# but files was not an iterator, so it was nonsensical, and the new code uses a for loop that advances it for you, so it was unnecessary.
# If the goal is to intentionally skip the next line under some conditions, you *could*
# use next(files, None) to advance the iterator so the for loop will skip it, but
# it's rare that a line *failing* a test means you don't want to look at the next line
# so you probably don't want it
# This works:
print('\n'.join(map(str, a)))
# But it's even simpler to spell it as:
print(*a, sep="\n")
# which lets print do the work of stringifying and inserting the separator, avoiding
# the need to make a potentially huge string in memory; it *might* still do so (no documented
# guarantees), but if you want to avoid that possibility, you could do:
sys.stdout.writelines(map('{}\n'.format, a))
# which technically doesn't guarantee it, but definitely actually operates lazily, or
for x in a:
print(x)
# which is 100% guaranteed not to make any huge strings
You can do:
with open("filename.txt", "r") as f:
for line in f:
clean_line = line.rstrip('\r\n')
process_line(clean_line)
Edit:
for your application of populating an array, you could do something like this:
with open("filename.txt", "r") as f:
contains = ["text" in l for l in f]
This will give you a list of length number of lines in filename.txt, the contents of the array will be False for each line that doesn't contain text, and True for each line that does.
Edit 2: To reflect #ShadowRanger's comments, I've changed my code to not do iterate over each line in the file without reading the whole thing at once.

Using Python3 to search a file for a string, add the results on the next lines to an array before stopping at the next string

I am using Python 3 to process a results file. The structure of the file is a combination of string identifiers followed by lists of integer values in this format:
ENERGY_BOUNDS
1.964033E+07 1.733253E+07 1.491825E+07 1.384031E+07 1.161834E+07 1.000000E+07 8.187308E+06 6.703200E+06
6.065307E+06 5.488116E+06 4.493290E+06 3.678794E+06 3.011942E+06 2.465970E+06 2.231302E+06 2.018965E+06
EIGENVALUE
1.219034E+00
There are maybe 50 different sets of data with unique identifiers in this file. What I want to do is write a code that will search for a specific identifier (e.g. ENERGY_BOUNDS), then read the values that follow into a list, stopping at the next identifier (in this case EIGENVALUE). I then need to be able to manipulate the list (finding its length, printing its values, etc.).
I am writing this as a function so I can call it multiple times in my code when I want to search for different identifiers. So far what I have is:
def read_data_from_file(file_name, identifier):
list_of_results = [] # Create list_of_results to put results in for future manipulation
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
# If yes, read the next line
nextValue = next(line)
list_of_results.append(nextValue.rstrip())
return list_of_results
It works fine up until it comes to reading the next line after the identifier, and I am stuck on how to continue reading the results after that line and how to make it stop at the next identifier.
Following is simple and tested answer.
You are making two mistakes
line is a string and not iterator so doing next(line) is causing error.
You are just reading one line after identifier has been found while you need to keep on reading until another identifier appears.
Following is the code after doing little modification of your code. It's also tested on your data
def read_data_from_file(file_name, identifier):
with open(file_name, 'r') as read_obj:
list_of_results = []
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
# If yes, read the next line
nextValue = next(read_obj)
while(not nextValue.strip().isalpha()): #keep on reading untill next identifier appears
list_of_results.extend(nextValue.split())
nextValue = next(read_obj)
print(list_of_results)
I would suggest adding a variable that indicates whether you have found a line containing an identifier.
Afterwards, simply add the values into the array until the next identifier has been reached.
def read_data_from_file(file_name, identifier):
list_of_results = [] # Create list_of_results to put results in for future manipulation
identifier_found = False
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
identifier_found = True
elif identifier_found:
if line.strip().isalpha(): # Next identifier reached, exit loop
break
list_of_results += line.split() # Add values to result
return list_of_results
Use booleans, continue, and break!
Try to implement logic as follows:
Set a boolean (I'll use in_range) to False
Look through the lines and see if they match the identifier.
If it does, set the boolean to True and continue
If it does not, continue
If the boolean is False AND the line begins with a space: continue
If the boolean is True AND the line begins with a space: Add the line to the list.
If the boolean is True AND the line doesn't begin with a space: break.
This ends the searching process once a new identifier has been started.
The other 2 answers are already helpful. Here is my method incase that you need something else. With comments to explain.
If you dont want to use the end_identifier you can use .isAlpha() which checks if the string only contains letters.
def read_data_from_file(file_name, start_identifier, end_identifier):
list_of_results = []
with open(file_name, 'r') as read_obj:
start_identifier_reached = False # variable to check if we reached the needed identifier_reached
for line in read_obj:
if start_identifier in line:
start_identifier_reached = True # now we reached the identifier
continue # We go back to the start so we dont write the identifier into the list
if start_identifier_reached and (end_identifier not in line): # Put the values into the list until we reach the end_identifier
list_of_results.append(line.rstrip())
else:
return list_of_results

Python - how to get last line in a loop

I have some CSV files that I have to modify which I do through a loop. The code loops through the source file, reads each line, makes some modifications and then saves the output to another CSV file. In order to check my work, I want the first line and the last line saved in another file so I can confirm that nothing was skipped.
What I've done is put all of the lines into a list then get the last one from the index minus 1. This works but I'm wondering if there is a more elegant way to accomplish this.
Code sample:
def CVS1():
fb = open('C:\\HP\\WS\\final-cir.csv','wb')
check = open('C:\\HP\\WS\\check-all.csv','wb')
check_count = 0
check_list = []
with open('C:\\HP\\WS\\CVS1-source.csv','r') as infile:
skip_first_line = islice(infile, 3, None)
for line in skip_first_line:
check_list.append(line)
check_count += 1
if check_count == 1:
check.write(line)
[CSV modifications become a string called "newline"]
fb.write(newline)
final_check = check_list[len(check_list)-1]
check.write(final_check)
fb.close()
If you actually need check_list for something, then, as the other answers suggest, using check_list[-1] is equivalent to but better than check_list[len(check_list)-1].
But do you really need the list? If all you want to keep track of is the first and last lines, you don't. If you keep track of the first line specially, and keep track of the current line as you go along, then at the end, the first line and the current line are the ones you want.
In fact, since you appear to be writing the first line into check as soon as you see it, you don't need to keep track of anything but the current line. And the current line, you've already got that, it's line.
So, let's strip all the other stuff out:
def CVS1():
fb = open('C:\\HP\\WS\\final-cir.csv','wb')
check = open('C:\\HP\\WS\\check-all.csv','wb')
first_line = True
with open('C:\\HP\\WS\\CVS1-source.csv','r') as infile:
skip_first_line = islice(infile, 3, None)
for line in skip_first_line:
if first_line:
check.write(line)
first_line = False
[CSV modifications become a string called "newline"]
fb.write(newline)
check.write(line)
fb.close()
You can enumerate the csv rows of inpunt file, and check the index, like this:
def CVS1():
with open('C:\\HP\\WS\\final-cir.csv','wb') as fb, open('C:\\HP\\WS\\check-all.csv','wb') as check, open('C:\\HP\\WS\\CVS1-source.csv','r') as infile:
skip_first_line = islice(infile, 3, None)
for idx,line in enumerate(skip_first_line):
if idx==0 or idx==len(skip_first_line):
check.write(line)
#[CSV modifications become a string called "newline"]
fb.write(newline)
I've replaced the open statements with with block, to delegate to interpreter the files handlers
you can access the index -1 directly:
final_check = check_list[-1]
which is nicer than what you have now:
final_check = check_list[len(check_list)-1]
If it's not an empty or 1 line file you can:
my_file = open(root_to file, 'r')
my_lines = my_file.readlines()
first_line = my_lines[0]
last_line = my_lines[-1]

Include surrounding lines of text file match in output using Python 2.7.3

I've been working on a program which assists in log analysis. It finds error or fail messages using regex and prints them to a new .txt file. However, it would be much more beneficial if the program including the top and bottom 4 lines around what the match is. I can't figure out how to do this! Here is a part of the existing program:
def error_finder(filepath):
source = open(filepath, "r").readlines()
error_logs = set()
my_data = []
for line in source:
line = line.strip()
if re.search(exp, line):
error_logs.add(line)
I'm assuming something needs to be added to the very last line, but I've been working on this for a bit and either am not applying myself fully or just can't figure it out.
Any advice or help on this is appreciated.
Thank you!
Why python?
grep -C4 '^your_regex$' logfile > outfile.txt
Some comments:
I'm not sure why error_logs is a set instead of a list.
Using readlines() will read the entire file in memory, which will be inefficient for large files. You should be able to just iterate over the file a line at a time.
exp (which you're using for re.search) isn't defined anywhere, but I assume that's elsewhere in your code.
Anyway, here's complete code that should do what you want without reading the whole file in memory. It will also preserve the order of input lines.
import re
from collections import deque
exp = '\d'
# matches numbers, change to what you need
def error_finder(filepath, context_lines = 4):
source = open(filepath, 'r')
error_logs = []
buffer = deque(maxlen=context_lines)
lines_after = 0
for line in source:
line = line.strip()
if re.search(exp, line):
# add previous lines first
for prev_line in buffer:
error_logs.append(prev_line)
# clear the buffer
buffer.clear()
# add current line
error_logs.append(line)
# schedule lines that follow to be added too
lines_after = context_lines
elif lines_after > 0:
# a line that matched the regex came not so long ago
lines_after -= 1
error_logs.append(line)
else:
buffer.append(line)
# maybe do something with error_logs? I'll just return it
return error_logs
I suggest to use index loop instead of for each loop, try this:
error_logs = list()
for i in range(len(source)):
line = source[i].strip()
if re.search(exp, line):
error_logs.append((line,i-4,i+4))
in this case your errors log will contain ('line of error', line index - 4, line index + 4), so you can get these lines later form "source"

Categories

Resources