Python equivalent for 'grep -C N'? - python

So right now I'm looking for something in a file. I am getting a value variable, which is a rather long string, with newlines and so on. Then, I use re.findall(regex, value) to find regex. Regex is rather simple - something like "abc de.*".
Now, I want not only to capture whatever regex has, but also context(exactly like -C flag for grep).
So, assuming that I dumped value to file and ran grep on it, what I'd do is grep -C N 'abc de .*' valueinfile
How can I achieve the same thing in Python? I need the answer to work with Unicode regex/text.

My approach is to split the text block into list of lines. Next, iterate through each line and see if there is a match. In case of a match, then gather the context lines (lines that happens before and after the current line) and return it. Here is my code:
import re
def grep(pattern, block, context_lines=0):
lines = block.splitlines()
for line_number, line in enumerate(lines):
if re.match(pattern, line):
lines_with_context = lines[line_number - context_lines:line_number + context_lines + 1]
yield '\n'.join(lines_with_context)
# Try it out
text_block = """One
Two
Three
abc defg
four
five
six
abc defoobar
seven
eight
abc de"""
pattern = 'abc de.*'
for line in grep(pattern, text_block, context_lines=2):
print line
print '---'
Output:
Two
Three
abc defg
four
five
---
five
six
abc defoobar
seven
eight
---
seven
eight
abc de
---

As recommended by Ignacio Vazquez-Abrams, use a deque to store the last n lines. Once that many lines are present, popleft for each new line added. When your regular expression finds a match, return the previous n lines in the stack then iterate n more lines and return those also.
This keeps you from having to iterate on any line twice (DRY) and stores only minimal data in memory. You also mentioned the need for Unicode, so handling file encoding and adding the Unicode flag to RegEx searches is important. Also, the other answer uses re.match() instead of re.search() and as such may have unintended consequences.
Below is an example. This example only iterates over every line ONCE in the file, which means context lines that also contain hits don't get looked at again. This may or may not be desirable behavior but can easily be tweaked to highlight or otherwise flag lines with additional hits within context for a previous hit.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import re
from collections import deque
def grep(pattern, input_file, context=0, case_sensitivity=True, file_encoding='utf-8'):
stack = deque()
hits = []
lines_remaining = None
with codecs.open(input_file, mode='rb', encoding=file_encoding) as f:
for line in f:
# append next line to stack
stack.append(line)
# keep adding context after hit found (without popping off previous lines of context)
if lines_remaining and lines_remaining > 0:
continue # go to next line in file
elif lines_remaining and lines_remaining == 0:
hits.append(stack)
lines_remaining = None
stack = deque()
# if stack exceeds needed context, pop leftmost line off stack
# (but include current line with possible search hit if applicable)
if len(stack) > context+1:
last_line_removed = stack.popleft()
# search line for pattern
if case_sensitivity:
search_object = re.search(pattern, line, re.UNICODE)
else:
search_object = re.search(pattern, line, re.IGNORECASE|re.UNICODE)
if search_object:
lines_remaining = context
# in case there is not enough lines left in the file to provide trailing context
if lines_remaining and len(stack) > 0:
hits.append(stack)
# return list of deques containing hits with context
return hits # you'll probably want to format the output, this is just an example

Related

Find a pattern in a stream of bytes read in blocks

I have a stream of gigabytes of data that I read in blocks of 1 MB.
I'd like to find if (and where) one of the patterns PATTERNS = [b"foo", b"bar", ...] is present in the data (case insensitive).
Here is what I'm doing. It works but it is sub-optimal:
oldblock = b''
while True:
block = source_data.get_bytes(1024*1024)
if block == b'':
break
testblock = (oldblock + block).lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l) # note: this line can be incomplete if
oldblock = block # it continues in the next block (**)
Why do we need to search in oldblock + block? This is because the pattern foo could be precisely split in two consecutive 1 MB blocks:
[.......fo] [o........]
block n block n+1
Drawback: it's not optimal to have to concatenate oldblock + block and to perform the search almost twice as much as necessary.
We could use testblock = oldblock[-max_len_of_patterns:] + block, but there is surely a more canonical way to address this problem, as well as the side-remark (**).
How to do a more efficient pattern search in data read by blocks?
Note: the input data is not a file that I can iterate on or memory map, I only receive a stream of 1MB blocks from an external source.
I'd separate the block-getting from the pattern-searching and do it like this (all but the first two lines are from your original):
for block in nice_blocks():
testblock = block.lower()
for PATTERN in PATTERNS:
if PATTERN in testblock:
for l in testblock.split(b'\n'): # display only the line where the
if PATTERN in l: # pattern is found, not the whole 1MB block!
print(l)
Where nice_blocks() is an iterator of "nice" blocks, meaning they don't break lines apart and they don't overlap. And they're ~1 MB large as well.
To support that, I start with a helper just providing an iterator of the raw blocks:
def raw_blocks():
while block := source_data.get_bytes(1024*1024):
yield block
(The := assumes you're not years behind, it was added in Python 3.8. For older versions, do it with your while-True-if-break).
And to get nice blocks:
def nice_blocks():
carry = b''
for block in raw_blocks():
i = block.rfind(b'\n')
if i >= 0:
yield carry + block[:i]
carry = block[i+1:]
else:
carry += block
if carry:
yield carry
The carry carries over remaining bytes from the previous block (or previous blocks, if none of them had newlines, but that's not happening with your "blocks of 1 MB" and your "line_length < 1 KB").
With these two helper functions in place, you can write your code as at the top of my answer.
From the use of testblock.split(b'\n') in your code, as well as the comment about displaying the line where a pattern is found, it is well apparent that your expected input is not a true binary file, but a text file, where each line, separated by b'\n', is of a size reasonable enough to be readable by the end user when displayed on a screen. It is therefore most convenient and efficient to simply iterate through the file by lines instead of in chunks of a fixed size since the iterator of a file-like object already handles buffering and splitting by lines optimally.
However, since it is now clear from your comment that data is not really a file-like object in your real-world scenario, but an API that presumably has just a method that returns a chunk of data per call, we have to wrap that API into a file-like object.
For demonstration purpose, let's simulate the API you're dealing with by creating an API class that returns up to 10 bytes of data at a time with the get_next_chunk method:
class API:
def __init__(self, data):
self.data = data
self.position = 0
def get_next_chunk(self):
chunk = self.data[self.position:self.position + 10]
self.position += 10
return chunk
We can then create a subclass of io.RawIOBase that wraps the API into a file-like object with a readinto method that is necessary for a file iterator to work:
import io
class APIFileWrapper(io.RawIOBase):
def __init__(self, api):
self.api = api
self.leftover = None
def readable(self):
return True
def readinto(self, buffer):
chunk = self.leftover or api.get_next_chunk()
size = len(buffer)
output = chunk[:size]
self.leftover = chunk[size:]
output_size = len(output)
buffer[:output_size] = output
return output_size
With a raw file-like object, we can then wrap it in an io.BufferedReader with a buffer size that matches the size of data returned by your API call, and iterate through the file object by lines and use the built-in in operator to test if a line contains one of the patterns in the list:
api = API(b'foo bar\nHola World\npython\nstackoverflow\n')
PATTERNS = [b't', b'ho']
for line in io.BufferedReader(APIFileWrapper(api), 10): # or 1024 * 1024 in your case
lowered_line = line.lower()
for pattern in PATTERNS:
if pattern in lowered_line:
print(line)
break
This outputs:
b'Hola World\n'
b'python\n'
b'stackoverflow\n'
Demo: https://replit.com/#blhsing/CelebratedCadetblueWifi
I didn't do any benchmarks but this solution has the definite advantage of being straight forward and not looking everywhere twice, print the lines as they actually appear in the stream (and not in all lower case) and print the complete lines even if they cross a block:
import re
regex_patterns = list(re.compile('^.*'+re.escape(pattern)+'.*$',re.I|re.M) for pattern in PATTERNS)
testblock = ""
block = data.read(1024*1024) # **see remark below**
while len(block)>0:
lastLineStart = testblock.rfind('\n')+1
testblock = testblock[lastLineStart:]+block.decode('UFT-8') # **see edit below**
for pattern in regex_patterns:
for line in pattern.findall(testblock):
print(line)
block = data.read(1024*1024) # **see remark below**
Remark: Since you are processing text data here (otherwise the notion of "lines" wouldn't make any sense), you shouldn't be using b'...' anywhere. Your text in the stream has some encoding and you should read it in a way that honours that encoding (instead of data.read(1024*1024)) so that the loops are operating on real (Python internal unicode) strings and not some byte data. Not getting that straight is one of the most frustratingly difficult bugs to find in each and every Python script.
Edit: If your data is coming from someplace you don't have control over, then using block.decode('UTF-8') (where 'UTF-8' should be replaced by your data's actual encoding!) would allow for the patterns to be Python unicode strings as well. Meaning you could drop the b'..' around those as well. Naturally... if your data is all strictly 7-bit anyway, those points are mute.
How about:
Only concat the end of the first block and start of the next block using the length of the pattern that you are currently looking for. Then use a variable (carry) to indicate if you found the pattern or not so that when you move to the next block you automatically print the first line because you already know that line started with the pattern.
E.g.
block_0 = "abcd"
block_1 = "efgh"
pattern = "def"
length = 3
if pattern in block_0[-length + 1:] + block_1[:length - 1]
This if statement will check "cdef" for the pattern "def". No need to check any more characters than that because if it isn't in that selection of characters then it isn't between blocks in the first place. Now you know the pattern is across blocks, you just need to print the first line of the next block which will be done by checking the value of carry as seen below.
This should stop you needing to go through the block twice like you said.
oldblock = b''
carry = False
while True:
block = data.read(1024*1024)
if block == b'':
break
block = block.lower()
lines = block.split(b'\n')
if carry:
print(lines[0])
carry = False
for PATTERN in PATTERNS:
if PATTERN in block:
for l in lines:
if PATTERN in l:
print(l)
length = len(PATTERN)
if PATTERN in (block[-length + 1:] + oldblock[:length - 1]):
carry = True #Found the PATTERN between blocks, indicate that the first line of the next block needs to be printed
oldblock = block
Updated Answer
Given what we now know about the nature of the data, then we only need to retain from a previous call to get_bytes the last N characters where N is the maximum pattern length - 1. And since a portion of the previous block retrieved must be concatenated with the newly read block in order to match patterns that are split across block boundaries, it then becomes possible to match the same pattern twice. Therefore, it only makes sense that once a pattern has been matched we do not try to match it again. And, of course, when there are no more patterns to match we can quit.
The pattern strings, if not Ascii, should be encoded with the same encoding being used in the stream.
PATTERNS = [b'foo', b'bar', 'coûteux'.encode('utf-8')]
BLOCKSIZE = 1024 * 1024
FILE_PATH = 'test.txt'
# Compute maximum pattern length - 1
pad_length = max(map(lambda pattern: len(pattern), PATTERNS)) - 1
with open(FILE_PATH, 'rb') as f:
patterns = PATTERNS
# Initialize with any byte string we are not trying to match.
data = b'\x00' * pad_length
offset = 0
# Any unmatched patterns left?:
while patterns:
# Emulate a call to get_bytes(BLOCKSIZE) using a binary file:
block = f.read(BLOCKSIZE)
if block == b'':
break
# You only need to keep the last pad_length bytes from previous read:
data = data[-pad_length:] + block.lower()
# Once a pattern is matched we do not want to try matching it again:
new_patterns = []
for pattern in patterns:
idx = data.find(pattern)
if idx != -1:
print('Found: ', pattern, 'at offset', offset + idx - pad_length)
else:
new_patterns.append(pattern)
offset += BLOCKSIZE
patterns = new_patterns
if patern matched, try to use "break;" word inside "for" body for breaking execution of already useless code
and use {...} for start and finish "for" loop body like:
for (...)
{ if match(PATTERN) break;
}

How to remove dash/ hyphen from each line in .txt file

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?
E.g.:
effects on the skin is fully under-
stood one fights
to:
effects on the skin is fully understood
one fights
or:
effects on the skin is fully
understood one fights
Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.
Edit:
The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below
This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.
To combine the results back into a single block of text, you can join it against the line separator of your choice:
source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""
def reflow(text):
holdover = ""
for line in text.splitlines():
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
yield f"{holdover}{lin}"
holdover = e[:-1]
print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""
To read one file line-by-line and write directly to a new file:
def reflow(infile, outfile):
with open(infile) as source, open(outfile, "w") as dest:
holdover = ""
for line in source.readlines():
line = line.rstrip("\n")
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
dest.write(f"{holdover}{lin}\n")
holdover = e[:-1]
if __name__ == "__main__":
reflow("source.txt", "dest.txt")
Here is one way to do it
with open('test.txt') as file:
combined_strings = []
merge_line = False
for item in file:
item = item.replace('\n', '') # remove new line character at end of line
if '-' in item[-1]: # check that it is the last character
merge_line = True
combined_strings.append(item[:-1])
elif merge_line:
merge_line = False
combined_strings[-1] = combined_strings[-1] + item
else:
combined_strings.append(item)
If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items
words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if "-" in wordsSplit[i][g]:
#setting the new word in the list and removing the hyphen
wordsSplit[i][g] = wordsSplit[i][g][0:-1]+wordsSplit[i+1][0]
wordsSplit[i+1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if wordsSplit[i][g] != "":
msg += wordsSplit[i][g]+" "
What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.
What about
import re
a = '''effects on the skin is fully under-
stood one fights'''
re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')
Explanation
a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)
-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.
.replace('~','\n') finally replaces all remaining ~ chars to newlines.

Trouble parsing FASTA files in Python

dict = {}
tag = ""
with open('/storage/emulated/0/Download/sequence.fasta.txt','r') as sequence:
seq = sequence.readlines()
for line in seq:
if line.startswith(">"):
tag = line.replace("\n", "")
else:
seq = "".join(seq[1:])
dict[tag] = seq.replace("\n", "")
print(dict)
Background for those who arn't familiar with FASTA files. This format contains one or multiple DNA, RNA, or protein sequences with a one-line descriptive tag of the sequence that starts with a ">" and then the sequence in the following lines(Ex. For DNA it would be a lot of repeating of A, T, G, and C). It also comes with many unnecessary line breaks. So far this code works when I only have one sequence per file but it seems to ignore the if condition if there are multiple. For example it should add each new tag: sequence pair into the dictionary everytime it notices a ">" but instead it only runs once and puts the first description as the key in the dictionary and joins the rest of the file regardless of ">" characters and uses that as the value. How can I get this loop to notice a new ">" after the first occurrence?
I am purposefully steering away from the biopython module.
UPDATE: the code below now works for multiple-line sequences.
The following code works fine for me:
import re
from collections import defaultdict
sequences = defaultdict(str)
with open('fasta.txt') as f:
lines = f.readlines()
current_tag = None
for line in lines:
m = re.match('^>(.+)', line)
if m:
current_tag = m.group(1)
else:
sequences[current_tag] += line.strip()
for k, v in sequences.items():
print(f"{k}: {v}")
It uses a number of features you may be unfamiliar with, such as regular expressions (which are probably very useful in bioinformatics) and f-string formatting. If anything confuses you, ask away. One thing I should add is that you don't want to define a variable as dict because that will clobber something Python has defined at startup. I chose sequences, which doesn't do this and is more informative.
For reference, this is the content of the example FASTA file fasta.txt I used in this instance:
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

Python counting occurrences across multiple lines using loops

I want a quick pythonic method to give me a count in a loop. I am actually too embarrassed to post up my solutions which are currently not working.
Given a sample from a text file structured follows:
script7
BLANK INTERRUPTION
script2
launch4.VBS
script3
script8
launch3.VBS
script5
launch1.VBS
script6
I want a count of all times script[y] is followed by a launch[X]. Launch has a range of values from 1-5, whilst script has range of 1-15.
Using script3 as an example, I would need a count for each of the following in a given file:
script3
launch1
#count this
script3
launch2
#count this
script3
launch3
#count this
script3
launch4
#count this
script3
launch4
#count this
script3
launch5
#count this
I think the sheer number of loops involved here has surpassed my knowledge of Python. Any assistance would be greatly appreciated.
Why not use a multi-line regex - then the script becomes:
import re
# read all the text of the file, and clean it up
with open('counts.txt', 'rt') as f:
alltext = '\n'.join(line.strip() for line in f)
# find all occurrences of the script line followed by the launch line
cont = re.findall('^script(\d)\nlaunch(\d+)\.VBS\n(?mi)',alltext)
# accumulate the counts of each launch number for each script number
# into nested dictionaries
scriptcounts = {}
for scriptnum,launchnum in cont:
# if we haven't seen this scriptnumber before, create the dictionary for it
if scriptnum not in scriptcounts:
scriptcounts[scriptnum]={}
# if we haven't seen this launchnumber with this scriptnumber before,
# initialize count to 0
if launchnum not in scriptcounts[scriptnum]:
scriptcounts[scriptnum][launchnum] = 0
# incremement the count for this combination of script and launch number
scriptcounts[scriptnum][launchnum] += 1
# produce the output in order of increasing scriptnum/launchnum
for scriptnum in sorted(scriptcounts.keys()):
for launchnum in sorted(scriptcounts[scriptnum].keys()):
print "script%s\nlaunch%s.VBS\n# count %d\n"%(scriptnum,launchnum,scriptcounts[scriptnum][launchnum])
The output (in the format you requested) is, for example:
script2
launch1.VBS
# count 1
script2
launch4.VBS
# count 1
script5
launch1.VBS
# count 1
script8
launch3.VBS
# count 3
re.findall() returns a list of all the matches - each match is a list of the () parts of the pattern except the (?mi) which is a directive to tell the regular expression matcher to work across line ends \n and to match case insensitive. The regex pattern as it stands e.g. fragment 'script(\d)' pulls out the digit following the script/launch into the match - this could as easily include 'script' by being '(script\d)', similarly '(launch\d+\.VBS)' and only the printing would need modification to handle this variation.
HTH
barny
Here is my solution using defaultdict with Counters and regex with lookahead.
import re
from collections import Counter, defaultdict
with open('in.txt', 'r') as f:
# make sure we have only \n as lineend and no leading or trailing whitespaces
# this makes the regex less complex
alltext = '\n'.join(line.strip() for line in f)
# find keyword script\d+ and capture it, then lazy expand and capture everything
# with lookahead so that we stop as soon as and only if next word is 'script' or
# end of the string
scriptPattern = re.compile(r'(script\d+)(.*?)(?=script|\n?$)', re.DOTALL)
# just find everything that matches launch\d+
launchPattern = re.compile(r'launch\d+')
# create a defaultdict with a counter for every entry
scriptDict = defaultdict(Counter)
# go through all matches
for match in scriptPattern.finditer(alltext):
script, body = match.groups()
# update the counter of this script
scriptDict[script].update(launchPattern.findall(body))
# print the results
for script in sorted(scriptDict):
counter = scriptDict[script]
if len(counter):
print('{} launches:'.format(script))
for launch in sorted(counter):
count = counter[launch]
print('\t{} {} time(s)'.format(launch, count))
else:
print('{} launches nothing'.format(script))
Using the string on regex101 (see link above) I get the following result:
script2 launches:
launch4 1 time(s)
script3 launches nothing
script5 launches:
launch1 1 time(s)
script6 launches nothing
script7 launches nothing
script8 launches:
launch3 1 time(s)
Here's an approach which uses nested dictionaries. Please tell me if you would like the output to be in a different format:
#!/usr/bin/env python3
import re
script_dict={}
with open('infile.txt','r') as infile:
scriptre = re.compile(r"^script\d+$")
for line in infile:
line = line.rstrip()
if scriptre.match(line) is not None:
script_dict[line] = {}
infile.seek(0) # go to beginning
launchre = re.compile(r"^launch\d+\.[vV][bB][sS]$")
current=None
for line in infile:
line = line.rstrip()
if line in script_dict:
current=line
elif launchre.match(line) is not None and current is not None:
if line not in script_dict[current]:
script_dict[current][line] = 1
else:
script_dict[current][line] += 1
print(script_dict)
You could use setdefault method
code:
dic={}
with open("a.txt") as inp:
check=0
key_string=""
for line in inp:
if check:
if line.strip().startswith("launch") and int(line.strip()[6])<6:
print "yes"
dic[key_string]=dic.setdefault(key_string,0)+1
check=0
if line.strip().startswith("script"):
key_string=line.strip()
check=1
For your given input the output would be
output:
{"script3":6}

How would you find text in a string in python and then look for a number after it?

I have a log file and at the end of each line in the file there is this string:
Line:# where # is the line number.
I am trying to get the # and compare it to the previous line's number. what would be the best way to do that in python?
I would probably use str.split because it seems easy:
with open('logfile.log') as fin:
numbers = [ int(line.split(':')[-1]) for line in fin ]
Now you can use zip to compare one number with the next one:
for num1,num2 in zip(numbers,numbers[1:]):
compare(num1,num2) #do comparison here.
Of course, this isn't lazy (you store every line number in the file at once when you really only need 2 at a time), so it might take up a lot of memory if your files are HUGE. It wouldn't be hard to make it lazy though:
def elem_with_next(iterable):
ii = iter(iterable)
prev = next(ii)
for here in ii:
yield prev,here
prev = here
with open('logfile.log') as fin:
numbers = ( int(line.split(':')[-1]) for line in fin )
for num1,num2 in elem_with_next(numbers):
compare(num1,num2)
I'm assuming that you don't have something convenient to split a string on, meaning a regular expression might make more sense. That is, if the lines in your log file are structured like:
date: 1-15-2013, error: mildly_annoying, line: 121
date: 1-16-2013, error: err_something_bad, line: 123
Then you won't be able to use line.split('#') as mgilson as suggested, although if there is always a colon, line.split(':') might work. In any case, a regular expression solution would look like:
import re
numbers = []
for line in log:
digit_match = re.search("(\d+)$", line)
if digit_match is not None:
numbers.append(int(digit_match.group(1)))
Here the expression "(\d+)$" is matching some number of digits and then the end of the line. We extract the digits with the group(1) method on the returned match object and then add them to our list of line numbers.
If you're not confident that the "Line: #" will always come at the end of the log, you could replace the regular expression used above with something akin to "Line:\s*(\d+)" which checks for the string "Line:" then some (or no) whitespace, and then any number of digits.

Categories

Resources