Python Programming for .json Loggly files - python

I want to search particular strings in long .json loggly file, including its line number and also want to print 5 lines above and below the searched line. Can u guzz plzz help me ?
it is always returning "NOT FOUND".
after this now i am only getting some output with the below shown program.
with open('logg.json', 'r') as f:
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
i = ln-25
for i in (ln-25,ln+25):
l = linecache.getline('logg.json', i)
i+=1
print(ln,l)
print(" Next Error")

file.readlines() return a list of lines. Lines does contains newline (\n).
You need to specify newline to match the line:
ln = data.index("error CRASHLOG\n")
If you want to find a line that contians a target string, you need to iterate the lines:
with open('logg.json', 'r') as f:
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
# You now know the line index (`ln`)
# Print above/below 5 lines here.
break
else:
print("Not Found")
BTW, this kind of work is easily done with grep(1):
grep -C 5 'error CRASHLOG' logg.json || echo 'Not Found'
UPDATE
Following is more complete code:
from collections import deque
from itertools import islice, chain
import sys
with open('logg.json', 'r') as f:
last_lines = deque(maxlen=5) # contains the last (up to) 5 lines.
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
sys.stdout.writelines(chain(last_lines, [line], islice(f, 5)))
last_lines.append(line)
else:
print("Not Found")

I'm sure it actually returns "Not found", but I put that down to anxiety-induced shoutiness.
data is a list. The documentation about the list type (http://docs.python.org/2/tutorial/datastructures.html) states that list.index(x) returns "the index in the list of the first item whose value is x. It is an error if there is no such item."
Therefore the only lines that would be reported are those that contain ONLY the string you specify with no other characters. As falsetru points out in her/his answer, if there is no other information on the log lines then you must include for comparison the newline that readlines() ensures is at the end of every line in the list it returns (even on a Windows system as long as you open the file in text mode, which is the default). Without that there is no chance of a match.
If the lines contain other information then a better test might indeed be to use x in string as she/he suggests, but I suspect you might be interested to see how much more processing it takes than a simple equality test. Come to that so would I, but this isn't my problem ...

Related

Python - how to navigate through text file multiple lines backwards using seek()?

What im trying to do is match a phrase in a text file, then print that line(This works fine). I then need to move the cursor up 4 lines so I can do another match in that line, but I cant get the seek() method to move up 4 lines from the line that has been matched so that I can do another regex search. All I can seem to do with seek() is search from the very end of the file, or the beginning. It doesn't seem to let me just do seek(105,1) from the line that is matched.
### This is the example test.txt
This is 1st line
This is 2nd line # Needs to seek() to this line from the 6th line. This needs to be dynamic as it wont always be 4 lines.
This is 3rd line
This is 4th line
This is 5th line
This is 6st line # Matches this line, now need to move it up 4 lines to the "2nd line"
This is 7 line
This is 8 line
This is 9 line
This is 10 line
#
def Findmatch():
file = open("test.txt", "r")
print file.tell() # shows 0 which is the beginning of the file
string = file.readlines()
for line in string:
if "This is 6th line" in line:
print line
print file.tell() # shows 171 which is the end of the file. I need for it to be on the line that matches my search which should be around 108. seek() only lets me search from end or beginning of file, but not from the line that was matched.
Findmatch()
Since you've read all of it into memory at once with file.readlines(). tell() method does indeed correctly point to the end and your already have all your lines in an array. If you still wanted to, you'd have to read the file in line by line and record position within file for each line start so that you could go back four lines.
For your described problem. You can first find index of the line first match and then do the second operation starting from the list slice four items before that.
Here a very rough example of that (return None isn't really needed, it's just for sake of verbosity, clearly stating intent/expected behavior; raising an exception might be just as well a desired depending on what the overall plan is):
def relevant(value, lines):
found = False
for (idx, line) in enumerate(lines):
if value in line:
found = True
break # Stop iterating, last idx is a match.
if found is True:
idx = idx - 4
if idx < 0:
idx = 0 # Just return all lines up to now? Or was that broken input and fail?
return lines[idx:]
else:
return None
with open("test.txt") as in_file:
lines = in_file.readlines()
print(''.join(relevant("This is 6th line", lines)))
Please also note: It's a bit confusing to name list of lines string (one would probably expect a str there), go with lines or something else) and it's also not advisable (esp. since you indicate to be using 2.7) to assign your variable names already used for built-ins, like file. Use in_file for instance.
EDIT: As requested in a comment, just a printing example, adding it in parallel as the former seem potentially more useful for further extension. :) ...
def print_relevant(value, lines):
found = False
for (idx, line) in enumerate(lines):
if value in line:
found = True
print(line.rstrip('\n'))
break # Stop iterating, last idx is a match.
if found is True:
idx = idx - 4
if idx < 0:
idx = 0 # Just return all lines up to now? Or was that broken input and fail?
print(lines[idx].rstrip('\n'))
with open("test.txt") as in_file:
lines = in_file.readlines()
print_relevant("This is 6th line", lines)
Note, since lines are read in with trailing newlines and print would add one of its own I've rstrip'ed the line before printing. Just be aware of it.

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Python, Extracting 3 lines before and after a match

I am trying to figure out how to extract 3 lines before and after a matched word.
At the moment, my word is found. I wrote up some text to test my code. And, I figured out how to print three lines after my match.
But, I am having difficulty trying to figure out how to print three lines before the word, "secure".
Here is what I have so far:
from itertools import islice
with open("testdoc.txt", "r") as f:
for line in f:
if "secure" in line:
print("".join(line))
print ("".join(islice(f,3)))
Here is the text I created for testing:
----------------------------
This is a test to see
if i can extract information
using this code
I hope, I try,
maybe secure shell will save thee
Im adding extra lines to see my output
hoping that it comes out correctly
boy im tired, sleep is nice
until then, time will suffice
You need to buffer your lines so you can recall them. The simplest way is to just load all the lines into a list:
with open("testdoc.txt", "r") as f:
lines = f.readlines() # read all lines into a list
for index, line in enumerate(lines): # enumerate the list and loop through it
if "secure" in line: # check if the current line has your substring
print(line.rstrip()) # print the current line (stripped off whitespace)
print("".join(lines[max(0,index-3):index])) # print three lines preceeding it
But if you need maximum storage efficiency you can use a buffer to store the last 3 lines as you loop over the file line by line. A collections.deque is ideal for that.
i came up with this solution, just adding the previous lines in a list, and deleting the first one after 4 elements
from itertools import islice
with open("testdoc.txt", "r") as f:
linesBefore = list()
for line in f:
linesBefore.append(line.rstrip())
if len(linesBefore) > 4: #Adding up to 4 lines
linesBefore.pop(0)
if "secure" in line:
if len(linesBefore) == 4: # if there are at least 3 lines before the match
for i in range(3):
print(linesBefore[i])
else: #if there are less than 3 lines before the match
print(''.join(linesBefore))
print("".join(line.rstrip()))
print ("".join(islice(f,3)))

Python - Count characters between two specific strings

I made a text file containing random sequences of bases (ATCG) and want to find the longest and shortest "reading frame" within those sequences.
I was able to identify the Start- and Stop-Codons (the two "specific strings" mentioned) with "searchfile" and a for-loop and also know the basics of counting (example of code at the end) but I can't find any possibility to set those two as "boundaries" between I can count.
Can anybody perhaps give me a hint or tell me how such a function/operation is called so I can at least find it in a documentary or how it could look like? I found many options how to count various different things but none for counting between "x" and "y".
Example of how I looked up the strings between which I want to count:
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
searchfile.close()
whole code:
import numpy as np
BASES = ('A', 'C', 'T', 'G')
P = (0.25, 0.25, 0.25, 0.25)
def random_dna_sequence(length):
return ''.join(np.random.choice(BASES, p=P) for _ in range(length))
with open('dna.txt', 'w+') as txtout:
for _ in range(10):
dna = random_dna_sequence(50)
txtout.write(dna)
txtout.write("\n")
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
searchfile.close()
searchfile = open('dna.txt', 'r')
for line in searchfile:
if "ATG" in line: print (line)
elif "TAG" in line: print (line)
elif "TAA" in line: print (line)
elif "TGA" in line: print (line)
else: print ("no stop-codon detected")
searchfile.close()
Sidenote: The print instruction is only a temporary placeholder for testing. In the end i would like to set the found strings as mentioned "boundaries" (i can't find a better name for it) at that point.
Some example lines from the dna.txt file:
GAAGACGCAATAGGTTCACGGCGCTCATAGGCTTGCCCTCATAGGGCTTG
TCTGAGGTAGAAGGAGCTACTGCCGTTGCAGGTGACGCCCACAGTCCTGA
GTTATTACTCCCTGACTGTCATCTGTTCGGATACCGTGCAGCGCATCGAG
AGGAGATAACGCGATCCTGAGACAGTTTACCTATATGTTCACTACGCATG
CCGAGCTGATCCGACTACTGAAGGTGAATTCTGAAGCTAATCTGCAGTTC
This is a small example (I use 10 and 50 for testing) but in the end the file shall contain 10000 sequences with 1000 characters each.
What I would do is something like this:
with open("dna.txt", 'r') as searchfile:
all_dna = searchfile.read()
start = all_dna.index("ATG")
rem_dna = all_dna[start + 3:]
end = rem_dna.index("ATG")
needed_dna = all_dna[start:(end + 3)]
print len(needed_dna)
index finds where in a string the substring passed as an argument occurs, and will raise ValueError if the substring is not found. with is a keyword useful as a safety precaution for file I/O that ensures that the file is properly closed even if the code inside that block causes an error. If you don't want to include the starting and ending "ATG" in needed_dna, you can set that to all_dna[(start + 3):end]. The brackets, by the way, mean "take the substring of the specified string beginning at the argument before the colon (inclusive, zero-indexed) and ending at the argument after the colon (non-inclusive, also zero-indexed). This can also be used for lists, and can be used without the colon to get the character at a specific index. Hope this helps!

Include surrounding lines of text file match in output using Python 2.7.3

I've been working on a program which assists in log analysis. It finds error or fail messages using regex and prints them to a new .txt file. However, it would be much more beneficial if the program including the top and bottom 4 lines around what the match is. I can't figure out how to do this! Here is a part of the existing program:
def error_finder(filepath):
source = open(filepath, "r").readlines()
error_logs = set()
my_data = []
for line in source:
line = line.strip()
if re.search(exp, line):
error_logs.add(line)
I'm assuming something needs to be added to the very last line, but I've been working on this for a bit and either am not applying myself fully or just can't figure it out.
Any advice or help on this is appreciated.
Thank you!
Why python?
grep -C4 '^your_regex$' logfile > outfile.txt
Some comments:
I'm not sure why error_logs is a set instead of a list.
Using readlines() will read the entire file in memory, which will be inefficient for large files. You should be able to just iterate over the file a line at a time.
exp (which you're using for re.search) isn't defined anywhere, but I assume that's elsewhere in your code.
Anyway, here's complete code that should do what you want without reading the whole file in memory. It will also preserve the order of input lines.
import re
from collections import deque
exp = '\d'
# matches numbers, change to what you need
def error_finder(filepath, context_lines = 4):
source = open(filepath, 'r')
error_logs = []
buffer = deque(maxlen=context_lines)
lines_after = 0
for line in source:
line = line.strip()
if re.search(exp, line):
# add previous lines first
for prev_line in buffer:
error_logs.append(prev_line)
# clear the buffer
buffer.clear()
# add current line
error_logs.append(line)
# schedule lines that follow to be added too
lines_after = context_lines
elif lines_after > 0:
# a line that matched the regex came not so long ago
lines_after -= 1
error_logs.append(line)
else:
buffer.append(line)
# maybe do something with error_logs? I'll just return it
return error_logs
I suggest to use index loop instead of for each loop, try this:
error_logs = list()
for i in range(len(source)):
line = source[i].strip()
if re.search(exp, line):
error_logs.append((line,i-4,i+4))
in this case your errors log will contain ('line of error', line index - 4, line index + 4), so you can get these lines later form "source"

Categories

Resources