Parse log between datetime range using Python

Parse log between datetime range using Python - python

I'm trying to make a dynamic function: I give two datetime values and it could read the log between those datetime values, for example:
start_point = "2019-04-25 09:30:46.781"
stop_point = "2019-04-25 10:15:49.109"
I'm thinking of algorithm that checks:
if the dates are equal:
check if the start hour 0 char (09 -> 0) is higher or less than stop hour 0 char (10 -> 1);
same check with the hour 1 char ((start) 09 -> 9, (stop) 10 -> 0);
same check with the minute 0 char;
same check with the minute 1 char;
if the dates differ:
some other checks...
I don't know if I'm not inventing a wheel again, but I'm really lost, I'll list things I tried:
1.
...
cmd = subprocess.Popen(['egrep "2019-04-19 ([0-1][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9].[0-9]{3}" file.log'], shell=True, stdout=subprocess.PIPE)
cmd_result = cmd.communicate()[0]
for i in str(cmd_result).split("\n"):
print(i)
...
The problem with this one: I added the values from the example and it couldn't work, because it has invalid ranges like hour 1 chars it creates range [9-0], minute char 0 as well [3-1] and etc.
2.
Tried the following solutions from The best way to filter a log by a dates range in python
Any help is appreciated.
EDIT
the log line structure:
...
2019-04-25 09:30:46.781 text text text ...
2019-04-25 09:30:46.853 text text text ...
...
EDIT 2
So I tried the code:
from datetime import datetime as dt
s1 = "2019-04-25 09:34:11.057"
s2 = "2019-04-25 09:59:43.534"
start = dt.strptime('2019-04-25 09:34:11.057','%Y-%m-%d %H:%M:%S.%f')
stop = dt.strptime('2019-04-25 09:59:43.534', '%Y-%m-%d %H:%M:%S.%f')
start_1 = dt.strptime('09:34:11.057','%H:%M:%S.%f')
stop_1 = dt.strptime('09:59:43.534','%H:%M:%S.%f')
with open('file.out','r') as file:
for line in file:
ts = dt.strptime(line.split()[1],'%H:%M:%S.%f')
if (ts > start_1) and (ts < stop_1):
print line
and I got the error
ValueError: time data 'Platform' does not match format '%H:%M:%S.%f'
So it seems I found the other problem it contains sometimes non datetime at line start. Is there a way to provide a regex in which I provide the datetime format?
EDIT 3
Fixed the issue when the string appears at the start of the line which causes ValueError and fixed index out of range error when maybe the other values occur:
try:
ts = dt.strptime(line.split()[1],'%H:%M:%S.%f')
if (ts > start_1) and (ts < stop_1):
print line
except IndexError as err:
continue
except ValueError as err:
continue
So now it lists not in the range I provide, now it read the log
FROM 2019-02-27 09:38:46.229TO 2019-02-28 09:57:11.028. Any thoughts?

Your edit 2 had the right idea. You need to put exception handling in to catch lines which are not formatted correctly and skip them, for example blank lines, or lines that do not have the timestamp. This can be done as follows:
from datetime import datetime
s1 = "2019-04-25 09:24:11.057"
s2 = "2019-04-25 09:59:43.534"
fmt = '%Y-%m-%d %H:%M:%S.%f'
start = datetime.strptime(s1, fmt)
stop = datetime.strptime(s2, fmt)
with open('file.out', 'r') as file:
for line in file:
line = line.strip()
try:
ts = datetime.strptime(' '.join(line.split(' ', maxsplit=2)[:2]), fmt)
if start <= ts <= stop:
print(line)
except:
pass
The whole of the timestamp is used to create ts, this was so it can be correctly compared with start and stop.
Each line first has the trailing newline removed. It is then split on spaces up to twice. The first two splits are then joined back together and converted into a datetime object. If this fails, it implies that you do not have a correctly formatted line.

Related

Is there a way to print data from a log file between two endpoints in python

I have a log file and am trying to print the data between two dates.
2020-01-31T20:12:38.1234Z, asdasdasdasdasdasd,...\n
2020-01-31T20:12:39.1234Z, abcdef,...\n
2020-01-31T20:12:40.1234Z, ghikjl,...\n
2020-01-31T20:12:41.1234Z, mnopqrstuv,...\n
2020-01-31T20:12:42.1234Z, wxyzdsasad,...\n
This is the sample log file and I want to print the lines between 2020-01-31T20:12:39 up to 2020-01-31T20:12:41.
So far I have manged to find and print the starting date line. I have passed the starting date as start.
with open("logfile.log") as myFile:
for line in myFile:
linenum += 1
if line.find(start) != -1:
print("Line " + str(linenum) + ": " + line.rstrip('\n'))
but how do I keep printing till the end date?

Not the answer in python but in bash.
sed -n '/2020-01-31T20:12:38.1234Z/,/2020-01-31T20:12:41.1234Z/p' file.log
Output:
2020-01-31T20:12:38.1234Z, asdasdasdasdasdasd,...\n
2020-01-31T20:12:39.1234Z, abcdef,...\n
2020-01-31T20:12:40.1234Z, ghikjl,...\n
2020-01-31T20:12:41.1234Z, mnopqrstuv,...\n

Since the time string is already structured nicely in your file, you can just do a simple string comparison between the times you're interested in without converting the string to a datetime object.
Use the csv module to read in the file, using the default comma delimiter, and then the filter() function to filter between two dates.
import csv
reader = csv.reader(open("logfile.log"))
filtered = filter(lambda p: p[0].split('.')[0] >= '2020-01-31T20:12:39' and p[0].split('.')[0] <= '2020-01-31T20:12:41', reader)
for l in filtered:
print(','.join(l))
Edit:
I used split() to remove the fractional part of the time string in the string comparison since you're interested in times to the nearest minute accuracy, e.g. 2020-01-31T20:12:39.

if you want in python,
import time
from datetime import datetime as dt
def to_timestamp(date,forma='%Y-%m-%dT%H:%M:%S'):
return time.mktime(dt.strptime(date,forma).timetuple())
start=to_timestamp(startdate)
end=to_timestamp(enddate)
logs={}
with open("logfile.log") as f:
for line in f:
date=line.split(', ')[0].split('.')[0]
logline=line.split(', ')[1].strip('\n')
if to_timestamp(date)>=start and to_timestamp(end) <= end:
logs[date]=logline

How to extract multiple time from same string in Python?

I'm trying to extract time from single strings where in one string there will be texts other than only time. An example is s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'.
I've tried using datefinder module like this :
from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
print(dt.strftime(m, "%H:%M:%S"))
Which gives me this :
17:58:00
In this case the time "06:00" is missed out. Now if I try without datefinder with only datetime module like this :
dt.strftime(s, "%H:%M")
It notifies me that the input must be a datetime object already, not a string with the following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str'
So I tried to use dateutil module to parse this string s to a datetime object with this :
from dateutil.parser import parse
parse(s)
but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58')
I have thought of getting the time with regex like
import re
p = r"\d{2}\:\d{2}"
times = [i.group() for i in re.finditer(p, s)]
# Gives me ['06:00', '17:58']
But doing this way will need me to check again whether this regex matched chunks are actually time or not because even "99:99" could be regex matched rightly and told as time wrongly. Is there any work around without regex to get all the times from a single string?
Please note that the string might contain or might not contain any date, but it will contain a time always. Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.

I don't see many options here, so I would go with a heuristic. I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases:
import re
import logging
from datetime import datetime as dt
s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59'
SUPPORTED_DATE_FMTS = {
re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y",
re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y",
re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y",
# Capture more here
}
SUPPORTED_TIME_FMTS = {
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M",
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S",
# Capture more here
}
def extract_supported_dt(config, s):
"""
Loop thru the given config (keys are regexes, values are date/time format)
and attempt to gather all valid data.
"""
valid_data = []
for regex, fmt in config.items():
# Extract what you think looks like date
valid_ish_data = regex.findall(s)
if not valid_ish_data:
continue
print("Checking " + str(valid_ish_data))
# validate it
for d in valid_ish_data:
try:
valid_data.append(dt.strptime(d, fmt))
except ValueError:
pass
return valid_data
# Handle dates
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# Handle times
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)
print("Found dates: ")
for date in dates:
print("\t" + str(date.date()))
print("Found times: ")
for t in times:
print("\t" + str(t.time()))
Example output:
Checking ['12/Jul/2019']
Checking ['12/08/2019']
Checking ['06:00']
Checking ['17:58:59']
Found dates:
2019-07-12
2019-08-12
Found times:
06:00:00
17:58:59
This is a trial and error approach but I do not think there is an alternative in your case. Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1. This way, the more data you run against the more complete your config will be.
One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere. Later you will need to manually revise and see if something that was missed could be captured.
Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it. If the data input is from human, then you will probably never manage to get 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today :) )

How to extract multiple time from same string in Python?
If you need only time this regex should work fine
r"[0-2][0-9]\:[0-5][0-9]"
If there could be spaces in time like 23 : 59 use this
r"[0-2][0-9]\s*\:\s*[0-5][0-9]"

Use Regex But Something Like This,
(?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9]
This Matched
00:00, 00:59 01:00 01:59 02:00 02: 59
09:00 10:00 11:59 20:00 21:59 23:59
Not work for
99:99 23:99 01:99
Check Here Dude if it works for You
Check on Repl.it

you could use dictionaries:
my_dict = {}
for i in s.split(', '):
m = i.strip().split(' : ', 1)
my_dict[m[0]] = m[1].split()
my_dict
Out:
{'Dates': ['12/Jul/2019', '12/Aug/2019'],
'Loc': ['MEISHAN', 'BRIDGE'],
'Time': ['06:00', '17:58']}

File reading & counting & sorting by hours in Python

I'm new to Python & here is my question
Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon.
From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008
Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.
Link of the file:
http://www.pythonlearn.com/code/mbox-short.txt
This is my code:
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
for line in handle:
if not line.startswith ("From "):continue
#words = line.split()
col = line.find(':')
coll = col - 2
print coll
#zero = line.find('0')
#one = line.find('1')
#b = line[ zero or one : col ]
#print b
#hour = words[5:6]
#print hour
#for line in hour:
# hr = line.split(':')
# x = hr[1]
for x in coll:
counts[x] = counts.get(x,0) + 1
for key, value in sorted(counts.items()):
print key, value
My first try was with list splitting(Comments) and it didn't work as it considered the 0 & the 1 as the first & the second letter not the numbers
second one was with line find (:) which is partially worked with minutes not with hours as required!!
First question
Why when I write line.find(:), it takes automatically the 2 numbers after?
Second question
Why when I run the program now, it gives an error
TypeError: 'int' object is not iterable on line 26 ??
Third question
Why it considered 0 & 1 as first & second letters of the line not 0 & 1 numbers
Finally
If possible please solve me this problem with a little of explanation please (with the same codes to keep my learning sequence)
Thank you...

First question
Why when I write line.find(:), it takes automatically the 2 numbers
after?
str.find() return the first index of the character that you want to find. If your string is "From 00:00:00", it returns 7 as the first ':' is at index 7.
Second question
Why when I run the program now, it gives an error TypeError: 'int'
object is not iterable on line 26 ??
As have said above, it returns an int, which you cannot iterate
Third question
Why it considered 0 & 1 as first & second letters of the line not 0 &
1 numbers
I don't really understand what do you mean here. Anyway, as I understand, you try to find the first index which '0' or '1' occurs and assume that the first letter of hour? What about 8-11pm(start with 2)?
Finally If possible please solve me this problem with a little of
explanation please (with the same codes to keep my learning sequence)
Sure, it will be like this:
for line in f:
if not line.startswith("From "): continue
first_colon_index = line.find(":")
if first_colon_index == -1: # there is no ':'
continue
first_char_hour_index = first_colon_index - 2
# string slicing
# [a:b] get string from index a to b
hour = line[first_char_hour_index:first_char_hour_index+2]
hour_int = int(hour)
# if key exist, increase by 1. If not, set to 1
if hour_int in count:
count[hour_int] += 1
else:
count[hour_int] = 1
# print hour & count, in sorting order
for hour in sorted(count):
print hour, count[hour]
The part about string slicing can be confusing, you can read more about it at Python docs.
And you have to sure that: in the line, there is no other ":" or this method will fail as the first ":" will not be the one between hour and minute.
To make sure it works, it's better to use Regex. Something like:
for line in f:
if not line.startswith("From"): continue
match = re.search(r'^From.*?([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})', line)
if match:
time = match.group(1) # hh:mm:ss
hh = int(time.split(":")[0])
# if key exist, increase by 1. If not, set to 1
if hh in count:
count[hh] += 1
else:
count[hh] = 1
# print hour & count, in sorting order
for hour in sorted(count):
print hour, count[hour]

That's because str.find() returns an index of the found substring, not the string itself. Consequently, when you subtract 2 from it and then try to loop through it it will complain that you're trying to loop through an integer and raise a TypeError.
You can grab the whole time string as:
time_start = line.find(":")
if time_start == -1: # not found
continue
time_string = line[time_start-2:time_start+6] # slice out the whole time string
You can then further split the time_string by : to get hours, minutes and seconds (e.g. hours, minutes, seconds = time_string.split(":", 2) just keep in mind that those will be strings, not integers), or if you just want the hour:
hour = int(line[time_start-2:time_start])
You can take it from there - just increase your dict value and when you're done with parsing the file sort everything out.

formatting date, time and string for filename

I want to create a csv file with a filename of the following format:
"day-month-year hour:minute-malware_scan.csv"
Example:" 6-8-2016 21:45-malware_scan.csv"
The first part of the filename is formed by the actual date and time at file creation time, instead "-malware_scan.csv" is a fixed string.
I know that in order to get the date and time I should use the time or datetime module and the strftime() function for formatting.
At first I tried with:
t = datetime.datetime.now()
formatted_time = t.strftime(%d-%m-%y %H:%M)
filename = formatted_time + "-malware_scan.csv"
with open(filename, "a") as f:
...............
I didn't get the expected result, so I tried another way:
i = datetime.datetime.now()
file_to_open = "{day}-{month}-{year} {hour}:{minute}-malware_scan.csv".format(day = i.day, month = i.month, year = i.year, hour = i.hour, minute = i.minute)
with open(file_to_open, "a") as f:
.......................
Also using the code above I don't get the expected result.
I get a filename of this kind: "6-8-2016 21". Day, month, year and hour is displayed but the minutes and the rest of the string (-malware_scan.csv) isn't diplayed.
I'm focusing only on the filename with this question, not on the csv writing itself, whose code is omitted.

The : character is not allowed for filenames on PC. You could discard the : separator entirely:
>>> from datetime import datetime
>>> t = datetime.now()
>>> formatted_time = t.strftime('%d-%m-%y %H%M')
>>> formatted_time
'06-08-16 2226'
>>> datetime.strptime(formatted_time, '%d-%m-%y %H%M')
datetime.datetime(2016, 8, 6, 22, 26)
Or replace that character with an underscore or hyphen.

Thanks to Moses Koledoye for spotting the problem. I was thinking I made a mistake in the Python code, but actually the problem was the characters of the filename.
According to MSDN the following are reserved characters that cannot be used in a filename on Windows:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)

How do I only parse the timestamp from a string

I am trying to parse only the timestamp from a specific line in a log file using python. This is the line from the file:
Mar 29 06:12:42 10.11.100.22 [info.events] [WARNING] 10.11.100.22:
event, 1234
How do I only get the timestamp from this? This is the code I am using at the minute, which finds the line from the file which has the word 'WARNING' in it, and then gets the timestamp.
def is_Warning(self,line):
if line.find("WARNING") >= 0:
ts = time.strptime(line, "%b %d %H:%M:%S")
print "==================== %s" % ts
When I run this I get a 'ValueError: unconverted data remains: 10.11.100.22 [info.events] [WARNING] 10.11.100.22: event, 1234'
Can anyone help?

Use Regex.
import re
...
def is_warning(self,line):
if line.find("WARNING") >= 0:
date = re.match(r"[A-Za-z]{3} \d{1,2} \d{2}:\d{2}:\d{2}",line).group()
ts = time.strptime(date, "%b %d %H:%M:%S")
print("===================== %s" % ts
Note that time is a really old module. You should use datetime.datetime.strptime(date, format).time() if you need to get JUST a time.

The strptime should match the entire string and not just the beginning. Since you know the line's length, you can do this:
ts = time.strptime(line[:15].strip(), "%b %d %H:%M:%S")
The [:15] method will only return the first 15 characters from the string, which are the only characters you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse log between datetime range using Python - python

Related

Is there a way to print data from a log file between two endpoints in python

How to extract multiple time from same string in Python?

File reading & counting & sorting by hours in Python

formatting date, time and string for filename

How do I only parse the timestamp from a string

Categories

Resources