I would like to search through a range of lines in a date ordered log file between two dates. If I were at the command line, sed would come handy with:
sed -rn '/03.Nov.2012/,/12.Oct.2013/s/search key/search key/p' my.log
The above would only display lines between the 3 November, 2012 and 12 of October, 2013 that contain the string "search key".
Is there a light weight way I can do this in python?
I could build a single RE for the above , but it would be nightmarish.
The best I can come up with is this:
#!/usr/bin/python
start_date = "03/Nov/2012"
end_date = "12/Oct/2013"
start = False
try:
with open("my.log",'r') as log:
for line in log:
if start:
if end_date in line:
break
else:
if start_date in line:
start = True
else:
continue
if search_key in line:
print line
except IOError, e:
print '<p>Log file not found.'
But this strikes me as not 'pythonic'.
One can assume that search date limits will be found in the log file.
Using itertools and a generator is one way:
from itertools import takewhile, dropwhile
with open('logfile') as fin:
start = dropwhile(lambda L: '03.Nov.2012' not in L, fin)
until = takewhile(lambda L: '12.Oct.2013' not in L, start)
query = (line for line in until if 'search string' in line)
for line in query:
pass # do something
Related
I am following this tutorial (https://towardsdatascience.com/build-your-own-whatsapp-chat-analyzer-9590acca9014) to build a WhatsApp analyzer.
This is the entire code from his tutorial -
def startsWithDateTime(s):
pattern = '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)(\d{2}|\d{4}), ([0-9][0-9]):([0-9][0-9]) -'
result = re.match(pattern, s)
if result:
return True
return False
def startsWithAuthor(s):
patterns = [
'([\w]+):', # First Name
'([\w]+[\s]+[\w]+):', # First Name + Last Name
'([\w]+[\s]+[\w]+[\s]+[\w]+):', # First Name + Middle Name + Last Name
'([+]\d{2} \d{5} \d{5}):', # Mobile Number (India)
'([+]\d{2} \d{3} \d{3} \d{4}):', # Mobile Number (US)
'([+]\d{2} \d{4} \d{7})' # Mobile Number (Europe)
]
pattern = '^' + '|'.join(patterns)
result = re.match(pattern, s)
if result:
return True
return False
def getDataPoint(line):
# line = 18/06/17, 22:47 - Loki: Why do you have 2 numbers, Banner?
splitLine = line.split(' - ') # splitLine = ['18/06/17, 22:47', 'Loki: Why do you have 2 numbers, Banner?']
dateTime = splitLine[0] # dateTime = '18/06/17, 22:47'
date, time = dateTime.split(', ') # date = '18/06/17'; time = '22:47'
message = ' '.join(splitLine[1:]) # message = 'Loki: Why do you have 2 numbers, Banner?'
if startsWithAuthor(message): # True
splitMessage = message.split(': ') # splitMessage = ['Loki', 'Why do you have 2 numbers, Banner?']
author = splitMessage[0] # author = 'Loki'
message = ' '.join(splitMessage[1:]) # message = 'Why do you have 2 numbers, Banner?'
else:
author = None
return date, time, author, message
parsedData = [] # List to keep track of data so it can be used by a Pandas dataframe
conversationPath = 'chat.txt'
with open(conversationPath, encoding="utf-8") as fp:
fp.readline() # Skipping first line of the file (usually contains information about end-to-end encryption)
messageBuffer = [] # Buffer to capture intermediate output for multi-line messages
date, time, author = None, None, None # Intermediate variables to keep track of the current message being processed
while True:
line = fp.readline()
if not line: # Stop reading further if end of file has been reached
break
line = line.strip() # Guarding against erroneous leading and trailing whitespaces
if startsWithDateTime(line): # If a line starts with a Date Time pattern, then this indicates the beginning of a new message
print('true')
if len(messageBuffer) > 0: # Check if the message buffer contains characters from previous iterations
parsedData.append([date, time, author, ' '.join(messageBuffer)]) # Save the tokens from the previous message in parsedData
messageBuffer.clear() # Clear the message buffer so that it can be used for the next message
date, time, author, message = getDataPoint(line) # Identify and extract tokens from the line
messageBuffer.append(message) # Append message to buffer
else:
messageBuffer.append(line) # If a line doesn't start with a Date Time pattern, then it is part of a multi-line message. So, just append to buffer
When I plug in my chat file - chat.txt, I noticed that my list parsedData is empty. After going through his code, I noticed what might be responsible for the empty list.
From his tutorial, his chats are in this format (24hrs) -
18/06/17, 22:47 - Loki: Why do you have 2 numbers, Banner? but my chats are this format (12 hrs) - [4/19/20, 8:10:57 PM] Joe: How are you doing.
Reason why the startsWithDateTime function is unable to match any date and time.
Please how do I change the regex format in the startsWithDateTime function to match my chat format?
There are a few problems here.
First, your regex is looking for a date in the format DD/MM/YYYY, but the format you give is in M/DD/YYYY.
Second, the square brackets are not present in the first example you give, which succeeds, but are present in the second example.
Third, in looking at your regex, the problem isn't in the 12-hour v 24-hour time format per se, but in the fact that your regex is searching for a strictly 2-digit hour digit. When using 24-hour format, it is common to include a leading zero for single-digit hours (e.g., 08:10 for 8:10am), but in 12-hour format it is not (so your code would fail to find 8:10.
You can fix your regular expression by changing the relevant section from
([0-9][0-9]):([0-9][0-9])
to
([0-9]{1,2}):([0-9]{2})
The number in curly braces indicates how many examples of that character to look for, so in this case this expression will look for either one or two digits, then a colon, then exactly two digits.
Then the final regex would have to be
^\[(\d{1,2})\/(\d{2})\/(\d{2}|\d{4}), ([0-9]{1,2}):([0-9]{2}):([0-9]{2})\ ([AP]M)\]
Example:
import re
a = '[4/19/20, 8:10:57 PM] Joe: How are you doing'
p = r'^\[(\d{1,2})\/(\d{2})\/(\d{2}|\d{4}), ([0-9]{1,2}):([0-9]{2}):([0-9]{2})\ ([AP]M)\]'
re.findall(p, a)
# [('4', '19', '20', '8', '10', '57', 'PM')]
I'm trying to make a dynamic function: I give two datetime values and it could read the log between those datetime values, for example:
start_point = "2019-04-25 09:30:46.781"
stop_point = "2019-04-25 10:15:49.109"
I'm thinking of algorithm that checks:
if the dates are equal:
check if the start hour 0 char (09 -> 0) is higher or less than stop hour 0 char (10 -> 1);
same check with the hour 1 char ((start) 09 -> 9, (stop) 10 -> 0);
same check with the minute 0 char;
same check with the minute 1 char;
if the dates differ:
some other checks...
I don't know if I'm not inventing a wheel again, but I'm really lost, I'll list things I tried:
1.
...
cmd = subprocess.Popen(['egrep "2019-04-19 ([0-1][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9].[0-9]{3}" file.log'], shell=True, stdout=subprocess.PIPE)
cmd_result = cmd.communicate()[0]
for i in str(cmd_result).split("\n"):
print(i)
...
The problem with this one: I added the values from the example and it couldn't work, because it has invalid ranges like hour 1 chars it creates range [9-0], minute char 0 as well [3-1] and etc.
2.
Tried the following solutions from The best way to filter a log by a dates range in python
Any help is appreciated.
EDIT
the log line structure:
...
2019-04-25 09:30:46.781 text text text ...
2019-04-25 09:30:46.853 text text text ...
...
EDIT 2
So I tried the code:
from datetime import datetime as dt
s1 = "2019-04-25 09:34:11.057"
s2 = "2019-04-25 09:59:43.534"
start = dt.strptime('2019-04-25 09:34:11.057','%Y-%m-%d %H:%M:%S.%f')
stop = dt.strptime('2019-04-25 09:59:43.534', '%Y-%m-%d %H:%M:%S.%f')
start_1 = dt.strptime('09:34:11.057','%H:%M:%S.%f')
stop_1 = dt.strptime('09:59:43.534','%H:%M:%S.%f')
with open('file.out','r') as file:
for line in file:
ts = dt.strptime(line.split()[1],'%H:%M:%S.%f')
if (ts > start_1) and (ts < stop_1):
print line
and I got the error
ValueError: time data 'Platform' does not match format '%H:%M:%S.%f'
So it seems I found the other problem it contains sometimes non datetime at line start. Is there a way to provide a regex in which I provide the datetime format?
EDIT 3
Fixed the issue when the string appears at the start of the line which causes ValueError and fixed index out of range error when maybe the other values occur:
try:
ts = dt.strptime(line.split()[1],'%H:%M:%S.%f')
if (ts > start_1) and (ts < stop_1):
print line
except IndexError as err:
continue
except ValueError as err:
continue
So now it lists not in the range I provide, now it read the log
FROM 2019-02-27 09:38:46.229TO 2019-02-28 09:57:11.028. Any thoughts?
Your edit 2 had the right idea. You need to put exception handling in to catch lines which are not formatted correctly and skip them, for example blank lines, or lines that do not have the timestamp. This can be done as follows:
from datetime import datetime
s1 = "2019-04-25 09:24:11.057"
s2 = "2019-04-25 09:59:43.534"
fmt = '%Y-%m-%d %H:%M:%S.%f'
start = datetime.strptime(s1, fmt)
stop = datetime.strptime(s2, fmt)
with open('file.out', 'r') as file:
for line in file:
line = line.strip()
try:
ts = datetime.strptime(' '.join(line.split(' ', maxsplit=2)[:2]), fmt)
if start <= ts <= stop:
print(line)
except:
pass
The whole of the timestamp is used to create ts, this was so it can be correctly compared with start and stop.
Each line first has the trailing newline removed. It is then split on spaces up to twice. The first two splits are then joined back together and converted into a datetime object. If this fails, it implies that you do not have a correctly formatted line.
I have a huge text file of some 500 MB as,i need to print the lines matching the input, along with the previous 3 lines and next 3 lines.
My text file looks like:
...
...
...
benz is a nice car
...
...
...
its also said benz is a safe car
...
...
...
If the user input as 'benz' then it should print the 3 lines before and after of the matching, for every individual matches.
My code:-
users= raw_input('enter the word:')
with open('mytext.txt',rb) as f:
for line if f:
if users in line:
print line(i-3)
print line
print line(i+3)
But errors as i not defined
Use grep:
$ grep -C 3 benz mytext.txt
I wrote a small function that might be useful for your case:
from collections import deque
def search_cont(filename, search_for, num_before, num_after):
with open(filename) as f:
before_lines = deque(maxlen=num_before)
after_lines = deque(maxlen=num_after+1)
for _ in range(num_after+1):
after_lines.append(next(f))
while len(after_lines)>0:
current_line = after_lines.popleft()
if search_for in current_line:
print("".join(before_lines))
print(current_line)
print("".join(after_lines))
print("-----------------------")
before_lines.append(current_line)
try:
after_lines.append(next(f))
except StopIteration:
pass
For your example you call it like
search_for = raw_input('enter the word:')
search_cont('mytext.txt', search_for, 3, 3)
This solution has no upper bound for the size of your file (except you have really long lines) as there are never more than 7 lines in memory.
You could call grep from python:
import subprocess
result = subprocess.check_output(["grep" "-A" "3" "-B" "3" "benz" "mytext.txt"])
I have a bunch of data in .txt file and I need it in a format that I can use in fusion tables/spreadsheet. I assume that that format would be a csv that I can write into another file that I can then import into a spreadsheet to work with.
The data is in this format with multiple entries separated by a blank line.
Start Time
8/18/14, 11:59 AM
Duration
15 min
Start Side
Left
Fed on Both Sides
No
Start Time
8/18/14, 8:59 AM
Duration
13 min
Start Side
Right
Fed on Both Sides
No
(etc.)
but I need it ultimately in this format (or whatever i can use to get it into a spreadsheet)
StartDate, StartTime, Duration, StartSide, FedOnBothSides
8/18/14, 11:59 AM, 15, Left, No
- , -, -, -, -
The problems I have come across are:
-I don't need all the info or every line but i'm not sure how to automatically separate them. I don't even know if the way I am going about sorting each line is smart
-I have been getting an error that says that "argument 1 must be string or read-only character buffer, not list" when I use .read() or .readlines() sometimes (although it did work at first). also both of my arguments are .txt files.
-the dates and times are not in set formats with regular lengths (it has 8/4/14, 5:14 AM instead of 08/04/14, 05:14 AM) which I'm not sure how to deal with
this is what I have tried so far
from sys import argv
from os.path import exists
def filework():
script, from_file, to_file = argv
print "copying from %s to %s" % (from_file, to_file)
in_file = open(from_file)
indata = in_file.readlines() #.read() .readline .readlines .read().splitline .xreadlines
print "the input file is %d bytes long" % len(indata)
print "does the output file exist? %r" % exists(to_file)
print "ready, hit RETURN to continue, CTRL-C to abort."
raw_input()
#do stuff section----------------BEGIN
for i in indata:
if i == "Start Time":
pass #do something
elif i== '{date format}':
pass #do something
else:
pass #do something
#do stuff section----------------END
out_file = open(to_file, 'w')
out_file.write(indata)
print "alright, all done."
out_file.close()
in_file.close()
filework()
So I'm relatively unversed in scripts like this that have multiple complex parts. Any help and suggestions would be greatly appreciated. Sorry if this is a jumble.
Thanks
This code should work, although its not exactly optimal, but I'm sure you'll figure out how to make it better!
What this code basically does is:
Get all the lines from the input data
Loop through all the lines, and try to recognize different keys (the start time etc)
If a keys is recognize, get the line beneath it, and apply a appropriate function to it
If a new line is found, add the current entry to a list, so that other entries can be read
Write the data to a file
Incase you haven't seen string formatting being done this way before:
"{0:} {1:}".format(arg0, arg1), the {0:} is just a way of defining a placeholder for a variable(here: arg0), and the 0 just defines which arguments to use.
Find out more here:
Python .format docs
Python OrderedDict docs
If you are using a version of python < 2.7, you might have to install a other version of ordereddicts by using pip install ordereddict. If that doesn't work, just change data = OrderedDict() to data = {}, and it should work. But then the output will look somewhat different each time it is generated, but it will still be correct.
from sys import argv
from os.path import exists
# since we want to have a somewhat standardized format
# and dicts are unordered by default
try:
from collections import OrderedDict
except ImportError:
# python 2.6 or earlier, use backport
from ordereddict import OrderedDict
def get_time_and_date(time):
date, time = time.split(",")
time, time_indic = time.split()
date = pad_time(date)
time = "{0:} {1:}".format(pad_time(time), time_indic)
return time, date
"""
Make all the time values look the same, ex turn 5:30 AM into 05:30 AM
"""
def pad_time(time):
# if its time
if ":" in time:
separator = ":"
# if its a date
else:
separator = "/"
time = time.split(separator)
for index, num in enumerate(time):
if len(num) < 2:
time[index] = "0" + time[index]
return separator.join(time)
def filework():
from_file, to_file = argv[1:]
data = OrderedDict()
print "copying from %s to %s" % (from_file, to_file)
# by using open(...) the file closes automatically
with open(from_file, "r") as inputfile:
indata = inputfile.readlines()
entries = []
print "the input file is %d bytes long" % len(indata)
print "does the output file exist? %r" % exists(to_file)
print "ready, hit RETURN to continue, CTRL-C to abort."
raw_input()
for line_num in xrange(len(indata)):
# make the entire string lowercase to be more flexible,
# and then remove whitespace
line_lowered = indata[line_num].lower().strip()
if "start time" == line_lowered:
time, date = get_time_and_date(indata[line_num+1].strip())
data["StartTime"] = time
data["StartDate"] = date
elif "duration" == line_lowered:
duration = indata[line_num+1].strip().split()
# only keep the amount of minutes
data["Duration"] = duration[0]
elif "start side" == line_lowered:
data["StartSide"] = indata[line_num+1].strip()
elif "fed on both sides" == line_lowered:
data["FedOnBothSides"] = indata[line_num+1].strip()
elif line_lowered == "":
# if a blank line is found, prepare for reading a new entry
entries.append(data)
data = OrderedDict()
entries.append(data)
# create the outfile if it does not exist
with open(to_file, "w+") as outfile:
headers = entries[0].keys()
outfile.write(", ".join(headers) + "\n")
for entry in entries:
outfile.write(", ".join(entry.values()) + "\n")
filework()
So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()