Python analyse logfile with regex - python

I have to analyse a email sending logfile (get SMTP reply for a message-id), which looks like this:
Nov 12 17:26:57 zeus postfix/smtpd[23992]: E859950021DB1: client=pegasus.os[172.20.19.62]
Nov 12 17:26:57 zeus postfix/cleanup[23995]: E859950021DB1: message-id=a92de331-9242-4d2a-8f0e-9418eb7c0123
Nov 12 17:26:58 zeus postfix/qmgr[22359]: E859950021DB1: from=<system#directoperation.de>, size=114324, nrcpt=1 (queue active)
Nov 12 17:26:58 zeus postfix/smtp[24007]: certificate verification failed for mx.elutopia.it[62.149.128.160]:25: untrusted issuer /C=US/O=RTFM, Inc./OU=Widgets Division/CN=Test CA20010517
Nov 12 17:26:58 zeus postfix/smtp[24007]: E859950021DB1: to=<mike#elutopia.it>, relay=mx.elutopia.it[62.149.128.160]:25, delay=0.89, delays=0.09/0/0.3/0.5, dsn=2.0.0, status=sent (250 2.0.0 d3Sx1m03q0ps1bK013Sxg4 mail accepted for delivery)
Nov 12 17:26:58 zeus postfix/qmgr[22359]: E859950021DB1: removed
Nov 12 17:27:00 zeus postfix/smtpd[23980]: connect from pegasus.os[172.20.19.62]
Nov 12 17:27:00 zeus postfix/smtpd[23980]: setting up TLS connection from pegasus.os[172.20.19.62]
Nov 12 17:27:00 zeus postfix/smtpd[23980]: Anonymous TLS connection established from pegasus.os[172.20.19.62]: TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)
Nov 12 17:27:00 zeus postfix/smtpd[23992]: disconnect from pegasus.os[172.20.19.62]
Nov 12 17:27:00 zeus postfix/smtpd[23980]: 2C04150101DB2: client=pegasus.os[172.20.19.62]
Nov 12 17:27:00 zeus postfix/cleanup[23994]: 2C04150101DB2: message-id=21e2f9d3-154a-3683-85d3-a7c52d429386
Nov 12 17:27:00 zeus postfix/qmgr[22359]: 2C04150101DB2: from=<system#directoperation.de>, size=53237, nrcpt=1 (queue active)
Nov 12 17:27:00 zeus postfix/smtp[24006]: ABE7C50001D62: to=<info#elvictoria.it>, relay=relay3.telnew.it[195.36.1.102]:25, delay=4.9, delays=0.1/0/4/0.76, dsn=2.0.0, status=sent (250 2.0.0 r9EFQt0J009467 Message accepted for delivery)
Nov 12 17:27:00 zeus postfix/qmgr[22359]: ABE7C50001D62: removed
Nov 12 17:27:00 zeus postfix/smtp[23998]: 2C04150101DB2: to=<peter#elgravo.ch>, relay=liberomx2.elgravo.ch[212.52.84.93]:25, delay=0.72, delays=0.07/0/0.3/0.35, dsn=2.0.0, status=sent (250 ok: Message 2040264602 accepted)
Nov 12 17:27:00 zeus postfix/qmgr[22359]: 2C04150101DB2: removed
At the moment, I get a message-id (uuid) from a database (for example a92de331-9242-4d2a-8f0e-9418eb7c0123) and then run my code through the logfile:
log_id = re.search (']: (.+?): message-id='+message_id, text).group(1)
sent_status = (re.search (']: '+log_id+'.*dsn=(.....)', text)
With the message-id I find the log_id, and with the log_id I can find the SMTP reply answer.
This works fine, but a better way would be, if the software goes through the log file, get the message-id and the reply code and update the DB then. But I'm not sure, how I shall do this? This script has to be run every ~2 minutes and check on a updating log-file. So how can I assure, that it remembers where it was and doesn't get a message-id twice?
Thanks in advance

Use a dictionary to store message IDs, use a separate file to store the byte number where you last left off in the log file.
msgIDs = {}
# get where you left off in the logfile during the last read:
try:
with open('logfile_placemarker.txt', 'r') as f:
lastRead = int(f.read())
except IOError:
print("Can't find/read place marker file! Starting at 0")
lastRead = 0
with open('logfile.log', 'r') as f:
f.seek(lastRead)
for line in f:
# ...
# Pick out msgIDs and response codes
# ...
if msgID in msgIDs:
print("uh oh, found the same msg id twice!!")
msgIDs[msgID] = responseCode
lastRead = f.tell()
# Do whatever you need to do with the msgIDs you found:
updateDB(msgIDs)
# Store lastRead (where you left off in the logfile) in a file if you need to so it persists in the next run
with open('logfile_placemarker.txt', 'w') as f:
f.write(str(lastRead))

Related

Python - How to find profile from file [duplicate]

This question already has answers here:
Understanding slicing
(38 answers)
Closed 8 months ago.
I am new to Python.
I wanted to find profiles from a log file, with following criteria
user logged in, user changed password, user logged off within same second
those actions (log in, change password, log off) happened one after another with no other entires in between.
with .txt file looks like this
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|asdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|asdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|asdf| - |user logged off| -
Mon, 22 Aug 2016 13:15:42 +0200|178.57.66.225|iukj| - |user logged in| -
Mon, 22 Aug 2016 13:15:40 +0200|178.57.66.215|klij| - |user logged in| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|klij| - |user changed password| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|klij| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user logged off| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user logged in| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user changed password| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user changed profile| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|zzad| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|zzad| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|zzad| - |user logged off| -
Mon, 22 Aug 2016 13:20:42 +0200|178.57.67.225|yytr| - |user logged in| -
asdf - is typical profile name from the log file
Here is what I have done so far
import collections
import time
with open('logfiles.txt') as infile:
counts = collections.Counter(l.strip() for l in infile)
for line, count in counts.most_common():
print(line, count)
time.sleep(10)
I know the logic is to get same hours, minutes, and seconds
if they are duplicates, then I print the profiles.
But I am confuse how to get time from a file.
Any help is very much appreciated.
EDIT:
The output would be:
asdf
klij
plnb
zzad
I think this is more complicated than you might have imagined. Your sample data is very straightforward but the description (requirements) imply that the log might have interspersed lines that you need to account for. So I think it's a case of working through the log file sequentially recording certain actions (log on, log off) and keeping a note of what was observed on any previous line. This seems to work with your data:
from datetime import datetime as DT, timedelta as TD
FMT = '%a, %d %b %Y %H:%M:%S %z'
td = TD(seconds=1)
prev = None
with open('logfile.txt') as logfile:
for line in logfile:
if len(tokens := line.split('|')) > 4:
dt, _, profile, _, action, *_ = tokens
if prev is None or prev[1] != profile:
prev = (dt, profile) if action == 'user logged in' else None
else:
if action == 'user logged off':
if DT.strptime(dt, FMT) - DT.strptime(prev[0], FMT) <= td:
print(profile)
prev = None
Output:
asdf
plnb
qweq
zzad
To parse a time I would use regex for this task to match a time expression on each line.
Something like this would work.
EDIT: I omitted the lines which don't correspond to the formatting.
import re
time = re.search(r'(\d+):(\d+):(\d+)', line).group()
As far as the profile name is concerned, I would use a split function on the most common lines like #Matthias suggested and your code would look something like this:
import collections
import time
with open('logfiles.txt') as infile:
counts = collections.Counter(l.strip() for l in infile)
for line, count in counts.most_common():
"""The line splits where the '|' symbol is and creates a list.
We choose the third element of the list - profile"""
list_of_segments = line.split('|')
if len(list_of_segments) == 6:
print(list_of_segments[2])
time.sleep(10)

How to convert test log file to json in a prescribed way

I have a log file which is below.trying to take first server details 192.168.1.1 and check when it is connected and disconnected.then go to second server 192.168.1.2 details and check when it is connected and disconnected. Like way need to determine the connection time and disconnected time of all servers
str_ = '''Jan 23 2016 11:30:08AM - ssh 22 192.168.1.1 connected
Jan 23 2016 12:04:56AM - ssh 22 192.168.1.2 connected
Jan 23 2016 2:18:32PM - ssh 22 192.168.1.2 disconnected
Jan 23 2016 5:16:09PM - un x Dos attack from 201.10.0.4
Jan 23 2016 10:43:44PM - ssh 22 192.168.1.1 disconnected
Feb 1 2016 1:40:28AM - ssh 22 192.168.1.1 connected
Feb 1 2016 2:21:52AM - un x Dos attack from 201.168.123.1
Mar 29 2016 2:13:07PM - ssh 22 192.168.1.1 disconnected'''
How to convert my log file in to json
My Expected out
{1:{192.168.1.1:[(connected,Jan 23 2016 11:30:08AM),(disconnected,Jan 23 2016 10:43:44PM)]},
2:{192.168.1.2:[(connected,Jan 23 2016 12:04:56AM),(disconnected,Jan 23 2016 2:18:32PM)]},
3:{192.168.1.1:[(connected,Feb 1 2016 1:40:28AM),(disconnected,Mar 29 2016 2:13:07PM )]},
4:{Dos:[201.10.0.4,201.168.123.1]}}
My Pseudo code
import json
import re
i = 1
result = {}
with open('test.log') as f:
lines = f.readlines()
for line in lines:
r = line.split('')
#result[i] = {}
i += 1
print(result)
with open('data.json', 'w') as fp:
json.dump(result, fp)
Why do you need dict keyed by entry numbers {1: xxx, 2: yyy, 3: zzz}? I'll advise using just a list instead - [xxx, yyy, zzz]. You can get an entry by index and so on. Technically json can't use numbers as keys.
There is no logic to group connected and disconnected events in your pseudocode.
Some lines from log don't has connect/disconnect info, so you need some logic for it too.
lines = f.readlines(); for line in lines: may eat lots of memory for large log files, just use for lines in f:
So, I think you need something like:
import json
import re
result = []
opened = {}
with open('test.log') as f:
for line in f:
date, rest = line.split(' - ', 1)
rest, last = rest.strip().rsplit(' ', 1)
ip = rest.rsplit(' ', 1)[1]
if last == 'connected':
entry = {ip: [(last, date)]}
opened[ip] = entry
result.append(entry)
elif last == 'disconnected':
opened[ip][ip].append((last, date))
del opened[ip]
print(result)
with open('data.json', 'w') as fp:
json.dump(result, fp)
It works for your sample, but needs more error checking for other logs

Python write to CSV

i want to export user and last logs to csv file, but i recive on file only last line from connection and not all ssh response
import yaml
import os
import functools
import datetime
csv_file = open(filename,'w+')
csv_file.write("%s,%s,%s,%s\n" % ('name' , 'ssh_ec2user' , 'ssh_centosuser' , 'ssh_nginx_log'))
csv_file.flush()
for instance in running_instances:
if (instance.tags == None or instance.tags == ""): continue
for tag in instance.tags:
if 'Name' in tag['Key']:
name = tag['Value']
print(name)
instance_private_ip = (instance.private_ip_address)
print(instance_private_ip)
ssh_ec2user = os.system("ssh -t -t -i %s -n -o StrictHostKeyChecking=no ec2-user#%s 'sudo touch last.txt;sudo chmod 777 last.txt;sudo last > last.txt; sudo grep -v user last.txt |head -n3'" % (identity_file , instance_private_ip))
ssh_centosuser = os.system("ssh -t -t -i %s -n -o StrictHostKeyChecking=no centos#%s 'sudo touch last.txt;sudo chmod 777 last.txt;sudo last > last.txt; sudo grep -v centos last.txt |head -n3'" % (identity_file , instance_private_ip))
ssh_nginx_log = "test nginx"
print(ssh_ec2user,user, ssh_nginx_log)
csv_file.write("\'%s\',\'%s\',\'%s\',\'%s\'\n" %(name,ssh_ec2user,ssh_centosuser,ssh_nginx_log))csv_file.flush()
for example per line i need to receive:
user pts/0 172.21.0.114 Thu Jan 25 12:30 - 13:38 (01:08)
user pts/0 172.21.2.130 Wed Jan 17 15:11 - 15:17 (00:05)
user pts/0 172.21.2.130 Wed Jan 17 09:27 - 09:46 (00:18)
Connection to 1.1.1.1 closed.
65280 0
test nginx
and in file a only receive:
65280 0
how i can input to the same line all answer:
user pts/0 172.21.0.114 Thu Jan 25 12:30 - 13:38 (01:08)
user pts/0 172.21.2.130 Wed Jan 17 15:11 - 15:17 (00:05)
user pts/0 172.21.2.130 Wed Jan 17 09:27 - 09:46 (00:18)
Connection to 1.1.1.1 closed.
65280 0
tnx
Use csv library.
import csv
writer = csv.writer(csv_file)
writer.writerow(['name' , 'ssh_ec2user' , 'ssh_centosuser' , 'ssh_nginx_log'])
...
writer.writerow([name,ssh_ec2user,ssh_centosuser,ssh_nginx_log])
The output will not be on the same line, but will be correctly escaped so that if you open it with Excel or OpenCal or similar will be displayed correctly.
Also you can have ',' characters in your string without messing up the format.

Python Timeout doesn't seem to work

I have the following code;
def ip_addresses():
# Get external ipv4
try:
response = urllib2.urlopen('http://icanhazip.com', timeout = 2)
out = response.read()
public_ipv4 = re.sub('\n', '', out)
except:
public_ipv4 = "failed to retrieve public_ipv4"
In normal circumstance, when response from http://icanhazip.com is received, the output is something like this;
xxx#xxx:/var/log$ date && tail -1 xxx.log
Tue Jul 25 **07:43**:18 UTC 2017 {"public_ipv4": "208.185.193.131"}, "date": "2017-07-25 **07:43**:01.558242"
So, the current date and the date of the log generation are same.
However, when there is an exception, this is happening;
xxx#xxx:/var/log$ date && tail -1 xxx.log
Tue Jul 25 **07:30**:25 UTC 2017 {"public_ipv4": "failed to retrieve public_ipv4"},"date": "2017-07-25 **07:23**:01.525444"
Why is the "timeout" not working?
Try to get the verbose exception details in this manner
and then investigate what is the error all about, the difference in time
Use this format...
import sys
try:
1 / 0
except:
print sys.exc_info()

Segmentation Fault in Python´s Gensim

sometimes half hour of running the following script I get a Segmentantion Fault error:
2016-02-09 21:01:21,256 : INFO : PROGRESS: at sentence #9130000, processed 201000982 words, keeping 85047862 word types
Segmentation fault
I´m using Mint on a a VM (VMware workstation 12.0.1) using the word2vec version of gensim-0.12.3-py2.7-linux-x86_64.egg (Python 2.7.6)
coding: utf-8
In[1]:
import os, nltk
import io
import gensim, logging
import nltk
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
In[2]:
class MySentences(object):
def __init__(self, dirname, encoding='iso-8859-1'):
self.dirname = dirname
self.encoding = encoding #added encoding parameter for non utf8 texts
def __iter__(self):
for fname in os.listdir(self.dirname):
sub_dir = os.path.join(self.dirname, fname)
for fname in os.listdir(sub_dir):
text_file = os.path.join(sub_dir, fname)
for line in io.open(text_file, encoding=self.encoding):
yield nltk.word_tokenize(line, language='english') #you can change tokenizer
In[3]:
sentences = MySentences('/home/arie/extracted')
In[ ]:
model = gensim.models.Word2Vec(sentences)
I just saw the memory monitor and it looks it will crash again any time:
Every 5,0s: free -m Tue Mar 15 19:55:36 2016
total used free shared buffers cached
Mem: 9837 7735 2102 15 141 1232
-/+ buffers/cache: 6360 3476
Swap: 2044 0 2044
Every 5,0s: free -m Tue Mar 15 19:59:06 2016
total used free shared buffers cached
Mem: 9837 8563 1274 14 1 108
-/+ buffers/cache: 8453 1384
Swap: 2044 12 2032

Categories

Resources