This question already has answers here:
Understanding slicing
(38 answers)
Closed 8 months ago.
I am new to Python.
I wanted to find profiles from a log file, with following criteria
user logged in, user changed password, user logged off within same second
those actions (log in, change password, log off) happened one after another with no other entires in between.
with .txt file looks like this
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|asdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|asdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|asdf| - |user logged off| -
Mon, 22 Aug 2016 13:15:42 +0200|178.57.66.225|iukj| - |user logged in| -
Mon, 22 Aug 2016 13:15:40 +0200|178.57.66.215|klij| - |user logged in| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|klij| - |user changed password| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|klij| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|plnb| - |user logged off| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user logged in| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user changed password| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user changed profile| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|qweq| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|zzad| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|zzad| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|zzad| - |user logged off| -
Mon, 22 Aug 2016 13:20:42 +0200|178.57.67.225|yytr| - |user logged in| -
asdf - is typical profile name from the log file
Here is what I have done so far
import collections
import time
with open('logfiles.txt') as infile:
counts = collections.Counter(l.strip() for l in infile)
for line, count in counts.most_common():
print(line, count)
time.sleep(10)
I know the logic is to get same hours, minutes, and seconds
if they are duplicates, then I print the profiles.
But I am confuse how to get time from a file.
Any help is very much appreciated.
EDIT:
The output would be:
asdf
klij
plnb
zzad
I think this is more complicated than you might have imagined. Your sample data is very straightforward but the description (requirements) imply that the log might have interspersed lines that you need to account for. So I think it's a case of working through the log file sequentially recording certain actions (log on, log off) and keeping a note of what was observed on any previous line. This seems to work with your data:
from datetime import datetime as DT, timedelta as TD
FMT = '%a, %d %b %Y %H:%M:%S %z'
td = TD(seconds=1)
prev = None
with open('logfile.txt') as logfile:
for line in logfile:
if len(tokens := line.split('|')) > 4:
dt, _, profile, _, action, *_ = tokens
if prev is None or prev[1] != profile:
prev = (dt, profile) if action == 'user logged in' else None
else:
if action == 'user logged off':
if DT.strptime(dt, FMT) - DT.strptime(prev[0], FMT) <= td:
print(profile)
prev = None
Output:
asdf
plnb
qweq
zzad
To parse a time I would use regex for this task to match a time expression on each line.
Something like this would work.
EDIT: I omitted the lines which don't correspond to the formatting.
import re
time = re.search(r'(\d+):(\d+):(\d+)', line).group()
As far as the profile name is concerned, I would use a split function on the most common lines like #Matthias suggested and your code would look something like this:
import collections
import time
with open('logfiles.txt') as infile:
counts = collections.Counter(l.strip() for l in infile)
for line, count in counts.most_common():
"""The line splits where the '|' symbol is and creates a list.
We choose the third element of the list - profile"""
list_of_segments = line.split('|')
if len(list_of_segments) == 6:
print(list_of_segments[2])
time.sleep(10)
I try to extreact data from twitter json file retrived by using tweepy streaming
Here is my code for streaming:
class MyListener(Stream):
t_count=0
def on_data(self, data):
print (data)
self.t_count += 0
#stop by
if self.t_count >= 5000:
sys.exit("exit")
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
stream = MyListener(consumer_key, consumer_secret, access_token, access_token_secret)
stream.filter(track=['corona'], languages = ["en"])
Here is my code for reading the file:
with open("covid-test-out", "r") as f:
count = 0
for line in f:
data = json.loads(line)
Then I got the error
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here is one line in the json file. I noticed that there is a b-prefix in front of each line but when I check the type of the line, it is not bytes object but still string object. And I am not even sure if this is the reason that I can not get the correct data.
b'{"created_at":"Mon Nov 22 07:37:46 +0000 2021","id":1462686730956333061,"id_str":"1462686730956333061","text":"RT #corybernardi: Scientists 'mystified'. \n\nhttps:\/\/t.co\/rvTYCUEQ74","source":"\u003ca href=\"https:\/\/mobile.twitter.com\" rel=\"nofollow\"\u003eTwitter Web App\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1336870146242056192,"id_str":"1336870146242056192","name":"Terence Byrnes","screen_name":"byrnes_terence","location":null,"url":null,"description":"Retired Aussie. Against mandatory vaccinations, government interference in our lives, and the climate cult. Now on Gab Social as a backup : Terence50","translator_type":"none","protected":false,"verified":false,"followers_count":960,"friends_count":1012,"listed_count":3,"favourites_count":15163,"statuses_count":171876,"created_at":"Thu Dec 10 03:08:01 +0000 2020","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1428994180458508292\/fT2Olt4J_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1428994180458508292\/fT2Olt4J_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1336870146242056192\/1631520259","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null,"withheld_in_countries":[]},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Sun Nov 21 19:42:14 +0000 2021","id":1462506658421112834,"id_str":"1462506658421112834","text":"Scientists 'mystified'. \n\nhttps:\/\/t.co\/rvTYCUEQ74","source":"\u003ca href=\"https:\/\/mobile.twitter.com\" rel=\"nofollow\"\u003eTwitter Web App\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":80965423,"id_str":"80965423","name":"CoryBernardi.com.au","screen_name":"corybernardi","location":"Adelaide ","url":"http:\/\/www.corybernardi.com.au","description":"Get your free Weekly Dose of Common Sense email at https:\/\/t.co\/MAJpp7iZJy.\n\nLaughing at liars and leftists since 2006. Tweets deleted weekly to infuriate losers.","translator_type":"none","protected":false,"verified":true,"followers_count":47794,"friends_count":63,"listed_count":461,"favourites_count":112,"statuses_count":55,"created_at":"Thu Oct 08 22:54:55 +0000 2009","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1446336496827387904\/Ay6QRHQt_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1446336496827387904\/Ay6QRHQt_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/80965423\/1633668973","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null,"withheld_in_countries":[]},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":5,"reply_count":30,"retweet_count":40,"favorite_count":136,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/rvTYCUEQ74","expanded_url":"https:\/\/apnews.com\/article\/coronavirus-pandemic-science-health-pandemics-united-nations-fcf28a83c9352a67e50aa2172eb01a2f","display_url":"apnews.com\/article\/corona\u2026","indices":[26,49]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/rvTYCUEQ74","expanded_url":"https:\/\/apnews.com\/article\/coronavirus-pandemic-science-health-pandemics-united-nations-fcf28a83c9352a67e50aa2172eb01a2f","display_url":"apnews.com\/article\/corona\u2026","indices":[44,67]}],"user_mentions":[{"screen_name":"corybernardi","name":"CoryBernardi.com.au","id":80965423,"id_str":"80965423","indices":[3,16]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1637566666722"}'
I have a problem to use pywinrm on linux, to get a PowerShell Session.
I read several posts and questions on sites about that. But any that can solve my question.
The error is in the Kerberos autentication. This is my krb5.conf:
0 [libdefaults]
1 default_realm = DOMAIN.COM.BR
2 ticket_lifetime = 24000
3 clock-skew = 300
4 dns_lookup_kdc = true
5
6 # [realms]
7 # LABCORP.CAIXA.GOV.BR = {
8 # kdc = DOMAIN.COM.BR
9 # kdc = DOMAIN.COM.BR
10 # admin_server = DOMAIN.COM.BR
11 # default_domain = DOMAIN.COM.BR
12 # }
13
14 [logging]
15
16 default = FILE:/var/log/krb5libs.log
17 kdc = FILE:/var/log/krb5kdc.log
18 admin_server = FILE:/var/log/kadmind.log
19
20 # [domain_realm]
21 # .DOMAIN.COM.BR = DOMAIN.COM.BR
22 # server.com = DOMAIN.COM.BR
My /etc/resolv.conf is:
search DOMAIN.COM.BR
nameserver IP
And my python code is:
import winrm
s = winrm.Session(
'DOMAIN.COM.BR ',
'transport='kerberos',
auth=('my_active_directory_user', 'my_active_directory_password'),
server_cert_validation='ignore')
r = s.run_cmd('ipconfig', ['/all'])
And the server return this error:
winrm.exceptions.WinRMTransportError: ('http', 'Bad HTTP response returned from server. Code 500')
The port of the server is open. I see with nmap:
5985/tcp open wsman
I can ping and resolv the name of the server:
$ ping DOMAIN.COM.BR
PING DOMAIN.COM.BR (IP) 56(84) bytes of data.
64 bytes from IP: icmp_seq=2 ttl=127 time=0.410 ms
64 bytes from IP: icmp_seq=2 ttl=127 time=0.410 ms
I can use kinit without problem to get the ticket:
$ kinit my_active_directory_user#DOMAIN.COM.BR
And, list the ticket:
$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: my_active_directory_user#DOMAIN.COM.BR
Valid starting Expires Service principal
05-09-2017 10:23:52 05-09-2017 17:03:50 krbtgt/DOMAIN.COM.BR #DOMAIN.COM.BR
What kind of problem is that?
Other solution is to add this line with allow_weak_crypto in your krb5.conf file:
[libdefaults]
***
allow_weak_crypto = true
***
Problem:
Iam parsing a logline from a service which i installed with a custom date field.So i want to match the log line and then see if new logs came into the logfile.
But to match the logfile iam using the regex to exact match the date in the logline.I attached the code part below.
Code:
def matchDate(self , line):
matchThis = ""
#Thu Jul 27 00:03:27 2017
matched = re.match(r'\d\d\d\ \d\d\d \d\d\ \d\d:\d\d:\d\d \d\d\d\d',line)
print matched
if matched:
#matches a date and adds it to matchThis
matchThis = matched.group()
print 'Match found {}'.format(matchThis)
else:
matchThis = "NONE"
return matchThis
def log_parse(self):
currentDict = {}
with open(self.default_log , 'r') as f:
for line in f:
print line
if line.startswith(self.matchDate(line) , 0 ,24 ):
if currentDict:
yield currentDict
currentDict = {
"date" : line.split('[')[0][:24],
"no" : line.split(']')[0][-4:-1],
"type" : line.split(':')[0][-4:-1],
"text" : line.split(':')[1][1:]
}
else:
pass
# currentDict['text'] += line
yield currentDict
Here it is not matching anything so i fixed this by new regex like this
'[A-Za-z]{3} [A-Za-z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}'
Here is the regex editor [http://regexr.com/3gl67]
Any suggestions on how to solve this problem and to exact match the logline.
Example Logline:
Wed Aug 30 13:05:47 2017 [3163] INFO: Something new, the something you looking for is hidden. Update finished.
Wed Aug 2 13:05:47 2017 [3163] INFO: Something new, the something you looking for is hidden. Update finished.
I developed this code which might help you to detect the desired pattern:
import re
#detecting Thu Jul 27 00:03:27 2017
line = 'Wed Aug 30 13:05:47 2017 [3163] INFO: Something new, the something you looking for is hidden. Update finished.'
days = '(?:Sat|Sun|Mon|Tue|Wed|Thu|Fri) '
months = '(?:Jan|Feb|Mar|Apr|May|June|July|Aug|Sept|Oct|Nov|Dec) '
day_number = '\d{2} '
time = '\d{1,2}:\d{1,2}:\d{1,2} '
year = '\d{4} '
date = days+months+day_number
pattern = date + time + year
date_matched = re.findall(date, line)
time_matched = re.findall(time, line)
year_matched = re.findall(year, line)
full_matched = re.findall(pattern, line)
print(date_matched, year_matched, time_matched , full_matched)
if len(full_matched) > 0:
print('yes')
else:
print('no')
I used specific patterns for months, days, year and time. I am not much familiar with re.match function so I used re.findall. My priority was simplicity and clearness of the code, so I guess more efficient code or patterns are available. I really hope this one could come in handy anyway.
Good luck
I've been using the script below to download technical videos for later analysis. The script has worked well for me and retrieves the highest resolution version available for the videos that I have needed.
Now I've come across a 4K YouTube video, and my script only saves an mp4 with 1280x720.
I'd like to know if there is a way to adjust my current script to download higher resolution versions of this video. I understand there are python packages that might address this, but right now I would like stick to this step-by-step method if possible.
above: info from Quicktime and OSX
"""
length: 175 seconds
quality: hd720
type: video/mp4; codecs="avc1.64001F, mp4a.40.2"
Last-Modified: Sun, 21 Aug 2016 10:41:48 GMT
Content-Type: video/mp4
Date: Sat, 01 Apr 2017 16:50:16 GMT
Expires: Sat, 01 Apr 2017 16:50:16 GMT
Cache-Control: private, max-age=21294
Accept-Ranges: bytes
Content-Length: 35933033
Connection: close
Alt-Svc: quic=":443"; ma=2592000
X-Content-Type-Options: nosniff
Server: gvs 1.
"""
import urlparse, urllib2
vid = "vzS1Vkpsi5k"
save_title = "YouTube SpaceX - Booster Number 4 - Thaicom 8 06-06-2016"
url_init = "https://www.youtube.com/get_video_info?video_id=" + vid
resp = urllib2.urlopen(url_init, timeout=10)
data = resp.read()
info = urlparse.parse_qs(data)
title = info['title']
print "length: ", info['length_seconds'][0] + " seconds"
stream_map = info['url_encoded_fmt_stream_map'][0]
vid_info = stream_map.split(",")
mp4_filename = save_title + ".mp4"
for video in vid_info:
item = urlparse.parse_qs(video)
print 'quality: ', item['quality'][0]
print 'type: ', item['type'][0]
url_download = item['url'][0]
resp = urllib2.urlopen(url_download)
print resp.headers
length = int(resp.headers['Content-Length'])
my_file = open(mp4_filename, "w+")
done, i = 0, 0
buff = resp.read(1024)
while buff:
my_file.write(buff)
done += 1024
percent = done * 100.0 / length
buff = resp.read(1024)
if not i%1000:
percent = done * 100.0 / length
print str(percent) + "%"
i += 1
break
Ok, so I have not taken the time to get to the bottom of this. However, I did find that when you do:
stream_map = info['url_encoded_fmt_stream_map'][0]
Somehow you only get a selection of a single 720p option, one 'medium' and two 'small'.
However, if you change that line into:
stream_map = info['adaptive_fmts'][0]
you will get all the available versions, including the 2160p one. Thus, the 4K one.
PS: You'd have to comment out the print quality and print type command since those labels aren't always available in the new throughput. When commenting them out however, and adapting your script as explained above, I was able to successfully download the 4K version.
indeed
info ['adaptive_fmts'] [0]
returns the information of the whole video, but the url is not usable directly , but the bar of advancement