Python: How to Average Ping Times From File - python

I am looking to write two python scripts; one to ping an IP, and store the ping results to a file, and one to extract and average the ping times from the created .txt file. (Please note that all I really need to log are the ping times) (My platform is the Pi if it helps)
Below is the line of code which I plan to use to store the ping results in a text file (Obviously in my program I have put this in an infinity loop with a delay so it doesn't ping too often)
command = os.system('ping 127.0.0.1 >> pingresults.txt')
I am stuck on how access this file, and then to parse this file into just the useful data? (Please bear in mind that I am a serious amateur)
I am wondering if when I initially log the data into the file, if I can filter it for just the ping time. That would make averaging them later much easier.
If you have any suggestions of commands of interest, or tricks, or implementations, that would be great!

I'll take this in basic steps, entirely in Python, ignoring Python "tricks":
Open the file:
f = open("pingresults.txt", "r")
time_list = []
for line in f:
# Ping time is in the penultimate field, 5 chars in.
field_list = line.split(' ')
ping_time = field_list[-2][5:]
time_list.append(field_list[-1]) # add last field to end of time_list
print sum(time_list) / float(len(time_list))

Related

How can I skip n lines of a binary stdin using Python?

I'm piping binary data to a Python script on a Hadoop cluster using the Hadoop CLI. The binary data have terminators that identify where new documents begin. The records are sorted by a unique identifier which starts at 1000000001 and increments by 1.
I am trying to save the data only for a subset of these IDs which I have in a dictionary.
My current process is to select the data from the CLI using:
hadoop select "Database" "Collection" | cut -d$'\t' -f2 | python script.py
and process it in script.py which looks like this:
import json
import sys
member_mapping = json.load(open('member_mapping.json'))
output = []
for line in sys.stdin:
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break
The problem is that there are 6.5M IDs in this binary data and it takes almost 2 hours to scan. I know the min() and max() IDs in my dictionary and you can see in my code that I stop early when I have saved n documents where n is the length of my mapping file.
I want to make this process more efficient by skipping as many reads as possible. If the ID starts at 1000000001 and the first ID I want to save is 1000010001, can I simply skip 10,000 lines?
Due to system issues, I'm not currently able to use spark or any other tools that may improve this process, so I need to stick to solutions that utilize Python and the Hadoop CLI for now.
You could try using enumerate and a threshold, and then skip any input that isn't in the rane you care about. This isn't a direct fix, but should run much faster and throw those first 10,000 lines away pretty quick.
for lineNum, line in enumerate(sys.stdin):
if(lineNum < 10000):
continue
person = json.loads(line)
if member_mapping.get(person['personId']):
output.append({person['personId']: person})
if len(output) == len(member_mapping):
break

Reading a file line-by-line with a timeout for lines that are taking too long?

I have a 1.2TB file that I am running some code against, but constantly running into OutOfMemoryError exceptions. I ran the following two pieces of code against the file to see what was wrong:
import sys
with open(sys.argv[1]) as f:
count = 1
for line in f:
if count > 173646280:
print line
else:
print count
count += 1
And this code:
#!/usr/bin/env perl
use strict;
use warnings;
my $count = 1;
while (<>) {
print "$count\n";
$count++;
}
Both of them zoom until they hit line 173,646,264, and then they just completely stop. Let me just give a quick background on the file.
I created a file called groupBy.json. I then processed that file with some Java code to transform the JSON objects and created a file called groupBy_new.json. I put groupBy_new.json on s3, pulled it down on another server and was doing some processing on it when I started getting OOM errors. I figured that maybe the file got corrupted when transferring to s3. I ran the above Python/Perl code on groupBy_new.json on both serverA (the server where it was originally at), and serverB (the server from which I pulled the file off s3), both halted at the same line. I ran then ran the above Python/Perl code on groupBy.json, the original file, and it also halted. I tried to recreate groupBy_new.json with the same code that I had used to originally create it, and ran into an OOM error.
So this is a really odd problem that is perplexing me. In short, I'd like to get rid of this line that is causing me problems. What I'm trying to do is read a file with a timeout on the line being read. If it cannot read the input line in 2 seconds or so, move on to the next line.
What you can do is count the number of lines until the problem line and output it - make sure you flush the output - see https://perl.plover.com/FAQs/Buffering.html . Then write another program that will copy the first of this number of lines to a different file, and then read the file's input stream character by character (see http://perldoc.perl.org/functions/read.html ) until it hits a "\n" and then copy the rest of the file - either line by line or in chunks.

Efficient way to find a string based on a list

I'm new to scripting and have been reading up on Python for about 6 weeks. The below is meant to read a log file and send an alert if one of the keywords defined in srchstring is found. It works as expected and doesn't alert on strings previously found, as expected. However the file its processing is actively being written to by an application and the script is too slow on files around 500mb. under 200mb it works fine ie within 20secs.
Could someone suggest a more efficient way to search for a string within a file based on a pre-defined list?
import os
srchstring = ["Shutdown", "Disconnecting", "Stopping Event Thread"]
if os.path.isfile(r"\\server\\share\\logfile.txt"):
with open(r"\\server\\share\\logfile.txt","r") as F:
for line in F:
for st in srchstring:
if st in line:
print line,
#do some slicing of string to get dd/mm/yy hh:mm:ss:ms
# then create a marker file called file_dd/mm/yy hh:mm:ss:ms
if os.path.isfile("file_dd/mm/yy hh:mm:ss:ms"): # check if a file already exists named file_dd/mm/yy hh:mm:ss:ms
print "string previously found- ignoring, continuing search" # marker file exists
else:
open("file_dd/mm/yy hh:mm:ss:ms", 'a') # create file_dd/mm/yy hh:mm:ss:ms
print "error string found--creating marker file sending email alert" # no marker file, create it then send email
else:
print "file not exist"
Refactoring the search expression to a precompiled regular expression avoids the (explicit) innermost loop.
import os, re
regex = re.compile(r'Shutdown|Disconnecting|Stopping Event Thread')
if os.path.isfile(r"\\server\\share\\logfile.txt"):
#Indentation fixed as per comment
with open(r"\\server\\share\\logfile.txt","r") as F:
for line in F:
if regex.search(line):
# ...
I assume here that you use Linux. If you don't, install MinGW on Windows and the solution below will become suitable too.
Just leave the hard part to the most efficient tools available. Filter your data before you go to the python script. Use grep command to get the lines containing "Shutdown", "Disconnecting" or "Stopping Event Thread"
grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt
and redirect the lines to your script
grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt | python log.py
Edit: Windows solution. You can create a .bat file to make it executable.
findstr /c:"Shutdown" /c:"Disconnecting" /c:"Stopping Event Thread" \server\share\logfile.txt | python log.py
In 'log.py', read from stdin. It's file-like object, so no difficulties here:
import sys
for line in sys.stdin:
print line,
# do some slicing of string to get dd/mm/yy hh:mm:ss:ms
# then create a marker file called file_dd/mm/yy hh:mm:ss:ms
# and so on
This solution will reduce the amount of work your script has to do. As Python isn't a fast language, it may speed up the task. I suspect it can be rewritten purely in bash and it will be even faster (20+ years of optimization of a C program is not the thing you compete with easily), but I don't know bash enough.

Getting number of lines in a text file without readlines

Let's say I have a program that uses a .txt file to store data it needs to operate. Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. Let's just say (I know this isn't secure) that it's a list of usernames. So my code would look like this (using python 3.3).
for x in range LenOfFile:
id = file.readlines(x)
if username == id:
validusername = True
#ask for a password
if validusername == True and validpassword == True:
pass
else:
print("Invalid Username")
Assume that valid password is set to True or False where I ask for a password. My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). Is there a way Python can do this without reading the entire file and storing it at once? I already tried len(), which apparently doesn't work on text files but was worth a try. The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. I would prefer not to use this way, so any suggestions would be appreciated.
You can just iterate over the file handle directly, which will then iterate over it line-by-line:
for line in file:
if username == line.strip():
validusername = True
break
Other than that, you can’t really tell how many lines a file has without looking at it completely. You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); but you don’t know how long each line is without seeing it, so you don’t know where the line breaks are and as such can’t tell how many lines there are in total. You still would have to look at every character one-by-one to see if a new line begins or not.
So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that’s when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on.
Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. Even more specifically here are fragments from working code to get you started. You are welcome to freely copy and use but if you do then please cite my work thank you:
import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat
unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1
if __name__ == '__main__':
argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False 1:True.')
argparser.add_argument('--fileName', default=fileName, required=False, help='')
argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
cmd = vars(argparser.parse_args())
unitTest=cmd['unitTest']
fileName=cmd['fileName']
balanceFactor=cmd['balanceFactor']
numProcesses=cmd['numProcesses']
#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np # long heavy import but i love numpy syntax
def PartitionDataToWorkers(workers, items, balanceFactor=2):
strips = balanceFactor * workers
step = int(round(float(items)/strips))
startPos = list(range(1, items+1, step))
if len(startPos) > strips:
startPos = startPos[:-1]
endPos = [x + step - 1 for x in startPos]
endPos[-1] = items
return startPos, endPos
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
if 0 == unitTest:
# Run app, not unit tests.
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
else:
print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv are how I found the correct values.
fileName='beer.csv' # byte 98 is a newline
assert(8==ReadFileSegment(1, 288, fileName))
assert(1==ReadFileSegment(1, 100, fileName))
assert(0==ReadFileSegment(1, 97, fileName))
assert(1==ReadFileSegment(97, 98, fileName))
assert(1==ReadFileSegment(98, 99, fileName))
assert(0==ReadFileSegment(99, 99, fileName))
assert(1==ReadFileSegment(98, 98, fileName))
assert(0==ReadFileSegment(97, 97, fileName))
print("OK")
The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. That said if you change some of this code to use cython or something you might even get some more speed.
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program.
I wondered if compiling the regex just one time and passing it to all workers will improve speed. Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing. Does parallel CSV file reading even help, I wondered? Is the disk the bottleneck, or is it the CPU? Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Data science is a typical use case for pure python. I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples.
Higgs Boson dataset:
https://archive.ics.uci.edu/ml/datasets/HIGGS
If you want number of lines in a file so badly, why don't you use len
with open("filename") as f:
num = len(f.readlines())

How to read and truncate the snmptrapd log file without restarting the daemon

i have made a python script that performs a nagios check. The functionality of the script is pretty simple it just parses a log and matches some info witch is used to construct the nagios check output. The log is a snmptrapd log witch records the traps from other servers and logs them in /var/log/snmptrapd after witch i just parse them with the script. In order to have the latest traps i erase the log from python each time after reading it. In order to preserve the info i have made a cron job that copies the content of the log into another log at an time interval a bit smaller than the nagios check interval. The thing that i don't understand is why is the log growing so much (i mean the messages log which has i guess 1000 times more info is smaller). From what i've seen in the log there are a lot of special characters like ^# and i think that this is done by the way i'm manipulating the file from pyton but seeing that i olny have like three weeks of experience with it I can't seem to figure out the problem.
The script code is the following:
import sys, os, re
validstring = "OK"
filename = "/var/log/snmptrapd.log"
if os.stat(filename)[6] == 0:
print validstring
sys.exit()
else:
f = open(filename,"r")
sharestring = ""
line1 = []
patte0 = re.compile("[0-9]+-[0-9]+-[0-9]+")
patte2 = re.compile("NG: [a-zA-Z\s=0-9]+.*")
for line in f:
line1 = line.split(" ")
if re.search(patte0,line1[0]):
sharestring = sharestring + line1[1] + " "
continue
result2 = re.search(patte2,line)
if result2:
result22 = result2.group()
result22 = result22.replace("NG:","")
sharestring = sharestring + result22 + " "
f.close()
f1 = open(filename,"w")
f1.close()
print sharestring
sys.exit(2)
~
The log looks like:
2012-07-11 04:17:16 Some IP(via UDP: [this is an ip]:port) TRAP, SNMP v1, community somestring
SNMPv2-SMI::enterprises.OID Some info which is not necesarry
SNMPv2-MIB::sysDescrOID = STRING: info which i'm matching
I'm pretty sure that it has something to do with the my way of erasing the file but i can't figure it out. If you have some idea i would be really interested. Thank you.
As an information about the size i have 93 lines(so says Vim) and the log occupies 161K and that is not ok because the lines are quite short.
OK it has nothing to do with the way i read and erased the file. Is something in the snmptrapd daemon that is doing this when i'm erasing it's log file. I have modified my code and now i send SIGSTOP to snmptrapd reight before i open the file, and i make my modifications to the file and then i send SIGCONT after i'm done but it seem i experience the same behavior. The new code looks like(the different parts):
else:
command = "pidof snmptrapd"
p=subprocess.Popen(shlex.split(command),stdout=subprocess.PIPE)
pidstring = p.stdout.readline()
patte1 = re.compile("[0-9]+")
pidnr = re.search(patte1,pidstring)
pid = pidnr.group()
os.kill(int(pid), SIGSTOP)
time.sleep(0.5)
f = open(filename,"r+")
sharestring = ""
and
sharestring = sharestring + result22 + " "
f.truncate(0)
f.close()
time.sleep(0.5)
os.kill(int(pid), SIGCONT)
print sharestring
I'm thinking of stopping the daemon erasing the file and after that recreating it with the proper permissions and starting the daemon.
I don't think you can, but here are some things to try
Truncating a File
f1 = open(filename, 'w')
f1.close()
is a hacky side effect way of deleting a files contents and will probably be causing undesired side effects depending on the underlying OS if other applications have that file open.
Using the File Object method truncate()
truncate([size])
Truncate the file's size. If the optional size argument is present,
the file is truncated to (at most) that size. The size defaults to the
current position. The current file position is not changed. Note that
if a specified size exceeds the file's current size, the result is
platform-dependent: possibilities include that the file may remain
unchanged, increase to the specified size as if zero-filled, or
increase to the specified size with undefined new content.
Availability: Windows, many Unix variants.
Probably the only determinist way to do this is
stop the snmptrapd process at the start of the script, use the proper os module function remove and then recreate the file and restart the snmptrapd daemon at the end of the script.
os.remove(path)
Remove (delete) the file path. If path is a directory, OSError is
raised; see rmdir() below to remove a directory. This is identical to
the unlink() function documented below. On Windows, attempting to
remove a file that is in use causes an exception to be raised; on
Unix, the directory entry is removed but the storage allocated to the
file is not made available until the original file is no longer in
use.
Shared resource concern
You still might have problems with having two processes trying to fight for writing to a single file without some kind of locking mechanism and having non-deterministic things happening to the file. I bet you can send a SIGINT or something similar to your daemon process and get it to re-read the file or something, check your documentation.
Manipulating shared resources, especially file resources without exclusive locking is going to be trouble, especially with filesystem caching and application caching of data.

Categories

Resources