Python - How to parse smartctl program output? - python

I am writing a wrapper for smartctl in python 2.7.3...
I am having a hell of a time trying to wrap my head around how to parse the output from the smartctl program in Linux (Ubuntu x64 to be specific)
I am running smartctl -l selftest /dev/sdx via subprocess and grabbing the output into a variable
This variable is broken up into a list, then I drop the useless header data and blank lines from the output.
Now, I am left with a list of strings, which is great!
The data is sort-of tabular, and I want to parse it into a dict() full of lists (I think this is the correct way to represent tabular data in Python from reading the docs)
Here's a sample of the data:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 44796 -
# 2 Short offline Completed without error 00% 44796 -
# 3 Short offline Completed without error 00% 44796 -
# 4 Short offline Completed without error 00% 44796 -
# 5 Short offline Completed without error 00% 44796 -
# 6 Extended offline Completed without error 00% 44771 -
# 7 Short offline Completed without error 00% 44771 -
# 8 Short offline Completed without error 00% 44741 -
# 9 Short offline Completed without error 00% 1 -
#10 Short offline Self-test routine in progress 70% 44813 -
I can see some issues with trying to parse this, and am open to solutions, but i may also just be doing this all wrong ;-):
The Status Text Self-test routine in progress flows past the first character of the text Remaining
In the Num column, the numbers after 9 are not separated from the # character by a space
I might be way off-base here, but this is my first time trying to parse something this eccentric.
Thank everyone who even bothers to read this wall of text in advance!!!
Here's my code so far, if anyone feels it necessary or finds it useful:
#testStatus.py
#This module provides an interface for retrieving
#test status and results for ongoing and completed
#drive tests
import subprocess
#this function takes a list of strings and removes
#strings which do not have pertinent information
def cleanOutput(data):
cleanedOutput = []
del data[0:3] #This deletes records 0-3 (lines 1-4) from the list
for item in data:
if item == '': #This removes blank items from remaining list
pass
else:
cleanedOutput.append(item)
return cleanedOutput
def resultsOutput(data):
headerLines = []
resultsLines = []
resultsTable = {}
for line in data:
if "START OF READ" in line or "log structure revision" in line:
headerLines.append(line)
else:
resultsLines.append(line)
nameLine = resultsLines[0].split()
print nameLine
def getStatus(sdxPath):
try:
output = subprocess.check_output(["smartctl", "-l", "selftest", sdxPath])
except subprocess.CalledProcessError:
print ("smartctl command failed...")
except Exception as e:
print (e)
splitOutput = output.split('\n')
cleanedOutput = cleanOutput(splitOutput)
resultsOutput(cleanedOutput)
#For Testing
getStatus("/dev/sdb")

For what it's worth (this is an old question): smartctl has a --json flag which you can use and then parse the output like normal JSON since version 7.0
release notes

The main parsing problem seems to be the first three columns; the remaining data is more straight forward. Assuming the output uses blanks between fields (instead of tab characters, which would be much easier to parse), I'd go for fixed length parsing, something like:
num = line[1:2]
desc = line[5:25]
status = line[25:54]
remain = line[54:58]
lifetime = line[60:68]
lba = line[77:99]
The header line would be handled differently. What structure you put the data into depends on what you want to do with it. A dictionary keyed by "num" might be appropriate if you mainly wanted to randomly access data by that "num" identifier. Otherwise a list might be better. Each entry (per line) could be a tuple, a list, a dictionary, a class instance, or maybe other things. If you want to access fields by name, then a dictionary or class instance per entry might be appropriate.

Related

Python loop error in SPSS syntax only if i run the same code twice

I'm quite new in python programming.
I'm trying to automate some tabulations in SPSS using python (and i kind of managed it...) using a loop and some python code, but it works fine only the first time i run the syntax, the second time it tabulates only once:
I have an SPSS file with different projects merged together (i.e. different countries) , so first i try to extract a list of projects using a built in function.
Once i have my list of project i run a loop and i change the spss syntax for the case selection and tabulation.
this is the code:
begin program.
import spss
#Function that extracts the data from spss
def DatiDaSPSS(vars, num):
if num == 0:
num = spss.GetCaseCount()
if vars == None:
varNums = range(spss.GetVariableCount())
else:
allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
varNums = [allvars.index(i) for i in vars]
data = spss.Cursor(varNums)
pydata = data.fetchmany(num)
data.close()
return pydata
#store the result of the function into a list:
all_prj=DatiDaSPSS(vars=["Project"],num=0)
#remove duplicates and keep only the country that i need:
prj_list=list(set([i[0] for i in all_prj]))
#loop for the tabulation:
for i in range(len(prj_list)):
prj_now=str(prj_list[i])
spss.Submit("""
compute filter_$=Project='%s'.
filter by filter_$.
exe.
TEXT "Country"
/OUTLINE HEADING="%s" TITLE="Country".
CTABLES
/VLABELS VARIABLES=HisInterviewer HisResult DISPLAY=DEFAULT
/TABLE HisInterviewer [C][COUNT F40.0, ROWPCT.COUNT PCT40.1] BY HisResult [C]
/CATEGORIES VARIABLES=HisInterviewer HisResult ORDER=A KEY=VALUE EMPTY=EXCLUDE TOTAL=YES
POSITION=AFTER
/CRITERIA CILEVEL=95.
""" %(prj_now,prj_now))
end program.
When i run it the second time it shows only the last value of the list (and only one tabulation). If i restart SPSS it works fine the first time.
Is it because of the function?
i'm using spss25
can I reply myself, should i edit the discussion or maybe delete it? i think i found out the reason, i guess the function picks up only the values that are already selected, i tried now adding this SPSS code before the begin and it seems to be working:
use all.
exe.
begin program.
...
at the last loop there is a filter on the data and i removed it before of running the script. please let me know if you want me to edit or remove the message

How Do I Start Pulling Apart This Block of JSON Data?

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1
Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Infinite For Loop issue in Python

I need to extract an ID from a JSON string that is needed for loading information into a MySQL database. The ID is a 5 or 6 digit number, but the JSON key that contains this number is the URL net_devices resource string that has the number at the end like this example:
{u'router': u'https://www.somecompany.com/api/v2/routers/123456/'}
Since there is not a key with just the ID, I have used the following to return just the ID from the JSON key string:
url = 'https://www.somecompany.com/api/v2/net_devices/?fields=router,service_type'
r = json.loads(s.get((url), headers=headers).text)
status = r["data"]
for item in status:
type = item['service_type']
router_url = item['router']
router_id = router_url.replace("https://www.somecompany.com/api/v2/routers/", "")
id = router_id.replace("/", "")
print id
This does indeed return just the ID values I want, and it doesn't matter if the result varies in the number of digits.
The problem: This code creates an infinite loop when I include the two lines above the print statement.
How can I change the syntax to allow the loop to run through all the returned IDs once, but still strip out everything except the numerical ID?
I am new to Python, and just starting to write code again after a very long hiatus since college. Any help would be greatly appreciated!
UPDATE
Thanks everyone for the feedback! With the help from David and Gerrat, I was able to find the issue that was causing the infinite loop and it was not this segment of the code, but another segment that was not properly indented. I am learning how to properly indent loops in Python, and this was one of my silly mistakes! Thanks again for the help!

Using Python to split a Unicode file object into dictionary Keys and values

Hi and thanks for reading. I’ll admit that this is a progression on from a previous question I asked earlier, after I partially solved the issue. I am trying to process a block of text (file_object) in an earlier working function. The text or file_object happens to be in Unicode, but I have managed to convert to ascii text and split on a line by line basis. I am hoping to then further split the text on the ‘=’ symbol so that I can drop the text into a dictionary. For example Key: Value as ‘GPS Time’:’ 14:18:43’ so removing the trailing '.000' from the time (though this is a second issue).
Here’s the file_object format…
2015 Jan 01 20:07:16.047 GPS Info #Log packet ID
GPS Time = 14:18:43.000
Longitude = 000.65341
Latitude = +41.25385
Altitude = +111.400
This is my partially working function…
def process_data(file_object):
file_object = file_object.encode('ascii','ignore')
split = file_object.split('\n')
for i in range(len(split)):
while '=' in split[i]:
processed_data = (split[i].split('=', 1) for _ in xrange(len(split)))
return {k.strip(): v.strip() for k, v in processed_data}
This is the initial section of the main script that prompts the above function, and then sets GPS Time as the Dictionary key…
while (mypkt.Next()): #mypkt.Next is an API function in the log processor app I am using – essentially it grabs the whole GPS Info packet shown above
data = process_data(mypkt.Text, 1)
packets[data['GPS Time']] = data
The code above has no problem splitting the first instance ‘GPS Time’, but it ignores Lonitude, Latitude etc, To make matters worse, there is sometimes a blank line between each packet item too. I guess I need to store previous dictionary related splits before the ‘return’, but I am having difficulty trying to find out how to do this.
The dict output I am currently getting is…
'14:19:09.000': {'GPS Time': '14:19:09.000'},
But What I am hoping for is…
'14:19:09': {'GPS Time': '14:19:09',
‘Longitude’:’000.65341’,
‘Latitude’:’+41.25385’,
‘Altitude’:’+111.400’},
Thanks in advance for any help.
MikG
All this use of range(len(whatever)) is nonsense. You almost never need to do that in Python. Just iterate through the thing.
Your problem however is more fundamental: you return from inside the while loop. That means you only ever get one element, because as soon as that first line is processed, you return and the function ends.
Also, you have a while loop which means that processing will end as soon as the program encounters a line without an equals; but you have blank lines between each data line, so again execution would never proceed past the first one.
So all you need is:
split_data = file_object.split('\n')
result = {}
for line in split_data:
if '=' in line:
key, value = line.split('=', 1)
result[key.strip()] = value.strip()
return result

Checking if A follows B on twitter using Tweepy/Python

I have a list of a few thousand twitter ids and I would like to check who follows who in this network.
I used Tweepy to get the accounts using something like:
ids = {}
for i in list_of_accounts:
for page in tweepy.Cursor(api.followers_ids, screen_name=i).pages():
ids[i]=page
time.sleep(60)
The values in the dictionary ids form the network I would like to analyze. If I try to get the complete list of followers for each id (to compare to the list of users in the network) I run into two problems.
The first is that I may not have permission to see the user's followers - that's okay and I can skip those - but they stop my program. This is the case with the following code:
connections = {}
for x in user_ids:
l=[]
for page in tweepy.Cursor(api.followers_ids, user_id=x).pages():
l.append(page)
connections[x]=l
The second is that I have no way of telling when my program will need to sleep to avoid the rate-limit. If I put a 60 second wait after every page in this query - my program would take too long to run.
I tried to find a simple 'exists_friendship' command that might get around these issues in a simpler way - but I only find things that became obsolete with the change to API 1.1. I am open to using other packages for Python. Thanks.
if api.exists_friendship(userid_a, userid_b):
print "a follows b"
else:
print "a doesn't follow b, check separately if b follows a"

Categories

Resources