How to process the output from a command invoked by python - python

I want to write a python program, that invokes ipcs and uses its output to delete shared memory segments and semaphores.
I have a working solution, but feel there must be a better way to do this.
Here is my program:
import subprocess
def getid(ip):
ret=''
while (output[ip]==' '):
ip=ip+1
while((output[ip]).isdigit()):
ret=ret+output[ip]
ip=ip+1
return ret
print 'invoking ipcs'
output = subprocess.check_output(['ipcs'])
print output
for i in range (len(output)):
if (output[i]=='m'):
r=getid(i+1)
print r
if (r):
op = subprocess.check_output(['ipcrm','-m',r])
print op
elif (output[i]=='s'):
r=getid(i+1)
print r
if (r):
op = subprocess.check_output(['ipcrm','-s',r])
print op
print 'invoking ipcs'
output = subprocess.check_output(['ipcs'])
print output
In particular, is there a better way to write "getid"? ie instead of parsing it character by character,
can I parse it string by string?
This is what the output variable looks like (before parsing):
Message Queues:
T ID KEY MODE OWNER GROUP
Shared Memory:
T ID KEY MODE OWNER GROUP
m 262144 0 --rw-rw-rw- xyz None
m 262145 0 --rw-rw-rw- xyz None
m 262146 0 --rw-rw-rw- xyz None
m 196611 0 --rw-rw-rw- xyz None
m 196612 0 --rw-rw-rw- xyz None
m 262151 0 --rw-rw-rw- xyz None
Semaphores:
T ID KEY MODE OWNER GROUP
s 262144 0 --rw-rw-rw- xyz None
s 262145 0 --rw-rw-rw- xyz None
s 196610 0 --rw-rw-rw- xyz None
Thanks!

You can pipe in the output from ipcs line by line as it outputs it. Then I would use .strip().split() to parse each line, and something like a try except block to make sure the line fits your criteria. Parsing it as a stream of characters makes things more complicated, I wouldn't recommend it..
import subprocess
proc = subprocess.Popen(['ipcs'],stdout=subprocess.PIPE)
for line in iter(proc.stdout.readline,''):
line=line.strip().split()
try:
r = int(line[1])
except:
continue
if line[0] == "m":
op = subprocess.check_output(['ipcrm','-m',str(r)])
elif line[0] == "s":
op = subprocess.check_output(['ipcrm','-s',str(r)])
print op
proc.wait()

There is really no need to iterate over the output one char at a time.
For one, you should split the output string in lines and iterate over those, handling one at a time. That is done by using the splitlines method of strings (see the docs for details).
You could further split the lines on blanks, using split(), but given the regularity of your output, a regular expression fits the bill nicely. Basically, if the first character is either m or s, the next number of digits is your id, and whether m or s matches decides your next action.
You can use names to identify the groups of characters you identified, which makes for an easier reading of the regular expression and more comfortable handling of the result due to groupdict.
import re
pattern = re.compile('^((?P<mem>m)|(?P<sem>s))\s+(?P<id>\d+)')
for line in output.splitlines():
m = pattern.match(line)
if m:
groups = m.groupdict()
_id = groups['id']
if groups['mem']:
print 'handling a memory line'
pass # handle memory case
else:
print ' handling a semaphore line'
pass # handle semaphore case

You could use the string split method to split the string based on the spaces in it.
So you could use,
for line in output:
if (line.split(" ")[0] == 'm')):
id = line.split(" ")[2]
print id

Related

Add linebreaks after N special characters are observed

I have a requirement wherein I have a CSV file which has data in a wrong format. However based on the number pipes I need to add a newline character and make the data ready for Consumption.
Can we count the number pipes and add newline \ncharacter?
Example:
sadasd|asdasd|l||||0sds|sdsds|2||||0sdsd|asdasd|l||||0
Expected output:
sadasd|asdasd|l||||0
sds|sdsds|2||||0
sdsd|asdasd|l||||0 .
Something like this?
_in = "sadasd|asdasd|l||||0sds|sdsds|2||||0sdsd|asdasd|l||||0"
_out = ""
pipeCount = 0
for char in _in:
if pipeCount == 6:
_out = _out+char+"\n"
pipeCount = 0
else:
_out = _out+char
if char == "|":
pipeCount += 1
print(_out)
I am not sure I understood the criterion for adding newline (See comments on question), but my output conforms with your expectation:
sadasd|asdasd|l||||0
sds|sdsds|2||||0
sdsd|asdasd|l||||0
Output is still a string, but you can just as easily make it a list of string.

Better regex implementation than for looping whole file?

I have files looking like this:
# BJD K2SC-Flux EAPFlux Err Flag Spline
2457217.463564 5848.004 5846.670 6.764 0 0.998291
2457217.483996 6195.018 6193.685 6.781 1 0.998291
2457217.504428 6396.612 6395.278 6.790 0 0.998292
2457217.524861 6220.890 6219.556 6.782 0 0.998292
2457217.545293 5891.856 5890.523 6.766 1 0.998292
2457217.565725 5581.000 5579.667 6.749 1 0.998292
2457217.586158 5230.566 5229.232 6.733 1 0.998292
2457217.606590 4901.128 4899.795 6.718 0 0.998293
2457217.627023 4604.127 4602.793 6.700 0 0.998293
I need to find and count the lines with Flag = 1. (5th column.) This is how I have done it:
foundlines=[]
c=0
import re
with open('examplefile') as f:
for index, line in enumerate(f):
try:
found = re.findall(r' 1 ', line)[0]
foundlines.append(index)
print(line)
c+=1
except:
pass
print(c)
In Shell, I would just do grep " 1 " examplefile | wc -l, which is much shorter than the Python script above. The python script works, but I am interested in whether is there a shorter, more compact way to do the task than the script above? I prefer the shortness of Shell so I would like to have something similar in Python.
You have CSV data, you can use the csv module:
import csv
with open('your file', 'r', newline='', encoding='utf8') as fp:
rows = csv.reader(fp, delimiter=' ')
# generator comprehension
errors = (row for row in rows if row[4] == '1')
for error in errors:
print(error)
You shell implementation can be made even shorter, grep has -c option to get you a count, no need for an anonymous pipe and wc:
grep -c " 1 " examplefile
You shell code simply gets you the line counts where the pattern 1 is found, but your Python code additionally keeps a list of indexes of lines where the pattern is matched.
Only to get the line counts, you can use sum and genexp/list comprehension, also no need for Regex; simple string __contains__ check would do as strings are iterable:
with open('examplefile') as f:
count = sum(1 for line in f if ' 1 ' in line)
print(count)
If you want to keep indexes as well, you can stick to your idea with only replacing re test with str test:
count = 0
indexes = []
with open('examplefile') as f:
for idx, line in enumerate(f):
if ' 1 ' in line:
count += 1
indexes.append(idx)
Additionally, doing a bare except is almost always a bad idea (at least you should use except Exception to leave out SystemExit, KeyboardInterrupt like exceptions), catch only the exceptions you know might be raised.
Also, while parsing structured data, you should use specific tool e.g. here csv.reader with space as the separator (line.split(' ') should do in this case as well) and checking against index-4 would be safest (see Tomalak's answer). With the ' 1 ' in line test, there would be misleading results if any other column contains 1.
Considering the above, here's the shell way using awk to match against the 5-th field:
awk '$5 == "1" {count+=1}; END{print count}' examplefile
Shortest code
This is a very short version under some specific preconditions:
You just want to count occurrences like your grep invocation
There is guaranteed to be only one " 1 " per line
That " 1 " can only occur in the desired column
Your file fits easily into memory
Note that if these preconditions are not met, this may cause issues with memory or return false positives.
print(open("examplefile").read().count(" 1 "))
Easy and versatile, slightly longer
Of course, if you're interested in actually doing something with these lines later on, I recommend Pandas:
df = pandas.read_table('test.txt', delimiter=" ",
comment="#",
names=['BJD', 'K2SC-Flux', 'EAPFlux', 'Err', 'Flag', 'Spline'])
To get all the rows where Flag is 1:
flaggedrows = df[df.Flag == 1]
returns:
BJD K2SC-Flux EAPFlux Err Flag Spline
1 2.457217e+06 6195.018 6193.685 6.781 1 0.998291
4 2.457218e+06 5891.856 5890.523 6.766 1 0.998292
5 2.457218e+06 5581.000 5579.667 6.749 1 0.998292
6 2.457218e+06 5230.566 5229.232 6.733 1 0.998292
To count them:
print(len(flaggedrows))
returns 4

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Python perform actions only at certain locations in text file

I have a text file which contains the data like this
AA 331
line1 ...
line2 ...
% information here
AA 332
line1 ...
line2 ...
line3 ...
%information here
AA 1021
line1 ...
line2 ...
% information here
AA 1022
line1 ...
% information here
AA 1023
line1 ...
line2 ...
% information here
I want to perform action only for "informations" that comes after smallest integer that is after line "AA 331"and line "AA 1021" and not after lines "AA 332" , "AA 1022" and "AA 1023" .
P.s This is just a sample data of large file
The below code i try to parse the text file and get the integers which are after "AA" in a list "list1" and in second function i group them to get minimal value in "list2". This will return integers like [331,1021,...]. So i thought of extracting lines which comes after "AA 331" and perform action but i d'nt know how to proceed.
from itertools import groupby
def getlineindex(textfile):
with open(textfile) as infile:
list1 = []
for line in infile :
if line.startswith("AA"):
intid = line[3:]
list1.append(intid)
return list1
def minimalinteger(list1):
list2 = []
for k,v in groupby(list1,key=lambda x: x//10):
minimalint = min(v)
list2.append(minimalint)
return list2
list2 contains the smallest integers which comes after "AA" [331,1021,..]
You can use something like:
import re
matcher = re.compile("AA ([\d]+)")
already_was = []
good_block = False
with open(filename) as f:
for line in f:
m = matcher.match(line)
if m:
v = int(m.groups(0)) / 10
else:
v = None
if m and v not in already_was:
good_block = True
already_was.append(m)
if m and v in already_was:
good_block = False
if not m and good_block:
do_action()
These code works only if first value in group is minimal one.
Okay, here's my solution. At a high level, I go line by line, watching for AA lines to know when I've found the start/end of a data block, and watch what I call the run number to know whether or not we should process the next block. Then, I have a subroutine that handles any given block, basically reading off all relevant lines and processing them if needed. That subroutine is what watches for the next AA line in order to know when it's done.
import re
runIdRegex = re.compile(r'AA (\d+)')
def processFile(fileHandle):
lastNumber = None # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten.
line = fileHandle.next()
while line is not None: # None is being used as a special value indicating we've hit the end of the file.
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
# Always remember where we were.
lastNumber = runNumber
# And grab and process data.
line = dataBlock(fileHandle, process=processData)
else:
try:
line = fileHandle.next()
except StopIteration:
line = None
def dataBlock(fileHandle, process=False):
runData = []
try:
line = fileHandle.next()
match = runIdRegex.match(line)
while not match:
runData.append(line)
line = fileHandle.next()
match = runIdRegex.match(line)
except StopIteration:
# Hit end of file
line = None
if process:
# Data processing call here
# processData(runData)
pass
# Return line so we don't lose it!
return line
Some notes for you. First, I'm in agreement with Jimilian that you should use a regular expression to match AA lines.
Second, the logic we talked about with regard to when we should process data is in processFile. Specifically these lines:
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
I assume we don't want to process data, then identify when we do. Logically speaking, you can do the inverse of this and assume you want to process data, then identify when you don't. Next, we need to store the last run's value in order to know whether or not we need to process this run's data. (and watch out for that first run edge case) We know we want to process data when the sequence is broken (the difference between two runs is greater than 1), which is handled by the else statement. We also know that we want to process data when the sequence increments the digit in the tens place, which is handled by my integer divide by 10.
Third, watch out for that return data from dataBlock. If you don't do that, you're going to lose the AA line that caused dataBlock to stop iterating, and processFile needs that line in order to know whether the next data block should be processed.
Last, I've opted to use fileHandle.next() and exception handling to identify when I get to the end of the file. But don't think this is the only way. :)
Let me know in comments if you have any questions.

patterns searching in text

I have text file as follows seq.txt
>S1
AACAAGAAGAAAGCCCGCCCGGAAGCAGCTCAATCAGGAGGCTGGGCTGGAATGACAGCG
CAGCGGGGCCTGAAACTATTTATATCCCAAAGCTCCTCTCAGATAAACACAAATGACTGC
GTTCTGCCTGCACTCGGGCTATTGCGAGGACAGAGAGCTGGTGCTCCATTGGCGTGAAGT
CTCCAGGGCCAGAAGGGGCCTTTGTCGCTTCCTCACAAGGCACAAGTTCCCCTTCTGCTT
CCCCGAGAAAGGTTTGGTAGGGGTGGTGGTTTAGTGCCTATAGAACAAGGCATTTCGCTT
CCTAGACGGTGAAATGAAAGGGAAAAAAAGGACACCTAATCTCCTACAAATGGTCTTTAG
TAAAGGAACCGTGTCTAAGCGCTAAGAACTGCGCAAAGTATAAATTATCAGCCGGAACGA
GCAAACAGACGGAGTTTTAAAAGATAAATACGCATTTTTTTCCGCCGTAGCTCCCAGGCC
AGCATTCCTGTGGGAAGCAAGTGGAAACCCTATAGCGCTCTCGCAGTTAGGAAGGAGGGG
TGGGGCTGTCCCTGGATTTCTTCTCGGTCTCTGCAGAGACAATCCAGAGGGAGACAGTGG
ATTCACTGCCCCCAATGCTTCTAAAACGGGGAGACAAAACAAAAAAAAACAAACTTCGGG
TTACCATCGGGGAACAGGACCGACGCCCAGGGCCACCAGCCCAGATCAAACAGCCCGCGT
CTCGGCGCTGCGGCTCAGCCCGACACACTCCCGCGCAAGCGCAGCCGCCCCCCCGCCCCG
GGGGCCCGCTGACTACCCCACACAGCCTCCGCCGCGCCCTCGGCGGGCTCAGGTGGCTGC
GACGCGCTCCGGCCCAGGTGGCGGCCGGCCGCCCAGCCTCCCCGCCTGCTGGCGGGAGAA
ACCATCTCCTCTGGCGGGGGTAGGGGCGGAGCTGGCGTCCGCCCACACCGGAAGAGGAAG
TCTAAGCGCCGGAAGTGGTGGGCATTCTGGGTAACGAGCTATTTACTTCCTGCGGGTGCA
CAGGCTGTGGTCGTCTATCTCCCTGTTGTTC
>S2
ACACGCATTCACTAAACATATTTACTATGTGCCAGGCACTGTTCTCAGTGCTGGGGATAT
AGCAGTGAAGAAACAGAAACCCTTGCACTCACTGAGCTCATATCTTAGGGTGAGAAACAG
TTATTAAGCAAGATCAGGATGGAAAACAGATGGTACGGTAGTGTGAAATGCTAAAGAGAA
AAATAACTACGGAAAAGGGATAGGAAGTGTGTGTATCGCAGTTGACTTATTTGTTCGCGT
TGTTTACCTGCGTTCTGTCTGCATCTCCCACTAAACTGTAAGCTCTACATCTCCCATCTG
TCTTATTTACCAATGCCAACCGGGGCTCAGCGCAGCGCCTGACACACAGCAGGCAGCTGA
CAGACAGGTGTTGAGCAAGGAGCAAAGGCGCATCTTCATTGCTCTGTCCTTGCTTCTAGG
AGGCGAATTGGGAAATCCAGAGGGAAAGGAAAAGCGAGGAAAGTGGCTCGCTTTTGGCGC
TGGGGAAGAGGTGTACAGTGAGCAGTCACGCTCAGAGCTGGCTTGGGGGACACTCTCACG
CTCAGGAGAGGGACAGAGCGACAGAGGCGCTCGCAGCAGCGCGCTGTACAGGTGCAACAG
CTTAGGCATTTCTATCCCTATTTTTACAGCGAGGGACACTGGGCCTCAGAAAGGGAAGTG
CCTTCCCAAGCTCCAACTGCTCATAAGCAGTCAACCTTGTCTAAGTCCAGGTCTGAAGTC
CTGGAGCGATTCTCCACCCACCACGACCACTCACCTACTCGCCTGCGCTTCACCTCACGT
GAGGATTTTCCAGGTTCCTCCCAGTCTCTGGGTAGGCGGGGAGCGCTTAGCAGGTATCAC
CTATAAGAAAATGAGAATGGGTTGGGGGCCGGTGCAAGACAAGAATATCCTGACTGTGAT
TGGTTGAATTGGCTGCCATTCCCAAAACGAGCTTTGGCGCCCGGTCTCATTCGTTCCCAG
CAGGCCCTGCGCGCGGCAACATGGCGGGGTCCAGGTGGAGGTCTTGAGGCTATCAGATCG
GTATGGCATTGGCGTCCGGGCCCGCAAGGCG
.
.
.
.
I have to count patterns in these sequences to achieve python script
import re
infile = open("seq.txt", 'r')
out = open("pat.txt", 'w')
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name = line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,len(s)))
But it is giving the wrong result. The script is reading line by line.
S1 : 0
S1 : 0
S1 : 0
S1 : 0
S2 : 0
S2 : 1
S2 : 0
S2 : 1
But I want output as follows:
S1 : 0
S2 : 2
Can anybody help?
Use a hit counter, zero it if line.startswith('>'). Increment by len(s) otherwise.
This code might be helpful for you:
import re
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
with open('seq.txt') as f:
sections = f.read().split('\n\n')
for section in sections:
lines = section.split()
name = lines[0].lstrip('>')
data = ''.join(lines[1:])
print '{0}: {1}'.format(name, len(pattern.findall(data)))
Example output:
S1: 1
S2: 2
Notes:
It's assumed that two newline characters are used to separate every section as in the example.
It's assumed that every section name is preceded by a greater than (>) character as in the example.
If you already have a pattern, use pattern.findall(data) instead of re.findall(pattern, data)
You should gather input until you enter the next pattern. This would also solve the corner case of where your pattern crosses a line boundary (not sure if that "can" happen with your data, but it looks like it).
Use a counter. Also, have your print function inside the for loop, so it's going to iterate as many times as the else condition. Note that it's also not a good idea to use the variable line as both the iterator variable in the for loop and as another variable. It makes the code more confusing.
counter_dict = {}
for line in infile:
if line[0] == '>':
name = line[1:len(line) - 2]
counter_dict[name] = 0
else:
counter_dict[name] += len(re.findall(pattern,line))
for (key, val) in counter_dict.items():
print '%s:%s' %(key, val)
out.write('%s:\t%s\n' %(key, val)

Categories

Resources