Extract values under a particular section occurring multiple times in a file - python

It is always helpful here at stackoverflow and need another help yet again.
In this file.txt, I have "Page Statistics section". This section appears 2 times in the file.
I have to extract the values under "DelAck" for SSD and FC. NOTE: These values are for Node 0 and Node 1
I would like to extract the values and put it in a list for Node 0 and Node 1.
Somewhat like this below:
Result:
SSD_0 = [115200, 115200] // DelAck values for Node 0
SSD_1 = [115200, 115200] // DelAck values for Node 1
FC_0 = [0, 0] // DelAck values for Node 0
FC_1 = [0, 0] // DelAck values for Node 1
Here is the file.txt which has the data to be extracted. Page Statistics sections appears multiple time. I have it here for 2 times. Need to extact values for SSD and FC like I mentioned earlier for Node 0 and Node 1 separately as shown above.
Hope I have explained my situation in detail.
***************************
file.txt
22:26:35 04/29/2021 ----- Current ----- ---------- Total ----------
Node Type Accesses Hits Hit% Accesses Hits Hit% LockBlk
0 Read 75630 39845 53 149728 79438 53 0
0 Write 47418 19709 42 93184 38230 41 0
1 Read 74076 38698 52 145810 75445 52 0
1 Write 42525 16099 38 84975 31751 37 0
Page Statistics
--CfcDirty-- ----CfcMax----- -DelAck--
Node FC NL SSD FC NL SSD FC NL SSD
0 250 0 14843 19200 0 115200 0 0 0
1 284 0 15618 19200 0 115200 0 0 0
Press the enter key to stop...
22:26:33 04/29/2021 ----- Current ----- ---------- Total ----------
Node Type Accesses Hits Hit% Accesses Hits Hit% LockBlk
0 Read 74098 39593 53 74098 39593 53 0
0 Write 45766 18521 40 45766 18521 40 0
1 Read 71734 36747 51 71734 36747 51 0
1 Write 42450 15652 37 42450 15652 37 0
Page Statistics
--CfcDirty-- ----CfcMax----- -DelAck--
Node FC NL SSD FC NL SSD FC NL SSD
0 258 0 13846 19200 0 115200 0 0 0
1 141 0 13356 19200 0 115200 0 0 0
Press the enter key to stop...
***************************
Then I would work on getting the average of the values in the list and print to user.
The challenging part is extracting the data under PAGE Statistics for SSD and FC drives.
Any help would be of immense help.
Thanks!

Since the format of this file is quite unique, you will have to parse it by hand. You can do this with a simple program such as this:
## Function to get the values from these blank-spaced tables
def values(line):
while ' ' in line:
line = line.replace(' ',' ')
line = line[:-1]
if line[0]==' ':line = line[1:]
if line[-1]==' ':line = line[:-1]
return line.split(' ')
## Load the file
inf = open("file.txt","r")
lines = inf.readlines()
inf.close()
## Get the relevant tables
tables = []
for i in range(len(lines)):
if 'Page Statistics' in lines[i]:
tables.append(lines[i+2:i+5])
## Initiate the results dictionary
result = {"Node 0":{}, "Node 1":{}}
for c in ['CfcDirty','CfcMax','DelAck']:
for n in result.keys():
result[n][c]={}
for value in ['FC','NL','SSD']:
result[n][c][value]=[]
## Parse the tables and fill up the results
for t in tables:
vnames = values(t[0])
node0 = values(t[1])
node1 = values(t[2])
i = 0
for c in ['CfcDirty','CfcMax','DelAck']:
for _ in range(3):
i+=1
result['Node 0'][c][vnames[i]].append(node0[i])
result['Node 1'][c][vnames[i]].append(node1[i])
The result will be a dictionary with node, column1, and column2 as keys. So, you can easily get any value from these tables:
>> print(result['Node 0']['DelAck']['SSD'])
['0', '0']
>> print(result['Node 1']['CfcMax']['SSD'])
['115200', '115200']
Now, you can compose any number of new variables that contain some values from these tables.
(BTW, I don't understand how you get the values for your example: SSD_0 = [115200, 115200] // DelAck values for Node 0. Node 0 always has the value of 0 for SSD in DelAck.)

Related

How to get the index of sorted timestamps?

I have a text file that contains the following:
n 1 id 10 12:17:32 type 6 is transitioning
n 2 id 10 12:16:12 type 5 is active
n 2 id 10 12:18:45 type 6 is transitioning
n 3 id 10 12:16:06 type 6 is transitioning
n 3 id 10 12:17:02 type 6 is transitioning
...
I need to sort these lines in Python by the timestamp. I can read line by line, collect all timestamps, then sort them using sorted(timestamps) but then I need to arrange the lines according to sorted timestamp.
How to get the index of sorted timestamps?
Is there some more elegant solution (I'm sure there is)?
import time
nID = []
mID = []
ts = []
ntype = []
comm = []
with open('changes.txt') as fp:
while True:
line = fp.readline()
if not line:
break
lx = line.split(' ')
nID.append(lx[1])
mID.append(lx[3])
ts.append(lx[4])
ntype.append(lx[6])
comm.append(lx[7:])
So, now I can use sorted(ts) to sort the timestamp, but I don't get the index of sorted timestamp values.

Transform line by line text file into tab delimited format by checking header by python

I have a large text file of experimental data like
spectrum:
index: 1
mz: 4544.5445
intensity: 57875100000
type: 1
something: skip
params - m1
binary: [4] 1 2 3 4
params - int1
binary: [4] 11 22 33 44
spectrum:
index: 2
mz: 546.7777
intensity: 210009
type: 2
params - m2
binary: [4] 2 3 4 5
params - int2
binary: [4] 55 44 33 22
charge: 3
others: no need to put into column
spectrum:
index: 3
I want to print it out as csv file, information in each spectrum data are put in the same row regarding its header. If they don't have the information in that header, just skip (or put NA in) it. If they have more than one value, print each next line.
Is there some easy ways by python to get the result like this?
You want something like that:
PSEYDOCODE
class Spectrum():
def add(self, text):
column, value = text.split(' ')
if column == 'index:'
self._csv['index'] = int(value)
elif column == 'mz:'
self._csv['mz'] = float(value)
... an so on
spectrum = Spectrum()
with text file as in_file
for line in in_file
if line == 'spectrum:'
if in_spectrum
spectrum.expand_to_csv()
spectrum = Spectrum()
in_spectrum = True
continue
spectrum.add(line)

Sort list of strings with data in the middle of them

So I have a list of strings which correlates to kibana indices, the strings look like this:
λ curl '10.10.43.210:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open filebeat-2016.10.08 5 1 899 0 913.8kb 913.8kb
yellow open filebeat-2016.10.12 5 1 902 0 763.9kb 763.9kb
yellow open filebeat-2016.10.13 5 1 816 0 588.9kb 588.9kb
yellow open filebeat-2016.10.10 5 1 926 0 684.1kb 684.1kb
yellow open filebeat-2016.10.11 5 1 876 0 615.2kb 615.2kb
yellow open filebeat-2016.10.09 5 1 745 0 610.7kb 610.7kb
The dates are coming back unsorted. I want to sort these by the index (which is a date) filebeat-2016-10.xx ASC or DESC is fine.
As it stands now I isolate the strings like this:
subp = subprocess.Popen(['curl','-XGET' ,'-H', '"Content-Type: application/json"', '10.10.43.210:9200/_cat/indices?v'], stdout=subproce$
curlstdout, curlstderr = subp.communicate()
op = str(curlstdout)
kibanaIndices = op.splitlines()
for index,elem in enumerate(kibanaIndices):
if "kibana" not in kibanaIndices[index]:
print kibanaIndices[index]+"\n"
kibanaIndexList.append(kibanaIndices[index])
But can't sort them in a meaningful way.
Is it what you need?
lines = """yellow open filebeat-2016.10.08 5 1 899 0 913.8kb 913.8kb
yellow open filebeat-2016.10.12 5 1 902 0 763.9kb 763.9kb
yellow open filebeat-2016.10.13 5 1 816 0 588.9kb 588.9kb
yellow open filebeat-2016.10.10 5 1 926 0 684.1kb 684.1kb
yellow open filebeat-2016.10.11 5 1 876 0 615.2kb 615.2kb
yellow open filebeat-2016.10.09 5 1 745 0 610.7kb 610.7kb
""".splitlines()
def extract_date(line):
return line.split()[2]
lines.sort(key=extract_date)
print("\n".join(lines))
Here extract_date is a function that returns third column (like filebeat-2016.10.12). We use this function as key argument to sort to use this value as a sort key. You date format can be sorted just as strings. You can probably use more sophisticated extract_line function to extract only the date.
I copied your sample data into a text file as UTF-8 since I don't have access to the server you referenced. Using list comprehensions and string methods you can clean the data then break it down into its component parts. Sorting is accomplished by passing a lambda function as an argument to the builtin sorted() method:
# read text data into list one line at a time
result_lines = open('kibana_data.txt').readlines()
# remove trailing newline characters
clean_lines = [line.replace("\n", "") for line in result_lines]
# get first line of file
info = clean_lines[0]
# create list of field names
header = [val.replace(" ", "")
for val in clean_lines[1].split()]
# create list of lists for data rows
data = [[val.replace(" ", "") for val in line.split()]
for line in clean_lines[2:]]
# sort data rows by date (third row item at index 2)
final = sorted(data, key=lambda row: row[2], reverse=False)
More info on list comprehensions here: https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
More info on sorting here: https://wiki.python.org/moin/HowTo/Sorting

Getting errors when converting a text file to VCF format

I have a python code where I am trying to convert a text file containing variant information in the rows to a variant call format file (vcf) for my downstream analysis.
I am getting everything correct but when I am trying to run the code I miss out the first two entries , I mean the first two rows. The code is below, The line which is not reading the entire file is highlighted. I would like some expert advice.
I just started coding in python so I am not well versed entirely with it.
##fileformat=VCFv4.0
##fileDate=20140901
##source=dbSNP
##dbSNP_BUILD_ID=137
##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
import sys
text=open(sys.argv[1]).readlines()
print text
print "First print"
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text[2:])
print text
print "################################################"
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
print text
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close()
INPUT:
chrM 152 T C T_S7998 N_S8980 0 DBSNP COVERED 1 1 1 282 36 0 163.60287 0.214008 0.02 11.666081 202 55 7221 1953 0 0 TT 14.748595 49 0 1786 0 KEEP
chr9 311 T C T_S7998 N_S8980 0 NOVEL COVERED 0.993882 0.999919 0.993962 299 0 0 207.697923 1 0.02 1.854431 0 56 0 1810 1 116 CC -44.649001 0 12 0 390 KEEP
chr13 440 C T T_S7998 N_S8980 0 NOVEL COVERED 1 1 1 503 7 0 4.130339 0.006696 0.02 4.124606 445 3 16048 135 0 0 CC 12.942762 40 0 1684 0 KEEP
OUTPUT desired:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chrM 152 . T C . PASS .
chr9 311 . T C . PASS .
chr13 440 . C T . PASS .
OUTPUT obtained:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chr13 440 . C T . PASS .
I would like to have some help regarding how this error can be rectified.
There are couple of issues with your code
In the filter function you are passing text[2:]. I think you want to pass text to get all the rows.
In the last loop where you write to the .vcf file, you are closing the file inside the loop. You should first write all the values and then close the file outside the loop.
So your code will look like (I removed all the prints):
import sys
text=open(sys.argv[1]).readlines()
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text) # Pass text
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close() # close after writing all the values, in the end

how do I match a specific number into number set efficiently?

I have a number set which contains 2375013 unique numbers in txt file. The data structure looks like this:
11009
900221
2
3
4930568
293
102
I want to match a number in a line from another data to the number set for extracting data what I need. So, I coded like this:
6 def get_US_users_IDs(filepath, mode):
7 IDs = []
8 with open(filepath, mode) as f:
9 for line in f:
10 sp = line.strip()
11 for id in sp:
12 IDs.append(id.lower())
13 return IDs
75 IDs = "|".join(get_US_users_IDs('/nas/USAuserlist.txt', 'r'))
76 matcher = re.compile(IDs)
77 if matcher.match(user_id):
78 number_of_US_user += 1
79 text = tweet.split('\t')[3]
But it takes a lot of time for running. Is there any idea to reduce run time?
What I understood is that you have a huge number of ids in a file and you want to know if a specific user_id is in this file.
You can use a python set.
fd = open(filepath, mode);
IDs = set(int(id) for id in fd)
...
if user_id in IDs:
number_of_US_user += 1
...

Categories

Resources