Need modification to regex to work in other situations - python

Just discovered that the structure of my file could be different and my regex only works sometimes because of this change. My regex is
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)
It currently matches the following section of the file.
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
ACTIVITY?
PDEV
ENTER OUTPUT DEVICE CODE:
0 FOR NO OUTPUT
1 FOR PROGRESS WINDOW
However that section of the file sometimes is as below
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.742 13.2060 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.916 1.8367 11 EL PASO 110
70187 [FTGARLND69.0] 0.936 19.6099 70 PSCOLORADO 710
73216 [WINDRIVR 115] 0.858 3.6100 73 WAPA R.M. 750
(VFSCAN) AT TIME = 20.0000 UP TO 100 BUSES WITH LOW FREQUENCY BELOW 59.600:
X ----- BUS ------ X FREQ X ----- BUS ------ X FREQ
12063 [ROSEBUD 13.8] 59.506
On both occasions I would like to capture just the section below:
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
How can my regex return the section above regardless of which version of the file I am lookin at?

This should work
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)

I would not suggest to use a regex, but do some parsing instead. Let's assume your data is in a string called data:
lines = [line for line in data.split("\n")]
# find start of header
for index, line in enumerate(lines):
if "LOW VOLTAGE SUMMARY BY AREA" in line:
start_index = index
break
# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
if line.strip() and line.split()[0].isdigit():
first_entry_index = start_index + index
break
# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
# we don't do this inside the if because it's possible
# to end the data with only entries and whitespace
end_entry_index = first_entry_index + index
if line.strip() and not line.split()[0].isdigit():
break
# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))

Related

Extract values under a particular section occurring multiple times in a file

It is always helpful here at stackoverflow and need another help yet again.
In this file.txt, I have "Page Statistics section". This section appears 2 times in the file.
I have to extract the values under "DelAck" for SSD and FC. NOTE: These values are for Node 0 and Node 1
I would like to extract the values and put it in a list for Node 0 and Node 1.
Somewhat like this below:
Result:
SSD_0 = [115200, 115200] // DelAck values for Node 0
SSD_1 = [115200, 115200] // DelAck values for Node 1
FC_0 = [0, 0] // DelAck values for Node 0
FC_1 = [0, 0] // DelAck values for Node 1
Here is the file.txt which has the data to be extracted. Page Statistics sections appears multiple time. I have it here for 2 times. Need to extact values for SSD and FC like I mentioned earlier for Node 0 and Node 1 separately as shown above.
Hope I have explained my situation in detail.
***************************
file.txt
22:26:35 04/29/2021 ----- Current ----- ---------- Total ----------
Node Type Accesses Hits Hit% Accesses Hits Hit% LockBlk
0 Read 75630 39845 53 149728 79438 53 0
0 Write 47418 19709 42 93184 38230 41 0
1 Read 74076 38698 52 145810 75445 52 0
1 Write 42525 16099 38 84975 31751 37 0
Page Statistics
--CfcDirty-- ----CfcMax----- -DelAck--
Node FC NL SSD FC NL SSD FC NL SSD
0 250 0 14843 19200 0 115200 0 0 0
1 284 0 15618 19200 0 115200 0 0 0
Press the enter key to stop...
22:26:33 04/29/2021 ----- Current ----- ---------- Total ----------
Node Type Accesses Hits Hit% Accesses Hits Hit% LockBlk
0 Read 74098 39593 53 74098 39593 53 0
0 Write 45766 18521 40 45766 18521 40 0
1 Read 71734 36747 51 71734 36747 51 0
1 Write 42450 15652 37 42450 15652 37 0
Page Statistics
--CfcDirty-- ----CfcMax----- -DelAck--
Node FC NL SSD FC NL SSD FC NL SSD
0 258 0 13846 19200 0 115200 0 0 0
1 141 0 13356 19200 0 115200 0 0 0
Press the enter key to stop...
***************************
Then I would work on getting the average of the values in the list and print to user.
The challenging part is extracting the data under PAGE Statistics for SSD and FC drives.
Any help would be of immense help.
Thanks!
Since the format of this file is quite unique, you will have to parse it by hand. You can do this with a simple program such as this:
## Function to get the values from these blank-spaced tables
def values(line):
while ' ' in line:
line = line.replace(' ',' ')
line = line[:-1]
if line[0]==' ':line = line[1:]
if line[-1]==' ':line = line[:-1]
return line.split(' ')
## Load the file
inf = open("file.txt","r")
lines = inf.readlines()
inf.close()
## Get the relevant tables
tables = []
for i in range(len(lines)):
if 'Page Statistics' in lines[i]:
tables.append(lines[i+2:i+5])
## Initiate the results dictionary
result = {"Node 0":{}, "Node 1":{}}
for c in ['CfcDirty','CfcMax','DelAck']:
for n in result.keys():
result[n][c]={}
for value in ['FC','NL','SSD']:
result[n][c][value]=[]
## Parse the tables and fill up the results
for t in tables:
vnames = values(t[0])
node0 = values(t[1])
node1 = values(t[2])
i = 0
for c in ['CfcDirty','CfcMax','DelAck']:
for _ in range(3):
i+=1
result['Node 0'][c][vnames[i]].append(node0[i])
result['Node 1'][c][vnames[i]].append(node1[i])
The result will be a dictionary with node, column1, and column2 as keys. So, you can easily get any value from these tables:
>> print(result['Node 0']['DelAck']['SSD'])
['0', '0']
>> print(result['Node 1']['CfcMax']['SSD'])
['115200', '115200']
Now, you can compose any number of new variables that contain some values from these tables.
(BTW, I don't understand how you get the values for your example: SSD_0 = [115200, 115200] // DelAck values for Node 0. Node 0 always has the value of 0 for SSD in DelAck.)

list index out of range(sometimes fails sometimes works)

I'm reading datas from a file. content: time-id-data,
when I run on MAC, it works well, but on linux, sometimes it works sometimes fails.
the error is "IndexError: list index out of range"
data like this:
'
1554196690 0.0 178 180 180 178 178 178 180
1554196690 0.1 178 180 178 180 180 178 178
1554196690 0.2 175 171 178 173 173 178 172
1554196690 0.3 171 175 175 17b 179 177 17e
1554196691 0.4 0 d3
1554196691 0.50 28:10:4:92:a:0:0:d6 395
1554196691 0.58 28:a2:23:93:a:0:0:99 385
'
data = []
boardID=100 #how many lines at most in datafile
for i in range(8):
data.append([[] for x in range(8)])#5 boards,every boards have 7 sensors add 1-boardID
time_stamp = []
time_xlabel=[]
time_second=[]
for i in range(8):
time_stamp.append([]) #5th-lines data is the input voltage and pressure
time_xlabel.append([])#for x label
time_second.append([])#time from timestamp to time(start-time is 0)
with open("Potting_20190402-111807.txt","r") as eboardfile:
for line in eboardfile:
values = line.strip().split("\t")
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
time_stamp[boardID].append(int(values[0]))
if boardID >= 0 and boardID < 4:
for i in range(2,9):
data[boardID][i-2].append(int(values[i],16) * 0.0625)
if boardID==4:#pressure
data[boardID][0].append( int(values[2],16) * 5./1024. *14.2/2.2) #voltage divider: 12k + 2.2k
data[boardID][1].append( (int(values[3],16) * 5./1024. - 0.5) / 4.*6.9*1000.) #adc to volt: value * 5V/1024, volt to hpa: (Vout - 0.5V)/4V *6.9bar * 1000i
elif boardID > 4 and boardID < 7: #temperature sensor located inside house not no electronBoards
data[boardID][0].append(int(values[4],10) * 0.0625)#values[2] is the address,[3]-empty;[4]is the valueself.
eboardfile.close()
Traceback(most recent call last):
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
IndexError: list index out of range
Traceback(most recent call last):
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
IndexError: list index out of range
this error occurs when your values has element less than one, which means values = line.strip().split("\t"), the line has no \t at all.
maybe a empty line? or linux format problem.
you can check the len of values before use:
if len(values) < 9:
continue
or try this:
import string
values = line.strip().split(string.whitespace)
can not reproduce your condition, so just have a try.

Compiling UniProt txt file into a dict to retrieve key (ID) and values (MOD_RES)

I am not much familiar with python and trying to retrieve data from a text file(test1), Uniprot, that looks like this:
ID YSH1_YEAST Reviewed; 779 AA.
AC Q06224; D6VYS4;
DT 10-JAN-2006, integrated into UniProtKB/Swiss-Prot
DT 01-NOV-1996, sequence version 1.
.
.
.
FT METAL 184 184 Zinc 1. {ECO:0000250}.
FT METAL 184 184 Zinc 2. {ECO:0000250}.
FT METAL 430 430 Zinc 2. {ECO:0000250}.
FT MOD_RES 517 517 Phosphoserine; by ATM or ATR.
FT {ECO:0000244|PubMed:18407956}.
FT MUTAGEN 37 37 D->N: Loss of endonuclease activity.
.
.
So far I am able to retrieve the MOD_RES and AC separately, by using these codelets:
test = open('test1', 'r')
regex2 = re.compile(r'^AC\s+\w+')
for line in test:
ac = regex2.findall(line)
for a in ac:
a=''.join(a)
print(a[5:12])
Q06224
P16521
testfile = open('test1')
regex = re.compile(r'^FT\s+\MOD_RES\s+\w+\s+\w+\s+\w.+')
for line in testfile:
po = regex.findall(line)
for p in po:
p=''.join(p)
print(p[23:48])
517 Phosphoserine;
2 N-acetylserine
187 N6,N6,N6-trime
196 N6,N6,N6-trime
the goal is to get AC and their relevant Modification residues (MOD_RES) into a tab separate format. Also, if more than one MOS_RES appear in the data for a particular AC, duplicate that AC and get the table format like this:
AC MOD_RES
Q06224 517 517 Phosphoserine
P04524 75 75 Phosphoserine
Q06224 57 57 Phosphoserine
Have you taken a look at Biopython?
You should be able to parse your Uniprot file like that:
from Bio import SwissProt
for record in SwissProt.parse(open('/path/to/your/uniprot_sprot.dat')):
for feature in record.features:
print feature
From there you should be able to print what you want to a file.

Getting errors when converting a text file to VCF format

I have a python code where I am trying to convert a text file containing variant information in the rows to a variant call format file (vcf) for my downstream analysis.
I am getting everything correct but when I am trying to run the code I miss out the first two entries , I mean the first two rows. The code is below, The line which is not reading the entire file is highlighted. I would like some expert advice.
I just started coding in python so I am not well versed entirely with it.
##fileformat=VCFv4.0
##fileDate=20140901
##source=dbSNP
##dbSNP_BUILD_ID=137
##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
import sys
text=open(sys.argv[1]).readlines()
print text
print "First print"
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text[2:])
print text
print "################################################"
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
print text
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close()
INPUT:
chrM 152 T C T_S7998 N_S8980 0 DBSNP COVERED 1 1 1 282 36 0 163.60287 0.214008 0.02 11.666081 202 55 7221 1953 0 0 TT 14.748595 49 0 1786 0 KEEP
chr9 311 T C T_S7998 N_S8980 0 NOVEL COVERED 0.993882 0.999919 0.993962 299 0 0 207.697923 1 0.02 1.854431 0 56 0 1810 1 116 CC -44.649001 0 12 0 390 KEEP
chr13 440 C T T_S7998 N_S8980 0 NOVEL COVERED 1 1 1 503 7 0 4.130339 0.006696 0.02 4.124606 445 3 16048 135 0 0 CC 12.942762 40 0 1684 0 KEEP
OUTPUT desired:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chrM 152 . T C . PASS .
chr9 311 . T C . PASS .
chr13 440 . C T . PASS .
OUTPUT obtained:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chr13 440 . C T . PASS .
I would like to have some help regarding how this error can be rectified.
There are couple of issues with your code
In the filter function you are passing text[2:]. I think you want to pass text to get all the rows.
In the last loop where you write to the .vcf file, you are closing the file inside the loop. You should first write all the values and then close the file outside the loop.
So your code will look like (I removed all the prints):
import sys
text=open(sys.argv[1]).readlines()
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text) # Pass text
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close() # close after writing all the values, in the end

Python - Extracting text by Column Header from Given Row

I created a text file from multiple email messages.
Each of the three tuples below was written to the text file from a different email message and sender.
Cusip NAME Original Current Cashflow Collat Offering
362341D71 GSAA 2005-15 2A2 10,000 8,783 FCF 5/25 65.000
026932AC7 AHM 2007-1 GA1C 9,867 7,250 Spr Snr OA 56.250
Name O/F C/F Cpn FICO CAL WALB 60+ Notes Offer
CSMC 06-9 7A1 25.00 11.97 L+45 728 26 578 35.21 FLT,AS,0.0% 50-00
LXS 07-10H 2A1 68.26 34.01 L+16 744 6 125 33.98 SS,9.57% 39-00`
CUSIP Name BID x Off SIZE C/E 60++ WAL ARM CFLW
86360KAA6 SAMI 06-AR3 11A1 57-00 x 59-00 73+MM 46.9% 67.0% 65 POA SSPT
86361HAQ7 SAMI 06-AR7 A12 19-08 x 21-08 32+MM 15.4% 61.1% 61 POA SRMEZ
By 'Name' I need a way to pull out the Price info (Price info = data under the words:'Offering','Offer' and 'Off'). This process will be replicated over the whole text file and the extracted data ('Name' and 'Price') will be written to an excel file via XLWT. Notice that the format for the price data varies by tuple.
The formatting for this makes it a little tricky since your names can have spaces, which can make csv difficult to use. One way to get around this is to use the first column to get the location and width of the columns you are interested by using regex. You can try something like this:
import re
for email in emails:
print email
lines = email.split('\n')
name = re.search(r'name\s*', lines[0], re.I)
price = re.search(r'off(er(ing)?)?\s*', lines[0], re.I)
for line in lines[1:]:
n = line[name.start():name.end()].strip()
p = line[price.start():price.end()].strip()
print (n, p)
print
This assumes that emails is a list where each entry is an email. Here is the output:
Cusip NAME Original Current Cashflow Collat Offering
362341D71 GSAA 2005-15 2A2 10,000 8,783 FCF 5/25 65.000
026932AC7 AHM 2007-1 GA1C 9,867 7,250 Spr Snr OA 56.250
('GSAA 2005-15 2A2', '65.000')
('AHM 2007-1 GA1C', '56.250')
Name O/F C/F Cpn FICO CAL WALB 60+ Notes Offer
CSMC 06-9 7A1 25.00 11.97 L+45 728 26 578 35.21 FLT,AS,0.0% 50-00
LXS 07-10H 2A1 68.26 34.01 L+16 744 6 125 33.98 SS,9.57% 39-00`
('CSMC 06-9 7A1', '50-00')
('LXS 07-10H 2A1', '39-00')
CUSIP Name BID x Off SIZE C/E 60++ WAL ARM CFLW
86360KAA6 SAMI 06-AR3 11A1 57-00 x 59-00 73+MM 46.9% 67.0% 65 POA SSPT
86361HAQ7 SAMI 06-AR7 A12 19-08 x 21-08 32+MM 15.4% 61.1% 61 POA SRMEZ
('SAMI 06-AR3 11A1', '59-00')
('SAMI 06-AR7 A12', '21-08')
Just use csv module.
and use good formatting for your numbers

Categories

Resources