Text File Reading and Structuring data - python

Text File
• I.D.: AN000015544
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398
• I.D.: AN000016955
DESCRIPTION: TEMPERATURE CALIBRATOR
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063
• I.D.: AN000017259
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076
Objective
To read the text file and save the ID number and Serial number of each part into the part_data data structure.
Data Structure
part_data = {'ID': [],
'Serial Number': []}
Code
with open("textfile", 'r') as part_info:
lineArray = part_info.read().split('\n')
print(lineArray)
if "• I.D.: AN000015544 " in lineArray:
print("I have found the first line")
ID = [s for s in lineArray if "AN" in s]
print(ID[0])
My code isn't finding the I.D: or the serial number value. I know it is wrong I was trying to use the method I got from this website Text File Reading and Printing Data for parsing the data. Can anyone move me in the right direction for collecting the values?
Update
This solution works with python 2.7.9 not 3.4, thanks to domino - https://stackoverflow.com/users/209361/domino:
with open("textfile", 'r') as part_info:
lineArray = part_info.readlines()
for line in lineArray:
if "I.D.:" in line:
ID = line.split(':')[1]
print ID.strip()
However when I initially asked the question I was using python 3.4, and the solution did not work properly.
Does anyone understand why it doesn't work python 3.4? Thank You!

This should print out all your ID's. I think it should move you in the right direction.
with open("textfile", 'r') as part_info:
lineArray = part_info.readlines()
for line in lineArray:
if "I.D.:" in line:
ID = line.split(':')[1]
print ID.strip()

It won't work in python3 because in python3 print is a function
It should end
print(ID.strip())

Related

python nesting loops

I am trying perform a nested loop to combine data into a line by using matched MAC Addresses in both files.
I am able to pull the loop fine without the regex, however, when using the search regex below, it will only loop through the MAC_Lines once and print the correct results using the first entry in the MAC_Lines and stop. I'm unsure how to make the MAC_Lines go to the next line and repeat the process for all of the entries in the MAC_Lines.
try:
for mac in MAC_Lines:
MAC_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', mac, re.I)
MAC_address_final = MAC_address.group()
for arp in ARP_Lines:
ARP_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', arp, re.I)
ARP_address_final = ARP_address.group()
if MAC_address_final == ARP_address_final:
print mac + arp
continue
except Exception:
print 'completed.'
Results:
13,64,00:0c:29:36:9f:02,giga-swx 0/213,172.20.13.70, 00:0c:29:36:9f:02, vlan 64
completed.
I learned that the issue was how I opened the file. I should have used the 'open':'as' keywords when opening both files to allow the files to properly close and reopen for the next loop. Below is the code I was looking for.
Below is the code:
with open('MAC_List.txt', 'r') as read0:for items0 in read0:
MAC_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', items0, re.I)
if MAC_address:
mac_addy = MAC_address.group().upper()
with open('ARP_List.txt', 'r') as read1:
for items1 in read1:
ARP_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', items1, re.I)
if ARP_address:
arp_addy = ARP_address.group()
if mac_addy == arp_addy:
print(items0.strip() + ' ' + items1.strip())

how can i find "smallest eigenvalue" o a plate with python scripting in abaqus?

i write a python script to modeling and analysis a plate for buckling.
i need the minimum eigenvalue for run the other script for RIKS analysis.
how can i find "smallest eigenvalue" with python scripting in abaqus?
foo = [3,1,4,5]
print min(foo)
outputs => 1
datFullPath = PathDir+FileName+'.dat'
myOutdf = open(datFullPath,'r')
stline=' MODE NO EIGENVALUE\n'
lines = myOutdf.readlines()
ss=0
for i in range(len(lines)-1):
if lines[i] == stline :
print lines[i]
ss=i
f1=lines[ss+3]
MinEigen=float(f1[15:24])
myOutdf.close()
MinEigen
# Import abaqus odb work related modules
from odbAccess import *
# Read the odb file, change name and path as per your requirement
### This is a typical procedure to read history-outputs, as
### frequency etc. information is not available as field output
odb = openOdb(path =JobName+'.odb')
a=odb.rootAssembly
step=odb.steps['Step-1']
frames=step.frames
numframes=len(frames)
i=0
MinEigen=[]
for frame in frames :
f1=frame.description
if len(f1[28:48])>1:
MinEg=float(f1[28:48])
if MinEg>0.0:
MinEigen.append (MinEg)
print MinEigen,min(MinEigen)
fwall=open("EIGENX.txt", 'a')
fwall.write(str(min(MinEigen))+'\n')
fwall.close()
odb.close()
##stop

Transposing multiple lines of data into one table in Python

I have an output file from a script that parses iwlist scan that looks something like:
Cell: 01 -
Address: XX:XX:XX:XX:XX
ESSID: "My Network Name"
Frequency: 2.412 GHz (Channel 2)
Quality: =XX/100
Signal Level: XX/100
Cell: 02 -
Address:
ESSID:
etc etc for as many wlans that show up on the scan..
My question is how would I go about parsing this list even further, perhaps to a new file, to give it a tabulated view in the output (using python)?
for example, the output would be:
Cell Address ESSID Frequency Quality Signal Level
01 - XX:XX:XX:XX:XX:XX "My Network Name" 2.417 GHz (Channel 2) =XX/100 =XX/100
etc for the rest of the wlans on the scan, without repeating the headers preferably.
This would work, for example.
iwlist = '''Cell: 01 -
Address: XX:XX:XX:XX:XX
ESSID: "My Network Name"
Frequency: 2.412 GHz (Channel 2)
Quality: =XX/100
Signal Level: XX/100
Cell: 02 -
Address:
ESSID:
'''
options = []
values = []
for line in iwlist.split('\n'):
if not line.strip():
continue
line = line.split(':')
options.append(line[0])
values.append(':'.join(line[1:]))
for o in options:
print('{:^20}'.format(o), end="")
print()
for v in values:
print('{:^20}'.format(v), end="")
print()

Parsing text with Python 2.7

Text File
• I.D.: AN000015544
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398
• I.D.: AN000016955
DESCRIPTION: TEMPERATURE CALIBRATOR
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063
• I.D.: AN000017259
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076
• I.D.: AN000032766
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253B CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY5048 9036
Objective
Seeking to find a more efficient algorithm for parsing the manufacturer name and number. i.e 'HEWLETT-PACKARDMODEL NUM.: 34401A', 'AGILENT MODEL NUM.: U1253B'...etc. from the text file above.
Data Structure
parts_data = {'Model_Number': []}
Code
with open("textfile", 'r') as parts_info:
linearray = parts_info.readlines(
for line in linearray:
model_number = ''
model_name = ''
if "MANUFACTURER:" in line:
model_name = line.split(':')[1]
if "NUM.:" in line:
model_number = line.split(':')[2]
model_number = model_number.split()[0]
model_number = model_name + ' ' + model_number
parts_data['Model_Number'].append(model_number.rstrip())
My code does exactly what I want, but I think there is a faster or cleaner way to complete the action.Let's increase efficiency!
Your code looks fine already and unless you're parsing more than GB's of data I don't know what the point of this is. I thought of a few things.
If you remove the linearray = parts_info.readlines( line Python understands just using a for loop with an open file so that'd make this whole thing streaming in case your file's huge. Currently that line of code will try reading the entire file into memory at once, rather than going line by line, so you'll crash your computer if you have a file bigger than your memory.
You can also combine the if statements and do 1 conditional since you seem to only care about having both fields. In the interest of cleaner code you also don't need model_number = ''; model_name = ''
Saving the results of things like line.split(':') can help.
Alternatively, you could try a regex. It's impossible to tell which one is going to perform better without testing both, which brings me back to what I was saying in the beginning: optimizing code is tricky and really shouldn't be done if not necessary. If you really, really cared about efficiency you would use a program like awk written in C.
One straight way is using regex :
with open("textfile", 'r') as parts_info:
for line in parts_info:
m=re.search(r'[A-Z ]+ NUM\.: [A-Z\d]+',line)
if m:
print m.group(0)
result :
'PACKARDMODEL NUM.: 34401A',
' FLUKE MODEL NUM.: 724',
' AGILENT MODEL NUM.: U1253A',
' AGILENT MODEL NUM.: U1253B'
A few things are coming to my mind :
You could do the split(':') once and reuse it
if number of : is always the same then throw away the ifs and check with the length once
I am finishing with something like this
parts_data = {'Model_Number': []}
with open("textfile.txt", 'r') as parts_info:
linearray = parts_info.readlines()
for line in linearray:
linesp = line.split(':')
if len(linesp)>2:
model_name = linesp[1]
model_number = linesp[2]
model_number = model_number.split()[0]
model_number = model_name + ' ' + model_number
parts_data['Model_Number'].append(model_number.rstrip())

(BioPython) How do I stop MemoryError: Out of Memory exception?

I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later).
My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair.
Traceback (most recent call last):
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
cycle(f,k,binLen)
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
return _align(**keywds)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
score_only)
MemoryError: Out of memory
I have tried many things to work around this (as many of you may see from the code), all to no avail. I have tried splitting the large master file into smaller batches to be fed into score calculating method. I have tried del files after I am done using them, I have tried using my Ubuntu 11.11 on an Oracle Virtual machine (I typically work in 64bit Windows 7). Am I being to ambitious is this computationally feasable in BioPython? Below is my code, I have no experience in memory debugging which is the clear culprit of this problem. Any assistance is greatly appreciated I am becoming very frustrated with this problem.
Best,
Harry
##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)
from Bio import SeqIO
from Bio import pairwise2
import gc
##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
entry = True #Make sure we loop once
while entry :
batch = []
while len(batch) < batch_size :
try :
entry = iterator.next()
except StopIteration :
entry = None
if entry is None :
#End of file
break
batch.append(entry)
if batch :
yield batch
def split(subject,query):
##Query Iterator and Batch Subject Iterator
query_iterator = SeqIO.parse(query,"fasta")
record_iter = SeqIO.parse(subject,"fasta")
##Writes both large file into many small files
print "Splitting Subject File..."
binLen=2
for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
filename1="groupA_%i.fasta" % (j+1)
handle1=open(filename1, "w")
count1 = SeqIO.write(batch1, handle1, "fasta")
handle1.close()
print "Done splitting Subject file"
print "Splitting Query File..."
for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
filename2="groupB_%i.fasta" % (k+1)
handle2=open(filename2, "w")
count2 = SeqIO.write(batch2, handle2, "fasta")
handle2.close()
print "Done splitting both FASTA files"
print " "
return [k ,binLen]
##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')
def cycle(f,k,binLen):
i=1
m=1
while i<=k+1:
##Open the first small file
subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
i=i+1
j=0
##Make small file iterators
smallQuery=SeqIO.parse(queryFile,"fasta")
smallSubject=SeqIO.parse(subjectFile,"fasta")
##Cycles through both sets of FASTA files
while j<binLen:
j=j+1
currentQuery=smallQuery.next()
currentSubject=smallSubject.next()
##Verify every pair is correct
print " "
print "Pair: " + str(m)
print "Subject: "+ currentSubject.id
print "Query: " + currentQuery.id
gc.collect()
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
gc.collect()
currentQuery=None
currentSubject=None
score=str(a)
a=None
print "Score: " + score
f.write("1"+ "\n")
m=m+1
smallQuery.close()
smallSubject.close()
subjectFile.close()
queryFile.close()
gc.collect()
print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files
##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)
P.S. Be kind I am aware there is probably some goofy things in the code that I put in there trying to get around this problem.
See also this very similar question on BioStars, http://www.biostars.org/post/show/45893/trying-to-get-around-memoryerror-out-of-memory-exception-in-biopython-program/
There I suggested trying existing tools for this kind of thing, e.g. EMBOSS needleall http://emboss.open-bio.org/wiki/Appdoc:Needleall (you can parse the EMBOSS alignment output with Biopython)
The pairwise2 module was updated in the recent version of Biopython (1.68) to become faster and less memory consuming.

Categories

Resources