Parsing text with Python 2.7 - python

Text File
• I.D.: AN000015544
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398
• I.D.: AN000016955
DESCRIPTION: TEMPERATURE CALIBRATOR
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063
• I.D.: AN000017259
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076
• I.D.: AN000032766
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253B CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY5048 9036
Objective
Seeking to find a more efficient algorithm for parsing the manufacturer name and number. i.e 'HEWLETT-PACKARDMODEL NUM.: 34401A', 'AGILENT MODEL NUM.: U1253B'...etc. from the text file above.
Data Structure
parts_data = {'Model_Number': []}
Code
with open("textfile", 'r') as parts_info:
linearray = parts_info.readlines(
for line in linearray:
model_number = ''
model_name = ''
if "MANUFACTURER:" in line:
model_name = line.split(':')[1]
if "NUM.:" in line:
model_number = line.split(':')[2]
model_number = model_number.split()[0]
model_number = model_name + ' ' + model_number
parts_data['Model_Number'].append(model_number.rstrip())
My code does exactly what I want, but I think there is a faster or cleaner way to complete the action.Let's increase efficiency!

Your code looks fine already and unless you're parsing more than GB's of data I don't know what the point of this is. I thought of a few things.
If you remove the linearray = parts_info.readlines( line Python understands just using a for loop with an open file so that'd make this whole thing streaming in case your file's huge. Currently that line of code will try reading the entire file into memory at once, rather than going line by line, so you'll crash your computer if you have a file bigger than your memory.
You can also combine the if statements and do 1 conditional since you seem to only care about having both fields. In the interest of cleaner code you also don't need model_number = ''; model_name = ''
Saving the results of things like line.split(':') can help.
Alternatively, you could try a regex. It's impossible to tell which one is going to perform better without testing both, which brings me back to what I was saying in the beginning: optimizing code is tricky and really shouldn't be done if not necessary. If you really, really cared about efficiency you would use a program like awk written in C.

One straight way is using regex :
with open("textfile", 'r') as parts_info:
for line in parts_info:
m=re.search(r'[A-Z ]+ NUM\.: [A-Z\d]+',line)
if m:
print m.group(0)
result :
'PACKARDMODEL NUM.: 34401A',
' FLUKE MODEL NUM.: 724',
' AGILENT MODEL NUM.: U1253A',
' AGILENT MODEL NUM.: U1253B'

A few things are coming to my mind :
You could do the split(':') once and reuse it
if number of : is always the same then throw away the ifs and check with the length once
I am finishing with something like this
parts_data = {'Model_Number': []}
with open("textfile.txt", 'r') as parts_info:
linearray = parts_info.readlines()
for line in linearray:
linesp = line.split(':')
if len(linesp)>2:
model_name = linesp[1]
model_number = linesp[2]
model_number = model_number.split()[0]
model_number = model_name + ' ' + model_number
parts_data['Model_Number'].append(model_number.rstrip())

Related

How to remove newlines but keep blank ones in a text file?

My question is essentially identical to the one found here, but I'd like to perform that operation using python 3. The text in my file looks something like this:
'''
Chapter One ~~ Introductory
The institution of a leisure class is found in its best development at
the higher stages of the barbarian culture; as, for instance, in feudal...
'''
Per numerous suggestions I've found, I have tried:
with open('veblen_txt_test.txt', 'r') as src:
with open('write_new.txt', 'w') as dest:
for line in src:
if len(line) > 0:
line = line.replace('\n', ' ')
dest.write(line)
else:
line = line + '\n\n'
dest.write('%s' % (line))
But this returns:
'''
Chapter One ~~ Introductory The institution of a leisure class is found in its best development at the higher stages of the barbarian culture; as, for instance, in feudal...
'''
The intended output is:
'''
Chapter One ~~ Introductory
The institution of a leisure class is found in its best development at the higher stages of the barbarian culture; as, for instance, in feudal...
'''
I have tried using rstrip():
with open('veblen_txt_test.txt', 'r') as src:
with open('write_new.txt', 'w') as dest:
for line in src:
if len(line) > 0:
line = line.rstrip('\n')
dest.write('%s%s' % (line, ' '))
else:
line = line + '\n\n'
dest.write('%s' % (line))
But that yields the same result.
Most of the responses online address removing blank spaces, not keeping them; I have no doubt the solution is simple, but I've been trying different variations of the above code for about an hour and a half and just thought to ask the community. Thanks for your assistance!
If we change the len(line) > 0 to len(line) > 1, it does the job. This is because \n counts as 1 character. You'll also have to remove this line: line = line + '\n\n' as it adds 4 more extra lines (since there are two \n in between Chapter One ~~ Introductory and The institution....
Output:
Chapter One ~~ Introductory
The institution of a leisure class is found in its best development at the higher stages of the barbarian culture; as, for instance, in feudal...

python nesting loops

I am trying perform a nested loop to combine data into a line by using matched MAC Addresses in both files.
I am able to pull the loop fine without the regex, however, when using the search regex below, it will only loop through the MAC_Lines once and print the correct results using the first entry in the MAC_Lines and stop. I'm unsure how to make the MAC_Lines go to the next line and repeat the process for all of the entries in the MAC_Lines.
try:
for mac in MAC_Lines:
MAC_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', mac, re.I)
MAC_address_final = MAC_address.group()
for arp in ARP_Lines:
ARP_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', arp, re.I)
ARP_address_final = ARP_address.group()
if MAC_address_final == ARP_address_final:
print mac + arp
continue
except Exception:
print 'completed.'
Results:
13,64,00:0c:29:36:9f:02,giga-swx 0/213,172.20.13.70, 00:0c:29:36:9f:02, vlan 64
completed.
I learned that the issue was how I opened the file. I should have used the 'open':'as' keywords when opening both files to allow the files to properly close and reopen for the next loop. Below is the code I was looking for.
Below is the code:
with open('MAC_List.txt', 'r') as read0:for items0 in read0:
MAC_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', items0, re.I)
if MAC_address:
mac_addy = MAC_address.group().upper()
with open('ARP_List.txt', 'r') as read1:
for items1 in read1:
ARP_address = re.search(r'([a-fA-F0-9]{2}[:|\-]?){6}', items1, re.I)
if ARP_address:
arp_addy = ARP_address.group()
if mac_addy == arp_addy:
print(items0.strip() + ' ' + items1.strip())

Text File Reading and Structuring data

Text File
• I.D.: AN000015544
DESCRIPTION: 6 1/2 DIGIT DIGITAL MULTIMETER
MANUFACTURER: HEWLETT-PACKARDMODEL NUM.: 34401A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY45027398
• I.D.: AN000016955
DESCRIPTION: TEMPERATURE CALIBRATOR
MANUFACTURER: FLUKE MODEL NUM.: 724 CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: 1189063
• I.D.: AN000017259
DESCRIPTION: TRUE RMS MULTIMETER
MANUFACTURER: AGILENT MODEL NUM.: U1253A CALIBRATION - DUE DATE:6/1/2016 SERIAL NUMBER: MY49420076
Objective
To read the text file and save the ID number and Serial number of each part into the part_data data structure.
Data Structure
part_data = {'ID': [],
'Serial Number': []}
Code
with open("textfile", 'r') as part_info:
lineArray = part_info.read().split('\n')
print(lineArray)
if "• I.D.: AN000015544 " in lineArray:
print("I have found the first line")
ID = [s for s in lineArray if "AN" in s]
print(ID[0])
My code isn't finding the I.D: or the serial number value. I know it is wrong I was trying to use the method I got from this website Text File Reading and Printing Data for parsing the data. Can anyone move me in the right direction for collecting the values?
Update
This solution works with python 2.7.9 not 3.4, thanks to domino - https://stackoverflow.com/users/209361/domino:
with open("textfile", 'r') as part_info:
lineArray = part_info.readlines()
for line in lineArray:
if "I.D.:" in line:
ID = line.split(':')[1]
print ID.strip()
However when I initially asked the question I was using python 3.4, and the solution did not work properly.
Does anyone understand why it doesn't work python 3.4? Thank You!
This should print out all your ID's. I think it should move you in the right direction.
with open("textfile", 'r') as part_info:
lineArray = part_info.readlines()
for line in lineArray:
if "I.D.:" in line:
ID = line.split(':')[1]
print ID.strip()
It won't work in python3 because in python3 print is a function
It should end
print(ID.strip())

Arcpy - Creating a Buffer, then Dissolving in a single script

So I am attempting to write a script that has a number of user defined variables. I've gotten to the final step and can't seem to get it to dissolve things properly.
Purpose: The script should let me define a shapefile/layer file, a distance for the buffer to work with, create the buffer then dissolve it (This is where it fails) and save.
Here is what I have so far.
import arcpy
from arcpy import env
env.workspace = "C:\Users\...\Conroe Cut"
fc = raw_input (' What file is being Buffered' + " ")
distance = raw_input (' Buffer Size' + " ")
finalfile = raw_input (' Name of Final File' + " ")
unique_name = arcpy.CreateUniqueName("Results\\"+finalfile)
arcpy.Buffer_analysis(fc, unique_name, distance)
arcpy.Dissolve_management(unique_name, "SINGLE_PART", "DISSOLVE_LINES")
print "Finished with Analysis"
You can perform the buffer and dissolve in one line using arcpy.Buffer_analysis--make sure to specify the "ALL" parameter, which performs the dissolve. This should significantly simplify and clean your script.
import arcpy
infc = r'C:\path\to\input\shapefile.shp'
outfc = r'C:\path\to\output\shapefile_buffered_dissolved.shp'
bufferDistance = 20
arcpy.Buffer_analysis(infc, outfc, bufferDistance, "", "", "ALL")

(BioPython) How do I stop MemoryError: Out of Memory exception?

I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later).
My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair.
Traceback (most recent call last):
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
cycle(f,k,binLen)
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
return _align(**keywds)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
score_only)
MemoryError: Out of memory
I have tried many things to work around this (as many of you may see from the code), all to no avail. I have tried splitting the large master file into smaller batches to be fed into score calculating method. I have tried del files after I am done using them, I have tried using my Ubuntu 11.11 on an Oracle Virtual machine (I typically work in 64bit Windows 7). Am I being to ambitious is this computationally feasable in BioPython? Below is my code, I have no experience in memory debugging which is the clear culprit of this problem. Any assistance is greatly appreciated I am becoming very frustrated with this problem.
Best,
Harry
##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)
from Bio import SeqIO
from Bio import pairwise2
import gc
##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
entry = True #Make sure we loop once
while entry :
batch = []
while len(batch) < batch_size :
try :
entry = iterator.next()
except StopIteration :
entry = None
if entry is None :
#End of file
break
batch.append(entry)
if batch :
yield batch
def split(subject,query):
##Query Iterator and Batch Subject Iterator
query_iterator = SeqIO.parse(query,"fasta")
record_iter = SeqIO.parse(subject,"fasta")
##Writes both large file into many small files
print "Splitting Subject File..."
binLen=2
for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
filename1="groupA_%i.fasta" % (j+1)
handle1=open(filename1, "w")
count1 = SeqIO.write(batch1, handle1, "fasta")
handle1.close()
print "Done splitting Subject file"
print "Splitting Query File..."
for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
filename2="groupB_%i.fasta" % (k+1)
handle2=open(filename2, "w")
count2 = SeqIO.write(batch2, handle2, "fasta")
handle2.close()
print "Done splitting both FASTA files"
print " "
return [k ,binLen]
##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')
def cycle(f,k,binLen):
i=1
m=1
while i<=k+1:
##Open the first small file
subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
i=i+1
j=0
##Make small file iterators
smallQuery=SeqIO.parse(queryFile,"fasta")
smallSubject=SeqIO.parse(subjectFile,"fasta")
##Cycles through both sets of FASTA files
while j<binLen:
j=j+1
currentQuery=smallQuery.next()
currentSubject=smallSubject.next()
##Verify every pair is correct
print " "
print "Pair: " + str(m)
print "Subject: "+ currentSubject.id
print "Query: " + currentQuery.id
gc.collect()
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
gc.collect()
currentQuery=None
currentSubject=None
score=str(a)
a=None
print "Score: " + score
f.write("1"+ "\n")
m=m+1
smallQuery.close()
smallSubject.close()
subjectFile.close()
queryFile.close()
gc.collect()
print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files
##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)
P.S. Be kind I am aware there is probably some goofy things in the code that I put in there trying to get around this problem.
See also this very similar question on BioStars, http://www.biostars.org/post/show/45893/trying-to-get-around-memoryerror-out-of-memory-exception-in-biopython-program/
There I suggested trying existing tools for this kind of thing, e.g. EMBOSS needleall http://emboss.open-bio.org/wiki/Appdoc:Needleall (you can parse the EMBOSS alignment output with Biopython)
The pairwise2 module was updated in the recent version of Biopython (1.68) to become faster and less memory consuming.

Categories

Resources