extracting part of text from file in python - python

I have a collection of text files that are of the form:
Sponsor : U of NC Charlotte
U N C C Station
Charlotte, NC 28223 704/597-2000
NSF Program : 1468 MANUFACTURING MACHINES & EQUIP
Fld Applictn: 0308000 Industrial Technology
56 Engineering-Mechanical
Program Ref : 9146,MANU,
Abstract :
9500390 Patterson This award supports a new concept in precision metrology,
the Extreme Ultraviolet Optics Measuring Machine (EUVOMM). The goals for this
system when used to measure optical surfaces are a diameter range of 250 mm
with a lateral accuracy of 3.3 nm rms, and a depth range of 7.5 mm w
there's more text above and below the snippet. I want to be able to do the following, for each text file:
store the NSF program, and Fld Applictn numbers in a list, and store the associated text in another list
so, in the above example I want the following, for the i-th text file:
y_num[i] = 1468, 0308000, 56
y_txt[i] = MANUFACTURING MACHINES & EQUIP, Industrial Technology, Engineering-Mechanical
Is there a clean way to do this in python? I prefer python since I am using os.walk to parse all the text files stored in subdirectories.

file = open( "file","r")
for line in file.readlines():
if "NSF" in line:
values= line.split(":")
elif "Fld" in line:
values1 = line.split(":")
So values and values1 has the specific values which you are intetested

You can try something like
yourtextlist = yourtext.split(':')
numbers = []
for slice in yourtextlist:
l = slice.split()
try:
numbers.append(int(l[0]))
except ValueError:
pass

Related

Parsing a genbank file format with biopython's SeqIO

I'm trying to parse a protein genbank file format, Here's an example file (example.protein.gpff)
LOCUS NP_001346895 208 aa linear PRI 20-JAN-2018
DEFINITION intercellular adhesion molecule 2 precursor [Cercocebus atys].
ACCESSION NP_001346895
VERSION NP_001346895.1
DBSOURCE REFSEQ: accession NM_001359966.1
KEYWORDS RefSeq.
SOURCE Cercocebus atys (sooty mangabey)
ORGANISM Cercocebus atys
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Cercopithecidae; Cercopithecinae; Cercocebus.
REFERENCE 1 (residues 1 to 208)
AUTHORS Palesch D, Bosinger SE, Tharp GK, Vanderford TH, Paiardini M,
Chahroudi A, Johnson ZP, Kirchhoff F, Hahn BH, Norgren RB, Patel
NB, Sodora DL, Dawoud RA, Stewart CB, Seepo SM, Harris RA, Liu Y,
Raveendran M, Han Y, English A, Thomas GWC, Hahn MW, Pipes L, Mason
CE, Muzny DM, Gibbs RA, Sauter D, Worley K, Rogers J and Silvestri
G.
TITLE Sooty mangabey genome sequence provides insight into AIDS
resistance in a natural SIV host
JOURNAL Nature 553 (7686), 77-81 (2018)
PUBMED 29300007
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
KY308194.1.
##Evidence-Data-START##
Transcript exon combination :: KY308194.1 [ECO:0000332]
RNAseq introns :: single sample supports all introns
SAMN02045730, SAMN03085078
[ECO:0000348]
##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..208
/organism="Cercocebus atys"
/db_xref="taxon:9531"
Protein 1..208
/product="intercellular adhesion molecule 2 precursor"
/calculated_mol_wt=21138
sig_peptide 1..19
/inference="COORDINATES: ab initio prediction:SignalP:4.0"
/calculated_mol_wt=1999
Region 24..109
/region_name="ICAM_N"
/note="Intercellular adhesion molecule (ICAM), N-terminal
domain; pfam03921"
/db_xref="CDD:252248"
Region 112..>167
/region_name="Ig"
/note="Immunoglobulin domain; cl11960"
/db_xref="CDD:325142"
CDS 1..208
/gene="ICAM2"
/coded_by="NM_001359966.1:1..627"
/db_xref="GeneID:105590766"
ORIGIN
1 mssfgfgtlt malfalvccs gsdekafevh mrleklivkp kesfevncst tcnqpevggl
61 etslnkilll eqtqwkhyli snishdtvlw chftcsgkqk smssnvsvyq pprqvfltlq
121 ptwvavgksf tiecrvpave pldsltlsll rgsetlhsqt frkaapalpv lrelgmkfiq
181 lcprrglagt mppsrpwcpa athwsqgc
//
LOCUS NP_001280013 406 aa linear MAM 22-JAN-2018
DEFINITION 26S proteasome regulatory subunit 8 [Dasypus novemcinctus].
ACCESSION NP_001280013 XP_004456848
VERSION NP_001280013.1
DBSOURCE REFSEQ: accession NM_001293084.1
KEYWORDS RefSeq.
SOURCE Dasypus novemcinctus (nine-banded armadillo)
ORGANISM Dasypus novemcinctus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Xenarthra; Cingulata; Dasypodidae; Dasypus.
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
AAGV03083693.1.
On May 28, 2014 this sequence version replaced XP_004456848.1.
Sequence Note: The RefSeq transcript and protein were derived from
genomic sequence to make the sequence consistent with the reference
genome assembly. The genomic coordinates used for the transcript
record were based on alignments.
##Evidence-Data-START##
RNAseq introns :: mixed/partial sample support SAMN00668203,
SAMN00744121 [ECO:0000350]
##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..406
/organism="Dasypus novemcinctus"
/db_xref="taxon:9361"
Protein 1..406
/product="26S proteasome regulatory subunit 8"
/calculated_mol_wt=45495
Region 4..404
/region_name="RPT1"
/note="ATP-dependent 26S proteasome regulatory subunit
[Posttranslational modification, protein turnover,
chaperones]; COG1222"
/db_xref="CDD:224143"
CDS 1..406
/gene="PSMC5"
/coded_by="NM_001293084.1:1..1221"
/db_xref="GeneID:101445299"
ORIGIN
1 maldgpeqme leegkagsgl rqyylskiee lqlivndksq nlrrlqaqrn elnakvrllr
61 eelqllqeqg syvgevvram dkkkvlvkvh pegkfvvdvd knidindvtp ncrvalrnds
121 ytlhkilpnk vdplvslmmv ekvpdstyem iggldkqike ikevielpvk hpelfealgi
181 aqpkgvllyg ppgtgktlla ravahhtdct firvsgselv qkfigegarm vrelfvmare
241 hapsiifmde idsigssrle ggsggdsevq rtmlellnql dgfeatknik vimatnridi
301 ldsallrpgr idrkiefppp neearldilk ihsrkmnltr ginlrkiael mpgasgaevk
361 gvcteagmya lrerrvhvtq edfemavakv mqkdseknms ikklwk
//
The format has repeating records (separated by //), where each record is a protein. Each record has several sections among them a FEATURES section with several fixed fields, such as source, CDS, and Region, with values that refer to information specific to that record.
I'm interested in using biopython's SeqIO to parse this file into a dataframe which lists for each record ID, the values of its gene, db_xref, and coded_by from its CDS field, the organism and db_xref values from its source field, and db_xref value from its Region field. Except for the Regions field, which may appear several times in the FEATURES section of a record, the CDS and source fields appear only once in the FEATURES section of a record.
My unsuccessful attempt so far looks like this:
from Bio import SeqIO
filename = "example.protein.gpff"
for record in SeqIO.parse(filename, "genbank"):
for feature in record.features:
if feature.type == "CDS":
symbol = feature.qualifiers.get("gene", ["???"])[0]
gene_id = feature.qualifiers.get("db_xref", ["???"])[0]
gene_id = re.sub('GeneID:', '', gene_id)
transcript_id = feature.qualifiers.get("coded_by", ["???"])[0]
transcript_id = re.sub(':.*', '', transcript_id)
if feature.type == "source":
species_name = feature.qualifiers.get("organism", ["???"])[0]
species_id = feature.qualifiers.get("db_xref", ["???"])[0]
species_id = re.sub('taxon:', '', species_id)
if feature.type == "Region":
cdd_id = feature.qualifiers.get("db_xref", ["???"])[0]
cdd_id = re.sub('CDD:', '', cdd_id)
print("%s,%s,%s,%s,%s,%s,%s" % (record.id, cdd_id, transcript_id, symbol, gene_id, species_name, species_id))
The resulting dataframe I'd like to obtain (for the example.protein.gpff above) is:
record_id CDS_coded_by CDS_db_xref CDS_gene source_organism source_db_xref Region_db_xref
1 NP_001346895 NM_001359966.1:1..627 GeneID:105590766 ICAM2 Cercocebus atys taxon:9531 CDD:252248
2 NP_001346895 NM_001359966.1:1..627 GeneID:105590766 ICAM2 Cercocebus atys taxon:9531 CDD:325142
3 NP_001280013 NM_001293084.1:1..1221 GeneID:101445299 PSMC5 Dasypus novemcinctus taxon:9361 CDD:224143
Check out the Genebank-parser library. It accepts a genebank filename and the batch size; next_batch yields as many number of records as batch_size specifies.
Seems like the easiest way to deal with this file format is to convert it to a JSON format (for example, using Bio), and then read it with various JSON parsers (like the rjson package in R, which parses a JSON file to a list of records)

Obtain tsv from text with a specific pattern

I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.
Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID

Replace Single Character in a line of a Text file with Python

I have a text file with all of them currently having the same end character (N), which is being used to identify progress the system makes. I want to change the end character to "Y" in case the program ends via an error or other interruptions so that upon restarting the program will search until a line has the end character "N" and begin working from there. Below is my code as well as a sample from the text file.
UPDATED CODE:
def GeoCode():
f = open("geocodeLongLat.txt", "a")
with open("CstoGC.txt",'r') as file:
print("Geocoding...")
new_lines = []
for line in file.readlines():
check = line.split('~')
print(check)
if 'N' in check[-1]:
geolocator = Nominatim()
dot_number, entry_name, PHY_STREET,PHY_CITY,PHY_STATE,PHY_ZIP = check[0],check[1],check[2],check[3],check[4],check[5]
address = PHY_STREET + " " + PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
f.write(dot_number + '\n')
try:
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
try:
address = PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
print("Cannot Geocode")
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('CstoGC.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
f.close()
Output:
2967377~DARIN COLE~22112 TWP RD 209~ALVADA~OH~44802~Y
WAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
143608~LARRY A PETERSON & DONNA M PETERSON~W6359 450TH AVE~ELLSWORTH~WI~54011~N
635528~JAMES E WEBB~3926 GREEN ROAD~SPRINGFIELD~TN~37172~N
805496~WAYNE MLADY~22272 135TH ST~CRESCO~IA~52136~N
704996~SAVINA C MUNIZ~814 W LA QUINTA DR~PHARR~TX~78577~N
893169~BINDEWALD MAINTENANCE INC~213 CAMDEN DR~SLIDELL~LA~70459~N
948130~LOGISTICIZE LTD~861 E PERRY ST~PAULDING~OH~45879~N
438760~SMOOTH OPERATORS INC~W8861 CREEK ROAD~DARIEN~WI~53114~N
518872~A B C RELOCATION SERVICES INC~12 BOCKES ROAD~HUDSON~NH~03051~N
576143~E B D ENTERPRISES INC~29 ROY ROCHE DRIVE~WINNIPEG~MB~R3C 2E6~N
968264~BRIAN REDDEMANN~706 WESTGOR STREET~STORDEN~MN~56174-0220~N
721468~QUALITY LOGISTICS INC~645 LEONARD RD~DUNCAN~SC~29334~N
As you can see I am already keeping track of which line I am at just by using x. Should I use something like file.readlines()?
Sample of text document:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
Thank you!
Edit: updated code thanks to #idlehands
There are a few ways to do this.
Option #1
My original thought was to use the tell() and seek() method to go back a few steps but it quickly shows that you cannot do this conveniently when you're not opening the file in bytes and definitely not in a for loop of readlines(). You can see the reference threads here:
Is it possible to modify lines in a file in-place?
How to solve "OSError: telling position disabled by next() call"
The investigation led to this piece of code:
with open('file.txt','rb+') as file:
line = file.readline() # initiate the loop
while line: # continue while line is not None
print(line)
check = line.split(b'~')[-1]
if check.startswith(b'N'): # carriage return is expected for each line, strip it
# ... do stuff ... #
file.seek(-len(check), 1) # place the buffer at the check point
file.write(check.replace(b'N', b'Y')) # replace "N" with "Y"
line = file.readline() # read next line
In the first referenced thread one of the answers mentioned this could lead you to potential problems, and directly modifying the bytes on the buffer while reading it is probably considered a bad idea™. A lot of pros probably will scold me for even suggesting it.
Option #2a
(if file size is not horrendously huge)
with open('file.txt','r') as file:
new_lines = []
for line in file.readlines():
check = line.split('~')
if 'N' in check[-1]:
# ... do stuff ... #
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('file.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
This approach loads all the lines into memory first, so you do the modification in memory but leave the buffer alone. Then you reload the file and write the lines that were changed. The caveat is that technically you are rewriting the entire file line by line - not just the string N even though it was the only thing changed.
Option #2b
Technically you could open the file as r+ mode from the onset and then after the iterations have completed do this (still within the with block but outside of the loop):
# ... new_lines.append('~'.join(check)) #
file.seek(0)
for line in new_lines:
file.writelines(line)
I'm not sure what distinguishes this from Option #1 since you're still reading and modifying the file in the same go. If someone more proficient in IO/buffer/memory management wants to chime in please do.
The disadvantage for Option 2a/b is that you always end up storing and rewriting the lines in the file even if you are only left with a few lines that needs to be updated from 'N' to 'Y'.
Results (for all solutions):
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~Y
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~Y
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~Y
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~Y
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~Y
And if you were to say, encountered a break at the line starting with 220940, the file would become:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
There are pros and cons to these approaches. Try and see which one fits your use case the best.
I would read the entire input file into a list and .pop() the lines off one at a time. In case of an error, append the popped item to the list and write overwrite the input file. This way it will always be up to date and you won't need any other logic.

Rewriting code with an array

I have a file X_true that consists of sentences like these:
evid emerg interview show done deal
munich hamburg train crash wednesday first gener ice model power two electr power locomot capac 759 passeng
one report earlier week said older two boy upset girlfriend broken polic confirm
jordan previous said
Now instead of storing these sentences in a file, I wish to put them in an array(List of strings) to work with them throughout the code. So the array would look something like this:
['evid emerg interview show done deal',
'munich hamburg train crash wednesday first gener ice model power two electr power locomot capac 759 passeng',
'one report earlier week said older two boy upset girlfriend broken polic confirm',
'jordan previous said']
Earlier when working with the file, this was the code I was using:
def run(command):
output = subprocess.check_output(command, shell=True)
return output
row = run('cat '+'/Users/mink/X_true.txt'+" | wc -l").split()[0]
Now when I working with X_true as an array, how can I write an equivalent statement for the row assignment above?
len(X_true_array) ,where X_true_array is the array of ur file content represented by array.
because before then u use wc -l to get the line count of ur file,and in here u can represent the line count through the count of array item.
So I understand this correctly, you just want to read in a file and store each line as an element of an array?
X_true = []
with open("X_true.txt") as f:
for line in f:
X_true.append(line.strip())
Another option (thanks #roeland):
with open("X_true.txt") as f:
X_true = list(map(str.strip, f))
with open(X_true.txt) as f:
X_true= f.readlines()
or with stripping the newline character:
X_true= [line.rstrip('\n') for line in open(X_true.txt)]
Refer Input and Ouput:
Try this:
Using readlines
X_true = open("x_true.txt").readlines()
Using read:
X_true = open("x_true.txt").read().split("\n")
Using List comprehension:
X_true = [line.rstrip() for line in open("x_true.txt")]
with open(X_true.txt) as f:
array_of_lines = f.readlines()
array_of_lines will look like your example above. Note: it will still have the newline characters at the end of each string in the array. Those can be removed with string.strip() if they're a concern.

Finding common elements between two files

I have two different files as follows:
file1.txt is tab-delimited
AT5G54940.1 3182
pfam
PF01253 SUI1#Translation initiation factor SUI1
mf
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
bp
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 4996
pfam
PF01575 MaoC_dehydratas#MaoC like domain
mf
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01 560919
and file2.txt that contains different protein names,
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
I need to run a program, that finds me proteins names that are present in file2 from file1 but also prints me all "GO:" that appertains to that protein, if applicable. The difficult part for me is parsing the 1st file..the format is strange. I tried something like this,but any other ways are very much appreciated,
import re
with open('file2.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('file1.txt') as mydict:
with open('a.txt', 'w') as output:
for line in mydict:
new_list = line.strip().split()
protein = new_list[0]
if protein in proteins:
if re.search(r'GO:\d+', line):
output.write(protein+'\t'+line)
Desired output,whichever format is OK as long as I have all corresponding GO's
AT5G54940.1 GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01
Just to give you an idea how you might want to tackle this. A "group" belonging to one protein in your input file is delimited by a change from indented lines to a non-indented one. Search for this transition and you have your groups (or "chunks"). The first line of a group contains the protein name. All other lines might be GO: lines.
You can detect indention by using if line.startswith(" ") (instead of " " you might look for "\t", depending on your input file format).
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
Here, a chunk is nothing but a list of stripped lines. I extract the protein chunks from the input file with a generator. As you can see, the logic is based only on the transition from indented line to non-indented line.
When using the generator you can do with the data whatever you want to. I simply printed it. However, you might want to put the data into a dictionary and do further analysis.
Output:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
One option would be to build up a dictionary of lists, using the name of the protein as the key:
#!/usr/bin/env python
import pprint
pp = pprint.PrettyPrinter()
proteins = set(line.strip() for line in open('file2.txt'))
d = {}
with open('file1.txt') as file:
for line in file:
line = line.strip()
parts = line.split()
if parts[0] in proteins:
key = parts[0]
d[key] = []
elif parts[0].split(':')[0] == 'GO':
d[key].append(line)
pp.pprint(d)
I've used the pprint module to print the dictionary, as you said you weren't too fussy about the format. The output as it stands is:
{'AT5G54940.1': ['GO:0003743 translation initiation factor activity',
'GO:0008135 translation factor activity, nucleic acid binding',
'GO:0006413 translational initiation',
'GO:0006412 translation',
'GO:0044260 cellular macromolecule metabolic process'],
'GRMZM2G158629_P02': ['GO:0016491 oxidoreductase activity',
'GO:0033989 3alpha,7alpha,']}
edit
Instead of using pprint, you could obtain the output specified in the question using a loop:
with open('out.txt', 'w') as out:
for k,v in d.iteritems():
out.write('Protein: {}\n'.format(k))
out.write('{}\n'.format('\n'.join(v)))
out.txt:
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process

Categories

Resources