Finding common elements between two files

Finding common elements between two files - python

I have two different files as follows:
file1.txt is tab-delimited
AT5G54940.1 3182
pfam
PF01253 SUI1#Translation initiation factor SUI1
mf
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
bp
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 4996
pfam
PF01575 MaoC_dehydratas#MaoC like domain
mf
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01 560919
and file2.txt that contains different protein names,
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
I need to run a program, that finds me proteins names that are present in file2 from file1 but also prints me all "GO:" that appertains to that protein, if applicable. The difficult part for me is parsing the 1st file..the format is strange. I tried something like this,but any other ways are very much appreciated,
import re
with open('file2.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('file1.txt') as mydict:
with open('a.txt', 'w') as output:
for line in mydict:
new_list = line.strip().split()
protein = new_list[0]
if protein in proteins:
if re.search(r'GO:\d+', line):
output.write(protein+'\t'+line)
Desired output,whichever format is OK as long as I have all corresponding GO's
AT5G54940.1 GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01

Just to give you an idea how you might want to tackle this. A "group" belonging to one protein in your input file is delimited by a change from indented lines to a non-indented one. Search for this transition and you have your groups (or "chunks"). The first line of a group contains the protein name. All other lines might be GO: lines.
You can detect indention by using if line.startswith(" ") (instead of " " you might look for "\t", depending on your input file format).
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
Here, a chunk is nothing but a list of stripped lines. I extract the protein chunks from the input file with a generator. As you can see, the logic is based only on the transition from indented line to non-indented line.
When using the generator you can do with the data whatever you want to. I simply printed it. However, you might want to put the data into a dictionary and do further analysis.
Output:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,

One option would be to build up a dictionary of lists, using the name of the protein as the key:
#!/usr/bin/env python
import pprint
pp = pprint.PrettyPrinter()
proteins = set(line.strip() for line in open('file2.txt'))
d = {}
with open('file1.txt') as file:
for line in file:
line = line.strip()
parts = line.split()
if parts[0] in proteins:
key = parts[0]
d[key] = []
elif parts[0].split(':')[0] == 'GO':
d[key].append(line)
pp.pprint(d)
I've used the pprint module to print the dictionary, as you said you weren't too fussy about the format. The output as it stands is:
{'AT5G54940.1': ['GO:0003743 translation initiation factor activity',
'GO:0008135 translation factor activity, nucleic acid binding',
'GO:0006413 translational initiation',
'GO:0006412 translation',
'GO:0044260 cellular macromolecule metabolic process'],
'GRMZM2G158629_P02': ['GO:0016491 oxidoreductase activity',
'GO:0033989 3alpha,7alpha,']}
edit
Instead of using pprint, you could obtain the output specified in the question using a loop:
with open('out.txt', 'w') as out:
for k,v in d.iteritems():
out.write('Protein: {}\n'.format(k))
out.write('{}\n'.format('\n'.join(v)))
out.txt:
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process

Related

Obtain tsv from text with a specific pattern

I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.

Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.

You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.

It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

extracting part of text from file in python

I have a collection of text files that are of the form:
Sponsor : U of NC Charlotte
U N C C Station
Charlotte, NC 28223 704/597-2000
NSF Program : 1468 MANUFACTURING MACHINES & EQUIP
Fld Applictn: 0308000 Industrial Technology
56 Engineering-Mechanical
Program Ref : 9146,MANU,
Abstract :
9500390 Patterson This award supports a new concept in precision metrology,
the Extreme Ultraviolet Optics Measuring Machine (EUVOMM). The goals for this
system when used to measure optical surfaces are a diameter range of 250 mm
with a lateral accuracy of 3.3 nm rms, and a depth range of 7.5 mm w
there's more text above and below the snippet. I want to be able to do the following, for each text file:
store the NSF program, and Fld Applictn numbers in a list, and store the associated text in another list
so, in the above example I want the following, for the i-th text file:
y_num[i] = 1468, 0308000, 56
y_txt[i] = MANUFACTURING MACHINES & EQUIP, Industrial Technology, Engineering-Mechanical
Is there a clean way to do this in python? I prefer python since I am using os.walk to parse all the text files stored in subdirectories.

file = open( "file","r")
for line in file.readlines():
if "NSF" in line:
values= line.split(":")
elif "Fld" in line:
values1 = line.split(":")
So values and values1 has the specific values which you are intetested

You can try something like
yourtextlist = yourtext.split(':')
numbers = []
for slice in yourtextlist:
l = slice.split()
try:
numbers.append(int(l[0]))
except ValueError:
pass

Python 'for loop' to parse results

I am a beginning python user (trying to learn for bioinformatics) and I am having difficulties in getting my final 'for loop' correct. I have used a web-based bioinformatic program to assess the subcellular localization of certain proteins (protein names and sequences contained within ORFs) and I am trying to parse the results (contained within targetp). The web-based program that I've used truncates the names of the proteins (and does not include sequences), and I would like to parse my results file such that I have the complete name and sequence of each protein in FASTA format (this entails having a '>' + the protein name on one line, and the protein sequence on the subsequent line). I think that everything is going well until the last block of code; I end up with the proper protein names, but they are all appended to the same sequence. I know that there must be something simple that I am doing wrong, but I just can't figure it out. Any ideas?
Thanks!
The ORFs file looks like this (it's FASTA, but the " shouldn't be there, only >):
">HsaNP_000700 branched chain keto acid dehydrogenase E1, alpha polypeptide
MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDDKPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDEVNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK
">HsaNP_060914 pyruvate dehydrogenase phosphatase precursor
MPAPTQLFFPLIRNCELSRIYGTACYCHHKHLCCSSSYIPQSRLRYTPHPAYATFCRPKENWWQYTQGRRYASTPQKFYLTPPQVNSILKANEYSFKVPEFDGKNVSSILGFDSNQLPANAPIEDRRSAATCLQTRGMLLGVFDGHAGCACSQAVSERLFYYIAVSLLPHETLLEIENAVESGRALLPILQWHKHPNDYFSKEASKLYFNSLRTYWQELIDLNTGESTDIDVKEALINAFKRLDNDISLEAQVGDPNSFLNYLVLRVAFSGATACVAHVDGVDLHVANTGDSRAMLGVQEEDGSWSAVTLSNDHNAQNERELERLKLEHPKSEAKSVVKQDRLLGLLMPFRAFGDVKFKWSIDLQKRVIESGPDQLNDNEYTKFIPPNYHTPPYLTAEPEVTYHRLRPQDKFLVLATDGLWETMHRQDVVRIVGEYLTGMHHQQPIAVGGYKVTLGQMHGLLTERRTKMSSVFEDQNAATHLIRHAVGNNEFGTVDHERLSKMLSLPEELARMYRDDITIIVVQFNSHVVGAYQNQE
The targetp file looks like this (the M is in position 57, but the formatting here throws this off):
HsaNP_000700 445 0.939 0.020 0.089 M 1
HsaNP_060914 537 0.309 0.073 0.629 _ 4
The leftmost column in targetp is the identifier (part of the header line in each protein sequence above), and I want to return only entries with an 'M' (i.e., not '_') in position 57, along with the protein name from ORFs (header line).
My script is:
#!/usr/bin/python
ORFs = open('Human.MitoCarta.fasta', 'U')
targetp = open('MitoCarta_TargetP_combined.out', 'U')
report = targetp.readlines()
protfile = open('mitocarta_no_mTP.fasta','w')
protid = []
seqdict = {}
for seq in ORFs:
seq = seq.rstrip()
if seq[0] == '':
continue
if seq[0] == '>':
name = seq[1:]
seqdict[name] = ''
continue
seqdict[name] += seq
for entry in report:
if entry.startswith('HsaNP'):
if entry[57] != 'M':
protid.append(entry[0:20])
protid = [x.strip(' ') for x in protid]
nameslist = seqdict.keys()
c = 0
for i in protid:
if i in nameslist[c]:
protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[name]))
c += 1
protfile.close()

Yes, you are writing nameslist[c] and seqdict[name] but you never change 'name'. So you need to change 'name' if you want to get the different sequences. You should write:
protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[nameslist[c]]))
That way you should get it right.

Using Python to count morphemes in a file & list them in columns

This is the script that has very kindly been given to me as a starter:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import with_statement # needed for Python 2.5
from itertools import chain
def chunk(s):
"""Split a string on whitespace or hyphens"""
return chain(*(c.split("-") for c in s.split()))
def process(latin, gloss, trans):
chunks = zip(chunk(latin), chunk(gloss))
# now you have to DO SOMETHING with the chunks!
def main():
with open("examples.txt") as inf:
try:
while True:
latin = inf.next().strip()
gloss = inf.next().strip()
trans = inf.next().strip()
process(latin, gloss, trans)
inf.next() # skip blank line
except StopIteration:
# reached end of file
pass
if __name__=="__main__":
main()
I'm not sure if I'm missing anything, but the output is simply blank, taking me back to $.
I'm trying to do the following:
I have a text with a language other than English, broken up into morphemes (parts of each word) using hyphens, with the English gloss (linguistic translation of each morpheme) and a direct translation below.
eg.
Itali-am fat-o profug-us Lavini-a-que ven-it
Italy-Fem:Sg:Acc fate-Neut:Sg:Abl fleeing-Masc:Sg:Nom Lavinian-Neut:Pl:Acc come:Perf-3-Sg:Indic:Act
'in flight [driven] by fate came to Italy and the Lavinian [shores]'
I'll have several texts such as the above in one file - i.e.
blank line
a line of latin broken up with hyphens
a line of gloss broken up with corresponding hyphens, using colons to join elements
a line of translation
blank line
latin
gloss
translation
ad infinitum.
What I need to do is write a file that gives me the following output:
Itali: 1 Italy
am: 1 Fem:Sg:Acc
fat: 1 fate
o: 1 Neut:Sg:Abl
profug: 1 fleeing
us: 1 Masc:Sg:Nom
Lavini: 1 Lavinian
a: 1 Neug:Pl:Acc
que: 1 come:Perf
ven: 1 3
it: 1 Sg:Indic:Act
where the first column represents the first line of text without hyphens; the second column indicates the number of occurrences (it's only 1 each in this example), and the third column is the English translation of the first column, as written in the text.
If there's a latin morpheme with no corresponding English gloss/translation, the Latin column will be as normal but the English column will print [unknown], like:
a: 1 [unknown]
And if the opposite, i.e. an English morpheme with no corresponding Latin, it should print
[unknown]: 1 kitten
Finally, the prog needs to be able to deal with homophonous morphemes (i.e. two identically spelled latin morphemes with different meanings).
e.g.
a: 16 Neuter:Plural
a: 28 Feminine:Singular
Again, it's homework, and any pointers would be wonderful. Working on putting together some script now to upload here for critique!

Processing the file is a bit tricky because of the multiline structure; rather than the usual line-by-line iteration, I suggest something like this (I presume the file does not actually begin with a blank line, per your example, but that it uses blank lines as separators):
with open("input.txt") as inf:
try:
while True:
latin = inf.next().strip()
gloss = inf.next().strip()
trans = inf.next().strip()
process(latin, gloss, trans)
inf.next() # skip blank line
except StopIteration:
# reached end of file
pass
process must then split latin and gloss into chunks and pair them appropriately:
from itertools import chain
def chunk(s):
"""Split a string on whitespace or hyphens"""
return chain(*(c.split("-") for c in s.split()))
def process(latin, gloss, trans):
chunks = zip(chunk(latin), chunk(gloss))
Calling this like
process(
"Itali-am fat-o profug-us Lavini-a-que ven-it",
"Italy-Fem:Sg:Acc fate-Neut:Sg:Abl fleeing-Masc:Sg:Nom Lavinian-Neut:Pl:Acc come:Perf-3-Sg:Indic:Act",
"in flight [driven] by fate came to Italy and the Lavinian [shores]")
leaves chunks containing
[('Itali', 'Italy'),
('am', 'Fem:Sg:Acc'),
('fat', 'fate'),
('o', 'Neut:Sg:Abl'),
('profug', 'fleeing'),
('us', 'Masc:Sg:Nom'),
('Lavini', 'Lavinian'),
('a', 'Neut:Pl:Acc'),
('que', 'come:Perf'),
('ven', '3'),
('it', 'Sg:Indic:Act')]
The rest is an exercise for the student - keeping a running count of the chunks, then sorting and displaying it appropriately. Hope that helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding common elements between two files - python

Related

Obtain tsv from text with a specific pattern

Parsing numbers in strings from a file

extracting part of text from file in python

Python 'for loop' to parse results

Using Python to count morphemes in a file & list them in columns

Categories

Resources