Difflib displays wrong output - python

I'm trying to compare two files containing key-value pairs. Here is an example of such pairs:
ORIGINAL:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
FILE 2:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
When I run both files trough difflib.Differ().compare() I get the following output which is wrong:
'- z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n', '+ z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n'
If you look closely you'll see that the line from both files is extremely similar with some minor differences. For some reason difflib doesn't recognize the similar characters and just gives the whole line as a difference.
Does anyone have a solution for this problem?

Related

I'm trying to make a simple script that says two different two phrase lines(Python)

So, I'm just starting to program Python and I wanted to make a very simple script that will say something like "Gabe- Hello, my name is Gabe (Just an example of a sentence" + "Jerry- Hello Gabe, I'm Jerry" OR "Gabe- Goodbye, Jerry" + "Jerry- Goodbye, Gabe". Here's pretty much what I wrote.
answers1 = [
"James-Hello, my name is James!"
]
answers2 = [
"Jerry-Hello James, my name is Jerry!"
]
answers3 = [
"Gabe-Goodbye, Samuel."
]
answers4 = [
"Samuel-Goodbye, Gabe"
]
Jack1 = (answers1 + answers2)
Jack2 = (answers3 + answers4)
Jacks = ([Jack1,Jack2])
import random
for x in range(2):
a = random.randint(0,2)
print (random.sample([Jacks, a]))
I'm quite sure it's a very simple fix, but as I have just started Python (Like, literally 2-3 days ago) I don't quite know what the problem would be. Here's my error message
Traceback (most recent call last):
File "C:/Users/Owner/Documents/Test Python 3.py", line 19, in <module>
print (random.sample([Jacks, a]))
TypeError: sample() missing 1 required positional argument: 'k'
If anyone could help me with this, I would very much appreciate it! Other than that, I shall be searching on ways that may be relevant to fixing this.
The problem is that sample requires a parameter k that indicates how many random samples you want to take. However in this case it looks like you do not need sample, since you already have the random integer. Note that that integer should be in the range [0,1], because the list Jack has only two elements.
a = random.randint(0,1)
print (Jacks[a])
or the same behavior with sample, see here for an explanation.
print (random.sample(Jacks,1))
Hope this helps!
random.sample([Jacks, a])
This sample method should looks like
random.sample(Jacks, a)
However, I am concerted you also have no idea how lists are working. Can you explain why do you using lists of strings and then adding values in them? I am losing you here.
If you going to pick a pair or strings, use method described by Florian (requesting data by index value.)
k parameter tell random.sample function that how many sample you need, you should write:
print (random.sample([Jacks, a], 3))
which means you need 3 sample from your list. the output will be something like:
[1, jacks, 0]

Intramolecular protein residue contact map using biopython, KeyError: 'CA'

I am trying to identify amino acid residues in contact in the 3D protein structure. I am new to BioPython but found this helpful website http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/protein_contact_map/
Following their lead (which I will reproduce here for completion; Note, however, that I am using a different protein):
import Bio.PDB
import numpy as np
pdb_code = "1QHW"
pdb_filename = "1qhw.pdb"
def calc_residue_dist(residue_one, residue_two) :
"""Returns the C-alpha distance between two residues"""
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
return np.sqrt(np.sum(diff_vector * diff_vector))
def calc_dist_matrix(chain_one, chain_two) :
"""Returns a matrix of C-alpha distances between two chains"""
answer = np.zeros((len(chain_one), len(chain_two)), np.float)
for row, residue_one in enumerate(chain_one) :
for col, residue_two in enumerate(chain_two) :
answer[row, col] = calc_residue_dist(residue_one, residue_two)
return answer
structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
dist_matrix = calc_dist_matrix(model["A"], model["A"])
But when I run the above code, I get the following error message:
Traceback (most recent call last):
File "<ipython-input-26-7239fb7ebe14>", line 4, in <module>
dist_matrix = calc_dist_matrix(model["A"], model["A"])
File "<ipython-input-3-730a11883f27>", line 15, in calc_dist_matrix
answer[row, col] = calc_residue_dist(residue_one, residue_two)
File "<ipython-input-3-730a11883f27>", line 6, in calc_residue_dist
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
File "/Users/anaconda/lib/python3.6/site-packages/Bio/PDB/Entity.py", line 39, in __getitem__
return self.child_dict[id]
KeyError: 'CA'
Any suggestions on how to fix this issue?
You have heteroatoms (water, ions, etc; anything that isn't an amino acid or nucleic acid) in your structure, remove them with:
for residue in chain:
if residue.id[0] != ' ':
chain.detach_child(residue.id)
This will remove them from your entire structure. You may want to modify if want to keep the heteroatoms for further analysis.
I believe the problem is that some of the elements in model["A"] are not amino acids and therefore do not contain "CA".
To get around this, I wrote a new function which returns only the amino acid residues:
from Bio.PDB import *
chain = model["A"]
def aa_residues(chain):
aa_only = []
for i in chain:
if i.get_resname() in standard_aa_names:
aa_only.append(i)
return aa_only
AA_1 = aa_residues(model["A"])
dist_matrix = calc_dist_matrix(AA_1, AA_1)
So I've been testing (bear in mind I know very little about Bio) and it looks like whatever is in you 1qhw.pdb file is very different from the one in that example.
pdb_code = '1qhw'
structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
next, to see what is in it, I did:
print(list(model))
Which gave me:
[<Chain id=A>]
exploring this, it appears the pdb file is a dict of dicts. So, using this id,
test = model['A']
gives me the next dict. This level is the level being passed to your function that is causing the error. Printing this with:
print(list(test))
Gave me a huge list of the data inside, including lots of residues and related info. But crucially, no CA. Try using this to see whats inside and modify the line:
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
to reflect what you are after, replacing CA where appropriate.
I hope this helps, its a little tricky to get much more specific.
Another solution to obtain the contact map for a protein chain is to use the PdbParser shipped with ConKit.
ConKit is a library specifically designed to work with predicted contacts but has the functionality to extract contacts from a PDB file:
>>> from conkit.io.PdbIO import PdbParser
>>> p = PdbParser()
>>> with open("1qhw.pdb", "r") as pdb_fhandle:
... pdb = p.read(pdb_fhandle, f_id="1QHW", atom_type="CA")
>>> print(pdb)
ContactFile(id="1QHW_0" nmaps=1
This reads your PDB file into the pdb variable, which stores an internal ContactFile hierarchy. In this example, two residues are considered to be in contact if the participating CA atoms are within 8Å of each other.
To access the information, you can then iterate through the ContactFile and access each ContactMap, which in your case corresponds to intra-molecular contacts for chain A.
>>> for cmap in pdb:
... print(cmap)
ContactMap(id="A", ncontacts=1601)
If you would have more than one chain, there would be a ContactMap for each chain, and additional ones for inter-molecular contacts between chains.
The ContactMap for chain A contains 1601 contact pairs. You can access the Contact instances in each ContactMap by either iterating or indexing. Both work fine.
>>> print(cmap[0])
Contact(id="(26, 27)" res1="S" res1_chain="A" res1_seq=26 res2="T" res2_chain="A" res2_seq=27 raw_score=0.961895)
Each level in the hierarchy has various functions with which you could manipulate contact maps. Examples can be found here.

reporting the best alignment with pairwise2

I have a fastq file of reads, say "reads.fastq". I want to align the sequnces to a string saved as a fasta file ref.faa. I am using the following code for this
reads_array = []
for x in Bio.SeqIO.parse("reads.fastq","fastq"):
reads_array.append(x)
for x in Bio.SeqIO.parse("ref.faa","fasta"):
refseq = x
result = open("alignments_G10_final","w")
aligned_reads = []
for x in reads_array:
alignments =pairwise2.align.globalms(str(refseq.seq).upper(),str(x.seq),2,-1,-5,-0.05)
for a in alignments:
result.write(format_alignment(*a))
aligned_reads.append(x)
But I want to report only the best alignment for each read. How can I choose this alignment from the scores in a[2]. I want to choose the alignment(s) with the highest value of a[2]
You could sort the alignments according to a[2]:
for x in reads_array:
alignments = pairwise2.align.globalms(
str(refseq.seq).upper(), str(x.seq), 2, -1, -5, -0.05)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2))
result.write(format_alignment(*sorted_alignments[0]))
aligned_reads.append(x)
I know this is an old question, but for anyone still looking for the correct answer, add one_alignment_only=True argument to your align method:
alignments =pairwise2.align.globalms(str(refseq.seq).upper(),
str(x.seq),
2,-1,-5,-0.05,
one_alignment_only=True)
I had to do some digging around in the documentation to find it, but this gives back the optimal score.

Extraction and processing the data from txt file

I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.

python similar string removal from multiple files

I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions.
Here is what I have tried:
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:
destfile = open('%s.txt' % destname, 'a+')
destlines = destfile.readlines()
similar = False
for destline in destlines:
ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
if ratio > 0.8:
print destline
print sourceline
similar = True
if not similar:
destfile.write(sourceline)
destfile.close()
I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.
EDIT:
I have tried the code of the answer. It's still very slow.
Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.
Any other ways to remove the duplicates?
Here is a short version that does minimal IO and cleans up after itself.
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
with open('%s.txt' % destname, 'w+') as destfile:
# we read in the file so that on subsequent runs of this script, we
# won't duplicate the lines.
known_lines = set(destfile.readlines())
with open('%s.txt' % sourcename) as sourcefile:
for line in sourcefile:
similar = False
for known in known_lines:
ratio = difflib.SequenceMatcher(None, line, known).ratio()
if ratio > 0.8:
print ratio
print line
print known
similar = True
break
if not similar:
destfile.write(line)
known_lines.add(line)
Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.
A note on complexity
By its very nature, this problem has a O(n2) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.
There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.
You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().
Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.
Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.
##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data
##bindresult.txt
##--------------
##a website line
##this is data
##and more data
from difflib import SequenceMatcher
sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()
destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()
has_matches = {k: False for k in sourcelines}
for d_line in destlines:
for s_line in sourcelines:
if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
has_matches[s_line] = True
break
for k in has_matches:
if has_matches[k] == False:
destfile.write(k)
destfile.close()
This will add the line radically different thing`` to the destinationfile.

Categories

Resources