I'm trying to build a dictionary attack script (for education purposes) using Python and it only ever solves the last SHA-256 hash in my file.
The logic is as follows:
Read a file containing words
Store the hashed value of the word along with the word as a key-value pair in a dictionary
Scan the lines of a text file containing SHA-256 hashes (1 hashed value per line)
Iterate over the items in the dictionary and print the key if a value matches the hash
It works perfectly for the very last item in my file, but says a match was not found for all my others.
My hash file looks like:
9F86D081884C7D659A2FEAA0C55AD015A3BF4F1B2B0B822CD15D6C15B0F00A08
5E884898DA28047151D0E56F8DC6292773603D0D6AABBDD62A11EF721D1542D8
Containing the hashed values for "test" and "password".
My word file contains over 70,000 words and I've made sure both words are in the file, and when I debug, they both have values in the dictionary if I call the expression.
Here's where I iterate over the hashes in my file:
with open(hashFile) as f:
for c in f:
findMatch(str(c).lower(),wordMap)
And the function I wrote to compare a hashed value to every value in the dictionary:
def findMatch(hv,m):
#k is the key, m is the dictionary
for k in m:
if(m[k].lower() == hv):
print("Match was found: " + k )
return
print("Match was not found, searched through " + str(len(wordMap)) + " words")
Any help is appreciated, thanks!
In findMatch(str(c).lower(),wordMap), there is no need to call str() (because c is already a string), but there is a need to strip off the trailing newline character: findMatch(c.strip().lower(),wordMap). Otherwise, it is included in the hash value calculation. Apparently the last line of your file does not have the trailing newline, that's why it is correctly recognized.
Related
My assignment is to:
Read the protein sequence from FILE A and calculate the molecular weight of
this protein using the dictionary created above.
So far, I have the code below:
import pprint
my_dict= {'A':'089Da', 'R':'174Da','N':'132Da','D':'133Da','B':'133Da','C':'121Da','Q':'146Da','E':'147Da',
'Z':'147Da','G':'075Da','H':'155Da','I':'131Da','L':'131Da','K':'146Da','M':'149Da',
'F':'165Da','P':'115Da','S':'105Da','T':'119Da','W':'204Da','Y':'181Da','V':'117Da'}
new=sorted(my_dict.items(), key=lambda x:x[1])
print("AA", " ", "MW")
for key,value in new:
print(key, " ", value)
with open('lysozyme.fasta','r') as content:
fasta = content.read()
for my_dict in fasta:
In which the top part of the code is my dictionary created. The task is to i.e open the rile and read 'MWAAAA' in the file, and then sum up the values associated with those keys using the dictionary I created. I'm not sure how to proceed after the for loop. Do I use an append function? Would appreciate any advice, thanks!
after read your file, you can check char by char:
for char in fasta:
print(char)
output:
M
W
A
A
A
A
then use the char as a key for retrieve value of your dict
summ += my_dict[char]
This is a small part of my program, but basically so far I have looked through two txt files and compared them to a main txt file with a key of words. For each of the first two txt files (txt file 1 & txt file 2), I found the frequencies of words from the main txt file and put the words and their frequencies of txt file 1 & txt file 2 into two separate dictionaries, wordfreq and wordfreq2.
Now I would like to compare the frequencies of the words from these two lists. If a the key in wordfreq has a greater value than the same key in wordfreq2, I would like to add that word to anotherdict1, and vice versa.
anotherdict1 = {}
anotherdict2 = {}
for key in wordfreq.keys():
if key in wordfreq2.keys() > key in wordfreq.keys():
anotherdict2.update(wordfreq2)
for key in wordfreq2.keys():
if key in wordfreq.keys() > key in wordfreq2.keys():
anotherdict1.update(wordfreq)
print (wordfreq)
print (wordfreq2)
What you're doing here is updating anotherdict2 with wordfreq2 (and the same for dict1). That means that every key/value in wordfreq2 will be the same in anotherdict2. What you should be doing, however, is just adding that particular key/value pair. In addition, your if check is comparing two booleans. that is, key in wordfreq2.keys() will result in True or False, not the value itself. You should be using wordfreq2[key]. Here's how I would do it:
for key, wordfreq_value in wordfreq.items():
wordfreq2_value = wordfreq2[key]
if wordfreq2_value > wordfreq_value:
anotherdict2[key] = wordfreq2_value
else:
anotherdict[key] = wordfreq_value
I'll get straight to the point:
I need to find a way to encrypt and decrypt a string of text using a Vigenère Cipher using Python3. I am trying to do this without downloading extra assets, but importing existing ones will be fine. A specific feature I want my program to have is that users will need to be able to enter the key they want to use inside the program itself. So far, I have managed to change letters into their values in the alphabet as well as back, but how do I do map this to the whole string while changing the key letter? Code so far:
with open("appbin/vignere.json", "rt") as vd:
vigneredict = json.load(vd)
with open("appbin/encrypt.txt", "rt") as intx:
inputtext = intx.read()
vignereword = input("Input the keyword for encrypting your text: ")
with open("appbin/vigkey.txt", "w") as kw:
kw.write(vignereword)
textlist = list(inputtext)
This code loads in the text from a file called encrypt.txt and stores it, as well as making it into a list. How do I do the actual encrypting part?
first make and store your key as a list, then Use a for loop like this
index = 0
for letter in textlist:
#blah blah
index += 1
if index > len(keylist):
index = 0
in place of blah blah put your method of converting the key and text letters to numbers (the index variable is for when you need to get the letter out of the keylist but I left that bit of the code for you to write), and add them together, subtracting 25 if the number is bigger than 25 then convert back to a letter and store in a new variable
I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified).
However, I have issues in formatting my response when it comes to large sequences.
What I expect to return, regardless of length, is:
"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "
All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning.
but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section.
def cons(file):
#returns consensus sequence and profile of a FASTA file
import os
path = os.path.abspath(os.path.expanduser(file))
with open(path,"r") as D:
F=D.readlines()
#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
SEQS=[]
mystrings=[]
temp_seq=[]
#get a list of strings from the file, stripping the newline character
for x in F:
mystrings.append(x.strip("\n"))
#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS
for i in range(1,len(mystrings)):
if ">" not in mystrings[i]:
temp_seq.append(mystrings[i])
else:
SEQS.append(("").join(temp_seq))
temp_seq=[]
SEQS.append(("").join(temp_seq))
#set up list of nucleotide counts for A,C,G and T, in that order
ACGT= [[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))]]
#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
for i in range(0,len(SEQS[0])-1):
for j in range(0, len(SEQS)):
if SEQS[j][i]=="A":
ACGT[0][i]+=1
elif SEQS[j][i]=="C":
ACGT[1][i]+=1
elif SEQS[j][i]=="G":
ACGT[2][i]+=1
elif SEQS[j][i]=="T":
ACGT[3][i]+=1
ancstr=""
TR_ACGT=list(zip(*ACGT))
acgt=["A: ","C: ","G: ","T: "]
for i in range(0,len(TR_ACGT)-1):
comp=TR_ACGT[i]
if comp.index(max(comp))==0:
ancstr+=("A")
elif comp.index(max(comp))==1:
ancstr+=("C")
elif comp.index(max(comp))==2:
ancstr+=("G")
elif comp.index(max(comp))==3:
ancstr+=("T")
'''
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
'''
myfile="myconsensus.txt"
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
with open(myfile,'w') as D:
D.writelines(ancstr)
D.writelines("\n")
for i in range(0,len(writing_strings)):
D.writelines(writing_strings[i])
D.writelines("\n")
cons("rosalind_cons.txt")
Your code is totally fine except for this line:
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
You accidentally replicate your data. Try replacing it with:
writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]
and then write it to your output file as follows:
D.write(writing_strings[i][1:-1])
That's a lazy way to get rid of the brackets from your list.
I'm now trying to create a tool which can translate DNA sequences and then compare them to each other for deleting the repetitions!
I used this script to read my fastq file:
def sequence_cleaner(fastq_file, min_length=0, por_n=100):
# Create our hash table to add the sequences
sequences={}
# Using the Biopython fastq parse we can read our fastq input
for seq_record in SeqIO.parse(fastq_file, "fastq"):
# Take the current sequence
sequence = str(seq_record.seq).upper()
# Check if the current sequence is according to the user parameters
if (len(sequence) >= min_length and
(float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
# If the sequence passed in the test "is it clean?" and it isn't in the
# hash table, the sequence and its id are going to be in the hash
if sequence not in sequences:
sequences[sequence] = seq_record.id
# If it is already in the hash table, we're just gonna concatenate the ID
# of the current sequence to another one that is already in the hash table
else:
sequences[sequence] += "_" + seq_record.id
print sequence
trans=translate( sequence )
# Write the clean sequences
# Create a file in the same directory where you ran this script
output_file = open("clear_" + fastq_file, "w+")
# Just read the hash table and write on the file as a fasta format
for sequence in sequences:
output_file.write("#" + sequences[sequence] +"\n" + sequence + "\n" + trans +"\n")
output_file.close()
print("\n YOUR SEQUENCES ARE CLEAN!!!\nPlease check clear_" + fastq_file + " on the same repository than " + rep + "\n")
and i used this one to translate it to amino acide sequences:
def translate( sequ ):
"""Return the translated protein from 'sequence' assuming +1 reading frame"""
gencode = {
'ATA':'Ile', 'ATC':'Ile', 'ATT':'Ile', 'ATG':'Met',
'ACA':'Thr', 'ACC':'Thr', 'ACG':'Thr', 'ACT':'Thr',
'AAC':'Asn', 'AAT':'Asn', 'AAA':'Lys', 'AAG':'Lys',
'AGC':'Ser', 'AGT':'Ser', 'AGA':'Arg', 'AGG':'Arg',
'CTA':'Leu', 'CTC':'Leu', 'CTG':'Leu', 'CTT':'Leu',
'CCA':'Pro', 'CCC':'Pro', 'CCG':'Pro', 'CCT':'Pro',
'CAC':'His', 'CAT':'His', 'CAA':'Gln', 'CAG':'Gln',
'CGA':'Arg', 'CGC':'Arg', 'CGG':'Arg', 'CGT':'Arg',
'GTA':'Val', 'GTC':'Val', 'GTG':'Val', 'GTT':'Val',
'GCA':'Ala', 'GCC':'Ala', 'GCG':'Ala', 'GCT':'Ala',
'GAC':'Asp', 'GAT':'Asp', 'GAA':'Glu', 'GAG':'Glu',
'GGA':'Gly', 'GGC':'Gly', 'GGG':'Gly', 'GGT':'Gly',
'TCA':'Ser', 'TCC':'Ser', 'TCG':'Ser', 'TCT':'Ser',
'TTC':'Phe', 'TTT':'Phe', 'TTA':'Leu', 'TTG':'Leu',
'TAC':'Tyr', 'TAT':'Tyr', 'TAA':'STOP', 'TAG':'STOP',
'TGC':'Cys', 'TGT':'Cys', 'TGA':'STOP', 'TGG':'Trp'}
return ''.join(gencode.get(sequ[3*i:3*i+3],'X') for i in range(len(sequ)//3))
The result is not what i expected:
#SRR797221.3
TCAGCCGCGCAGTAGTTAGCACAAGTAGTACGATACAAGAACACTATTTGTAAGTCTAAGGCATTGGCCGCTCGTCTGAGACTGCCAAGGCACACAGGGAGTAGNGNN
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
#SRR797221.4
TCAGCCGCGCAGGTAGTTCCGTTATCATCAGTACCAGCAACTCCAACTCCATCCAACAATGCCGCTCGTCTGAGACTGCCAAGGCACACAGGAGTAGAG
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
#SRR797221.2
TCAGCCGCGCAGGTTCTTGGTAACGGAACGCGCGTTAGACTTAAGACCAGTGAATGGAGCCACCATTGGCCGCTCGTCTGAGACTGCCCAAAGGGCACACAGGGGNGTAGNGN
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
#SRR797221.1
TCAGCCGCGCAGGTAGATTAAGGATCAACGGTTCCTTGGCTCGCAAGTCAATTGGCCGCTCGTCTGAGACTGCCAAGGCACACAGGGAGTAGNG
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
Firstly you can see that the sequences id are not sorted from 1 to 4 like on the original file, and also it repeats the same 4th id translation for the three other sequences!
To answer your two questions
the sequences id are not sorted from 1 to 4 like on the original file
You are using a dictionary which is unsorted.
Regular Python dictionaries iterate over key/value pairs in arbitrary
order.
https://docs.python.org/3.1/whatsnew/3.1.html
You could sort your dictionary by values, see here for a suggestion: Sort a Python dictionary by value or use a sorted dictionary, see the link above
it repeats the same 4th id translation for the three other sequences
You are assigning the translated sequence trans=translate( sequence ) for each sequence but you are not storing trans in a dictionary or list which is specific for your ID, you are assigning trans to every entry. Try using a separate dictionary which stores the translated sequence together with the sequence ID.