Determine ROT encoding

Determine ROT encoding - python

I want to determine which type of ROT encoding is used and based off that, do the correct decode.
Also, I have found the following code which will indeed decode rot13 "sbbone" to "foobart" correctly:
import codecs
codecs.decode('sbbone', 'rot_13')
The thing is I'd like to run this python file against an existing file which has rot13 encoding. (for example rot13.py encoded.txt).
Thank you!

To answer the second part of your first question, decode something in ROT-x, you can use the following code:
def encode(s, ROT_number=13):
"""Encodes a string (s) using ROT (ROT_number) encoding."""
ROT_number %= 26 # To avoid IndexErrors
alpha = "abcdefghijklmnopqrstuvwxyz" * 2
alpha += alpha.upper()
def get_i():
for i in range(26):
yield i # indexes of the lowercase letters
for i in range(53, 78):
yield i # indexes of the uppercase letters
ROT = {alpha[i]: alpha[i + ROT_number] for i in get_i()}
return "".join(ROT.get(i, i) for i in s)
def decode(s, ROT_number=13):
"""Decodes a string (s) using ROT (ROT_number) encoding."""
return encrypt(s, abs(ROT_number % 26 - 26))
To answer the first part of your first question, find the rot encoding of an arbitrarily encoded string, you probably want to brute-force. Uses all rot-encodings, and check which one makes the most sense. A quick(-ish) way to do this is to get a space-delimited (e.g. cat\ndog\nmouse\nsheep\nsay\nsaid\nquick\n... where \n is a newline) file containing most common words in the English language, and then check which encoding has the most words in it.
with open("words.txt") as f:
words = frozenset(f.read().lower().split("\n"))
# frozenset for speed
def get_most_likely_encoding(s, delimiter=" "):
alpha = "abcdefghijklmnopqrstuvwxyz" + delimiter
for punctuation in "\n\t,:; .()":
s.replace(punctuation, delimiter)
s = "".join(c for c in s if c.lower() in alpha)
word_count = [sum(w.lower() in words for w in encode(
s, enc).split(delimiter)) for enc in range(26)]
return word_count.index(max(word_count))
A file on Unix machines that you could use is /usr/dict/words, which can also be found here

Well, you can read the file line by line and decode it.
The output should go to an output file:
import codecs
import sys
def main(filename):
output_file = open('output_file.txt', 'w')
with open(filename) as f:
for line in f:
output_file.write(codecs.decode(line, 'rot_13'))
output_file.close()
if __name__ == "__main__":
_filename = sys.argv[1]
main(_filename)

Related

Python: how to avoid the space in dna calculated?

I am using python 2.7.
I want to find the DNA length. I have no idea where is the mistake.....The length of DNA supposed to be 283, but it comes up with 345.
The sequence in a single line is nothing wrong but just the length have some problem.....
I think the spaces are calculated too. May I know how to get the length of the DNA without including the spaces?
Thank you.
import re
singleSeq = ""
fh = open("seq.embl.txt")
lines = fh.readlines()
for line in lines:
lines = line.strip()
m = re.match(r"\s+(.[^\d]+)\s+\d+", line)
if m:
print(m.group(0))
seqline = m.group(1)
print(seqline)
singleSeq += seqline
print("\nSequence in a single line: ")
# print(line.strip(singleSeq))
print(singleSeq)
print("\nSequence length: ", len(singleSeq))
Output
Sequence in a single line:
cccatgtccc agcggcgtat tgctttgcat cgcgaacgca ctttcaatgt cccagcggcg tattgcttct attttataag taccagctaa attttttttt tttttttata agtaccagct aaaatttttt tttttttttt ttataagtac cagctaaaat tttttttttt tttttttata agtaccagct aaaatttttt ttttttttta taagttccag cggcgtattg ctttctgaaa tttaaaaaaa aaaaaaaatt tttttttaat aatatattat ata
Sequence length: 345

This should do the trick
# Python3 code to remove whitespace
def remove(string):
return string.replace(" ", "")
# Driver Program
string = ' t e s t '
print(remove(string))

it seems you are reinventing the wheel her. i strongly suggest you try BioPython for this
from Bio import SeqIO
record = SeqIO.read("seq.embl.txt", "embl")
print("\nSequence length: ", len(record))

Python Caesar Cipher project, incorrect output

I can't seem to get this program I'm supposed to do for a project to output the correct output, even though I have tried getting it to work multiple times. The project is:
Your program needs to decode an encrypted text file called "encrypted. txt". The person who wrote it used a cipher specified in "key. txt". This key file would look similar to the following:
A    B
B    C
C    D
D    E
E    F
F    G
G    H
H    I
I    J
J    K
K    L
L    M
M    N
N    O
O    P
P    Q
Q    R
R    S
S    T
T    U
U    V
V    W
W    X
X    Y
Y    Z
Z    A
The left column represents the plaintext letter, and the right column represents the corresponding ciphertext.
Your program should decode the "encrypted.txt" file using "key.txt" and write the plaintext to "decrypted.txt".
Your program should handle both upper and lower case letters in the encrypted without having two key files (or duplicating keys).  You may have the decrypted text in all caps.
You should be able to handle characters in the encrypted  text that are not in your key file.  In that case, just have the decryption repeat the character.  This will allow you to have spaces in your encrypted text that remain spaces when decrypted.
While you may write a program to create the key file - do NOT include that in the submission.  You may manually create the encrypted and key text files.  Use either the "new file" option in Python Shell (don't forget to save as txt) or an editor such as notepad.  Do not use word.
Here is my code:
keyFile = open("key.txt", "r")
keylist1= []
keylist2 = []
for line in keyFile:
keylist1.append(line.split()[0])
keylist2.append(line.split()[1])
keyFile.close()
encryptedfile = open("encrypted.txt", "r")
lines = encryptedfile.readlines()
currentline = ""
decrypt = ""
for line in lines:
currentline = line
letter = list(currentline)
for i in range(len(letter)):
currentletter = letter[i]
if not letter[i].isalpha():
decrypt += letter[i]
else:
for o in range(len(keylist1)):
if currentletter == keylist1[o]:
decrypt += keylist2[o]
print(decrypt)
The only output I get is:
, ?
which is incorrect.

You forgot to handle lowercase letters. Use upper() to convert everything to a common case.
It would also be better to use a dictionary instead of a pair of lists.
mapping = {}
with open("key.txt", "r") as keyFile:
for line in keyFile:
l1, l2 = line.split()
mapping[upper(l1)] = upper(l2)
decrypt = ""
with open("encrypted.txt", "r") as encryptedFile:
for line in encryptedFile:
for char in line:
char = upper(char)
if char in mapping:
decrypt += mapping[char]
else:
decrypt += char
print(decrypt)

Writing to UTF-16-LE text file with BOM

I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.
The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).
The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.
The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):
import os
import codecs
import shutil
import sys
import codecs
first = u''
textdel = u'\u00FE'.encode('utf_16_le') #thorn
fielddel = u'\u00B6'.encode('utf_16_le') #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1) #pretend this is from the metadata profile
f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
mytext2 = u''
i = 0
i = i + 1
mytext2 = mytext2 + item + textdel
if i < (num - 1):
mytext2 = mytext2 + fielddel
f.write(mytext2 + u'\n')
f.close()

You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.
Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:
textdel = u'\u00FE'
len(textdel) # -> 1
type(textdel) # -> unicode
len(textdel.encode('utf-16-le')) # -> 2
type(textdel.encode('utf-16-le')) # -> str

Python splitting with string as delimiter

I have a file that looks something like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC
And I need to split it into the "non-N" sequences, so two separate files like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCT
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGC
TATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACA
TGTGTTGC
What I currently have is this:
UMfile = open ("C:\Users\Manuel\Desktop\sequence.txt","r")
contignumber = 1
contigfile = open ("contig "+str(contignumber), "w")
DNA = UMfile.read()
DNAstring = str(DNA)
for s in DNAstring:
DNAstring.split("NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN",1)
contigfile.write(DNAstring)
contigfile.close()
contignumber = contignumber+1
contigfile = open ("contig "+str(contignumber), "w")
The thing is that I realize there is a linebreak between the "Ns" and that is why it is not splitting my file, but the "file" I'm showing is just a part of a much much bigger one. So sometimes the "Ns" will look like this "NNNNNN\n" and sometimes like "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n", yet there is always a count of 1000 Ns between my sequences that I need to split.
So my question is: How do I tell python to split and wite into different files every 1000xNs knowing that there will be different number of Ns in each line?
Thank you all very much, I really have no informatics background and my python skills are at best basic.

Just split your string on 'N' and then remove all the strings that are empty, or just contain a newline. Like this:
#!/usr/bin/env python
DNAstring = '''AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC'''
sequences = [u for u in DNAstring.split('N') if u and u != '\n']
for i, seq in enumerate(sequences):
print i
print seq.replace('\n', '') + '\n'
output
0
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACTTCTCAATGGGCAGTACATATCATCTCT
1
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACATGTGTTGC
The code snippet above also removes newlines inside the sequences using .replace('\n', '').
Here are a few programs that you may find useful.
Firstly, a line buffer class. You initialise it with a file name and a line width. You can then feed it random length strings and it will automatically save them to the text file, line by line, with all lines (except possibly the last line) having the given length. You can use this class in other programs to make your output look neat.
Save this file as linebuffer.py to somewhere in your Python path; the simplest way is to save it wherever you save your Python programs and make that the current directory when you run the programs.
linebuffer.py
#! /usr/bin/env python
''' Text output buffer
Write fixed width lines to a text file
Written by PM 2Ring 2015.03.23
'''
class LineBuffer(object):
''' Text output buffer
Write fixed width lines to file fname
'''
def __init__(self, fname, width):
self.fh = open(fname, 'wt')
self.width = width
self.buff = []
self.bufflen = 0
def write(self, data):
''' Write a string to the buffer '''
self.buff.append(data)
self.bufflen += len(data)
if self.bufflen >= self.width:
self._save()
def _save(self):
''' Write the buffer to the file '''
buff = ''.join(self.buff)
#Split buff into lines
lines = []
while len(buff) >= self.width:
lines.append(buff[:self.width])
buff = buff[self.width:]
#Add an empty line so we get a trailing newline
lines.append('')
self.fh.write('\n'.join(lines))
self.buff = [buff]
self.bufflen = len(buff)
def close(self):
''' Flush the buffer & close the file '''
if self.bufflen > 0:
self.fh.write(''.join(self.buff) + '\n')
self.fh.close()
def testLB():
alpha = 'abcdefghijklmnopqrstuvwxyz'
fname = 'linebuffer_test.txt'
lb = LineBuffer(fname, 27)
for _ in xrange(30):
lb.write(alpha)
lb.write(' bye.')
lb.close()
if __name__ == '__main__':
testLB()
Here is a program that makes random DNA sequences of the form you described in your question. It uses linebuffer.py to handle the output. I wrote this so I could test my DNA sequence splitter properly.
Random_DNA0.py
#! /usr/bin/env python
''' Make random DNA sequences
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Takes approx 3 seconds per megabyte generated and saved
on a 2GHz CPU single core machine.
Written by PM 2Ring 2015.03.23
'''
import sys
import random
from linebuffer import LineBuffer
#Set seed to None to seed randomizer from system time
random.seed(37)
#Output line width
linewidth = 120
#Subsequence base length ranges
minsub, maxsub = 15, 300
#Subsequences per sequence ranges
minseq, maxseq = 5, 50
#random 'N' sequence ranges
minn, maxn = 5, 200
#Probability that a random 'N' sequence occurs after a subsequence
randn = 0.2
#Sequence separator
nsepblock = 'N' * 1000
def main():
#Get number of sequences from the command line
numsequences = int(sys.argv[1]) if len(sys.argv) > 1 else 2
outname = 'DNA_sequence.txt'
lb = LineBuffer(outname, linewidth)
for i in xrange(numsequences):
#Write the 1000*'N' separator between sequences
if i > 0:
lb.write(nsepblock)
for j in xrange(random.randint(minseq, maxseq)):
#Possibly make a short run of 'N's in the sequence
if j > 0 and random.random() < randn:
lb.write(''.join('N' * random.randint(minn, maxn)))
#Create a single subsequence
r = xrange(random.randint(minsub, maxsub))
lb.write(''.join([random.choice('ACGT') for _ in r]))
lb.close()
if __name__ == '__main__':
main()
Finally, we have a program that splits your random DNA sequences. Once again, it uses linebuffer.py to handle the output.
DNA_Splitter0.py
#! /usr/bin/env python
''' Split DNA sequences and save to separate files
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Written by PM 2Ring 2015.03.23
'''
import sys
from linebuffer import LineBuffer
#Output line width
linewidth = 120
#Sequence separator
nsepblock = 'N' * 1000
def main():
iname = 'DNA_sequence.txt'
outbase = 'contig'
with open(iname, 'rt') as f:
data = f.read()
#Remove all newlines
data = data.replace('\n', '')
sequences = data.split(nsepblock)
#Save each sequence to a series of files
for i, seq in enumerate(sequences, 1):
outname = '%s%05d' % (outbase, i)
print outname
#Write sequence data, with line breaks
lb = LineBuffer(outname, linewidth)
lb.write(seq)
lb.close()
if __name__ == '__main__':
main()

assuming you can read the whole file at once
s=DNAstring.replace("\n","") # first remove the nasty linebreaks
l=[x for x in s.split("N") if x] # split and drop empty lines
for x in l: # print in chunks
while x:
print x[:10]
x=x[10:]
print # extra linebreak between chunks

You could simply replace every N and \n with a space, and then split.
result = DNAstring.replace("\n", " ").replace("N", " ").split()
This will give you back a list of strings, and the 'ACGT' sequences will also be split with every new line.
if this is not you goal an you want to conserve the \n in the 'ACGT' and not split along it, you can do the following:
result = DNAstring.replace("N\n", " ").replace("N", " ").split()
this will only remove the \n if it is in the middle of an N sequence.
To split your string exactly after 1000 Ns:
# 1/ Get rid of line breaks in the N sequence
result = DNAstring.replace("N\n", "N")
# 2/ split every 1000 Ns
result = result.split(1000*"N")

Encrypting the lines in a file

I'm trying to write a program that opens a text file, and shifts each of the characters in the file 5 characters to the right. It should only do this for alphanumeric characters, and leave nonalphanumerics as they are. (ex: C becomes H) I'm supposed to be using the ASCII table to do this, and I'm having an issue when the characters wrap around. ex: w should become b, but my program gives me a character that's in the ASCII table. Another issue I'm having is that all the characters are printing on separate lines and I'd like them all to print on the same line.
I can't use lists or dictionaries.
This is what I have, I'm not sure how to do the final if statement
def main():
fileName= input('Please enter the file name: ')
encryptFile(fileName)
def encryptFile(fileName):
f= open(fileName, 'r')
line=1
while line:
line=f.readline()
for char in line:
if char.isalnum():
a=ord(char)
b= a + 5
#if number wraps around, how to correct it
if
print(chr(c))
else:
print(chr(b))
else:
print(char)

Using str.translate:
In [24]: import string
In [25]: string.uppercase
Out[25]: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [26]: string.uppercase[5:]+string.uppercase[:5]
Out[26]: 'FGHIJKLMNOPQRSTUVWXYZABCDE'
In [27]: table = string.maketrans(string.uppercase, string.uppercase[5:]+string.uppercase[:5])
In [28]: 'CAR'.translate(table)
Out[28]: 'HFW'
In [29]: 'HELLO'.translate(table)
Out[29]: 'MJQQT'

First, it matters if it is lower or upper case. I am going to assume here that all the characters are lower case (if they aren't, it would be easy enough to make them)
if b>122:
b=122-b #z=122
c=b+96 #a=97
w=119 in ASCII and z=122 (decimal in ASCII) so 119+5=124 and 124-122=2 which is our new b, then we add that to a-1 (this takes care of if we get a 1 back, 2+96=98 and 98 is b.
For the printing on the same line, instead of printing when you have them, I would write them to a list, then create a string from that list.
e.g instead of
print(chr(c))
else:
print(chr(b))
I would do
someList.append(chr(c))
else:
somList.append(chr(b))
then join each element of the list together into one string.

You could create a dictionary to handle it:
import string
s = string.lowercase + string.uppercase + string.digits + string.lowercase[:5]
encryptionKey = {s[i]:s[i+5] for i in range(len(s)-5)}
The final addend to s (+ string.lowercase[:5]) adds the first 5 letters into the key. Then, we use a simple dictionary comprehension to create a key for the encryption.
Put into your code (I also changed it so you iterate through the lines rather than using f.readline():
import string
def main():
fileName= input('Please enter the file name: ')
encryptFile(fileName)
def encryptFile(fileName):
s = string.lowercase + string.uppercase + string.digits + string.lowercase[:5]
encryptionKey = {s[i]:s[i+5] for i in range(len(s)-5)}
f= open(fileName, 'r')
line=1
for line in f:
for char in line:
if char.isalnum():
print(encryptionKey[char])
else:
print(char)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Determine ROT encoding - python

Related

Python: how to avoid the space in dna calculated?

Python Caesar Cipher project, incorrect output

Writing to UTF-16-LE text file with BOM

Python splitting with string as delimiter

Encrypting the lines in a file

Categories

Resources