I have a program that reads in a input text file (DNA.txt) of a DNA sequence, and then translates the DNA sequence (saved as a string) into various amino acid SLC codes using this function:
def translate(uppercase_dna_sequence, codon_list):
slc = ""
for i in range(0, len(uppercase_dna_sequence), 3):
codon = uppercase_dna_sequence[i:i+3]
if codon in codon_list:
slc = slc + codon_list[codon]
else:
slc = slc + "X"
return slc
I then have a function that creates two output text files called:
normalDNA.txt and mutatedDNA.txt
Each of these files has one long DNA sequence.
I now want to write a function that allows me to read both these files as input files, and use the "translate" function mentioned above to translate the DNA sequences that are in the containing text files. (just like I did the original DNA.txt file mentioned at the top of this explanation) but using the original translate function. (So I assume I am trying to inherit the other function's properties to this one). I have this code:
def txtTranslate(translate):
with open('normalDNA.txt') as inputfile:
normalDNA_input = inputfile.read()
print normalDNA_input
with open('mutatedDNA.txt') as inputfile:
mutatedDNA_input = inputfile.read()
print mutatedDNA_input
return txtTranslate
The program runs when I call it with:
print txtTranslate(translate)
But it prints:
function txtTranslate at 0x103bf39b0>
I want the second function (txtTranslate) to read in the external text files, and then have the first function translate the inputs and "print"out the result to the user...
I have my full code available on request, but i think I'm missing something small, hopefully! or should I put everything into classes with OOP?
I'm new to linking two functions, so please excuse the lack of knowledge in the second function...
This doesn't have anything to do with inheritance. If you want txtTranslate to execute translate, you have to actually call it. Try:
def txtTranslate():
with open('normalDNA.txt') as inputfile:
normalDNA_input = inputfile.read()
print normalDNA_input
with open('mutatedDNA.txt') as inputfile:
mutatedDNA_input = inputfile.read()
print mutatedDNA_input
#todo: get codon_list from somewhere
print translate(normalDNA_input, codon_list)
print translate(mutatedDNA_input, codon_list)
txtTranslate()
Related
I want to write a UDF python for pig, to read lines from the file called like
#'prefix.csv'
spol.
LLC
Oy
OOD
and match the names and if finds any matches, then replaces it with white space. here is my python code
def list_files2(name, f):
fin = open(f, 'r')
for line in fin:
final = name
extra = 'nothing'
if (name != name.replace(line.strip(), ' ')):
extra = line.strip()
final = name.replace(line.strip(), ' ').strip()
return final, extra,'insdie if'
return final, extra, 'inside for'
Running this code in python,
>print list_files2('LLC nakisa', 'prefix.csv' )
>print list_files2('AG company', 'prefix.csv' )
returns
('nakisa', 'LLC', 'insdie if')
('AG company', 'nothing', 'inside for')
which is exactly what I need. But when I register this code as a UDF in apache pig for this sample list:
nakisa company LLC
three Oy
AG Lans
Test OOD
pig returns wrong answer on the third line:
((nakisa company,LLC,insdie if))
((three,Oy,insdie if))
((A G L a n s,,insdie if))
((Test,OOD,insdie if))
The question is why UDF enters the if loop for the third entry which does not have any match in the prefix.csv file.
I don't know pig but the way you are checking for a match is strange and might be the cause of your problem.
If you want to check whether a string is a substring of another, python provides
the find method on strings:
if name.find(line.strip()) != -1:
# find will return the first index of the substring or -1 if it was not found
# ... do some stuff
additionally, your code might leave the file handle open. A way better approach to handle file operations is by using the with statement. This assures that in any case (except of interpreter crashes) the file handle will get closed.
with open(filename, "r") as file_:
# Everything within this block can use the opened file.
Last but not least, python provides a module called csv with a reader and a writer, that handle the parsing of the csv file format.
Thus, you could try the following code and check if it returns the correct thing:
import csv
def list_files2(name, filename):
with open(filename, 'rb') as file_:
final = name
extra = "nothing"
for prefix in csv.reader(file_):
if name.find(prefix) != -1:
extra = prefix
final = name.replace(prefix, " ")
return final, extra, "inside if"
return final, extra, "inside for"
Because your file is named prefix.csv I assume you want to do prefix substitution. In this case, you could use startswith instead of find for the check and replace the line final = name.replace(prefix, " ") with final = " " + name[name.find(prefix):]. This assures that only a prefix will be substituted with the space.
I hope, this helps
I'm trying to implement Vigenere's Cipher. I want to be able to obfuscate every single character in a file, not just alphabetic characters.
I think I'm missing something with the different types of encoding. I have made some test cases and some characters are getting replaced badly in the final result.
This is one test case:
,.-´`1234678abcde^*{}"¿?!"·$%&/\º
end
And this is the result I'm getting:
).-4`1234678abcde^*{}"??!"7$%&/:
end
As you can see, ',' is being replaced badly with ')' as well as some other characters.
My guess is that the others (for example, '¿' being replaced with '?') come from the original character not being in the range of [0, 127], so its normal those are changed. But I don't understand why ',' is failing.
My intent is to obfuscate CSV files, so the ',' problem is the one I'm mainly concerned about.
In the code below, I'm using modulus 128, but I'm not sure if that's correct. To execute it, put a file named "OriginalFile.txt" in the same folder with the content to cipher and run the script. Two files will be generated, Ciphered.txt and Deciphered.txt.
"""
Attempt to implement Vigenere cipher in Python.
"""
import os
key = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
fileOriginal = "OriginalFile.txt"
fileCiphered = "Ciphered.txt"
fileDeciphered = "Deciphered.txt"
# CIPHER PHASE
if os.path.isfile(fileCiphered):
os.remove(fileCiphered)
keyToUse = 0
with open(fileOriginal, "r") as original:
with open(fileCiphered, "a") as ciphered:
while True:
c = original.read(1) # read char
if not c:
break
k = key[keyToUse]
protected = chr((ord(c) + ord(k))%128)
ciphered.write(protected)
keyToUse = (keyToUse + 1)%len(key)
print("Cipher successful")
# DECIPHER PHASE
if os.path.isfile(fileDeciphered):
os.remove(fileDeciphered)
keyToUse = 0
with open(fileCiphered, "r") as ciphered:
with open(fileDeciphered, "a") as deciphered:
while True:
c = ciphered.read(1) # read char
if not c:
break
k = key[keyToUse]
unprotected = chr((128 + ord(c) - ord(k))%128) # +128 so that we don't get into negative numbers
deciphered.write(unprotected)
keyToUse = (keyToUse + 1)%len(key)
print("Decipher successful")
Assumption: you're trying to produce a new, valid CSV with the contents of cells enciphered via Vigenere, not to encipher the whole file.
In that case, you should check out the csv module, which will handle properly reading and writing CSV files for you (including cells that contain commas in the value, which might happen after you encipher a cell's contents, as you see). Very briefly, you can do something like:
with open("...", "r") as fpin, open("...", "w") as fpout:
reader = csv.reader(fpin)
writer = csv.writer(fpout)
for row in reader:
# row will be a list of strings, one per column in the row
ciphered = [encipher(cell) for cell in row]
writer.writerow(ciphered)
When using the csv module you should be aware of the notion of "dialects" -- ways that different programs (usually spreadsheet-like things, think Excel) handle CSV data. csv.reader() usually does a fine job of inferring the dialect you have in the input file, but you might need to tell csv.writer() what dialect you want for the output file. You can get the list of built-in dialects with csv.list_dialects() or you can make your own by creating a custom Dialect object.
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
We have a homework that I have a serious problem on.
The key is to make each line to a tuple and make these tuple to a list.
like list=[tuple(line1),tuple(line2),tuple(line3),...].
Besides, there are many strings separated by commas, like "aei","1433","lincoln",...
Here is the question:
A book can be represented as a tuple of the author's lastName, the author's firstName, the title, date, and ISBN.
Write a function, readBook(), that, given a comma-separated string containing this information, returns a tuple representing the book.
Write a function, readBooks(), that, given the name of a text file containing one comma-separated line per book, uses readBook() to return a list of tuples, each of which describes one book.
Write a function, buildIndex(), that, given a list of books as returned by readBooks(), builds a map from key word to book title. A key word is any word in a book's title, except "a", "an, or "the".
Here is my code:
RC=("Chann", "Robbbin", "Pride and Prejudice", "2013", "19960418")
RB=("Benjamin","Franklin","The Death of a Robin Thickle", "1725","4637284")
def readBook(lastName, firstName, booktitle, date, isbn):
booktuple=(lastName, firstName, booktitle, date, isbn)
return booktuple
# print readBook("Chen", "Robert", "Pride and Prejudice", "2013", "19960418")
def readBooks(file1):
inputFile = open(file1, "r")
lines = inputFile.readlines()
book = (lines)
inputFile.close()
return book
print readBooks("book.txt")
BooklistR=[RC,RB]
def buildIndex(file2):
inputFile= open("book.txt","r")
Blist = inputFile.readlines()
dictbooks={}
for bookinfo in Blist:
title=bookinfo[2].split()
for infos in title:
if infos.upper()=="A":
title.remove(infos)
elif infos.upper()=="THE":
title.remove(infos)
elif infos.upper()=="AN":
title.remove(infos)
else:
pass
dictbooks[tuple(title)]= bookinfo[2]
return dictbooks
print buildIndex("book.txt")
#Queries#
def lookupKeyword(keywords):
dictbooks=buildIndex(BooklistR)
keys=dictbooks.viewkeys()
values=dictbooks.viewvalues()
for keybook in list(keys):
for keyw in keywords:
for keyk in keybook:
if keyw== keyk:
printoo= dictbooks[keybook]
else:
pass
return printoo
print lookupKeyword("Robin")
What's wrong with something like this?:
with open(someFile) as inputFile:
myListofTuples = [tuple(line.split(',')) for line in inputFile.readlines()]
[Explanation added based on Robert's comment]
The first line opens the file in a with statement. Python with statements are a fairly new feature and rather advanced. The set up a context in which code executes with certain guarantees about how clean-up and finalization code will be executed as the Python engine exits that context (whether by completing the work or encountering an un-handled exception).
You can read about the ugly details at: Python Docs: Context Managers but the gist of it all is that we're opening someFile with a guarantee that it'll be closed properly after the execution of the code leaves that context (the suite of statements after the with statement. That'll be done even if we encounter some error or if our code inside that suite raises some exception that we fail to catch.
In this case we use the as clause to give us a local name by which we can refer to the opened file object. (The filename is just a string, passed as an argument to the open() built-in function ... the object returned by that function needs to have a name by which we can refer to it. This is similar to who a for i in whatever statement binds each item in whatever to the name i for each iteration through the loop.
The suite of our with statement (that's the set of indented statements which is run within the context of the context manager) consists of a single statement ... a list comprehension which is bound to the name myListofTuples.
A list comprehension is another fairly advanced programming concept. There are a number of very high level languages which implement them in various ways. In the case of Python they date back to much earlier versions than the with statement --- I think they were introduced in the 2.2 or so timeframe.
Consequently, list comprehensions are fairly common in Python code while with statements are only slowly being adopted.
A list literal in Python looks like: [something, another_thing, etc, ...] a list comprehension is similar but replaces the list of item literals with an expression, a line of code, which evaluates into a list. For example: [x*x for x in range(100) if x % 2] is a list comprehension which evaluates into a list of integers which are the squares of odd integers between 1 and 99. (Notice the absence of commas in the list comprehension. An expression takes the place of the comma delimited sequence which would have been used in a list literal).
In my example I'm using for line in inputFile.readlines() as the core of the expression and I'm splitting each of those on the common (line.split(',')) and then converting the resulting list into a tuple().
This is just a very concise way of saying:
myListofTuples = list()
for line in inputfile.readlines():
myListofTuples.append(line.split(','))
One possible program:
import fileinput
def readBook(str):
l = str.split(',')
t = (l[0:5])
return t
#b = readBook("First,Last,Title,2013,ISBN")
#print b
def readBooks(file):
l = []
for line in fileinput.input(file):
t = readBook(line)
# print t
l.append(t)
return l
books = readBooks("data")
#for t in books:
# for f in t:
# print f
def buildIndex(books):
i = {}
for b in books:
for w in b[2].split():
if w.lower() not in ('a', 'an', 'the'):
if w not in i:
i[w] = []
i[w].append(b[2])
return i
index = buildIndex(books)
for w in sorted(index):
print "Word: ", w
for t in index[w]:
print "Title: ", t
Sample data file (called "data" in the code):
Austen,Jane,Pride and Prejudice,1811,123456789012X
Austen,Jane,Sense and Sensibility,1813,21234567892
Rice-Burroughs,Edgar,Tarzan and the Apes,1911,302912341234X
Sample output:
Word: Apes
Title: Tarzan and the Apes
Word: Prejudice
Title: Pride and Prejudice
Word: Pride
Title: Pride and Prejudice
Word: Sense
Title: Sense and Sensibility
Word: Sensibility
Title: Sense and Sensibility
Word: Tarzan
Title: Tarzan and the Apes
Word: and
Title: Pride and Prejudice
Title: Sense and Sensibility
Title: Tarzan and the Apes
Note that the data format can't support book titles such as "The Lion, The Witch, and the Wardrobe" because of the embedded commas. If the file was in CSV format with quotes around the strings, then it could manage that.
I'm not sure that's perfectly minimally Pythonic code (not at all sure), but it does seem to match the requirements.
I'm working through some python problems on pythonchallenge.com to teach myself python and I've hit a roadblock, since the string I am to be using is too large for python to handle. I receive this error:
my-macbook:python owner1$ python singleoccurrence.py
Traceback (most recent call last):
File "singleoccurrence.py", line 32, in <module>
myString = myString.join(line)
OverflowError: join() result is too long for a Python string
What alternatives do I have for this issue? My code looks like such...
#open file testdata.txt
#for each character, check if already exists in array of checked characters
#if so, skip.
#if not, character.count
#if count > 1, repeat recursively with first character stripped off of page.
# if count = 1, add to valid character array.
#when string = 0, print valid character array.
valid = []
checked = []
myString = ""
def recursiveCount(bigString):
if len(bigString) == 0:
print "YAY!"
return valid
myChar = bigString[0]
if myChar in checked:
return recursiveCount(bigString[1:])
if bigString.count(myChar) > 1:
checked.append(myChar)
return recursiveCount(bigString[1:])
checked.append(myChar)
valid.append(myChar)
return recursiveCount(bigString[1:])
fileIN = open("testdata.txt", "r")
line = fileIN.readline()
while line:
line = line.strip()
myString = myString.join(line)
line = fileIN.readline()
myString = recursiveCount(myString)
print "\n"
print myString
string.join doesn't do what you think. join is used to combine a list of words into a single string with the given seperator. Ie:
>>> ",".join(('foo', 'bar', 'baz'))
'foo,bar,baz'
The code snippet you posted will attempt to insert myString between every character in the variable line. You can see how that will get big quickly :-). Are you trying to read the entire file into a single string, myString? If so, the way you want to concatenate the strings is like this:
myString = myString + line
While I'm here... since you're learning Python here are some other suggestions.
There are easier ways to read an entire file into a variable. For instance:
fileIN = open("testdata.txt", "r")
myString = fileIN.read()
(This won't have the exact behaviour of your existing strip() code, but may in fact do what you want.)
Also, I would never recommend practical Python code use recursion to iterate over a string. Your code will make a function call (and a stack entry) for every character in the string. Also I'm not sure Python will be very smart about all the uses of bigString[1:]: it may well create a second string in memory that's a copy of the original without the first character. The simplest way to process every character in a string is:
for mychar in bigString:
... do your stuff ...
Finally, you are using the list named "checked" to see if you've ever seen a particular character before. But the membership test on lists ("if myChar in checked") is slow. In Python you're better off using a dictionary:
checked = {}
...
if not checked.has_key(myChar):
checked[myChar] = True
...
This exercise you're doing is a great way to learn several Python idioms.