The pythonic way of printing a value - python

This probably measures how pythonic you are. I'm playing around trying to learn python so Im not close to being pythonic enough. The infile is a dummy patriline and I want a list of father son.
infile:
haffi jolli dkkdk lkskkk lkslll sdkjl kljdsfl klsdlj sdklja asldjkl
code:
def main():
infile = open('C:\Users\Notandi\Desktop\patriline.txt', 'r')
line = infile.readline()
tmpstr = line.split('\t')
for i in tmpstr[::2]:
print i, '\t', i + 1
infile.close()
main()
The issue is i + 1; I want to print out two strings in every line. Is this clear?

You are getting confused between the words in the split string and their indices. For example, the first word is "haffi" but the first index is 0.
To iterate over both the indices and their corresponding words, use enumerate:
for i, word in enumerate(tmpstr):
print word, tmpstr[i+1]
Of course, this looks messy. A better way is to just iterate over pairs of strings. There are many ways to do this; here's one.
def pairs(it):
it = iter(it)
for element in it:
yield element, next(it)
for word1, word2 in pairs(tmpstr):
print word1, word2

I'd use the with statement here, which if you're using an older version of python you need to import:
from __future__ import with_statement
for the actual code, if you can afford to load the whole file into memory twice (ie, it's pretty small) I would do this:
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as f:
strings = f.read().split('\t')
for father, son in zip(string, string[1:]):
print "%s \t %s" % (father, son)
main()
That way you skip the last line with out having too much overhead to not include the childless leaf at the end, which is think is what you were asking for(?)
As a bit of a tangent: if the file is really big, you may not want to load the whole thing into memory, in which case you may need a generator. You probably don't need to do this if you're actually printing everything out, but in case this is some simplified version of the problem, this is how I would approach making a generator to split the file:
class reader_and_split():
def __init__(self, fname, delim='\t'):
self.fname = fname
self.delim = delim
def __enter__(self):
self.file = open(self.fname, 'r')
return self.word_generator()
def __exit__(self, type, value, traceback):
self.file.close()
def word_generator(self):
current = []
while True:
char = self.file.read(1)
if char == self.delim:
yield ''.join(current)
current = []
elif not char:
break
else:
current.append(char)
The value of a generator is that you don't load the entire contents of the file into memory, before running the split on it, which can be expensive for very, very large files. This implementation only allows single character delimiter for simplicity. Which means all you need to do to parse out everything is to use the generator, a quick dirty way to do this is:
with reader_and_split(fileloc) as f:
previous = f.next()
for word in f:
print "%s \t %s" % (previous, word)
previous = word

You can be more pythonic in both your file reading and printing. Try this:
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as f:
strings = f.readline().split('\t')
for i, word in enumerate(strings):
print "{} \t {}".format(word, strings[i+1:i+2])
main()
Using strings[i+1:i+2] ensures an IndexError isn't thrown (instead, returning a []) when trying to reach the i+1th index at the end of the list.

Here's one clean way to do it. It has the benefit of not crashing when fed an odd number of items, but of course you may prefer an exception for that case.
def main():
with open('C:\Users\Notandi\Desktop\patriline.txt', 'r') as infile:
line = infile.readline()
previous = None
for i in line.split('\t'):
if previous is None:
previous = i
else:
print previous, '\t', i
previous = None
I won't make any claims that this is pythonic though.

Related

Python - How to read a specific line in a text file?

I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.
start_line = b
num_lines = 377763316
while b < num_lines:
plasmid1 = linecache.getline("Result.txt", b-1)
plasmid1 = plasmid1.strip("\n")
plasmid1 = plasmid1.split("\t")
plasmid2 = linecache.getline("Result.txt", b)
plasmid2 = plasmid2.strip("\n")
plasmid2 = plasmid2.split("\t")
if not str(plasmid1[0]) == str(plasmid2[0]):
end_line = b
#do something
The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.
I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!
Thanks,
Philipp
I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.
After calling loadtxt() you will get ndarray back.
You can use itertools:
from itertools import takewhile
class EqualityChecker(object):
def __init__(self, id):
self.id = id
def __call__(self, current_line):
result = False
current_id = current_line.split('\t')[0]
if self.id == current_id:
result = True
return result
with open('hugefile.txt', 'r') as f:
for id in ids:
checker = EqualityChecker(id)
for line in takewhile(checker, f.xreadlines()):
do_stuff(line)
In outer loop id can actually be obtain from the first line with an id non-matching previous value.
You should open the file just once, and iterate over the lines.
with open('Result.txt', 'r') as f:
aline = f.next()
currentid = aline.split('\t', 1)[0]
for nextline in f:
nextid = nextline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid
You get the idea, just use plain python.
Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.
If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:
with open('Result.txt', 'r') as f:
aline = f.readline()
currentid = aline.split('\t', 1)[0]
while aline != '':
aline = f.readline()
nextid = aline.split('\t', 1)[0]
if nextid != currentid:
#do stuff
currentid = nextid

How do I change print to return in a for loop in python?

The following code is doing most of what I want...
All I need is for that print to actually be a return so that I can dump the data into another txt file I'm writing (f2).
(Also, the spacing obtained with print letters is not what I want but I figure I'll deal with it later.)
Every time I replace print with return it just stops reading after the first line of the initial text file (f1).
def DNA2Prot(f1, f2="translated_fasta.txt"):
fin = open(f1, 'r')
for letters in fin:
if letters[0] != ">":
seqs = letters
codons = [ ]
protein = ''
for i in range(0, len(seqs), 3):
try:
codon = seqs[i:i+3]
codons = codon_table[codon]
protein = protein+codons
except KeyError:
protein += ""
print protein
else:
print letters
fin.close()
Use yield instead and treat your function as a generator. This way the caller can do what he/she pleases with all of the proteins the DNA2Prot function generates and read from the file until the entire file is read.
def DNA2Prot(f1, f2='translated_fasta.txt'):
# prefer using `with` to `open` and `close`
with open(f1, 'r') as fin:
for letters in fin:
if letters[0] != '>':
seqs = letters
codons = [ ]
protein = ''
for i in range(0, len(seqs), 3):
# no need for a try catch, because we can use `get`
# get will return None by default if the
# specified `codon` does not appear in
# `codon_table`
codon = seqs[i:i + 3]
codons = codon_table.get(codon)
if codons:
protein += codons
yield protein
else:
yield letters
Now you have to treat the DNA2Prot function as an Iterator:
with open('/path/to/outfile', 'w') as f:
for protein in DNA2Prot(f1):
# do something with protein
print protein
First things first. When you use the return statement you are telling your code to break out(i.e leave) from the point where the return statement is located. This means that your code will start reading from fin, move on to the second for and as soon as it is done with it (read all the letters of the line) it will reach you return statement and break out from the DNA2prot function.
Now, there are two things you can do to when it comes to files. First is use the print function to redirect your output to a file (not recommended) or properly open the files and write into them.
With regards to the first solution (and assuming you are using python 2.7) you can simply do:
from __future__ import print_function
and when you want to use your print statement just write:
print(protein, file = fin).
However, if I were you I would go for a more elegant and clean solution that doesn't rely on unnecessary imports:
def DNA2Prot(f1, f2="translated_fasta.txt"):
with open (f1, 'r+') as fin, open(f2, 'w+') as fin2: #Using the "with-open" statement you don't need to close the file object
for letters in fin:
if letters[0]!=">":
seqs=letters
codons=[ ]
protein=''
for i in range(0,len(seqs),3):
try:
codon=seqs[i:i+3]
codons=codon_table[codon]
protein=protein+codons
except KeyError:
protein+=""
f2.write(protein) # Write your data to the second file
else:
f2.write(letters)

Improving the speed of a python script

I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.

how to perform XOR of all words in a file

I want to convert all words in a standard dictionary (for example : /usr/share/dict/words of a unix machine) integer and find XOR between every two words in the dictionary( ofcourse after converting them to integer) and probably store it in a new file.
Since I am new to python and because of large file sizes, the program is getting hung every now and then.
import os
dictionary = open("/usr/share/dict/words","r")
'''a = os.path.getsize("/usr/share/dict/words")
c = fo.read(a)'''
words = dictionary.readlines()
foo = open("word_integer.txt", "a")
for word in words:
foo.write(word)
foo.write("\t")
int_word = int(word.encode('hex'), 16)
'''print int_word'''
foo.write(str(int_word))
foo.write("\n")
foo.close()
First we need a method to convert your string to an int, I'll make one up (since what you're doing isn't working for me at all, maybe you mean to encode as unicode?):
def word_to_int(word):
return sum(ord(i) for i in word.strip())
Next, we need to process the files. The following works in Python 2.7 onward, (in 2.6, just nest two separate with blocks, or use contextlib.nested:
with open("/usr/share/dict/words","rU") as dictionary:
with open("word_integer.txt", "a") as foo:
while dictionary:
try:
w1, w2 = next(dictionary), next(dictionary)
foo.write(str(word_to_int(w1) ^ word_to_int(w2)))
except StopIteration:
print("We've run out of words!")
break
This code seems to work for me. You're likely running into efficiency issues because you are calling readlines() on the entire file which pulls it all into memory at once.
This solution loops through the file line by line for each line and computes the xor.
f = open('/usr/share/dict/words', 'r')
pairwise_xors = {}
def str_to_int(w):
return int(w.encode('hex'), 16)
while True:
line1 = f.readline().strip()
g = open('/usr/share/dict/words', 'r')
line2 = g.readline().strip()
if line1 and line2:
pairwise_xors[(line1, line2)] = (str_to_int(line1) ^ str_to_int(line2))
else:
g.close()
break
f.close()

Deleting certain line of text file in python

I have the following text file:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,456
FRUIT
DRINK
FOOD,BURGER
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
CAR
And I have the following list called 'wanted':
['123', '789']
What I'm trying to do is if the numbers after NUM is not in the list called 'wanted', then that line along with 4 lines below it gets deleted. So the output file will looks like:
This is my text file
NUM,123
FRUIT
DRINK
FOOD,BACON
CAR
NUM,789
FRUIT
DRINK
FOOD,SAUSAGE
CAR
My code so far is:
infile = open("inputfile.txt",'r')
data = infile.readlines()
for beginning_line, ube_line in enumerate(data):
UNIT = data[beginning_line].split(',')[1]
if UNIT not in wanted:
del data_list[beginning_line:beginning_line+4]
You shouldn't modify a list while you are looping over it.
What you could try is to just advance the iterator on the file object when needed:
wanted = set(['123', '789'])
with open("inputfile.txt",'r') as infile, open("outfile.txt",'w') as outfile:
for line in infile:
if line.startswith('NUM,'):
UNIT = line.strip().split(',')[1]
if UNIT not in wanted:
for _ in xrange(4):
infile.next()
continue
outfile.write(line)
And use a set. It is faster for constantly checking the membership.
This approach doesn't make you read in the entire file at once to process it in a list form. It goes line by line, reading from the file, advancing, and writing to the new file. If you want, you can replace the outfile with a list that you are appending to.
There are some issues with the code; for instance, data_list isn't even defined. If it's a list, you can't del elements from it; you can only pop. Then you use both enumerate and direct index access on data; also readlines is not needed.
I'd suggest to avoid keeping all lines in memory, it's not really needed here. Maybe try with something like (untested):
with open('infile.txt') as fin, open('outfile.txt', 'w') as fout:
for line in fin:
if line.startswith('NUM,') and line.split(',')[1] not in wanted:
for _ in range(4):
fin.next()
else:
fout.write(line)
import re
# find the lines that match NUM,XYZ
nums = re.compile('NUM,(?:' + '|'.join(['456','012']) + ")")
# find the three lines after a nums match
line_matches = breaks = re.compile('.*\n.*\n.*\n')
keeper = ''
for line in nums.finditer(data):
keeper += breaks.findall( data[line.start():] )[0]
result on the given string is
NUM,456
FRUIT
DRINK
FOOD,BURGER
NUM,012
FRUIT
DRINK
FOOD,MEATBALL
edit: deleting items while iterating is probably not a good idea, see: Remove items from a list while iterating
infile = open("inputfile.txt",'r')
data = infile.readlines()
SKIP_LINES = 4
skip_until = False
result_data = []
for current_line, line in enumerate(data):
if skip_until and skip_until < current_line:
continue
try:
_, num = line.split(',')
except ValueError:
pass
else:
if num not in wanted:
skip_until = current_line + SKIP_LINES
else:
result_data.append(line)
... and result_data is what you want.
If you don't mind building a list, and iff your "NUM" lines come every 5 other line, you may want to try:
keep = []
for (i, v) in enumerate(lines[::5]):
(num, current) = v.split(",")
if current in wanted:
keep.extend(lines[i*5:i*5+5])
Don't try to think of this in terms of building up a list and removing stuff from it while you loop over it. That way leads madness.
It is much easier to write the output file directly. Loop over lines of the input file, each time deciding whether to write it to the output or not.
Also, to avoid difficulties with the fact that not every line has a comma, try just using .partition instead to split up the lines. That will always return 3 items: when there is a comma, you get (before the first comma, the comma, after the comma); otherwise, you get (the whole thing, empty string, empty string). So you can just use the last item from there, since wanted won't contain empty strings anyway.
skip_counter = 0
for line in infile:
if line.partition(',')[2] not in wanted:
skip_counter = 5
if skip_counter:
skip_counter -= 1
else:
outfile.write(line)

Categories

Resources