Split File when string is matches exactly - python

I have a huge text file that I need to split based on matching a 'EKYC' only value. However, when other values with similar pattern show up my script fails.
I am new in Python and it is wearing me out.
import sys;
import os;
MASTER_TEXT_FILE=sys.argv[1];
OUTPUT_FILE=sys.argv[2];
L = file(MASTER_TEXT_FILE, "r").read().strip().split("EKYC")
i = 0
for l in L:
i = i + 1
f = file(OUTPUT_FILE+"-%d.ekyc" % i , "w")
print >>f, "EKYC" + l
The script breaks when there is EKYCSMRT or EKYCVDA or EKYCTIGO then how can I make the guard to prevent the splitting to occur before the point.
This is the content of all of the messages
EKYC
WIK 12
EKYC
WIK 12
EKYCTIGO
EKYC
WIK 13
TTL
EKYCVD
EKYC
WIK 14
TTL D
Thanks for the assistance.

If possible, you should avoid reading large files into memory all at once. Instead, stream chunks of them at a time.
The sensible chunks of text files are usually lines. This can be done with .readline(), but simply iterating over the file yields its lines too.
After reading a line (which includes the newline), you can .write() it directly to the current output file.
import sys
master_filename = sys.argv[1]
output_filebase = sys.argv[2]
output = None
output_number = 0
for line in open(master_filename):
if line.strip() == 'EKYC':
if output is not None:
output.close()
output = None
else:
if output is None:
output_number += 1
output_filename = '%s-%d.ekyc' % (output_filebase, output_number)
output = open(output_filename, 'w')
output.write(line)
if output is not None:
output.close()
The output file is closed and reset upon encountering 'EKYC' on its own line.
Here, you'll notice that the output file isn't (re)opened until right before there is a line to write to it: this avoids creating an empty output file in case there are no further lines to write to it. You'll have to re-order this slightly if you want the 'EKYC' line to appear in the output file also.

Based on your sample input file, you need to: split('\nEKYC\n')
#!/usr/bin/env python
import sys
MASTER_TEXT_FILE = sys.argv[1]
OUTPUT_FILE = sys.argv[2]
with open(MASTER_TEXT_FILE) as f:
fdata = f.read()
i = 0
for subset in fdata.split('\nEKYC\n'):
i += 1
with open(OUTPUT_FILE+"-%d.ekyc" % i, 'w') as output:
output.write(subset)
Other comments:
Python doesn't use ;.
Your original code wasn't using os.
It's recommended to use with open(<filename>, <mode>) as f: ... since it handles possible errors and closes the file afterward.

Related

Find coincidence and add column

I want to achieve this specific task, I have 2 files, the first one with emails and credentials:
xavier.desprez#william.com:Xavier
xavier.locqueneux#william.com:vocojydu
xaviere.chevry#pepe.com:voluzigy
Xavier.Therin#william.com:Pussycat5
xiomara.rivera#william.com:xrhj1971
xiomara.rivera#william-honduras.william.com:xrhj1971
and the second one, with emails and location:
xavier.desprez#william.com:BOSNIA
xaviere.chevry#pepe.com:ROMANIA
I want that, whenever the email from the first file is found on the second file, the row is substituted by EMAIL:CREDENTIAL:LOCATION , and when it is not found, it ends up being: EMAIL:CREDENTIAL:BLANK
so the final file must be like this:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I have do several tries in python, but it is not even worth it to write it because I am not really close to the solution.
Regards !
EDIT:
This is what I tried:
import os
import sys
with open("test.txt", "r") as a_file:
for line_a in a_file:
stripped_email_a = line_a.strip().split(':')[0]
with open("location.txt", "r") as b_file:
for line_b in b_file:
stripped_email_b = line_b.strip().split(':')[0]
location = line_b.strip().split(':')[1]
if stripped_email_a == stripped_email_b:
a = line_a + ":" + location
print(a.replace("\n",""))
else:
b = line_a + ":BLANK"
print (b.replace("\n",""))
This is the result I get:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.desprez#william.com:Xavier:BLANK
xaviere.chevry#pepe.com:voluzigy:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
xavier.locqueneux#william.com:vocojydu:BLANK
xavier.locqueneux#william.com:vocojydu:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK
I am very close but I get duplicates ;)
Regards
The duplication issue comes from the fact that you are reading two files in a nested way, once a line from the test.txt is read, you open the location.txt file for reading and process it. Then, you read the second line from test.txt, and re-open the location.txt and process it again.
Instead, get all the necessary data from the location.txt, say, into a dictionary, and then use it while reading the test.txt:
email_loc_dict = {}
with open("location.txt", "r") as b_file:
for line_b in b_file:
splits = line_b.strip().split(':')
email_loc_dict[splits[0]] = splits[1]
with open("test.txt", "r") as a_file:
for line_a in a_file:
line_a = line_a.strip()
stripped_email_a = line_a.split(':')[0]
if stripped_email_a in email_loc_dict:
a = line_a + ":" + email_loc_dict[stripped_email_a]
print(a)
else:
b = line_a + ":BLANK"
print(b)
Output:
xavier.desprez#william.com:Xavier:BOSNIA
xavier.locqueneux#william.com:vocojydu:BLANK
xaviere.chevry#pepe.com:voluzigy:ROMANIA
Xavier.Therin#william.com:Pussycat5:BLANK
xiomara.rivera#william.com:xrhj1971:BLANK
xiomara.rivera#william-honduras.william.com:xrhj1971:BLANK

Removing extra space from text file

I am currently keeping high scores into a text file called "score.txt". The prgoram works fine, updating the file with the new high scores as normal. Except that every time the program updates the file, there is always one blank line before the first high score, creating an error when I try to save the scores the next time. The code:
scores_list = []
score = 10
def take_score():
# Save old scores into list
f = open("score.txt", "r")
lines = f.readlines()
for line in lines:
scores_list.append(line)
print scores_list
f.close()
take_score()
def save_score():
# Clear file
f = open("score.txt", "w")
print >> f, ""
f.close()
# Rewrite scores into text files
w = open("score.txt", "a")
for i in range(0, len(scores_list)):
new_string = scores_list[i].replace("\n", "")
scores_list[i] = int(new_string)
if score > scores_list[i]:
scores_list[i] = score
for p in range(0, len(scores_list)):
print >> w, str(scores_list[p])
print repr(str(scores_list[p]))
save_score()
The problem mentioned happens in the save_score() function. I have tried this related question: Removing spaces and empty lines from a file Using Python, but it requires I open the file in "r" mode. Is there a way to accomplish the same thing except when the file is opened in "a" mode (append)?
You are specifically printing an empty line as soon as you create the file.
print >> f, ""
You then append to it, keeping the empty line.
If you just want to clear the contents every time you run this, get rid of this:
# Clear file
f = open("score.txt", "w")
print >> f, ""
f.close()
And modify the opening to this:
w = open("score.txt", "w")
The 'w' mode truncates already, as you were already using. There's no need to truncate, write an empty line, close, then append lines. Just truncate and write what you want to write.
That said, you should use the with construct and file methods for working with files:
with open("score.txt", "w") as output: # here's the with construct
for i in xrange(len(scores_list)):
# int() can handle leading/trailing whitespace
scores_list[i] = int(scores_list[i])
if score > scores_list[i]:
scores_list[i] = score
for p in xrange(len(scores_list)):
output.write(str(scores_list[p]) + '\n') # writing to the file
print repr(str(scores_list[p]))
You will then not need to explicitly close() the file handle, as with takes care of that automatically and more reliably. Also note that you can simply send a single argument to range and it will iterate from 0, inclusive, until that argument, exclusive, so I've removed the redundant starting argument, 0. I've also changed range to the more efficient xrange, as range would only be reasonably useful here if you wanted compatibility with Python 3, and you're using Python 2-style print statements anyway, so there isn't much point.
print appends a newline to what you print. In the line
print >> f, ""
You're writing a newline to the file. This newline still exists when you reopen in append mode.
As #Zizouz212 mentions, you don't need to do all this. Just open in write mode, which'll truncate the file, then write what you need.
Your opening a file, clearing it, but then you open the same file again unnecessarily. When you open the file, you print a newline, even if you don't think so. Here is the offending line:
print >> f, ""
In Python 2, it really does this.
print "" + "\n"
This is because Python adds a newline at the end of the string to each print statement. To stop this, you could add a comma to the end of the statement:
print "",
Or just write directly:
f.write("my data")
However, if you're trying to save a Python data type, and it does not have to be human-readable, you may have luck using pickle. It's really simple to use:
def save_score():
with open('scores.txt', 'w') as f:
pickle.dump(score_data, f):
It is not really answer for question.
It is my version of your code (not tested). And don't avoid rewriting everything ;)
# --- functions ---
def take_score():
'''read values and convert to int'''
scores = []
with open("score.txt", "r") as f
for line in f:
value = int(line.strip())
scores.append(value)
return scores
def save_score(scores):
'''save values'''
with open("score.txt", "w") as f
for value in scores:
write(value)
write("\n")
def check_scores(scores, min_value):
results = []
for value in scores:
if value < min_value:
value = min_value
results.append(value)
return resulst
# --- main ---
score = 10
scores_list = take_score()
scores_list = check_scores(scores_list, score)
save_score(scores_list)

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

copy a part of a text line from a text document into a new document

I have a bunch of URL's in a text file but I'm only interested in one part of the URL and I want to save that part into another document. I've maneged to read 1 line at a time and then writing it to a file using this:
from sys import argv
script, sol , save = argv
data = open(sol)
indata = data.read()
result = indata[51:85]
result2 = "http://mars.jpl.nasa.gov/msl-raw-images/msss/00003/mcam/" + result + ".jpg"
output = open(save, 'w')
output.write(result2)
data.close()
output.close()
But I'm unable to port that into a for loop:
from sys import argv
script, sol = argv
data = open(sol)
indata = data.read()
for line in indata:
indata[51:85],
data.close()
I tried to print it in the screen to see why am getting it wrong but I only get empy lines. I'm stuck and I hope you can give me a hand.
from sys import argv
script, sol, save = argv
data = open(sol)
indata = data.read()
def get_line():
for line in indata.splitlines():
print indata[51:85]
result = indata[51:85]
result2 = "http://mars.jpl.nasa.gov/msl-raw-images/msss/00003/mcam/" + result + ".jpg"
output = open(save, 'w')
output.write(result2)
output.close()
get_line()
data.close()
I've managed to do this but I can only save the first line in the new document. The rest are printed on the screen but not saved in the new document
EDIT
Your control flow is off. You need to open the file before the loop
the result=... line is probably a bit confusing so I'll explain it
first it uses .replace to change text in the line. Then it uses indexing [:-4] to drop the last 4 characters. Finally it appends the string '-br.jpg' to the whole thing
from sys import argv
script, sol, save = argv
def get_line():
data = open(sol)
output = open(save, 'w')
for line in data: #for each line in the input file
result = line.replace('msl/multimedia/raw/?rawid=', 'msl-raw-images/msss/00003/mcam/')[:-4] + '-br.jpg\n'
output.write(result)
output.close()
data.close()
get_line()
you can iterate the lines of the file itself
from sys import argv
script, sol = argv
data = open(sol)
for line in data:
print line[51:85]
data.close()
seems closer to what you want.
When you do .read() you're grabbing the contents of the entire file as a single string. then you're indexing characters in that entire string, not a specific line. In the above code, you're indexing into each line one at a time.
Also, since this is a url and you're only interested in one section, the .split method could make your indexing easier. It returns a list of strings made by splitting your original string at a specific character. for example:
>>> line = 'stackoverflow.com/posts/11908027/'
>>> line.split('/')
['stackoverflow.com', 'posts', '11908027', '']
>>> line.split('/')[2]
'11908027'
>>> line.split('/')[1]
'posts'
Try:
for line in indata.splitlines():
print indata[51:85]
I would look into "split" and "splitline", which are useful when breaking up standard text, such as url. You can learn more about each here:
http://docs.python.org/library/stdtypes.html
That list will also have some info on partitions() which may be of use to you as well. It takes a string and separator, given you a few options on how to store the data.

Read File lines and characters

I have an input file which looks like this
some data...
some data...
some data...
...
some data...
<binary size="2358" width="32" height="24">
data of size 2358 bytes
</binary>
some data...
some data...
The value 2358 in the binary size can change for different files.
Now I want to extract the 2358 bytes of data for this file (which is a variable)
and write to another file.
I wrote the following code for the same. But it gives me an error. The problem is, I am not able to extract this 2358 bytes of binary data and write to another file.
c = responseFile.read(1)
ValueError: Mixing iteration and read methods would lose data
Code Is -
import re
outputFile = open('output', 'w')
inputFile = open('input.txt', 'r')
fileSize=0
width=0
height=0
for line in inputFile:
if "<binary size" in line:
x = re.findall('\w+', line)
fileSize = int(x[2])
width = int(x[4])
height = int(x[6])
break
print x
# Here the file will point to the start location of 2358 bytes.
for i in range(0,fileSize,1):
c = inputFile.read(1)
outputFile.write(c)
outputFile.close()
inputFile.close()
Final Answer to my Question -
#!/usr/local/bin/python
import os
inputFile = open('input', 'r')
outputFile = open('output', 'w')
flag = False
for line in inputFile:
if line.startswith("<binary size"):
print 'Start of Data'
flag = True
elif line.startswith("</binary>"):
flag = False
print 'End of Data'
elif flag:
outputFile.write(line) # remove newline
inputFile.close()
outputFile.close()
# I have to delete the last extra new line character from the output.
size = os.path.getsize('output')
outputFile = open('output', 'ab')
outputFile.truncate(size-1)
outputFile.close()
How about a different approach? In pseudo-code:
for each line in input file:
if line starts with binary tag: set output flag to True
if line starts with binary-termination tag: set output flag to False
if output flag is True: copy line to the output file
And in real code:
outputFile = open('./output', 'w')
inputFile = open('./input.txt', 'r')
flag = False
for line in inputFile:
if line.startswith("<binary size"):
flag = True
elif line.startswith("</binary>"):
flag = False
elif flag:
outputFile.write(line[:-1]) # remove newline
outputFile.close()
inputFile.close()
Try changing your first loop to something like this:
while True:
line = inputFile.readline()
# continue the loop as it was
This gets rid of iteration and only leaves read methods, so the problem should disappear.
Consider this method:
import re
line = '<binary size="2358" width="32" height="24">'
m = re.search('size="(\d*)"', line)
print m.group(1) # 2358
It varies from your code, so its not a drop-in replacement, but the regular expressions functionality is different.
This uses Python's regex group capturing features and is much better than your string splitting method.
For example, consider what would happen if the attributes were re-ordered. For example:
<binary width="32" size="2358" height="24">'
instead of
<binary size="2358" width="32" height="24">'
Would your code still work? Mine would. :-)
Edit: To answer your question:
If you want to read n bytes of data from the beginning of a file, you could do something like
bytes = ifile.read(n)
Note that you may get less than n bytes if the input file is not long enough.
If you don't want to start from the "0th" byte, but some other byte, use seek() first, as in:
ifile.seek(9)
bytes = ifile.read(5)
Which would give you bytes 9:13 or the 10th through 14th bytes.

Categories

Resources