Python splitting with string as delimiter

Python splitting with string as delimiter - python

I have a file that looks something like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC
And I need to split it into the "non-N" sequences, so two separate files like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCT
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGC
TATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACA
TGTGTTGC
What I currently have is this:
UMfile = open ("C:\Users\Manuel\Desktop\sequence.txt","r")
contignumber = 1
contigfile = open ("contig "+str(contignumber), "w")
DNA = UMfile.read()
DNAstring = str(DNA)
for s in DNAstring:
DNAstring.split("NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN",1)
contigfile.write(DNAstring)
contigfile.close()
contignumber = contignumber+1
contigfile = open ("contig "+str(contignumber), "w")
The thing is that I realize there is a linebreak between the "Ns" and that is why it is not splitting my file, but the "file" I'm showing is just a part of a much much bigger one. So sometimes the "Ns" will look like this "NNNNNN\n" and sometimes like "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n", yet there is always a count of 1000 Ns between my sequences that I need to split.
So my question is: How do I tell python to split and wite into different files every 1000xNs knowing that there will be different number of Ns in each line?
Thank you all very much, I really have no informatics background and my python skills are at best basic.

Just split your string on 'N' and then remove all the strings that are empty, or just contain a newline. Like this:
#!/usr/bin/env python
DNAstring = '''AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC'''
sequences = [u for u in DNAstring.split('N') if u and u != '\n']
for i, seq in enumerate(sequences):
print i
print seq.replace('\n', '') + '\n'
output
0
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACTTCTCAATGGGCAGTACATATCATCTCT
1
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACATGTGTTGC
The code snippet above also removes newlines inside the sequences using .replace('\n', '').
Here are a few programs that you may find useful.
Firstly, a line buffer class. You initialise it with a file name and a line width. You can then feed it random length strings and it will automatically save them to the text file, line by line, with all lines (except possibly the last line) having the given length. You can use this class in other programs to make your output look neat.
Save this file as linebuffer.py to somewhere in your Python path; the simplest way is to save it wherever you save your Python programs and make that the current directory when you run the programs.
linebuffer.py
#! /usr/bin/env python
''' Text output buffer
Write fixed width lines to a text file
Written by PM 2Ring 2015.03.23
'''
class LineBuffer(object):
''' Text output buffer
Write fixed width lines to file fname
'''
def __init__(self, fname, width):
self.fh = open(fname, 'wt')
self.width = width
self.buff = []
self.bufflen = 0
def write(self, data):
''' Write a string to the buffer '''
self.buff.append(data)
self.bufflen += len(data)
if self.bufflen >= self.width:
self._save()
def _save(self):
''' Write the buffer to the file '''
buff = ''.join(self.buff)
#Split buff into lines
lines = []
while len(buff) >= self.width:
lines.append(buff[:self.width])
buff = buff[self.width:]
#Add an empty line so we get a trailing newline
lines.append('')
self.fh.write('\n'.join(lines))
self.buff = [buff]
self.bufflen = len(buff)
def close(self):
''' Flush the buffer & close the file '''
if self.bufflen > 0:
self.fh.write(''.join(self.buff) + '\n')
self.fh.close()
def testLB():
alpha = 'abcdefghijklmnopqrstuvwxyz'
fname = 'linebuffer_test.txt'
lb = LineBuffer(fname, 27)
for _ in xrange(30):
lb.write(alpha)
lb.write(' bye.')
lb.close()
if __name__ == '__main__':
testLB()
Here is a program that makes random DNA sequences of the form you described in your question. It uses linebuffer.py to handle the output. I wrote this so I could test my DNA sequence splitter properly.
Random_DNA0.py
#! /usr/bin/env python
''' Make random DNA sequences
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Takes approx 3 seconds per megabyte generated and saved
on a 2GHz CPU single core machine.
Written by PM 2Ring 2015.03.23
'''
import sys
import random
from linebuffer import LineBuffer
#Set seed to None to seed randomizer from system time
random.seed(37)
#Output line width
linewidth = 120
#Subsequence base length ranges
minsub, maxsub = 15, 300
#Subsequences per sequence ranges
minseq, maxseq = 5, 50
#random 'N' sequence ranges
minn, maxn = 5, 200
#Probability that a random 'N' sequence occurs after a subsequence
randn = 0.2
#Sequence separator
nsepblock = 'N' * 1000
def main():
#Get number of sequences from the command line
numsequences = int(sys.argv[1]) if len(sys.argv) > 1 else 2
outname = 'DNA_sequence.txt'
lb = LineBuffer(outname, linewidth)
for i in xrange(numsequences):
#Write the 1000*'N' separator between sequences
if i > 0:
lb.write(nsepblock)
for j in xrange(random.randint(minseq, maxseq)):
#Possibly make a short run of 'N's in the sequence
if j > 0 and random.random() < randn:
lb.write(''.join('N' * random.randint(minn, maxn)))
#Create a single subsequence
r = xrange(random.randint(minsub, maxsub))
lb.write(''.join([random.choice('ACGT') for _ in r]))
lb.close()
if __name__ == '__main__':
main()
Finally, we have a program that splits your random DNA sequences. Once again, it uses linebuffer.py to handle the output.
DNA_Splitter0.py
#! /usr/bin/env python
''' Split DNA sequences and save to separate files
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Written by PM 2Ring 2015.03.23
'''
import sys
from linebuffer import LineBuffer
#Output line width
linewidth = 120
#Sequence separator
nsepblock = 'N' * 1000
def main():
iname = 'DNA_sequence.txt'
outbase = 'contig'
with open(iname, 'rt') as f:
data = f.read()
#Remove all newlines
data = data.replace('\n', '')
sequences = data.split(nsepblock)
#Save each sequence to a series of files
for i, seq in enumerate(sequences, 1):
outname = '%s%05d' % (outbase, i)
print outname
#Write sequence data, with line breaks
lb = LineBuffer(outname, linewidth)
lb.write(seq)
lb.close()
if __name__ == '__main__':
main()

assuming you can read the whole file at once
s=DNAstring.replace("\n","") # first remove the nasty linebreaks
l=[x for x in s.split("N") if x] # split and drop empty lines
for x in l: # print in chunks
while x:
print x[:10]
x=x[10:]
print # extra linebreak between chunks

You could simply replace every N and \n with a space, and then split.
result = DNAstring.replace("\n", " ").replace("N", " ").split()
This will give you back a list of strings, and the 'ACGT' sequences will also be split with every new line.
if this is not you goal an you want to conserve the \n in the 'ACGT' and not split along it, you can do the following:
result = DNAstring.replace("N\n", " ").replace("N", " ").split()
this will only remove the \n if it is in the middle of an N sequence.
To split your string exactly after 1000 Ns:
# 1/ Get rid of line breaks in the N sequence
result = DNAstring.replace("N\n", "N")
# 2/ split every 1000 Ns
result = result.split(1000*"N")

Related

How to create a text file of size N kilobytes with repetitions of "Hello World"

I want to create a text file of size N kilobytes with repetitions of "Hello World" where N is specified via a config file in a different directory from the repository, with the help of python. I am able to display the hello world N number of times, where N is a numerical input from a config file, but I dont know anything about size. Here is the code I have written so far:
import ConfigParser
import webbrowser
configParser = ConfigParser.RawConfigParser()
configParser.read("/home/suryaveer/check.conf")
num = configParser.get('userinput-config', 'num')
num2 = int(num)
message = "hello world"
f = open('test.txt', 'w')
f.write(message*num2)
f.close()

A string with length of 1 is 1 byte (as long as it is utf8).
That means that the size of "Hello World" in bytes is len("Hello World") = 11 bytes.
To get ~N kilobytes, you can run something like this:
# N is int
size_bytes = N * 1024
message = "hello world"
# using context manager, so no need to remember to close the file.
with open('test.txt', 'w') as f:
repeat_amount = int((size_bytes/len(message))
f.write(message * repeat_amount)

First get the size of your message and bear in mind that strings in Pyhton are objects, so when you call sys.getsizeof(message) this is not the pure string but the object itself. Then just count how many time you need to repeat the pure message to get N Kb as follows:
import sys
N = 1024 # size of the output file in Kb
message = "hello world"
string_object_size = sys.getsizeof("")
single_message_size = sys.getsizeof(message) - string_object_size
reps = int((N)*1024/single_message_size)
f = open('test.txt', 'w')
f.write(message*reps)
f.close()

First, you have to be clear on difference between number of characters written and number of bytes. In many encodings one character occupies more than 1 byte. In your example, if phrase is in English ('Hello world') and default encoding is utf-8, the numbers will be the same, but if you enable other language with different character set, they may differ.
...
with open('test.txt', 'wb') as f: # binary because we need count bytes
max_size = num2 * 1024 # I assume num2 in kb
msg_bytes = message.encode('utf-8')
bytes_written = 0
while bytes_written < max_size: # if you dont need breaking the last phrase
f.write(msg_bytes)
bytes_written += len(msg_bytes)

Writing to UTF-16-LE text file with BOM

I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.
The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).
The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.
The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):
import os
import codecs
import shutil
import sys
import codecs
first = u''
textdel = u'\u00FE'.encode('utf_16_le') #thorn
fielddel = u'\u00B6'.encode('utf_16_le') #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1) #pretend this is from the metadata profile
f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
mytext2 = u''
i = 0
i = i + 1
mytext2 = mytext2 + item + textdel
if i < (num - 1):
mytext2 = mytext2 + fielddel
f.write(mytext2 + u'\n')
f.close()

You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.
Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:
textdel = u'\u00FE'
len(textdel) # -> 1
type(textdel) # -> unicode
len(textdel.encode('utf-16-le')) # -> 2
type(textdel.encode('utf-16-le')) # -> str

Determine ROT encoding

I want to determine which type of ROT encoding is used and based off that, do the correct decode.
Also, I have found the following code which will indeed decode rot13 "sbbone" to "foobart" correctly:
import codecs
codecs.decode('sbbone', 'rot_13')
The thing is I'd like to run this python file against an existing file which has rot13 encoding. (for example rot13.py encoded.txt).
Thank you!

To answer the second part of your first question, decode something in ROT-x, you can use the following code:
def encode(s, ROT_number=13):
"""Encodes a string (s) using ROT (ROT_number) encoding."""
ROT_number %= 26 # To avoid IndexErrors
alpha = "abcdefghijklmnopqrstuvwxyz" * 2
alpha += alpha.upper()
def get_i():
for i in range(26):
yield i # indexes of the lowercase letters
for i in range(53, 78):
yield i # indexes of the uppercase letters
ROT = {alpha[i]: alpha[i + ROT_number] for i in get_i()}
return "".join(ROT.get(i, i) for i in s)
def decode(s, ROT_number=13):
"""Decodes a string (s) using ROT (ROT_number) encoding."""
return encrypt(s, abs(ROT_number % 26 - 26))
To answer the first part of your first question, find the rot encoding of an arbitrarily encoded string, you probably want to brute-force. Uses all rot-encodings, and check which one makes the most sense. A quick(-ish) way to do this is to get a space-delimited (e.g. cat\ndog\nmouse\nsheep\nsay\nsaid\nquick\n... where \n is a newline) file containing most common words in the English language, and then check which encoding has the most words in it.
with open("words.txt") as f:
words = frozenset(f.read().lower().split("\n"))
# frozenset for speed
def get_most_likely_encoding(s, delimiter=" "):
alpha = "abcdefghijklmnopqrstuvwxyz" + delimiter
for punctuation in "\n\t,:; .()":
s.replace(punctuation, delimiter)
s = "".join(c for c in s if c.lower() in alpha)
word_count = [sum(w.lower() in words for w in encode(
s, enc).split(delimiter)) for enc in range(26)]
return word_count.index(max(word_count))
A file on Unix machines that you could use is /usr/dict/words, which can also be found here

Well, you can read the file line by line and decode it.
The output should go to an output file:
import codecs
import sys
def main(filename):
output_file = open('output_file.txt', 'w')
with open(filename) as f:
for line in f:
output_file.write(codecs.decode(line, 'rot_13'))
output_file.close()
if __name__ == "__main__":
_filename = sys.argv[1]
main(_filename)

Extracting Data from Multiple TXT Files and Creating a Summary CSV File in Python

I have a folder with about 50 .txt files containing data in the following format.
=== Predictions on test data ===
inst# actual predicted error distribution (OFTd1_OF_Latency)
1 1:S 2:R + 0.125,*0.875 (73.84)
I need to write a program that combines the following: my index number (i), the letter of the true class (R or S), the letter of the predicted class, and each of the distribution predictions (the decimals less than 1.0).
I would like it to look like the following when finished, but preferably as a .csv file.
ID True Pred S R
1 S R 0.125 0.875
2 R R 0.105 0.895
3 S S 0.945 0.055
. . . . .
. . . . .
. . . . .
n S S 0.900 0.100
I'm a beginner and a bit fuzzy on how to get all of that parsed and then concatenated and appended. Here's what I was thinking, but feel free to suggest another direction if that would be easier.
for i in range(1, n):
s = str(i)
readin = open('mydata/output/output'+s+'out','r')
#The files are all named the same but with different numbers associated
output = open("mydata/summary.csv", "a")
storage = []
for line in readin:
#data extraction/concatenation here
if line.startswith('1'):
id = i
true = # split at the ':' and take the letter after it
pred = # split at the second ':' and take the letter after it
#some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
ds = # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = #skip the character after the comma but take the have characters after
else:
#take the five characters after the comma
lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
else: continue
output.write(lineholder)
I think using the indexes would be another option, but it might complicate things if the spacing is off in any of the files and I haven't checked this for sure.
Thank you for your help!

Well first of all, if you want to use CSV, you should use CSV module that comes with python. More about this module here: https://docs.python.org/2.7/library/csv.html I won't demonstrate how to use it, because it's pretty simple.
As for reading the input data, here's my suggestion how to break down every line of the data itself. I assume that lines of data in the input file have their values separated by spaces, and each value cannot contain a space:
def process_line(id_, line):
pieces = line.split() # Now we have an array of values
true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
if len(pieces) == 6: # There was an error, the + is there
p4 = pieces[4]
else: # There was no '+' only spaces
p4 = pieces[3]
ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
else:
dr = p4.split(',')[0]
return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr
What I mainly used here was split function of strings: https://docs.python.org/2/library/stdtypes.html#str.split and in one place this simple syntax of str[1:] to skip the first character of the string (strings are arrays after all, we can use this slicing syntax).
Keep in mind that my function won't handle any errors or lines formated differently than the one you posted as an example. If the values in every line are separated by tabs and not spaces you should replace this line: pieces = line.split() with pieces = line.split('\t').

i think u can separte floats and then combine it with the strings with the help of re module as follows:
import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code
this prog is written for one line u can use it for multiple line as well but u need to use a loop for that
contents of sample.txt:
1 1:S 2:R + 0.125,*0.875 (73.84)
2 1:S 2:R + 0.15,*0.85 (69.4)
when you run the prog the result will be:
[['1:S,'2:R'],['1:S','2:R'],['0.125','0.875','73.84'],['0.15,'0.85,'69.4']]
simply concatenate them

This uses regular expressions and the CSV module.
import re
import csv
matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'
output = csv.writer(open('mydata/summary.csv', 'w'))
for i in range(1, n):
for line in open(filenametemplate % i):
m = matcher.match(line)
if m:
output.write([i] + list(m.groups()))

Counting chars in a file | python 3x

I'm wondering, how can I count for example all "s" characters and print their number in a text file that I'm importing? Tried few times to do it by my own but I'm still doing something wrong. If someone could give me some tips I would really appreciate that :)

Open the file, the "r" means it is opened as readonly mode.
filetoread = open("./filename.txt", "r")
With this loop, you iterate over all the lines in the file and counts the number of times the character chartosearch appears. Finally, the value is printed.
total = 0
chartosearch = 's'
for line in filetoread:
total += line.count(chartosearch)
print("Number of " + chartosearch + ": " + total)

I am assuming you want to read a file, find the number of s s and then, store the result at the end of the file.
f = open('blah.txt','r+a')
data_to_read = f.read().strip()
total_s = sum(map(lambda x: x=='s', data_to_read ))
f.write(str(total_s))
f.close()
I did it functionally just to give you another perspective.

You open the file with an open("myscript.txt", "r") with the mode as "r" because you are reading. To remove whitespaces and \n's, we do a .read().split(). Then, using a for loop, we loop over each individual character and check if it is an 'S' or an 's', and each time we find one, we add one to the scount variable (scount is supposed to mean S-count).
filetoread = open("foo.txt").read().split()
scount = 0
for k in ''.join(filetoread):
if k.lower() == 's':
scount+=1
print ("There are %d 's' characters" %(scount))

Here's a version with a reasonable time performance (~500MB/s on my machine) for ascii letters:
#!/usr/bin/env python3
import sys
from functools import partial
byte = sys.argv[1].encode('ascii') # s
print(sum(chunk.count(byte)
for chunk in iter(partial(sys.stdin.buffer.read, 1<<14), b'')))
Example:
$ echo baobab | ./count-byte b
3
It could be easily changed to support arbitrary Unicode codepoints:
#!/usr/bin/env python3
import sys
from functools import partial
char = sys.argv[1]
print(sum(chunk.count(char)
for chunk in iter(partial(sys.stdin.read, 1<<14), '')))
Example:
$ echo ⛄⛇⛄⛇⛄ | ./count-char ⛄
3
To use it with a file, you could use a redirect:
$ ./count-char < input_file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python splitting with string as delimiter - python

assuming you can read the whole file at once s=DNAstring.replace("\n","") # first remove the nasty linebreaks l=[x for x in s.split("N") if x] # split and drop empty lines for x in l: # print in chunks while x: print x[:10] x=x[10:] print # extra linebreak between chunks

Related

How to create a text file of size N kilobytes with repetitions of "Hello World"

Writing to UTF-16-LE text file with BOM

Determine ROT encoding

Extracting Data from Multiple TXT Files and Creating a Summary CSV File in Python

Counting chars in a file | python 3x

Categories

Resources