Read string up to a certain size in Python - python

I have a string stored in a variable. Is there a way to read a string up to a certain size e.g. File objects have f.read(size) which can read up to a certain size?

Check out this post for finding object sizes in python.
If you are wanting to read the string from the start until a certain size MAX is reached, then return that new (possibly shorter string) you might want to try something like this:
import sys
MAX = 176 #bytes
totalSize = 0
newString = ""
s = "MyStringLength"
for c in s:
totalSize = totalSize + sys.getsizeof(c)
if totalSize <= MAX:
newString = newString + str(c)
elif totalSize > MAX:
#string that is slightly larger or the same size as MAX
print newString
break
This prints 'MyString' which is less than (or equal to) 176 Bytes.
Hope this helps.

message = 'a long string which contains a lot of valuable information.'
bite = 10
while message:
# bite off a chunk of the string
chunk = message[:bite]
# set message to be the remaining portion
message = message[bite:]
do_something_with(chunk)

Related

How to create a text file of size N kilobytes with repetitions of "Hello World"

I want to create a text file of size N kilobytes with repetitions of "Hello World" where N is specified via a config file in a different directory from the repository, with the help of python. I am able to display the hello world N number of times, where N is a numerical input from a config file, but I dont know anything about size. Here is the code I have written so far:
import ConfigParser
import webbrowser
configParser = ConfigParser.RawConfigParser()
configParser.read("/home/suryaveer/check.conf")
num = configParser.get('userinput-config', 'num')
num2 = int(num)
message = "hello world"
f = open('test.txt', 'w')
f.write(message*num2)
f.close()
A string with length of 1 is 1 byte (as long as it is utf8).
That means that the size of "Hello World" in bytes is len("Hello World") = 11 bytes.
To get ~N kilobytes, you can run something like this:
# N is int
size_bytes = N * 1024
message = "hello world"
# using context manager, so no need to remember to close the file.
with open('test.txt', 'w') as f:
repeat_amount = int((size_bytes/len(message))
f.write(message * repeat_amount)
First get the size of your message and bear in mind that strings in Pyhton are objects, so when you call sys.getsizeof(message) this is not the pure string but the object itself. Then just count how many time you need to repeat the pure message to get N Kb as follows:
import sys
N = 1024 # size of the output file in Kb
message = "hello world"
string_object_size = sys.getsizeof("")
single_message_size = sys.getsizeof(message) - string_object_size
reps = int((N)*1024/single_message_size)
f = open('test.txt', 'w')
f.write(message*reps)
f.close()
First, you have to be clear on difference between number of characters written and number of bytes. In many encodings one character occupies more than 1 byte. In your example, if phrase is in English ('Hello world') and default encoding is utf-8, the numbers will be the same, but if you enable other language with different character set, they may differ.
...
with open('test.txt', 'wb') as f: # binary because we need count bytes
max_size = num2 * 1024 # I assume num2 in kb
msg_bytes = message.encode('utf-8')
bytes_written = 0
while bytes_written < max_size: # if you dont need breaking the last phrase
f.write(msg_bytes)
bytes_written += len(msg_bytes)

generate string with length equal to length of time in file, with 1 label per second , python

I have a file like this:
https://gist.github.com/manbharae/70735d5a7b2bbbb5fdd99af477e224be
What I want to do is generate 1 label for 1 second.
Since this above file is 160 seconds long, there should be 160 labels. in other words I want to generate string of length 160.
However I'm ending up having an str of len 166 instead of 160.
My code :
filename = './test_file.txt'
ann = []
with open(filename, 'r') as f:
for line in f:
_, end, label = line.strip().split('\t')
ann.append((int(float(end)), 'MIT' if label == 'MILAN' else 'not-MIT'))
str = ''
prev_value = 0
for s in ann:
value = s[0]
letter = 'M' if s[1] == 'MIT' else 'x'
str += letter * (value - prev_value)
print str
prev_value = value
name_of_file, file_ext = os.path.splitext(os.path.basename(filename))
print "\n\nfile_name processed:", name_of_file
print str
print "length of string", len(str),"\n\n"
My final output:
xxxxxxxMxMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxMMMMMMMMMMMMMMMMMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
166.
Which is wrong. Str should be 160characters with each character per second, because file is 160 seconds long.
There is some small bug somewhere, unable to find it.
Please advise what's going wrong here?
Thanks.
Few things that I tried were , trying to include an if condition to break out of the loop once length of 160 is reached like this:
if ann[len(ann)-1][0] == len(str):
break;
AFAIK, something is going wrong in the last iteration, because until then everything is fine.
however it didn't help.
I looked at : https://stackoverflow.com/a/14927311/4932791
https://stackoverflow.com/a/1424016/4932791
The reason it doesn't add up is because you have two occasions which should add a negative amount of letters because the value is lower than the previous number:
(69, 'not-MIT')
(68, 'not-MIT')
(76, 'not-MIT')
(71, 'not-MIT')
For future reference: it's better not to call your variables 'str' as 'str()' already is a defined function in python.

The String is Not Read Fully

I wrote a programme to generate a string of number, consisting of 0,1,2,and 3 with the length s and write the output in decode.txt file. Below is the code :
import numpy as np
n_one =int(input('Insert the amount of 1: '))
n_two =int(input('Insert the amount of 2: '))
n_three = int(input('Insert the amount of 3: '))
l = n_one+n_two+n_three
n_zero = l+1
s = (2*(n_zero))-1
data = [0]*n_zero + [1]*n_one + [2]*n_two + [3]*n_three
print ("Data string length is %d" % len(data))
while data[0] == 0 and data[s-1]!=0:
np.random.shuffle(data)
datastring = ''.join(map(str, data))
datastring = str(int(datastring))
files = open('decode.txt', 'w')
files.write(datastring)
files.close()
print("Data string is : %s " % datastring)
The problem occur when I try to read the file from another program, the program don't call the last value of the string.
For example, if the string generated is 30112030000 , the other program will only call 3011203000, means the last 0 is not called.
But if I key in 30112030000 directly to the .txt file, all value is read. I can't figure out where is wrong in my code.
Thank you
Some programs might not like the fact that the file doesn't end with a newline. Try adding files.write('\n') before you close it.

Python splitting with string as delimiter

I have a file that looks something like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC
And I need to split it into the "non-N" sequences, so two separate files like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCT
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGC
TATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACA
TGTGTTGC
What I currently have is this:
UMfile = open ("C:\Users\Manuel\Desktop\sequence.txt","r")
contignumber = 1
contigfile = open ("contig "+str(contignumber), "w")
DNA = UMfile.read()
DNAstring = str(DNA)
for s in DNAstring:
DNAstring.split("NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN",1)
contigfile.write(DNAstring)
contigfile.close()
contignumber = contignumber+1
contigfile = open ("contig "+str(contignumber), "w")
The thing is that I realize there is a linebreak between the "Ns" and that is why it is not splitting my file, but the "file" I'm showing is just a part of a much much bigger one. So sometimes the "Ns" will look like this "NNNNNN\n" and sometimes like "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n", yet there is always a count of 1000 Ns between my sequences that I need to split.
So my question is: How do I tell python to split and wite into different files every 1000xNs knowing that there will be different number of Ns in each line?
Thank you all very much, I really have no informatics background and my python skills are at best basic.
Just split your string on 'N' and then remove all the strings that are empty, or just contain a newline. Like this:
#!/usr/bin/env python
DNAstring = '''AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC'''
sequences = [u for u in DNAstring.split('N') if u and u != '\n']
for i, seq in enumerate(sequences):
print i
print seq.replace('\n', '') + '\n'
output
0
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACTTCTCAATGGGCAGTACATATCATCTCT
1
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACATGTGTTGC
The code snippet above also removes newlines inside the sequences using .replace('\n', '').
Here are a few programs that you may find useful.
Firstly, a line buffer class. You initialise it with a file name and a line width. You can then feed it random length strings and it will automatically save them to the text file, line by line, with all lines (except possibly the last line) having the given length. You can use this class in other programs to make your output look neat.
Save this file as linebuffer.py to somewhere in your Python path; the simplest way is to save it wherever you save your Python programs and make that the current directory when you run the programs.
linebuffer.py
#! /usr/bin/env python
''' Text output buffer
Write fixed width lines to a text file
Written by PM 2Ring 2015.03.23
'''
class LineBuffer(object):
''' Text output buffer
Write fixed width lines to file fname
'''
def __init__(self, fname, width):
self.fh = open(fname, 'wt')
self.width = width
self.buff = []
self.bufflen = 0
def write(self, data):
''' Write a string to the buffer '''
self.buff.append(data)
self.bufflen += len(data)
if self.bufflen >= self.width:
self._save()
def _save(self):
''' Write the buffer to the file '''
buff = ''.join(self.buff)
#Split buff into lines
lines = []
while len(buff) >= self.width:
lines.append(buff[:self.width])
buff = buff[self.width:]
#Add an empty line so we get a trailing newline
lines.append('')
self.fh.write('\n'.join(lines))
self.buff = [buff]
self.bufflen = len(buff)
def close(self):
''' Flush the buffer & close the file '''
if self.bufflen > 0:
self.fh.write(''.join(self.buff) + '\n')
self.fh.close()
def testLB():
alpha = 'abcdefghijklmnopqrstuvwxyz'
fname = 'linebuffer_test.txt'
lb = LineBuffer(fname, 27)
for _ in xrange(30):
lb.write(alpha)
lb.write(' bye.')
lb.close()
if __name__ == '__main__':
testLB()
Here is a program that makes random DNA sequences of the form you described in your question. It uses linebuffer.py to handle the output. I wrote this so I could test my DNA sequence splitter properly.
Random_DNA0.py
#! /usr/bin/env python
''' Make random DNA sequences
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Takes approx 3 seconds per megabyte generated and saved
on a 2GHz CPU single core machine.
Written by PM 2Ring 2015.03.23
'''
import sys
import random
from linebuffer import LineBuffer
#Set seed to None to seed randomizer from system time
random.seed(37)
#Output line width
linewidth = 120
#Subsequence base length ranges
minsub, maxsub = 15, 300
#Subsequences per sequence ranges
minseq, maxseq = 5, 50
#random 'N' sequence ranges
minn, maxn = 5, 200
#Probability that a random 'N' sequence occurs after a subsequence
randn = 0.2
#Sequence separator
nsepblock = 'N' * 1000
def main():
#Get number of sequences from the command line
numsequences = int(sys.argv[1]) if len(sys.argv) > 1 else 2
outname = 'DNA_sequence.txt'
lb = LineBuffer(outname, linewidth)
for i in xrange(numsequences):
#Write the 1000*'N' separator between sequences
if i > 0:
lb.write(nsepblock)
for j in xrange(random.randint(minseq, maxseq)):
#Possibly make a short run of 'N's in the sequence
if j > 0 and random.random() < randn:
lb.write(''.join('N' * random.randint(minn, maxn)))
#Create a single subsequence
r = xrange(random.randint(minsub, maxsub))
lb.write(''.join([random.choice('ACGT') for _ in r]))
lb.close()
if __name__ == '__main__':
main()
Finally, we have a program that splits your random DNA sequences. Once again, it uses linebuffer.py to handle the output.
DNA_Splitter0.py
#! /usr/bin/env python
''' Split DNA sequences and save to separate files
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Written by PM 2Ring 2015.03.23
'''
import sys
from linebuffer import LineBuffer
#Output line width
linewidth = 120
#Sequence separator
nsepblock = 'N' * 1000
def main():
iname = 'DNA_sequence.txt'
outbase = 'contig'
with open(iname, 'rt') as f:
data = f.read()
#Remove all newlines
data = data.replace('\n', '')
sequences = data.split(nsepblock)
#Save each sequence to a series of files
for i, seq in enumerate(sequences, 1):
outname = '%s%05d' % (outbase, i)
print outname
#Write sequence data, with line breaks
lb = LineBuffer(outname, linewidth)
lb.write(seq)
lb.close()
if __name__ == '__main__':
main()
assuming you can read the whole file at once
s=DNAstring.replace("\n","") # first remove the nasty linebreaks
l=[x for x in s.split("N") if x] # split and drop empty lines
for x in l: # print in chunks
while x:
print x[:10]
x=x[10:]
print # extra linebreak between chunks
You could simply replace every N and \n with a space, and then split.
result = DNAstring.replace("\n", " ").replace("N", " ").split()
This will give you back a list of strings, and the 'ACGT' sequences will also be split with every new line.
if this is not you goal an you want to conserve the \n in the 'ACGT' and not split along it, you can do the following:
result = DNAstring.replace("N\n", " ").replace("N", " ").split()
this will only remove the \n if it is in the middle of an N sequence.
To split your string exactly after 1000 Ns:
# 1/ Get rid of line breaks in the N sequence
result = DNAstring.replace("N\n", "N")
# 2/ split every 1000 Ns
result = result.split(1000*"N")

Howto Remove Garbage Data from String

I'm in a situation where I have to use Python to read and write to an EEPROM on an embedded device. The first page (256 bytes) is used for non-volatile data storage. My problem is that the variables can vary in length, and I need to read a fixed amount.
For example, an string is stored at address 30 and can be anywhere from 6 to 10 bytes in length. I need to read the maximum possible length, because I don't know where it ends. What that does is it gives me excess garbage in the string.
data_str = ee_read(bytecount)
dbgmsg("Reading from EEPROM: addr = " + str(addr_low) + " value = " + str(data_str))
> Reading from EEPROM: addr = 30 value = h11c13����
I am fairly new to Python. Is there a way to automatically chop off that data in the string after it's been read in?
Do you mean something like:
>>> s = 'Reading from EEPROM: addr = 30 value = h11c13����'
>>> s
'Reading from EEPROM: addr = 30 value = h11c13\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd'
>>> filter(lambda x: ord(x)<128,s)
'Reading from EEPROM: addr = 30 value = h11c13'
On python3, you'll need to to join the string:
''.join(filter(lambda x: ord(x)<128,s)
A version which works for python2 and python3 would be:
''.join(x for x in s if ord(x) < 128)
Finally, it is concieveable that the excess garbage could contain printing characters. In that case you might want to take only characters until you read a non-printing character, itertools.takewhile could be helpful...
import string #doesn't exist on python3.x, use the builtin `str` type instead.
from itertools import takewhile
printable = set(string.printable)
''.join(takewhile(lambda x: x in printable, s))

Categories

Resources