Bug in StringIO module python using numpy - python

Very simple code:
import StringIO
import numpy as np
c = StringIO.StringIO()
c.write("1 0")
a = np.loadtxt(c)
print a
I get an empty array + warning that c is an empty file.
I fixed this by adding:
d=StringIO.StringIO(c.getvalue())
a = np.loadtxt(d)
I think such a thing shouldn't happen, what is happening here?

It's because the 'position' of the file object is at the end of the file after the write. So when numpy reads it, it reads from the end of the file to the end, which is nothing.
Seek to the beginning of the file and then it works:
>>> from StringIO import StringIO
>>> s = StringIO()
>>> s.write("1 2")
>>> s.read()
''
>>> s.seek(0)
>>> s.read()
'1 2'

StringIO is a file-like object. As such it has behaviors consistent with a file. There is a notion of a file pointer - the current position within the file. When you write data to a StringIO object the file pointer is adjusted to the end of the data. When you try to read it, the file pointer is already at the end of the buffer, so no data is returned.
To read it back you can do one of two things:
Use StringIO.getvalue() as you already discovered. This returns the
data from the beginning of the buffer, leaving the file pointer unchanged.
Use StringIO.seek(0) to reposition the file pointer to the start of
the buffer and then calling StringIO.read() to read the data.
Demo
>>> from StringIO import StringIO
>>> s = StringIO()
>>> s.write('hi there')
>>> s.read()
''
>>> s.tell() # shows the current position of the file pointer
8
>>> s.getvalue()
'hi there'
>>> s.tell()
8
>>> s.read()
''
>>> s.seek(0)
>>> s.tell()
0
>>> s.read()
'hi there'
>>> s.tell()
8
>>> s.read()
''
There is one exception to this. If you provide a value at the time that you create the StringIO the buffer will be initialised with the value, but the file pointer will positioned at the start of the buffer:
>>> s = StringIO('hi there')
>>> s.tell()
0
>>> s.read()
'hi there'
>>> s.read()
''
>>> s.tell()
8
And that is why it works when you use
d=StringIO.StringIO(c.getvalue())
because you are initialising the StringIO object at creation time, and the file pointer is positioned at the beginning of the buffer.

Related

Python using cStringIO with foreach loop

I want to iterate over lines cStringIO object, however it does not seem to work with foreach loop. To be more precise the behavior is as if the collection was empty. What am I doing wrong?
example:
Python 2.7.12 (default, Aug 29 2016, 16:51:45)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cStringIO
>>> s = cStringIO.StringIO()
>>> import os
>>> s.write("Hello" + os.linesep + "World" + os.linesep)
>>> s.getvalue()
'Hello\nWorld\n'
>>> for line in s :
... print line
...
>>>
Thank you.
cStringIO.StringIO returns either cStringIO.InputType object i.e input stream if provided a string else or cStringIO.OutputType object i.e output stream.
In [13]: sio = cStringIO.StringIO()
In [14]: sio??
Type: StringO
String form: <cStringIO.StringO object at 0x7f63d418f538>
Docstring: Simple type for output to strings.
In [15]: isinstance(sio, cStringIO.OutputType)
Out[15]: True
In [16]: sio = cStringIO.StringIO("dsaknml")
In [17]: sio??
Type: StringI
String form: <cStringIO.StringI object at 0x7f63d4218580>
Docstring: Simple type for treating strings as input file streams
In [18]: isinstance(sio, cStringIO.InputType)
Out[18]: True
So you can either do read operations or write operations but not both. a simple solution to do read operations on a cStringIO.OutputType object is by converting it into the value by getvalue() method.
If you try do both operations then either of them gets ignored silently.
cStringIO.OutputType.getvalue(c_string_io_object)
Try using the string split method:
for line in s.getvalue().split('\n'): print line
...
Hello
World
Or as suggested, if you are always splitting on a new line:
for line in s.getvalue().splitlines(): print line
You can read the contents from an open file handle after writing, but you first have to use the seek(0) method to move the pointer back to the start. This will work for either cStringIO or a real file:
import cStringIO
s = cStringIO.StringIO()
s.write("Hello\nWorld\n") # Python automatically converts '\n' as needed
s.getvalue()
# 'Hello\nWorld\n'
s.seek(0) # move pointer to start of file
for line in s :
print line.strip()
# Hello
# World

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Get the results of dis.dis() in a string

I am trying to compare the bytecode of two things with difflib, but dis.dis() always prints it to the console. Any way to get the output in a string?
If you're using Python 3.4 or later, you can get that string by using the method Bytecode.dis():
>>> s = dis.Bytecode(lambda x: x + 1).dis()
>>> print(s)
1 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (1)
6 BINARY_ADD
7 RETURN_VALUE
You also might want to take a look at dis.get_instructions(), which returns an iterator of named tuples, each corresponding to a bytecode instruction.
Uses StringIO to redirect stdout to a string-like object (python 2.7 solution)
import sys
import StringIO
import dis
def a():
print "Hello World"
stdout = sys.stdout # Hold onto the stdout handle
f = StringIO.StringIO()
sys.stdout = f # Assign new stdout
dis.dis(a) # Run dis.dis()
sys.stdout = stdout # Reattach stdout
print f.getvalue() # print contents

Python ConfigParser elements into CSV arguments

I have a script that parses a csv file and produces an XML file. One of the arguments I have to give the parser is the delimiter, which in my case is not a comma but a tab.
This information is stored in a configuration file which I extract and then pass to the csv parser.
ident = parser.get('CSV', 'delimiter') #delimiter taken from config file
csv.register_dialect('custom',
delimiter= ident, #passed to csv parser
doublequote=False,
escapechar=None,
quotechar='"',
quoting=csv.QUOTE_MINIMAL,
skipinitialspace=False)
However I get a type error saying that the "delimiter" must be an 1-character string. I checked the type of ident and it's a string but it doesn't seem to be recognising the \t as a tab. When I put ident = '\t' or delimiter = '\t' it works. How do I get the value correctly from the config file.
Maybe a bit too late, but I have a small workaround: setting the parameter as the hex code value and then decoding it
from ConfigParser import ConfigParser
cp = ConfigParser()
cp.add_section('a')
cp.set('a', 'b', '09') #hex code for tab (please note that there is no \x
cp.write(open('foo.ini', 'w'))
from ConfigParser import ConfigParser
cp_in = ConfigParser()
cp_in.read('foo.ini')
print(repr(bytearray.fromhex(cp_in.get('a', 'b')).decode())) #where the magic happens
This doesn't appear to be possible using ConfigParser.
While the docs don't explicitly mention this case, they do say that leading whitespace will be stripped from values.
Having tried to round-trip the value, it just gets back an empty string:
from ConfigParser import ConfigParser
cp = ConfigParser()
cp.add_section('a')
cp.set('a', 'b', '\t')
cp.write(open('foo.ini', 'w'))
cp_in = ConfigParser()
cp_in.read('foo.ini')
print(repr(cp_in.get('a', 'b'))) # prints ''
I'm adding what I think is the obvious answer that everyone apparently missed. Judging from the comments, the config file looks something like this:
[CSV]
delimiter=\t
quoting=QUOTE_ALL
The value for 'delimiter' is two characters, a backslash and a 't'. Here's how to read it and convert the value into a tab.
>>> import configparser, codecs, csv
>>> parser = configparser.ConfigParser()
>>> parser.read('foo.cfg')
['foo.cfg']
>>> ident = parser.get('CSV', 'delimiter')
>>> csv.register_dialect('custom', delimiter=ident)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: "delimiter" must be a 1-character string
>>> ident, len(ident)
('\\t', 2)
>>> decoded = codecs.decode(ident, encoding='unicode_escape')
>>> csv.register_dialect('custom', delimiter=decoded)
>>> decoded, len(decoded)
('\t', 1)
And here's a bonus:
>>> quoting = parser.get('CSV', 'quoting')
>>> csv.register_dialect('custom', quoting=quoting)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: "quoting" must be an integer
>>> quoting
'QUOTE_ALL'
>>> try:
... quoting = parser.getint('CSV', 'quoting')
... except ValueError:
... quoting = getattr(csv, parser.get('CSV', 'quoting'))
>>> csv.register_dialect('custom', quoting=quoting)
>>> quoting
1

Python read() seems to return less data than it reads

Can anyone tell me why the length of data is much less than the position of the end of the file? I would have expected these to be equal.
>>> target = open('target.jpg')
>>> print target.tell()
0
>>> data = target.read()
>>> print target.tell()
40962
>>> print len(data)
52
Open the file in binary mode:
target = open('target.jpg','rb')
I would not trust tell() on a file not opened as binary.
Later: actually, on reviewing the comments, I should have said I would not trust a read on a binary file opened as text.

Categories

Resources