I am very new to Python and I am trying to read in a file that partially contains binary data. There is a header with some information about the data and after the header binary data follow. If one opens the file in a texteditor it looks like this:
>>> Begin of header <<<
value1: 5
value2: 7
...
value65: 9
>>> End of header <<<
���ÄI›C¿���†¨¨v#���ÄW]c¿��� U⁄z#���#¬P\¿����∂:q#���#Ò˚U¿���†÷Us#���`ªw4¿��� :‘m#���#À›9#���ÄAs#���¿‹ ¿����ır#���¿#&%#���†„bq#����*˙-#��� [q#����ÚN8#����
Òo#���#√·T#���†‰zm#����9\#����ÃÜq#����€dZ#���`Ëäs#���†∏8I#���¿¬Ot#���†�6
an additional problem is that I did not create the file myself and do not now if those are double or float data.
So how can I interpret those data?
So first, thanks to all for the help: So basically the problem is the header. I can read in the data quit well, when i remove the header from the file. This can be done with
x = numpy.fromfile(f, dtype = numpy.complex128 , count = -1)
quite easily. The problem is that I cannot find any option for the function fromfile that skips lines (one can skip bytes, but the header size may be different from file to file.
In this great thread I found the how to convert an binary array to an numpy array:
convert binary string to numpy array
With this I could overcome the problem by reading in the datafile line for line and then merge every line after the end header line together in one string.
This string was then changed into an nice array exactly as I wanted it.
Related
I am trying to extract embeddings from a hidden layer of LSTM. I have a huge dataset with multiple sentences and therefore those will generate multiple numpy vectors. I want to store all those vectors efficiently into a single file. This is what I have so far
with open(src_vectors_save_file, "wb") as s_writer, open(tgt_vectors_save_file, "wb") as t_writer:
for batch in data_iter:
encoder_hidden_layer, decoder_hidden_layer = self.extract_lstm_hidden_states_for_batch(
batch, data.src_vocabs, attn_debug
)
encoder_hidden_layer = encoder_hidden_layer.detach().numpy()
decoder_hidden_layer = decoder_hidden_layer.detach().numpy()
enc_hidden_bytes = pickle.dumps(encoder_hidden_layer)
dec_hidden_bytes = pickle.dumps(decoder_hidden_layer)
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
t_writer.write(dec_hidden_bytes)
t_writer.write("\n")
Essentially I am using pickle to get the bytes from the np.array and writing that in binary file. I tried to naively separate each byte encoded array with ASCII newline which obviously throws an error. I was planning to use .readlines() function or read each byte-encoded array per line using a for loop in the next program. However, that won't be possible now.
I am out of any ideas can someone suggest an alternative? How can I efficiently store all the arrays in a compressed fashion in one file and how can I read them back from that file?
There is a problem with using \ns are separators because the dump from pickle (enc_hidden_bytes) could have \n in it because the data is not ASCII encoded.
There are two solutions. You can escape the \n appearing in the data and then use \n as terminators. But this adds complexity even while reading.
The other solution is to put into the file the size of the data before starting the actual data. This is like some sort of a header and is a very common practice while sending data over a connection.
You can write the following two functions -
import struct
def write_bytes(handle, data):
total_bytes = len(data)
handle.write(struct.pack(">Q", total_bytes))
handle.write(data)
def read_bytes(handle):
size_bytes = handle.read(8)
if len(size_bytes) == 0:
return None
total_bytes = struct.unpack(">Q", size_bytes)[0]
return handle.read(total_bytes)
Now you can replace
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
with
write_bytes(s_writer, enc_hidden_bytes)
and same for the other variables.
While reading back from the file in a loop you can use the read_bytes function in a similar way.
I have a very large csv file (>3GB, > 75million rows).
Problem is, it should not have been created as csv, but tab delimited.
The file has two columns, a string, and an integer. However, the string can have commas (for example: "Yes, it is very nice"), so, now the file may look like this, and it does not have a consistent number of columns and I cannot read it with pandas read_csv.
STRING CODE
This is nice 1
That is also nice 2
Yes it is very nice 3
I love everything 4
I am trying to convert it a tab delimited file, by changing the last comma into a tab. Since the file is huge, I cannot read it into memory. This is what I tried.
I read the file in chunks:
for ch in pandas.read_table("path", chunksize=256)
I define a function, myfunc, as follows:
li = s.rsplit(",", 1)
ret = "\t".join(li)
ret.rsplit("\t", 1)
Now, for each chunk I do something like:
data["STRING,CODE"] = data["STRING,CODE"].map(lambda x: x.myfunc(x))
data.to_csv("tmp.csv", sep="\t")
and I get something like:
STRING CODE
0 "This is nice 1
1 "That is also nice
2 "Yes it is very nice 3"
3 "I love everything 4"
Which is nothing like what I want. The entries are not separated the way I want, I get extra indices, and extra quotation marks. Besides, even after I am able to fix this for one chunk, I need to go back and append to the csv file to recreate the whole file.
Sorry this is messy, but I am lost. Any help?
File:
STRING,CODE
This is nice,1
That is also nice,2
Yes,it is very nice,3
I love everything,4
You shouldn't need pandas here. Just iterate through the lines of the file and write the fixed lines to a new file.
with open('new.csv', 'w') as newcsv:
with open('file.csv') as csvf:
for line in csvf:
head, _, tail = line.strip().rpartition(',')
newcsv.write('{}\t{}\n'.format(head, tail))
This should get the job done.
You don't even have to use python:
sed -i 's/\(.*\),/\1\t/' $INPUT
does an inplace replacement of the last , in the line with a /t.
If you want to preserve the input:
sed 's/\(.*\),/\1\t/' $INPUT > $OUTPUT
I suspect this would be faster than running it through python, but that's just a guess.
I'm working on a project that requires me to read Fortran binary files. It is my understanding that Fortran automatically puts a 4-byte header and footer into each file. As such, I want to remove the first and last 4 bytes from the file before I read it. Would this do the trick?
a = open("foo",rb)
b = a.seek(4,0)
x = np.fromfile(b.seek(4,2),dtype='float64')
It might be easier to read the entire file and then chop 4 bytes off each end:
a = open("foo","rb")
data = a.read()
a.close()
x = np.fromstring(data[4:-4], dtype='float64')
For a similar question, see How to read part of binary file with numpy?
I am trying to read an image file which is in *.his format. Honestly, I do not know much about this format, on spending some time on google I figured out that its a binary format and it can be read in ImageJ software as a raw format import. On further inquiry, I found the following details of the *.his file:
Image type = 16-bit unsigned
Matrix dimensions in pixels = w1024 x h1024
Skip header info = 100 Bytes (The number of bytes in the file before the first byte of image data).
Little-Endian Byte Order
With this information in hand, I started out ...
Just wanted to print the values in one by one, just to see the output:
f = open("file.his", 'rb')
f.seek(100)
try:
byte = f.read(2)
while byte != "":
byte = f.read(2)
print unpack('<H', byte)
finally:
f.close()
It prints some numbers out and then the error message :
.....
(64846,)
(64846,)
(64830,)
Traceback (most recent call last):
print unpack('
Plz can someone suggest me how to read this kind of file. I still think 'unpack' is the right function however if someone has similar experience, any response greatly appreciated.
Rky.
I've done a very similar task with *.inr image file maybe the logic could help you, here its what you could apply:
1-Reading the file
First you need to read the file.
file = open(hisfile, 'r')
inp = file.readlines()
2-Get header
In my case i done a for loop until the number of characters was 256, in your case you need to count the bits so you could "print" line by line to find out when you need to stop or try to use this to count the bits:
import sys
sys.getsizeof(line) #returns the size of the object
3-Data
When you already know that the following lines are the raw data you need to put them in one variable with a for loop:
for line in inp:
raw_data += line
4-Convert the data
To convert the string to a numpy array you could do:
data = fromstring(raw_data, dtype='uint16')
And then aplying the shape data:
data = data.reshape((1024,1024)).transpose() #You need to see if the transpose part its relevant,because in my case was fundamental.
Maybe if you have an example of the file i could try to read it and help you more. Of course you could do all the process in 1 for loop using if's.
I'm having a really difficult time writing integers out to a file. Here's my situation. I have a file, let's call it 'idlist.txt'. It has multiple columns and is fairly long (10,000 rows), but I only care about the first column of data.
I'm loading it into python using:
import numpy as np
FH = np.loadtxt('idlist.txt',delimiter=',',comments='#')
# Testing initial data type
print FH[0,0],type(FH[0,0])
>>> 85000370342.0 <type 'numpy.float64'>
# Converting to integers
F = [int(FH[i,0]) for i in range(len(FH))]
print F[0],type(F[0])
>>> 85000370342 <type 'long'>
As you can see, the data must be made into integers. What I now would like to do is to write the entries of this list out as the first column of another file (really the only column in the entire file), we can rename it 'idonly.txt'. Here is how I'm trying to do it:
with open('idonly.txt','a') as f:
for i in range(len(F)):
f.write('%d\n' % (F[i]))
This is clearly not producing the desired output - when I open the file 'idonly.txt', each entry is actually a float (i.e. - 85000370342.0). What exactly is going on here, and why is writing integers to a file such a complicated task? I found the string formatting idea from here: How to write integers to a file , but it didn't fix my issue.
Okay, well it appears that this is completely my fault. When I'm opening the file I'm using the mode 'a', which means append. It turns out that the first time I wrote this out to a file I did it incorrectly, and ever since I've been appending the correct answer onto that and simply not looking down as far as I should since it's a really long file.
For reference here are all of the modes you can use when handling files in python: http://www.tutorialspoint.com/python/python_files_io.htm. Choose carefully.
Try to use:
f.write('{}'.format(F[i]));