Reading a line with scientific numbers (like 0.4E-03) - python

I would like to process the following line (output of a Fortran program) from a file, with Python:
74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540
and obtain an array such as:
[74,0.4131493371345440e-3,-0.4592776407685850E-03,-0.1725046324754540]
My previous attempts do not work. In particular, if I do the following :
with open(filename,"r") as myfile:
line=np.array(re.findall(r"[-+]?\d*\.*\d+",myfile.readline())).astype(float)
I have the following error :
ValueError: could not convert string to float: 'E-03'

Steps:
Get list of strings (str.split(' '))
Get rid of "\n" (del arr[-1])
Turn list of strings into numbers (Converting a string (with scientific notation) to an int in Python)
Code:
import decimal # you may also leave this out and use `float` instead of `decimal.Decimal()`
arr = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
arr = arr.split(' ')
del arr[-1]
arr = [decimal.Decimal(x) for x in arr]
# do your np stuff
Result:
>>> print(arr)
[Decimal('74'), Decimal('0.0004131493371345440'), Decimal('-0.0004592776407685850'), Decimal('-0.1725046324754540')]
PS:
I don't know if you wrote the file that gives the output in the first place, but if you did, you could just think about outputting an array of float() / decimal.Decimal() from that file instead.

#ant.kr Here is a possible solution:
# Initial data
a = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
# Given the structure of the initial data, we can proceed as follow:
# - split the initial at each white space; this will produce **list** with the last
# the element being **\n**
# - we can now convert each list element into a floating point data, store them in a
# numpy array.
line = np.array([float(i) for i in a.split(" ")[:-1]])

Related

Convert Matlab to Python

I'm converting matlab code to python, and I'm having a huge doubt on the following line of code:
BD_teste = [BD_teste; grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l];
the whole code is this:
BD_teste = [];
por_treino = 0;
for l = 1:k
quant_elementos_t = int64((length(grupos.(['g',int2str(l)]).('elementos')) * por_treino)/100);
for element_c = 1 : quant_elementos_t
ind_element = randi([1 length(grupos.(['g',int2str(l)]).('elementos'))]);
BD_teste = [BD_teste; grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l];
grupos.(['g',int2str(l)]).('elementos')(ind_element,:) = [];
end
end
This line of code below is a structure, as I am converting to python, I used a list and inside it, a dictionary with its list 'elementos':
'g',int2str(l)]).('elementos')
So my question is just in the line I quoted above, I was wondering what is happening and how it is occurring, and how I would write in python.
Thank you very much in advance.
BD_teste = [BD_teste; grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l];
Is one very weird line. Let's break it down into pieces:
int2str(l) returns the number l as a char array (will span from '1' until k).
['g',int2str(l)] returns the char array g1, then g2 and so on along with the value of l.
grupos.(['g',int2str(l)]) will return the value of the field named g1, g2 and so on that belongs to the struct grupos.
grupos.(['g',int2str(l)]).('elementos') Now assumes that grupos.(['g',int2str(l)]) is itself a struct, and returns the value of its field named 'elementos'.
grupos.(['g',int2str(l)]).('elementos')(ind_element,:) Assuming that grupos.(['g',int2str(l)]) is a matrix, this line returns a line-vector containing the ind_element-th line of said matrix.
grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l appends the number one to the vector obtained before.
[BD_teste; grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l] appends the line vector [grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l] to the matrix BD_teste, at its bottom. and creates a new matrix.
Finally:
BD_teste = [BD_teste; grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l];``assignes the value of the obtained matrix to the variableBD_teste`, overwriting its previous value. Effectively, this just appends the new line, but because of the overwriting step, it is not very effective.
It would be recommendable to append with:
BD_teste(end+1,:) = [grupos.(['g',int2str(l)]).('elementos')(ind_element,:),l];
Now, how you will rewrite this in Python is a whole different story, and will depend on how you want to define the variable grupos mostly.

Convert a string of ndarray to ndarray

I have a string of ndarray. I want to convert it back to ndarray.
I tried newval = np.fromstring(val, dtype=float). But it gives ValueError: string size must be a multiple of element size
Also I tried newval = ast.literal_eval(val). This gives
File "<unknown>", line 1
[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
^
SyntaxError: invalid syntax
String of ndarray
'[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
-9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
-4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
-1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
-5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]'
How can I convert this back to ndarray?
To expand upon my comment:
If you're trying to parse a human-readable string representation of a NumPy array you've acquired from somewhere, you're already doing something you shouldn't.
Instead use numpy.save() and numpy.load() to persist NumPy arrays in an efficient binary format.
Maybe use .savetxt() if you need human readability at the expense of precision and processing speed... but never consider str(arr) to be something you can ever parse again.
However, to answer your question, if you're absolutely desperate and don't have a way to get the array into a better format...
>>> data = '''
... [-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
... -9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
... -4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
... -1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
... -5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
... 2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]
... '''.strip()
>>> list_of_floats = [float(x) for x in data.strip('[]').split(None)]
[-0.145181984, 0.151671678, 0.159053639, -0.102861412, -0.0970948339, -0.175551832, -0.072443448, 0.119182713, -0.0454084426, -0.0923779532, 0.0887222588, 0.0105331177, -0.131792471, 0.0350326337, -0.065857783, 1.02670217, -0.0529987812, 0.0209167395, -0.119845152, 0.0230511073, 0.0289404951, 0.0417387672, -0.208203331, 0.0234342851]
EDIT: For the case OP mentioned in the comments,
I am storing these arrays in LevelDB as key value pairs. The arrays are fasttext vectors. In levelDB vector (value) for each ngram (key) are stored. Is what you mentioned above applicable here?
Yes – you'd use BytesIO from the io module to emulate an in-memory "file" NumPy can write into, then put that buffer into LevelDB, and reverse the process (read from LevelDB into an empty BytesIO and pass it to NumPy) to read:
bio = io.BytesIO()
np.save(bio, my_array)
ldb.put('my-key', bio.getvalue())
# ...
bio = io.BytesIO(ldb.get('my-key'))
my_array = np.load(bio)

R readBin vs. Python struct

I am attempting to read a binary file using Python. Someone else has read in the data with R using the following code:
x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little")
myPoints <- data.frame("tmax" = x[1:(length(x)/4)],
"nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))],
"tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))],
"nmin" = x[(3*length(x)/4 + 1):(length(x))])
With Python, I am trying the following code:
import struct
with open('file','rb') as f:
val = f.read(16)
while val != '':
print(struct.unpack('4f', val))
val = f.read(16)
I am coming to slightly different results. For example, the first row in R returns 4 columns as -999.9, 0, -999.0, 0. Python returns -999.0 for all four columns (images below).
Python output:
R output:
I know that they are slicing by the length of the file with some of the [] code, but I do not know how exactly to do this in Python, nor do I understand quite why they do this. Basically, I want to recreate what R is doing in Python.
I can provide more of either code base if needed. I did not want to overwhelm with code that was not necessary.
Deducing from the R code, the binary file first contains a certain number tmax's, then the same number of nmax's, then tmin's and nmin's. What the code does is reading the entire file, which is then chopped up in the 4 parts (tmax's, nmax's, etc..) using slicing.
To do the same in python:
import struct
# Read entire file into memory first. This is done so we can count
# number of bytes before parsing the bytes. It is not a very memory
# efficient way, but it's the easiest. The R-code as posted wastes even
# more memory: it always takes 6e8 * 4 bytes (~ 2.2Gb) of memory no
# matter how small the file may be.
#
data = open('data.bin','rb').read()
# Calculate number of points in the file. This is
# file-size / 16, because there are 4 numeric()'s per
# point, and they are 4 bytes each.
#
num = int(len(data) / 16)
# Now we know how much there are, we take all tmax numbers first, then
# all nmax's, tmin's and lastly all nmin's.
# First generate a format string because it depends on the number points
# there are in the file. It will look like: "fffff"
#
format_string = 'f' * num
# Then, for cleaner code, calculate chunk size of the bytes we need to
# slice off each time.
#
n = num * 4 # 4-byte floats
# Note that python has different interpretation of slicing indices
# than R, so no "+1" is needed here as it is in the R code.
#
tmax = struct.unpack(format_string, data[:n])
nmax = struct.unpack(format_string, data[n:2*n])
tmin = struct.unpack(format_string, data[2*n:3*n])
nmin = struct.unpack(format_string, data[3*n:])
print("tmax", tmax)
print("nmax", nmax)
print("tmin", tmin)
print("nmin", nmin)
If the goal is to have this data structured as a list of points(?) like (tmax,nmax,tmin,nmin), then append this to the code:
print()
print("Points:")
# Combine ("zip") all 4 lists into a list of (tmax,nmax,tmin,nmin) points.
# Python has a function to do this at once: zip()
#
i = 0
for point in zip(tmax, nmax, tmin, nmin):
print(i, ":", point)
i += 1
Here's a less memory-hungry way to do the same. It possibly is a bit faster too. (but that is difficult to check for me)
My computer did not have sufficient memory to run the first program with those huge files. This one does, but I still needed to create a list of ony tmax's first (the first 1/4 of the file), then print it, and then delete the list in order to have enough memory for nmax's, tmin's and nmin's.
But this one too says the nmin's inside the 2018 file are all -999.0. If that doesn't make sense, could you check what the R-code makes of it then? I suspect that it is just what's in the file. The other possibility is of course, that I got it all wrong (which I doubt). However, I tried the 2017 file too, and that one does not have such problem: all of tmax, nmax, tmin, nmin have around 37% -999.0 's.
Anyway, here's the second code:
import os
import struct
# load_data()
# data_store : object to append() data items (floats) to
# num : number of floats to read and store
# datafile : opened binary file object to read float data from
#
def load_data(data_store, num, datafile):
for i in range(num):
data = datafile.read(4) # process one float (=4 bytes) at a time
item = struct.unpack("<f", data)[0] # '<' means little endian
data_store.append(item)
# save_list() saves a list of float's as strings to a file
#
def save_list(filename, datalist):
output = open(filename, "wt")
for item in datalist:
output.write(str(item) + '\n')
output.close()
#### MAIN ####
datafile = open('data.bin','rb')
# Get file size so we can calculate number of points without reading
# the (large) file entirely into memory.
#
file_info = os.stat(datafile.fileno())
# Calculate number of points, i.e. number of each tmax's, nmax's,
# tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number
# of points = file-size / (4*4)
#
num = int(file_info.st_size / 16)
tmax_list = list()
load_data(tmax_list, num, datafile)
save_list("tmax.txt", tmax_list)
del tmax_list # huge list, save memory
nmax_list = list()
load_data(nmax_list, num, datafile)
save_list("nmax.txt", nmax_list)
del nmax_list # huge list, save memory
tmin_list = list()
load_data(tmin_list, num, datafile)
save_list("tmin.txt", tmin_list)
del tmin_list # huge list, save memory
nmin_list = list()
load_data(nmin_list, num, datafile)
save_list("nmin.txt", nmin_list)
del nmin_list # huge list, save memory

Reading binary data in python

Firstly, before this question gets marked as duplicate, I'm aware others have asked similar questions but there doesn't seem to be a clear explanation. I'm trying to read in a binary file into an 2D array (documented well here http://nsidc.org/data/docs/daac/nsidc0051_gsfc_seaice.gd.html).
The header is a 300 byte array.
So far, I have;
import struct
with open("nt_197912_n07_v1.1_n.bin",mode='rb') as file:
filecontent = file.read()
x = struct.unpack("iiii",filecontent[:300])
Throws up an error of string argument length.
Reading the Data (Short Answer)
After you have determined the size of the grid (n_rowsxn_cols = 448x304) from your header (see below), you can simply read the data using numpy.frombuffer.
import numpy as np
#...
#Get data from Numpy buffer
dt = np.dtype(('>u1', (n_rows, n_cols)))
x = np.frombuffer(filecontent[300:], dt) #we know the data starts from idx 300 onwards
#Remove unnecessary dimension that numpy gave us
x = x[0,:,:]
The '>u1' specifies the format of the data, in this case unsigned integers of size 1-byte, that are big-endian format.
Plotting this with matplotlib.pyplot
import matplotlib.pyplot as plt
#...
plt.imshow(x, extent=[0,3,-3,3], aspect="auto")
plt.show()
The extent= option simply specifies the axis values, you can change these to lat/lon for example (parsed from your header)
Explanation of Error from .unpack()
From the docs for struct.unpack(fmt, string):
The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt))
You can determine the size specified in the format string (fmt) by looking at the Format Characters section.
Your fmt in struct.unpack("iiii",filecontent[:300]), specifies 4 int types (you can also use 4i = iiii for simplicity), each of which have size 4, requiring a string of length 16.
Your string (filecontent[:300]) is of length 300, whilst your fmt is asking for a string of length 16, hence the error.
Example Usage of .unpack()
As an example, reading your supplied document I extracted the first 21*6 bytes, which has format:
a 21-element array of 6-byte character strings that contain information such as polar stereographic grid characteristics
With:
x = struct.unpack("6s"*21, filecontent[:126])
This returns a tuple of 21 elements. Note the whitespace padding in some elements to meet the 6-byte requirement.
>> print x
# ('00255\x00', ' 304\x00', ' 448\x00', '1.799\x00', '39.43\x00', '45.00\x00', '558.4\x00', '154.0\x00', '234.0\x00', '
# SMMR\x00', '07 cn\x00', ' 336\x00', ' 0000\x00', ' 0034\x00', ' 364\x00', ' 0000\x00', ' 0046\x00', ' 1979\x00', ' 33
# 6\x00', ' 000\x00', '00250\x00')
Notes:
The first argument fmt, "6s"*21 is a string with 6s repeated 21
times. Each format-character 6s represents one string of 6-bytes
(see below), this will match the required format specified in your
document.
The number 126 in filecontent[:126] is calculated as 6*21 = 126.
Note that for the s (string) specifier, the preceding number does
not mean to repeat the format character 6 times (as it would
normally for other format characters). Instead, it specifies the size
of the string. s represents a 1-byte string, whilst 6s represents
a 6-byte string.
More Extensive Solution for Header Reading (Long)
Because the binary data must be manually specified, this may be tedious to do in source code. You can consider using some configuration file (like .ini file)
This function will read the header and store it in a dictionary, where the structure is given from a .ini file
# user configparser for Python 3x
import ConfigParser
def read_header(data, config_file):
"""
Read binary data specified by a INI file which specifies the structure
"""
with open(config_file) as fd:
#Init the config class
conf = ConfigParser.ConfigParser()
conf.readfp(fd)
#preallocate dictionary to store data
header = {}
#Iterate over the key-value pairs under the
#'Structure' section
for key in conf.options('structure'):
#determine the string properties
start_idx, end_idx = [int(x) for x in conf.get('structure', key).split(',')]
start_idx -= 1 #remember python is zero indexed!
strLength = end_idx - start_idx
#Get the data
header[key] = struct.unpack("%is" % strLength, data[start_idx:end_idx])
#Format the data
header[key] = [x.strip() for x in header[key]]
header[key] = [x.replace('\x00', '') for x in header[key]]
#Unmap from list-type
#use .items() for Python 3x
header = {k:v[0] for k, v in header.iteritems()}
return header
An example .ini file below. The key is the name to use when storing the data, and the values is a comma-separated pair of values, the first being the starting index and the second being the ending index. These values were taken from Table 1 in your document.
[structure]
missing_data: 1, 6
n_cols: 7, 12
n_rows: 13, 18
latitude_enclosed: 25, 30
This function can be used as follows:
header = read_header(filecontent, 'headerStructure.ini')
n_cols = int(header['n_cols'])

Error building bitstring in python 3.5 : the datatype is being set to U32 without my control

I'm using a function to build an array of strings (which happens to be 0s and 1s only), which are rather large. The function works when I am building smaller strings, but somehow the data type seems to be restricting the size of the string to 32 characters long (U32), without my having asked for it. Am I missing something simple?
As I build the strings, I am first casting them as lists so as to more easily manipulate individual characters before joining them into a string again. Am I somehow limiting my ability to use 'larger' data types by my method? The value of np.max(CM1) in this case is something like ~300 (one recent run yielded 253), but the string only come out 32 characters long...
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
for position, cell in np.ndenumerate(biopsy_list):
if cell == 0: continue
temp_parent = 2
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if cell == 1:
derived_genomes_inBx[position] = ''.join(bitstring)
continue
else:
while temp_parent > 1:
temp_parent = family_dict[cell]
bitstring[cell-1] = '1'
if temp_parent == 1: break
cell = family_dict[cell]
derived_genomes_inBx[position] = ''.join(bitstring)
return derived_genomes_inBx
The specific error message I get is:
Traceback (most recent call last):
File "biopsyCA.py", line 77, in <module>
if genome[site] == '1':
IndexError: string index out of range
family_dict is a dictionary which carries a list of parents and children that the algorithm above works through to reconstruct the 'genome' of individuals from the branching family tree. it basically sets positions in the bitstring to '1' if your parent had it, then if your grandparent etc... until you get to the first bit, which is always '1', then it should be done.
The 32 character limitation comes from the conversion of float64 array to string array in this line:
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
The resulting array contains datatype S32 values which limit the contents to 32 characters.
To change this limit, use 'S300' or larger instead of str.
You may also use map(str, np.zeros(len(biopsy_list)) to get more flexible string list and convert it back to numpy array with numpy.array() after you have populated it.
Thanks to help from a number of folks here and local, I finally got this working and the working function is:
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = list(map(str, np.zeros(len(biopsy_list))))
for biopsy in range(0,len(biopsy_list)):
if biopsy_list[biopsy] == 0:
bitstring = (np.max(CM1))*'0'
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if biopsy_list[biopsy] == 1:
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
else:
temp_parent = family_dict[biopsy_list[biopsy]]
bitstring[biopsy_list[biopsy]-1] = '1'
while temp_parent > 1:
temp_parent = family_dict[position]
bitstring[temp_parent-1] = '1'
if temp_parent == 1: break
derived_genomes_inBx[biopsy] = ''.join(bitstring)
return derived_genomes_inBx
The original problem was as Teppo Tammisto pointed out an issue with the 'str' datastructure taking 'S32' format. Once I changed to using the list(map(str, ...) functionality a few more issues arose with the original code, which I've now fixed. When I finish this thesis chapter I'll publish the whole family of functions to use to virtually 'biopsy' a cellular automaton model (well, just an array really) and reconstruct 'genomes' from family tree data and the current automaton state vector.
Thanks all!

Categories

Resources