If each array has the shape (1000, 2, 100), it is easy to use
con = np.concatenate((array_A, array_B))
to concatenate them, thus con has the shape (2000, 2, 100).
I want to dynamically append or concatenate "con" in a function. The step is described as following:
First, read data from the first file and process data to generate an array.
Secondly, read date from the second file and append generated array into the first array
....
def arrayappend():
for i in range(n):
#read data from file_0 to file_n-1
data = read(file_i)
#data processing to generate an array with shape (1000, 2, 100)
con = function(data)
# append con
Assuming all your files produce the same shape objects and you want to join them on the 1st dimension, there are several options:
alist = []
for f in files:
data = foo(f)
alist.append(f)
arr = np.concatenate(alist, axis=0)
concatenate takes a list. There are variations if you want to add a new axis (np.array(alist), np.stack etc).
Append to a list is fast, since it just means adding a pointer to the data object. concatenate creates a new array from the components; it's compiled but still relatively slower.
If you must/want to make a new array at each stage you could write:
arr = function(files[0])
for f in files[1:]:
data = foo(f)
arr = np.concatenate((arr, data), axis=0)
This probably is slower, though, if the file loading step is slow enough you might not notice a difference.
With care you might be able start with arr = np.zeros((0,2,100)) and read all files in the loop. You have to make sure the initial 'empty' array has a compatible shape. New users often have problems with this.
If you absolutely want to do it during iteration then:
def arrayappend():
con = None
for i, d in enumerate(files_list):
data = function(d)
con = data if i is 0 else np.vstack([con, data])
This should stack it vertically.
Very non pretty, but does it achieve what you want? It is way unoptimized.
def arrayappend():
for i in range(n):
data = read(file_i)
try:
con
con = np.concatenate((con, function(data)))
except NameError:
con = function(data)
return con
First loop will take the except branch, subsequent wont.
Related
I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.
For example, how can I ensure column 1-2 elements are int32 and 3-4 elements are float64?
import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))
Here is a simplistic version of how I am appending using Earray:
Matrix = np.ones(shape=(10**6, 4))
if counter <= 10**6: # keep appending to Matrix until 10**6 rows
Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
s += length
# save to disk when rows = 10**6
if counter > 10**6:
a.append(Matrix[:s])
del Matrix
Matrix = np.ones(shape=(10**6, 4))
What are the cons for the following method?
import tables as tb
import numpy as np
filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))
# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
[2, 2],
[3, 3]], dtype=np.int32)
# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
[1.1,1.2],
[1.1,1.2]], dtype=np.float64)
for i in range(3):
int_app.append(arr1)
float_app.append(arr2)
f.close()
print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)
No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.
The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class
Your code would look something like this:
import tables as tb
import numpy as np
table_dt = np.dtype(
{'names': ['int1', 'int2', 'float1', 'float2'],
'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)
with tb.File('table.h5', 'w') as h5f:
a = h5f.create_table('/', 'dataset_1', description=table_dt)
# Method 1 to create empty recarray 'Matrix', then add data:
Matrix = np.recarray( (10**6,), dtype=table_dt)
Matrix['int1'] = i1
Matrix['int2'] = i2
Matrix['float1'] = f1
Matrix['float2'] = f2
# Append Matrix to the table
a.append(Matrix)
# Method 2 to create recarray 'Matrix' with data in 1 step:
Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
a.append(Matrix)
You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.
The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.
Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time).
There are a few options:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.
I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.
I am trying to rewrite the following code,
processed_feats[0, 0::feats+2] = current_feats[0, 0::feats]
processed_feats[0, 1::feats+2] = current_feats[0, 1::feats]
processed_feats[0, 2::feats+2] = current_feats[0, 2::feats]
processed_feats[0, 3::feats+2] = current_feats[0, 3::feats]
processed_feats[0, 4::feats+2] = current_feats[0, 4::feats]
processed_feats[0, 5::feats+2] = current_feats[0, 5::feats]
processed_feats[0, 6::feats+2] = 0
processed_feats[0, 7::feats+2] = 0
Where
feats = 6
current_feats is a (1,132) numpy array
and the size of processed_feats should be (1,176) and
have the following format [feat1_1,feat2_1...feat6_1,0,0,feat1_2,feat2_2...]
I am trying to make this into a one liner or just less lines of code (if the new solution is less efficient than the existing code then I will go back to the old way). So far I have tried using numpy insert
processed_feats = np.insert(current_feats,range(6,len(current_feats[0]),feats+2),0)
but that does not account for adding the values at the end of the array and I have to use two insert commands since I need to add two 0s at every feats+2 index.
Reshape the two arrays to 22x8 and 22x6, and the operation simply becomes writing the second array into the first 6 columns of the first array and writing zeros into the other columns:
reshaped = processed_feats.reshape((22, 8))
reshaped[:, :6] = current_feats.reshape((22, 6))
reshaped[:, 6:] = 0
reshaped is a view of processed_feats, so writing data into reshaped writes through to processed_feats.
I am currently attempting to build a code that randomly selects food items from a table (which have a macro nutrient breakdown).
What i would like to know is how do i tell Python "Print the index of the food you randomly selected
as a list"?
Assume our input looks like:
import numpy as np
macro_nutrients = [
'carbohydrates',
'fats',
'dietary_fiber',
'minerals',
'proteins',
'vitamins',
'water'
]
You have several options:
If your macro-nutrients are stored in a list-like structure, you can do:
el = np.random.choice(macro_nutrients)
idx = macro_nutrients.index(el)
print(el, "; Is the index correct?:", el == macro_nutrients[idx])
# or you can just write:
idx = np.random.randint(0, len(macro_nutrients) - 1)
print(macro_nutrients[idx])
For [].index() you can check this SO answer for caveats.
If you have a table-like structure (e.g. numpy 2d array):
# we will simulate it by permuting the above list several times and adding the
# permutation as a row in the new 2d array:
mat = np.array([np.random.permutation(macro_nutrients.copy()),
np.random.permutation(macro_nutrients.copy()),
np.random.permutation(macro_nutrients.copy()),
np.random.permutation(macro_nutrients.copy())])
# flatten() will convert your table back to 1d array
np.random.choice(mat.flatten())
# otherwise, you can use something like:
row = np.random.randint(0, mat.shape[0] - 1)
col = np.random.randint(0, mat.shape[1] - 1)
print(mat[row, col])
What is the most efficient way to create a dask.array from a dask.Series of list?
The series consists of 5 million lists 300 of elements.
It is currently divide into 500 partitions.
Currently I am trying:
pt = [delayed(np.array)(y)
for y in
[delayed(list)(x)
for x in series.to_delayed()]]
da = delayed(dask.array.concatenate)(pt, axis=1)
da = dask.array.from_delayed(da, (vec.size.compute(), 300), dtype=float)
The idea is to convert each partition into a numpy array and stitch
those together into a dask.array.
This code is taking forever to run though.
A numpy array can be built from this data quite quickly from this data sequentially as long as there is enough RAM.
I think that you are on the right track using dask.delayed. However calling list on the series is probably not ideal. I would create a function that converts one of your series into a numpy array and then go through delayed with that.
def convert_series_to_array(pandas_series): # make this as fast as you can
...
return numpy_array
L = dask_series.to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=...) for x in L]
x = da.concatenate(arrays, axis=0)
Also, regarding this line:
da = delayed(dask.array.concatenate)(pt, axis=1)
You should never call delayed on a dask function. They are already lazy.
Looking at this with some dummy data. Building on #MRocklin's answer (and molding more after my specific use case), let's say that your vectors are actually list of ints instead of floats and the list is stored as a string. We take the series, transform it, and store it in a zarr array file.
# create dummy data
vectors = [ np.random.randint(low=0,high=100,size=300).tolist() for _ in range(1000) ]
df = pd.DataFrame()
df['vector'] = vectors
df['vector'] = df['vector'].map(lambda x:f"{x}")
df['foo'] = 'bar'
ddf = dd.from_pandas( df, npartitions=100 )
# transform series data to numpy array
def convert_series_to_array( series ): # make this as fast as you can
series_ = [ast.literal_eval( i ) for i in series]
return np.stack(series_, axis=0)
L = ddf['vector'].to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=np.int64) for x in L]
x = da.concatenate(arrays, axis=0)
# store result into a zarr array
x.compute_chunk_sizes().to_zarr( 'toy_dataset.zarr', '/home/user/Documents/', overwrite=True )
I have a nested list with different list sized and types.
def read(f,tree,objects):
Event=[]
for o in objects:
#find different features of one class
temp=[i.GetName() for i in tree.GetListOfBranches() if i.GetName().startswith(o)]
tempList=[] #contains one class of objects
for t in temp:
#print t
tempList.append(t)
comp=np.asarray(getattr(tree,t))
tempList.append(comp)
Event.append(tempList)
return Event
def main():
path="path/to/file"
objects= ['TauJet', 'Jet', 'Electron', 'Muon', 'Photon', 'Tracks', 'ETmis', 'CaloTower']
f=ROOT.TFile(path)
tree=f.Get("RecoTree")
tree.GetEntry(100)
event=read(f,tree,objects)
for example result of event[0] is
['TauJet', array(1), 'TauJet_E', array([ 31.24074173]), 'TauJet_Px', array([-28.27997971]), 'TauJet_Py', array([-13.18042469]), 'TauJet_Pz', array([-1.08304048]), 'TauJet_Eta', array([-0.03470514]), 'TauJet_Phi', array([-2.70545626]), 'TauJet_PT', array([ 31.20065498]), 'TauJet_Charge', array([ 1.]), 'TauJet_NTracks', array([3]), 'TauJet_EHoverEE', array([ 1745.89221191]), 'TauJet_size', array(1)]
how can I convert it into numpy array?
NOTE 1: np.asarray(event, "object") is slow. I am looking for a better way. Also np.fromiter() is not applicable as far as I don't have a fixed type
NOTE 2: I don't know the length of my Events.
NOTE 3: I can also get ride of names if it makes thing easier.
You could try something like this, I'm not sure how fast its going to be though. This creates a numpy record array for first row.
data = event[0]
keys = data[0::2]
vals = data[1::2]
#there are some zero-rank arrays in there, so need to check for those,
#but I think just recasting them to a np.float should work.
temp = [np.float(v) for v in vals]
#you could also just create a np array from the line above with np.array(temp)
dtype={"names":keys, "formats":("f4")*len(vals)}
myArr = np.rec.fromarrays(temp, dtype=dtype)
#test it out
In [53]: data["TauJet_Pz"]
Out[53]: array(-1.0830404758453369, dtype=float32)
#alternatively, you could try something like this, which just creates a 2d numpy array
vals = np.array([[np.float(v) for v in row[1::2]] for row in event])
#now create a nice record array from that using the dtypes above
myRecordArray = np.rec.fromarrays(vals, dtype=dtype)