Insert evenly spaced values to numpy array - python

I am trying to rewrite the following code,
processed_feats[0, 0::feats+2] = current_feats[0, 0::feats]
processed_feats[0, 1::feats+2] = current_feats[0, 1::feats]
processed_feats[0, 2::feats+2] = current_feats[0, 2::feats]
processed_feats[0, 3::feats+2] = current_feats[0, 3::feats]
processed_feats[0, 4::feats+2] = current_feats[0, 4::feats]
processed_feats[0, 5::feats+2] = current_feats[0, 5::feats]
processed_feats[0, 6::feats+2] = 0
processed_feats[0, 7::feats+2] = 0
Where
feats = 6
current_feats is a (1,132) numpy array
and the size of processed_feats should be (1,176) and
have the following format [feat1_1,feat2_1...feat6_1,0,0,feat1_2,feat2_2...]
I am trying to make this into a one liner or just less lines of code (if the new solution is less efficient than the existing code then I will go back to the old way). So far I have tried using numpy insert
processed_feats = np.insert(current_feats,range(6,len(current_feats[0]),feats+2),0)
but that does not account for adding the values at the end of the array and I have to use two insert commands since I need to add two 0s at every feats+2 index.

Reshape the two arrays to 22x8 and 22x6, and the operation simply becomes writing the second array into the first 6 columns of the first array and writing zeros into the other columns:
reshaped = processed_feats.reshape((22, 8))
reshaped[:, :6] = current_feats.reshape((22, 6))
reshaped[:, 6:] = 0
reshaped is a view of processed_feats, so writing data into reshaped writes through to processed_feats.

Related

Python/PyTables: Is it possible to have different data types for different columns of an array?

I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.
For example, how can I ensure column 1-2 elements are int32 and 3-4 elements are float64?
import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))
Here is a simplistic version of how I am appending using Earray:
Matrix = np.ones(shape=(10**6, 4))
if counter <= 10**6: # keep appending to Matrix until 10**6 rows
Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
s += length
# save to disk when rows = 10**6
if counter > 10**6:
a.append(Matrix[:s])
del Matrix
Matrix = np.ones(shape=(10**6, 4))
What are the cons for the following method?
import tables as tb
import numpy as np
filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))
# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
[2, 2],
[3, 3]], dtype=np.int32)
# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
[1.1,1.2],
[1.1,1.2]], dtype=np.float64)
for i in range(3):
int_app.append(arr1)
float_app.append(arr2)
f.close()
print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)
No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.
The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class
Your code would look something like this:
import tables as tb
import numpy as np
table_dt = np.dtype(
{'names': ['int1', 'int2', 'float1', 'float2'],
'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)
with tb.File('table.h5', 'w') as h5f:
a = h5f.create_table('/', 'dataset_1', description=table_dt)
# Method 1 to create empty recarray 'Matrix', then add data:
Matrix = np.recarray( (10**6,), dtype=table_dt)
Matrix['int1'] = i1
Matrix['int2'] = i2
Matrix['float1'] = f1
Matrix['float2'] = f2
# Append Matrix to the table
a.append(Matrix)
# Method 2 to create recarray 'Matrix' with data in 1 step:
Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
a.append(Matrix)
You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.
The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.
Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time).
There are a few options:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.
I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.

Random Macro-nutrient selection (Python)

I am currently attempting to build a code that randomly selects food items from a table (which have a macro nutrient breakdown).
What i would like to know is how do i tell Python "Print the index of the food you randomly selected
as a list"?
Assume our input looks like:
import numpy as np
macro_nutrients = [
'carbohydrates',
'fats',
'dietary_fiber',
'minerals',
'proteins',
'vitamins',
'water'
]
You have several options:
If your macro-nutrients are stored in a list-like structure, you can do:
el = np.random.choice(macro_nutrients)
idx = macro_nutrients.index(el)
print(el, "; Is the index correct?:", el == macro_nutrients[idx])
# or you can just write:
idx = np.random.randint(0, len(macro_nutrients) - 1)
print(macro_nutrients[idx])
For [].index() you can check this SO answer for caveats.
If you have a table-like structure (e.g. numpy 2d array):
# we will simulate it by permuting the above list several times and adding the
# permutation as a row in the new 2d array:
mat = np.array([np.random.permutation(macro_nutrients.copy()),
np.random.permutation(macro_nutrients.copy()),
np.random.permutation(macro_nutrients.copy()),
np.random.permutation(macro_nutrients.copy())])
# flatten() will convert your table back to 1d array
np.random.choice(mat.flatten())
# otherwise, you can use something like:
row = np.random.randint(0, mat.shape[0] - 1)
col = np.random.randint(0, mat.shape[1] - 1)
print(mat[row, col])

Trying to loop through multiple arrays and getting error: ValueError: cannot reshape array of size 2 into shape (44,1)

New to for loops and I cannot seem to get this one to work. I have multiple arrays that I want to run through my code. It works for individual arrays, but when I try to run it through a list of arrays it tries to join the arrays together.
Pandas looping, multiple attempts at looping in numpy.
Min regret matrix
for i in [a],[b],[c],[d],[e]:
sum columns and rows:
suma0 = np.sum(a,axis=0)
suma1 = np.sum(a,axis=1)
#find the minimum values for rows and columns:
col_min=np.min(a)
col_min0=data.min(0)
row_min=np.min(a[:44])
row_min0=data.min(1)
difference or least regret between scenarios and policies:
p = np.array(a)
q = np.min(p,axis=0)
r = np.min(p,axis=1)
cidx = np.argmin(p,axis=0)
ridx = np.argmin(p,axis=1)
cdif = p-q
rdif = p-r[:,None]
find the sum of the rows and columns for the difference arrays:
sumc = np.sum(cdif,axis=0)
sumr = np.sum(rdif,axis=1)
sumr1 = np.reshape(sumr,(44,1))
append the scenario array with the column sums:
sumcol = np.zeros((45,10))
sumcol = np.append([cdif],[sumc])
sumcol.shape = (45,10)
rank columns:
order0 = sumc.argsort()
rank0 = order0.argsort()
rankcol = np.zeros((46,10))
rankcol = np.append([sumcol],[rank0])
rankcol.shape = (46,10)
append the policy array with row sums
sumrow = np.zeros((44,11))
sumrow = np.hstack((rdif,sumr1))
rank rows
order1 = sumr.argsort()
rank1 = order1.argsort()
rank1r = np.reshape(rank1,(44,1))
rankrow = np.zeros((44,12))
rankrow = np.hstack((sumrow,rank1r))
print(sumrow)
print(rankrow)
Add row and column headers for least regret for df0:
RCP = np.zeros((47,11))
RCP = pd.DataFrame(rankcol, columns=column_names1, index=row_names1)
print(RCP)
Add row and column headers for least regret for df1:
RCP1 = np.zeros((45,13))
RCP1 = pd.DataFrame(rankrow, columns=column_names2, index=row_names2)
print(RCP1)
Export loops to CSV in output folder:
filepath = os.path.join(output_path, 'out_'+str(index)+'.csv')
RCP.to_csv(filepath)
filepath = os.path.join(output_path, 'out1_'+str(index)+'.csv')
RCP1.to_csv(filepath)
As per your question, please highlight the input, expected output and error as this is a base case example.
x = np.random.randn(2)
x.shape = (2,)
and if we attempt for :
x.reshape(44,1)
The error we get is:
ValueError: cannot reshape array of size 2 into shape (44,1)
reason for this error is simple as we are trying to reshape an array of size 2 into 44 sized array. As per your error highlighted please check the dimension of the input and expected output.

efficiently create dask.array from a dask.Series of lists

What is the most efficient way to create a dask.array from a dask.Series of list?
The series consists of 5 million lists 300 of elements.
It is currently divide into 500 partitions.
Currently I am trying:
pt = [delayed(np.array)(y)
for y in
[delayed(list)(x)
for x in series.to_delayed()]]
da = delayed(dask.array.concatenate)(pt, axis=1)
da = dask.array.from_delayed(da, (vec.size.compute(), 300), dtype=float)
The idea is to convert each partition into a numpy array and stitch
those together into a dask.array.
This code is taking forever to run though.
A numpy array can be built from this data quite quickly from this data sequentially as long as there is enough RAM.
I think that you are on the right track using dask.delayed. However calling list on the series is probably not ideal. I would create a function that converts one of your series into a numpy array and then go through delayed with that.
def convert_series_to_array(pandas_series): # make this as fast as you can
...
return numpy_array
L = dask_series.to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=...) for x in L]
x = da.concatenate(arrays, axis=0)
Also, regarding this line:
da = delayed(dask.array.concatenate)(pt, axis=1)
You should never call delayed on a dask function. They are already lazy.
Looking at this with some dummy data. Building on #MRocklin's answer (and molding more after my specific use case), let's say that your vectors are actually list of ints instead of floats and the list is stored as a string. We take the series, transform it, and store it in a zarr array file.
# create dummy data
vectors = [ np.random.randint(low=0,high=100,size=300).tolist() for _ in range(1000) ]
df = pd.DataFrame()
df['vector'] = vectors
df['vector'] = df['vector'].map(lambda x:f"{x}")
df['foo'] = 'bar'
ddf = dd.from_pandas( df, npartitions=100 )
# transform series data to numpy array
def convert_series_to_array( series ): # make this as fast as you can
series_ = [ast.literal_eval( i ) for i in series]
return np.stack(series_, axis=0)
L = ddf['vector'].to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=np.int64) for x in L]
x = da.concatenate(arrays, axis=0)
# store result into a zarr array
x.compute_chunk_sizes().to_zarr( 'toy_dataset.zarr', '/home/user/Documents/', overwrite=True )

dynamically append N-dimensional array

If each array has the shape (1000, 2, 100), it is easy to use
con = np.concatenate((array_A, array_B))
to concatenate them, thus con has the shape (2000, 2, 100).
I want to dynamically append or concatenate "con" in a function. The step is described as following:
First, read data from the first file and process data to generate an array.
Secondly, read date from the second file and append generated array into the first array
....
def arrayappend():
for i in range(n):
#read data from file_0 to file_n-1
data = read(file_i)
#data processing to generate an array with shape (1000, 2, 100)
con = function(data)
# append con
Assuming all your files produce the same shape objects and you want to join them on the 1st dimension, there are several options:
alist = []
for f in files:
data = foo(f)
alist.append(f)
arr = np.concatenate(alist, axis=0)
concatenate takes a list. There are variations if you want to add a new axis (np.array(alist), np.stack etc).
Append to a list is fast, since it just means adding a pointer to the data object. concatenate creates a new array from the components; it's compiled but still relatively slower.
If you must/want to make a new array at each stage you could write:
arr = function(files[0])
for f in files[1:]:
data = foo(f)
arr = np.concatenate((arr, data), axis=0)
This probably is slower, though, if the file loading step is slow enough you might not notice a difference.
With care you might be able start with arr = np.zeros((0,2,100)) and read all files in the loop. You have to make sure the initial 'empty' array has a compatible shape. New users often have problems with this.
If you absolutely want to do it during iteration then:
def arrayappend():
con = None
for i, d in enumerate(files_list):
data = function(d)
con = data if i is 0 else np.vstack([con, data])
This should stack it vertically.
Very non pretty, but does it achieve what you want? It is way unoptimized.
def arrayappend():
for i in range(n):
data = read(file_i)
try:
con
con = np.concatenate((con, function(data)))
except NameError:
con = function(data)
return con
First loop will take the except branch, subsequent wont.

Categories

Resources