Downsample numpy array - python

I have two numpy arrays. One has features and the other the corresponding labels.
There are 3 classes, but the dataset is imbalanced, so I would like to balance it out by downsample one class. It has like 10k elements, I would like to have it around 2k like the other classes. I tried to do it with a for loop by creating a new array but I am sure there is a more clean method for that.
In the end there should be the two numpy arrays, one with the features but with the elements removed that also got removed in the labels array based on the class.
Any idea? Thanks!

Related

Slicing Ragged Array

I have a bunch of sliced images (32 x 32 in shape) stored each with a corresponding string id (file name and such). The slices and their ids are grouped in a bunch of arrays part of a final large array. This array is ragged and non-standard, but I'd like to be able to efficiently access the slices inside.
Let's say that I have 500 slices. The shape should be (500, 2) on the surface because the first grouping of slices and ids is not a standard array with given shape.
What I would like to be able to do is extract the sliced images themselves from the final array. Normally, I could collect everything by slicing the big array like big_array[:][:][0] but the ragged nesting has made the array appear 1D with shape (500, ).
The only way around this is to use a clunky for loop, but I'm pretty sure everything I've been doing up until this point has been a terrible way of storing the data.
I need to keep the ids associated with each slice because I'm training a model with this, and if something goes wrong, I'd like to be able to reference the origins of the slice which has undergone some processing.
The only other way around this is to store the ids and slices separately, but that is also a lot of hassle since I have to save them in separate files.
What's the correct way to store this thing?

Fastest Way to Create Large Numpy Arrays?

I'm working on creating a bunch of data (10M rows or more) and am trying to understand more deeply about the best way to work with numpy arrays as quickly as possible.
I'll have 1M rows of each class of data, which I read in from different sources (async). When I'm done reading, I want to combine them into a single numpy array. I'll know the final array is 10M (precisely).
I'm thinking I have the following paths:
Create a global numpy array of the entire size and copy in
Create a global numpy array and a numpy array for each source and concat together at the end
Create a null global numpy array and add each row to the global array (I think this is the slowest)
However, I'm not sure how to do #1 - numpy.copyto seems to always start with index 0.
Is there another model. I should be going with here?
If I use "views", I'm not clear how to copy it to the final array. I'm, of course, familiar with views for DBs, but not for numpy.

How to append multiple data arrays into one varibale of xarray dataset?

i am new in this field and i need a small help.
i just want to know that, what is the best way to append multiple data arrays in a variable of a xarray dataset?
each data array has a different time and value but has the same x,y coordinates same as dataset.
i tried ds[variable_name] = da but it works only for the first data array .
i want to create a function that gets data arrays and put them into one variable of the dataset and updating the time dimension of the dataset.
thanks for your help
The best way for doing that is first to convert data arrays to datasets separately then merge datasets together (using xr.merge).
Hope it helps the others.

Accessing ndarray columns using a custom label

I've just started learning numpy to analyse experimental data and I am trying to give a custom name to a ndarray column so that I can do operations without slicing.
The data I receive is generally a two column .txt file (I'll call it X and Y for the sake of clarity) with a large number of rows corresponding to the measured data. I do operations on those data and generate new columns (I'll call it F(X,Y), G(X,Y,F), ...). I know I can do column-wise operations by slicing ie Data[:,2]=Data[:,1]+Data[:,0], but with a large number of added columns, this becomes tedious. Hence I'm looking for a way to label the columns so that I can refer to a column by its label, and can also label the new columns I generate. So essentially I'm looking for something that'll allow me to directly write F=X+Y (as a substitute for the example above).
Currently, I'm assigning the entire column to a new variable and doing the operations, and then 'hstack'ing it to the data, but I'm unsure of the memory usage here. For example,
X=Data[:,0]
Y=Data[:,1]
F=X+Y
Data=numpy.hstack((Data,F.reshape(n,1)))
I've seen the use of structured array and record arrays, but the data I'm dealing with is homogeneous and new columns are being added continuously. Also, I hear Pandas is well suited for what I'm describing, but again, since I'm working with numerical data, I don't find the need to learn a new module, unless it's really needed. Thanks in advance.

Pandas dataframe having an additional "layer"

Suppose you have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.nan,columns=['A','B','C'],index=[0,1,2])
Suppose I want an additional "layer" on top of this pandas dataframe, such that column A, row 0 would have its value, column B, row 0 would have a different value, column C row 0 would have something, column A row 1 and so on. So like a dataframe on top of this existing one.
Is it possible to add other layers? How does one access these layers? Is this efficient, i.e. should I just use a separate data frame all together? And one would save these multiple layers as a csv, by accessing the individual layers?, or is there a function that would break them down into different worksheets in the same workbook?
pandas.DataFrame cannot have 3 dimensions:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
However, there is a way to fake 3-dimensions with MultiIndex / Advanced Indexing:
Hierarchical indexing (MultiIndex)
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data
analysis and manipulation, especially for working with higher
dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like Series (1d) and DataFrame (2d).
If you really need that extra dimension go with pandas.Panel:
Panel is a somewhat less-used, but still important container for 3-dimensional data.
but don't miss this important disclaimer from the docs:
Note: Unfortunately Panel, being less commonly used than Series and
DataFrame, has been slightly neglected feature-wise. A number of
methods and options available in DataFrame are not available in Panel.
There is also pandas.Panel4D (experimental) in the unlikely chance that you need it.

Categories

Resources