Applying a simple function to CSV and save multiple csv files - python

I am trying to replicate the data by multiplying every value with a range of values and save the results as CSV.
I have created a function "Replicate_Data" which takes the input numpy array and multiply with a random value between a range. What is the best way to create a 100 files and save as P3D1 , P4D1 and so on.
def Replicate_Data(data: np.ndarray) -> np.ndarray:
Rep_factor= random.uniform(-3,7)
data1 = data * Rep_factor
return data1
P2D1 = Replicate_Data(P1D1)
np.savetxt("P2D1.csv", P2D1, delimiter="," , dtype = complex)

Here is an example you can use as reference.
I generate toy data named toy, then I make n random values using np.random.uniform and call it randos, then I multiply these two objects to form out using numpy broadcasting. You could also do this multiplication in a loop (the same one you save in, in fact): depending on the size of your input array it could be very memory intensive as I've written it. A more complete answer probably depends on the shape of your input data.
import numpy as np
toy = np.random.random(size=(2,2)) # a toy input array
n = 100 # number of random values
randos = np.random.uniform(-3,7,size=n) # generate 100 uniform randoms
# now multiply all elements in toy by the randoms in randos
out = toy[None,...]*randos[...,None,None] # this depends on the shape.
# this will work only if toy has two dimensions. Otherwise requires modification
# it will take a lot of memory... 100*toy.nbytes worth
# now save in the loop..
for i,o in enumerate(out):
name = 'P{}D1'.format(str(i+1))
np.savetxt(name,o,delimiter=",")
# a second way without the broadcasting (slow, better on memory)
# more like 2*toy.nbytes
#for i,r in enumerate(randos):
# name = 'P{}D1'.format(str(i+1))
# np.savetxt(name,r*toy,delimiter=",")

Related

Array interpolation, But padding with only source values - without insert new values - Python

I have a python array (old_array) in size N (N=1920) and I want to interpolate it to a new array (new_array) in size M (M=2823)
Usually, I use interpolation on an old array like this:
from scipy.interpolate import make_interp_spline
Spline_object = make_interp_spline(x_axis, old_array) # len(x_axis) == len(old_array)
N=1920
M=2823
X = np.linspace(x_axis.min(), x_axis.max(), M)
new_array = Spline_object(X)
But in this way, I get a new array of values between source values. But I need to pad the space only with source values and not insert new values.
For example, if N=1000, M=2000 so: new_array[0:2] = old_array[0], new_array[2:4] = old_array[1] etc.
This is simple and easy to do.
But I'm trying to find a way to do it in nonlinear cases. (2823/1920 = 0.6801275239107333)
Thanks

Combine the arrays in these datasets of a HDF5 file and finally get a 2D

I have an HDF5 file and it contains 500 datasets (named A000, A001, A002, A003 .... A499) and each dataset contains arrays of (200, 5400) sizes. I want to combine the arrays in these datasets and finally get a 2D array. For this, I can achieve the result by doing certain manipulations in the for loop, but this process takes a long time. So what I did is like this:
for datasets in my_list:
dataset = f[(datasets)][:]
i_data = dataset['real']
q_data = dataset['imag']
# power = 10.log(10.(I^2+Q^2)+1)
power = np.log10(((np.add(np.square(np.abs(i_data)),np.square(np.abs(q_data ))))*10)+1)*10
power = np.rot90(power)
power_list.append(power)
print("Dataset: ", datasets)
power = np.concatenate(power_list,1)
So, is there any way to doing this in shorter time like maybe without for loop.

How to add new rows in each for loop to an array in order to create a matrix (m,n) in python?

I am working on creating a matrix of features from a database of signals.
I want to calculate some features in order to end up with a matrix. Each row corresponding to each signal, and 4 columns corresponding to each assessed feature.
I have searched and I can't understand how to properly insert or add a new row with the features for each signal, for every for loop while I assess the features.
This is the code I'm following on:
The .mat file is attached to this link HERE
import numpy as np
import scipy.io as sio
from scipy import stats
mat=sio.loadmat('signal_1.mat')
size=mat['signal_1']
a,b=size.shape
calc=[]
for i in range(a):
signal=mat['signal_1'][i][0]
def function(signal):
x = signal
mu=np.mean(x)
mini=np.min(x)
maxi=np.max(x)
ran=maxi-mini
values = np.column_stack((mu,mini,maxi,ran))
return values
calc.append(function(signal))
Which creates a list as follows:
That is inconvenient because I need to have an array with the shape (n,4), being n= a (number of signals).
This is is the desired result:
To sum up,
-How can I create the calc list as a float64 array with size (n,5)?
-How can I replace this line calc.append(function(signal)) to add each row to the array of the assessed features corresponding to each for loop?
-or what is the most efficient way to properly add each row?
*
*
*
*
*
PD: if I try this conversion calc=np.array(calc),it doesn't work and gives me a very weird float64 array with size (9,1,4)
Just create an empty array features_mat and fill it with your features by iterating on all your signals :
import numpy as np
import scipy.io as sio
mat = sio.loadmat('signal_1.mat')
# number of signals in .mat file
n = mat['signal_1'].shape[0]
# get the signals
signals = mat['signal_1'][:,0]
def get_features(signal):
mu = np.mean(signal)
mini = np.min(signal)
maxi = np.max(signal)
ran = maxi-mini
return np.array([mu,mini,maxi,ran])
# pre-allocate memory without initializing it
features_mat = np.empty((n,4))
for i, signal in enumerate(signals):
features_mat[i,:] = get_features(signal)
>>> np.array([[ 4.07850385e+00, -2.10251071e-01, 7.06541344e+00, 7.27566451e+00],
[ 8.31759999e-02, -2.61125020e-03, 1.50838105e-01, 1.53449355e-01],
[-5.55470935e+00, -5.81185396e+00, -5.17208787e+00, 6.39766089e-01],
[-1.36478103e+01, -1.46263278e+02, 1.46379425e+02, 2.92642704e+02],
[ 3.22094459e+00, 1.00760787e+00, 5.55007608e+00, 4.54246820e+00],
[ 4.36753757e+01, 3.57114093e+01, 4.93010863e+01, 1.35896770e+01],
[ 1.71242787e+00, -2.25392323e-01, 3.59933423e+00, 3.82472655e+00],
[-1.73530851e+00, -2.00324815e+00, -1.35313746e+00, 6.50110688e-01],
[-5.83099184e+00, -6.98125270e+00, -4.75522063e+00, 2.22603207e+00]])
Output has desired shape and seems to contain the features you're looking for. Tell me if this works.
Hope this helps.

Exporting large array variables (type = object) to CSV files

I have used Gekko from APM in Python to solve an optimization problem. The two main decision variables (DVs) are large arrays. The problem has converged successfully, however, I need the results of these tables in an excel worksheet for further work.
An example variable name is 's'. Since the arrays created within Gekko are GKVariable/Object variable types I cannot simply use:
pd.DataFrame(s).to_csv(r'C:\Users\...\s.csv')
because the result gives every cell of the array the label of each variable defined in the model (i.e. v1, v2, etc.)
Using print 's' within the kernel will show the numbers of the array from the optimization results but in a format that doesn't guarantee that each line is a new row of the matrix because of the many columns.
Is there another solution to copy just the resulting value of the DV 's' so it becomes a normal np.array instead of the object type variable? Open to any ideas for this.
You can use s[i].value[0]`` for steady state problems (IMODE=1orIMODE=3) ors[i].value[:]``` to access the array of values for all other IMODE options. Here is a simple example with writing the results to a file with pandas and numpy.
import numpy as np
from gekko import GEKKO
import pandas as pd
m = GEKKO(remote=False)
# Random 3x3
A = np.random.rand(3,3)
# Random 3x1
b = np.random.rand(3,1)
# Ax = b
y = m.axb(A,b)
m.solve()
yn = [y[i].value[0] for i in range(3)]
print(yn)
pd.DataFrame(yn).to_csv(r'y1.csv')
np.savetxt('y2.csv',yn,delimiter=',',comments='')

Parallel programming approach to solve pandas problems

I have a dataframe of the following format.
df
A B Target
5 4 3
1 3 4
I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1]).
But the issue is - size of my actual dataframe is (216, 72391) which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.
Here, I have tried to implement your operation using numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link to colab notebook.
You should take a look at dask. It should be able to do what you want and a lot more.
It parallelizes most of the DataFrame functions.

Categories

Resources