Parallel programming approach to solve pandas problems

Parallel programming approach to solve pandas problems - python

I have a dataframe of the following format.
df
A B Target
5 4 3
1 3 4
I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1]).
But the issue is - size of my actual dataframe is (216, 72391) which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.

Here, I have tried to implement your operation using numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link to colab notebook.

You should take a look at dask. It should be able to do what you want and a lot more.
It parallelizes most of the DataFrame functions.

Related

Is it possible to selecting dataset by time range when range is different for every pixel in pythons xarray module

I try to select only this part of the data within a specific time range that is different for every pixel.
For indexing, I have two np.datetime64[ns] xr.DataArrays with shape(lat:152, lon:131) named time_range_min, time_range_max
One is holding the start dates and the other one the end dates.
I try this for selecting the data
dataset = data.sel(time=slice(time_range_min, time_range_max))
but it yields
cannot use non-scalar arrays in a slice for xarray indexing:
<xarray.DataArray 'NDVI' (lat: 152, lon: 131)>
If I cannot use non-scalar arrays it means that it is in general not possible to do this, or can I transform my arrays?

If "time" is a list of dates in string that is ordered from past to present (e.g. ["10-20-2021", "10-21-2021", ...]:
import numpy as np
listOfMinMaxTimeRanges = [time_range_min, time_range_max]
specifiedRangeOfTimeIndexedList = []
for indexingListOfMinMaxTimeRanges in range(np.shape(listOfMinMaxTimeRanges)[1])):
specifiedRangeOfTimeIndexed = [specifiedRangeOfTime for specifiedRangeOfTime in np.arange(0, len(time), 1) if time.index(listOfMinMaxTimeRanges[0][indexingListOfMinMaxTimeRanges]) <= specifiedRangeOfTime <= time.index(listOfMinMaxTimeRanges[1][indexingListOfMinMaxTimeRanges])]
for indexes in range(len(specifiedRangeOfTimeIndexed)):
specifiedRangeOfTimeIndexedList.append(specifiedRangeOfTimeIndexed[indexes])
Depending on how your dataset is structured:
dataset = data.sel(time = specifiedRangeOfTimeIndexedList)
or
dataset = data.sel(time = time[specifiedRangeOfTimeIndexedList])
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[:, time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList], :, :]
or
dataset = dataset[specifiedRangeOfTimeIndexedList]
...

I found a way to group every cell with stacking in xarray:
time_range_min and time_range_max marks now a single date
stack = dataset.value.stack(gridcell=['lat', 'lon'])
for unique_value, grouped_array in stack.groupby('gridcell'):
grouped_array.sel(time=slice(time_range_min,time_range_max))

How to run function for multitude of arrays

so I need to analyse the peak number & width of a signal (in my case Calcium signal from epidermis cells) that I have stored in an excelsheet. Each column has all the values for one Cell (600 values)
To analyse the peaks, which I will be duing with the scipy.signal.find_peaks() and scipy.signal.peak_widths() function, I put the individual columns in an 1D numpy array containing all the 601 values from that column.
I did this by saving all the individual columns (Columns are named A, B, C, D, etc in Excelsheet) into their own dataframes (df_A, df_B) then putting them in an array :
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx')
df_A = df.loc[:,'A']
df_B = df.loc[:,'B']
arrA = np.array(df_A)
arrB = np.array(df_B)
To calculate the the peak number&width i used the following lines :
from scipy.signal import find_peaks, peak_widths
peaks_A, _ = find_peaks(x,height=7000, prominence= 1)
results_peakwidth_A = peak_widths(x, peaks, rel_height=0.5)
Now since I have not only one but > 100 cells/signals to analyse, is there an simple way to do this for all the cells/arrays ? This exceeds my capabilities so I would gladly welcome any help.

The proposal would be as follows. In essence you firstly select the required columns (however many there are). Then you create a function that will take in a column (no need to turn it into arrays, unless scipy disagrees, in that case add column = column.values in the top of process function).
Afterwards use apply, which will loop through each column in the dataframe and pass it into the function that you defined.
import pandas as pd
from scipy.signal import find_peaks, peak_widths
df = pd.read_excel('test.xlsx')
df = ... # select all columns from A-Z into a single dataframe with the columns required.
# The shape here would b
# A B C
# 1 4 4.1
# 2 3 4.0
# ...
# define the function you want to apply to each column
def process(column):
peaks, _ = find_peaks(column,height=7000, prominence= 1)
return peak_widths(column, peaks, rel_height=0.5)
new_columns = df.apply(process)
As I'm unsure of what the actual output should look like, you might want to keep the peaks_A and the width. In which case you could alter the process function slightly:
def process(column):
peaks, _ = find_peaks(column,height=7000, prominence= 1)
width = peak_widths(column, peaks, rel_height=0.5)
return pd.Series({"width": width, "peaks": peaks})

How to add new rows in each for loop to an array in order to create a matrix (m,n) in python?

I am working on creating a matrix of features from a database of signals.
I want to calculate some features in order to end up with a matrix. Each row corresponding to each signal, and 4 columns corresponding to each assessed feature.
I have searched and I can't understand how to properly insert or add a new row with the features for each signal, for every for loop while I assess the features.
This is the code I'm following on:
The .mat file is attached to this link HERE
import numpy as np
import scipy.io as sio
from scipy import stats
mat=sio.loadmat('signal_1.mat')
size=mat['signal_1']
a,b=size.shape
calc=[]
for i in range(a):
signal=mat['signal_1'][i][0]
def function(signal):
x = signal
mu=np.mean(x)
mini=np.min(x)
maxi=np.max(x)
ran=maxi-mini
values = np.column_stack((mu,mini,maxi,ran))
return values
calc.append(function(signal))
Which creates a list as follows:
That is inconvenient because I need to have an array with the shape (n,4), being n= a (number of signals).
This is is the desired result:
To sum up,
-How can I create the calc list as a float64 array with size (n,5)?
-How can I replace this line calc.append(function(signal)) to add each row to the array of the assessed features corresponding to each for loop?
-or what is the most efficient way to properly add each row?
*
*
*
*
*
PD: if I try this conversion calc=np.array(calc),it doesn't work and gives me a very weird float64 array with size (9,1,4)

Just create an empty array features_mat and fill it with your features by iterating on all your signals :
import numpy as np
import scipy.io as sio
mat = sio.loadmat('signal_1.mat')
# number of signals in .mat file
n = mat['signal_1'].shape[0]
# get the signals
signals = mat['signal_1'][:,0]
def get_features(signal):
mu = np.mean(signal)
mini = np.min(signal)
maxi = np.max(signal)
ran = maxi-mini
return np.array([mu,mini,maxi,ran])
# pre-allocate memory without initializing it
features_mat = np.empty((n,4))
for i, signal in enumerate(signals):
features_mat[i,:] = get_features(signal)
>>> np.array([[ 4.07850385e+00, -2.10251071e-01, 7.06541344e+00, 7.27566451e+00],
[ 8.31759999e-02, -2.61125020e-03, 1.50838105e-01, 1.53449355e-01],
[-5.55470935e+00, -5.81185396e+00, -5.17208787e+00, 6.39766089e-01],
[-1.36478103e+01, -1.46263278e+02, 1.46379425e+02, 2.92642704e+02],
[ 3.22094459e+00, 1.00760787e+00, 5.55007608e+00, 4.54246820e+00],
[ 4.36753757e+01, 3.57114093e+01, 4.93010863e+01, 1.35896770e+01],
[ 1.71242787e+00, -2.25392323e-01, 3.59933423e+00, 3.82472655e+00],
[-1.73530851e+00, -2.00324815e+00, -1.35313746e+00, 6.50110688e-01],
[-5.83099184e+00, -6.98125270e+00, -4.75522063e+00, 2.22603207e+00]])
Output has desired shape and seems to contain the features you're looking for. Tell me if this works.
Hope this helps.

Applying a simple function to CSV and save multiple csv files

I am trying to replicate the data by multiplying every value with a range of values and save the results as CSV.
I have created a function "Replicate_Data" which takes the input numpy array and multiply with a random value between a range. What is the best way to create a 100 files and save as P3D1 , P4D1 and so on.
def Replicate_Data(data: np.ndarray) -> np.ndarray:
Rep_factor= random.uniform(-3,7)
data1 = data * Rep_factor
return data1
P2D1 = Replicate_Data(P1D1)
np.savetxt("P2D1.csv", P2D1, delimiter="," , dtype = complex)

Here is an example you can use as reference.
I generate toy data named toy, then I make n random values using np.random.uniform and call it randos, then I multiply these two objects to form out using numpy broadcasting. You could also do this multiplication in a loop (the same one you save in, in fact): depending on the size of your input array it could be very memory intensive as I've written it. A more complete answer probably depends on the shape of your input data.
import numpy as np
toy = np.random.random(size=(2,2)) # a toy input array
n = 100 # number of random values
randos = np.random.uniform(-3,7,size=n) # generate 100 uniform randoms
# now multiply all elements in toy by the randoms in randos
out = toy[None,...]*randos[...,None,None] # this depends on the shape.
# this will work only if toy has two dimensions. Otherwise requires modification
# it will take a lot of memory... 100*toy.nbytes worth
# now save in the loop..
for i,o in enumerate(out):
name = 'P{}D1'.format(str(i+1))
np.savetxt(name,o,delimiter=",")
# a second way without the broadcasting (slow, better on memory)
# more like 2*toy.nbytes
#for i,r in enumerate(randos):
# name = 'P{}D1'.format(str(i+1))
# np.savetxt(name,r*toy,delimiter=",")

PySpark: Convert RDD to column in dataframe

I have a spark dataframe using which I am calculating the Euclidean distance between a row and a given set of corrdinates. I am recreating a structurally similar dataframe 'df_vector' here to explain better.
from pyspark.ml.feature import VectorAssembler
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
>>> df_vector.dtypes
[('features', 'vector')]
As you can see the features column is a vector. In practice, I get this vector column as the output of a StandardScaler. Anyway, since I need to calculate Euclidean distance, I do the following
rdd = df_vector.select('features').rdd.map(lambda r: np.linalg.norm(r-b))
where
b = np.asarray([0.5,1.0,1.5])
I have all the calculations I need but I need this rdd as a column in df_vector. How do I go about it?

Instead of creating a new rdd, you could use an UDF:
norm_udf = udf(lambda r: np.linalg.norm(r - b).tolist(), FloatType())
df_vector.withColumn("norm", norm_udf(df.features))
Make sure numpy is defined on the worker nodes.

One way to tackle performance issues might be to use mapPartitions. The idea would be, at a partition level, to convert features to an array and then calculate the norm on the whole array (thus implicitly using numpy vectorisation). Then do some housekeeping to get the form you want. For large datasets this might improve performance:
Here is the function which calculates the norm at partition level:
from pyspark.sql import Row
def getnorm(vectors):
# convert vectors into numpy array
vec_array=np.vstack([v['features'] for v in vectors])
# calculate the norm
norm=np.linalg.norm(vec_array-b, axis=1)
# tidy up to get norm as a column
output=[Row(features=x, norm=y) for x,y in zip(vec_array.tolist(), norm.tolist())]
return(output)
Applying this using mapPartitions gives an RDD of Rows which can then be converted to a DataFrame:
df_vector.rdd.mapPartitions(getnorm).toDF()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel programming approach to solve pandas problems - python

You should take a look at dask. It should be able to do what you want and a lot more. It parallelizes most of the DataFrame functions.

Related

Is it possible to selecting dataset by time range when range is different for every pixel in pythons xarray module

How to run function for multitude of arrays

How to add new rows in each for loop to an array in order to create a matrix (m,n) in python?

Applying a simple function to CSV and save multiple csv files

PySpark: Convert RDD to column in dataframe

Categories

Resources