converting a sparse dataframe to dense Dataframe in python efficiently - python

I have a Problem at hand, I have a dataframe which Looks like the one below:
Input Dataframe:
VEHICLE_HASH LS_ID UPPER_BOUND LS_RATIO
00061E31E25B36 PROMISELS103 2500.0 0.000684
00061E31E25B36 PROMISELS103a 3000.0 0.002001
00061E31E25B36 PROMISELS104 3500.0 0.004128
0006254DB52066 PROMISELS104 4000.0 0.003216
0006254DB52066 PROMISELS103 4500.0 0.001114
0006254DB52066 PROMISELS105 5000.0 0.020767
This is a sample dataframe, the actual dataframe is of size (53526122 x 4). Now i wanted to convert this dataframe to a OneHotEncoded Matrix with Features drawn from the string combined by LS_ID and UPPER_BOUND column. I was able to do one hot Encoding and convert the Matrix to a sparse Matrix and then i multiplied the sparse Matrix with the LS_ratio to get the resultant Input sparse Matrix for my xgboost classifier.
Now I want to convert the dataframe into this dense Format with an unique HASH per row with multiple column Features so i could do PCA with this data. But i get out of memmory error. Can this be done efficiently?
Expected Output:
HASH PROMISELS103a_3000.0 PROMISELS103_2500.0 PROMISELS103_4500.0 PROMISELS104_3500.0 PROMISELS104_4000.0 PROMISELS105_5000.0
00061E31E25B36 0.002001 0.000684 0 0 0.004128 0
0006254DB52066 0 0 0.001114 0.003216 0 0.020767

You can try to concatenate LS_ID and UPPER_BOUND columns with separator '_', construct cross-tabulation (suppose all elements in constructed column and 'VEHICLE_HASH' column is unique), and fill NaN values with zeros:
import pandas as pd
import numpy as np
df = pd.DataFrame() # here should be your initial dataframe
df['ID_AND_BOUND'] = df['LS_ID'] + '_' + df['UPPER_BOUND'].astype(str)
df_processed = pd.crosstab(index=df['VEHICLE_HASH'],
columns=df['ID_AND_BOUND'],
values=df['LS_RATIO'],
aggfunc=np.mean)
df_processed = df_processed.reset_index().fillna(0)

Related

add random noise and random NA in pandas dataframe

I have a pandas dataframe and I want to add random NA and random noise in the data
exp_TSPAN6 exp_TNMD exp_DPM1 exp_SCYL3 exp_C1orf112
0 7.951917 3.524705 12.043700 7.605068 8.214067
1 8.079243 9.545859 5.6445321 8.509788 6.853905
2 11.335783 12.45859 12.254986 6.617365 8.196391
Example Output
exp_TSPAN6 exp_TNMD exp_DPM1 exp_SCYL3 exp_C1orf112
0 8.951917 4.524705 11.043700 7.605068 8.214067
1 8.079243 NA NA 8.509788 6.853905
2 11.335783 NA 12.254986 6.617365 9.196391
I have tried the following code to add NA, but I could not add random noise
for col in data.columns:
data.loc[data.sample(frac=0.1).index, col] = pd.np.nan
Why don't you try what is suggested here: Adding gaussian noise to a dataset of floating points and save it (python)
Load the data into a pandas dataframe clean_signal = pd.read_csv("data_file_name")
Use numpy to generate Gaussian noise with the same dimension as the dataset.
Add gaussian noise to the clean signal with signal = clean_signal + noise

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Parallel programming approach to solve pandas problems

I have a dataframe of the following format.
df
A B Target
5 4 3
1 3 4
I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1]).
But the issue is - size of my actual dataframe is (216, 72391) which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.
Here, I have tried to implement your operation using numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link to colab notebook.
You should take a look at dask. It should be able to do what you want and a lot more.
It parallelizes most of the DataFrame functions.

PySpark: Convert RDD to column in dataframe

I have a spark dataframe using which I am calculating the Euclidean distance between a row and a given set of corrdinates. I am recreating a structurally similar dataframe 'df_vector' here to explain better.
from pyspark.ml.feature import VectorAssembler
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
>>> df_vector.dtypes
[('features', 'vector')]
As you can see the features column is a vector. In practice, I get this vector column as the output of a StandardScaler. Anyway, since I need to calculate Euclidean distance, I do the following
rdd = df_vector.select('features').rdd.map(lambda r: np.linalg.norm(r-b))
where
b = np.asarray([0.5,1.0,1.5])
I have all the calculations I need but I need this rdd as a column in df_vector. How do I go about it?
Instead of creating a new rdd, you could use an UDF:
norm_udf = udf(lambda r: np.linalg.norm(r - b).tolist(), FloatType())
df_vector.withColumn("norm", norm_udf(df.features))
Make sure numpy is defined on the worker nodes.
One way to tackle performance issues might be to use mapPartitions. The idea would be, at a partition level, to convert features to an array and then calculate the norm on the whole array (thus implicitly using numpy vectorisation). Then do some housekeeping to get the form you want. For large datasets this might improve performance:
Here is the function which calculates the norm at partition level:
from pyspark.sql import Row
def getnorm(vectors):
# convert vectors into numpy array
vec_array=np.vstack([v['features'] for v in vectors])
# calculate the norm
norm=np.linalg.norm(vec_array-b, axis=1)
# tidy up to get norm as a column
output=[Row(features=x, norm=y) for x,y in zip(vec_array.tolist(), norm.tolist())]
return(output)
Applying this using mapPartitions gives an RDD of Rows which can then be converted to a DataFrame:
df_vector.rdd.mapPartitions(getnorm).toDF()

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

Categories

Resources