I have a pandas dataframe and I want to add random NA and random noise in the data
exp_TSPAN6 exp_TNMD exp_DPM1 exp_SCYL3 exp_C1orf112
0 7.951917 3.524705 12.043700 7.605068 8.214067
1 8.079243 9.545859 5.6445321 8.509788 6.853905
2 11.335783 12.45859 12.254986 6.617365 8.196391
Example Output
exp_TSPAN6 exp_TNMD exp_DPM1 exp_SCYL3 exp_C1orf112
0 8.951917 4.524705 11.043700 7.605068 8.214067
1 8.079243 NA NA 8.509788 6.853905
2 11.335783 NA 12.254986 6.617365 9.196391
I have tried the following code to add NA, but I could not add random noise
for col in data.columns:
data.loc[data.sample(frac=0.1).index, col] = pd.np.nan
Why don't you try what is suggested here: Adding gaussian noise to a dataset of floating points and save it (python)
Load the data into a pandas dataframe clean_signal = pd.read_csv("data_file_name")
Use numpy to generate Gaussian noise with the same dimension as the dataset.
Add gaussian noise to the clean signal with signal = clean_signal + noise
Related
I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)
it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()
Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.
I have a pandas dataframe that is in the following format:
This contains the % change in stock prices each day for 3 companies MSFT, F and BAC.
I would like to use a OneClassSVM calculator to detect whether the data is an outlier or not. I have tried the following code, which I believe detects the rows which contain outliers.
#Import libraries
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
#Create SVM Classifier
svm = OneClassSVM(kernel='rbf',
gamma=0.001, nu=0.03)
#Use svm to fit and predict
svm.fit(delta)
pred = svm.predict(delta)
#If the values are outlier the prediction
#would be -1
outliers = where(pred==-1)
#Print rows with outliers
print(outliers)
This gives the following output:
I would like to then add a new column to my dataframe that includes whether the data is an outlier or not. I have tried the following code but I get an error due to the lists being different lengths as shown below.
condition = (delta.index.isin(outliers))
assigned_value = "outlier"
df['isoutlier'] = np.select(condition,
assigned_value)
Would you be able to let me know I could add this column given that the list of the rows containing outliers is much shorter please?
It's not very clear what is delta and df in your code. I am assuming they are the same data frame.
You can use the result from svm.predict , here we leave it as blank '' if not outlier:
import numpy as np
df = pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=['A','B','C'])
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.03)
svm.fit(df)
pred = svm.predict(df)
df['isoutlier'] = np.where(pred == -1 ,'outlier','')
A B C isoutlier
0 0.869475 0.752420 0.388898
1 0.177420 0.694438 0.129073
2 0.011222 0.245425 0.417329
3 0.791647 0.265672 0.401144
4 0.538580 0.252193 0.142094
.. ... ... ... ...
95 0.742192 0.079426 0.676820 outlier
96 0.619767 0.702513 0.734390
97 0.872848 0.251184 0.887500 outlier
98 0.950669 0.444553 0.088101
99 0.209207 0.882629 0.184912
I am using hierarchical clustering from seaborn.clustermap to cluster my data. This works fine to nicely visualize the clusters in a heatmap. However, now I would like to extract all row values that are assigned to the different clusters.
This is what my data looks like:
import pandas as pd
# load DataFrame
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
df
log_HU1 log_HU2
EEF1A1 13.439499 13.746856
HSPA8 13.169191 12.983910
FTH1 13.861164 13.511200
PABPC1 12.142340 11.885885
TFRC 11.261368 10.433607
RPL26 13.837205 13.934710
NPM1 12.381585 11.956855
RPS4X 13.359880 12.588574
EEF2 11.076926 11.379336
RPS11 13.212654 13.915813
RPS2 12.910164 13.009184
RPL11 13.498649 13.453234
CA1 9.060244 13.152061
RPS3 11.243343 11.431791
YBX1 12.135316 12.100374
ACTB 11.592359 12.108637
RPL4 12.168588 12.184330
HSP90AA1 10.776370 10.550427
HSP90AB1 11.200892 11.457365
NCL 11.366145 11.060236
Then I perform the clustering using seaborn as follows:
fig = sns.clustermap(df)
Which produces the following clustermap:
For this example I may be able to manually interpret the values belonging to each cluster (e.g. that TFRC and HSP90AA1 cluster). However I am planning to do these clustering analysis on much bigger data sets.
So my question is: does anyone know how to get the row values belonging to each cluster?
Thanks,
Using scipy.cluster.hierarchy module with fcluster allows cluster retrieval:
import pandas as pd
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
# retrieve clusters using fcluster
d = sch.distance.pdist(df)
L = sch.linkage(d, method='complete')
# 0.2 can be modified to retrieve more stringent or relaxed clusters
clusters = sch.fcluster(L, 0.2*d.max(), 'distance')
# clusters indicices correspond to incides of original df
for i,cluster in enumerate(clusters):
print(df.index[i], cluster)
Out:
EEF1A1 2
HSPA8 1
FTH1 2
PABPC1 3
TFRC 5
RPL26 2
NPM1 3
RPS4X 1
EEF2 4
RPS11 2
RPS2 1
RPL11 2
CA1 6
RPS3 4
YBX1 3
ACTB 3
RPL4 3
HSP90AA1 5
HSP90AB1 4
NCL 4
I have a Problem at hand, I have a dataframe which Looks like the one below:
Input Dataframe:
VEHICLE_HASH LS_ID UPPER_BOUND LS_RATIO
00061E31E25B36 PROMISELS103 2500.0 0.000684
00061E31E25B36 PROMISELS103a 3000.0 0.002001
00061E31E25B36 PROMISELS104 3500.0 0.004128
0006254DB52066 PROMISELS104 4000.0 0.003216
0006254DB52066 PROMISELS103 4500.0 0.001114
0006254DB52066 PROMISELS105 5000.0 0.020767
This is a sample dataframe, the actual dataframe is of size (53526122 x 4). Now i wanted to convert this dataframe to a OneHotEncoded Matrix with Features drawn from the string combined by LS_ID and UPPER_BOUND column. I was able to do one hot Encoding and convert the Matrix to a sparse Matrix and then i multiplied the sparse Matrix with the LS_ratio to get the resultant Input sparse Matrix for my xgboost classifier.
Now I want to convert the dataframe into this dense Format with an unique HASH per row with multiple column Features so i could do PCA with this data. But i get out of memmory error. Can this be done efficiently?
Expected Output:
HASH PROMISELS103a_3000.0 PROMISELS103_2500.0 PROMISELS103_4500.0 PROMISELS104_3500.0 PROMISELS104_4000.0 PROMISELS105_5000.0
00061E31E25B36 0.002001 0.000684 0 0 0.004128 0
0006254DB52066 0 0 0.001114 0.003216 0 0.020767
You can try to concatenate LS_ID and UPPER_BOUND columns with separator '_', construct cross-tabulation (suppose all elements in constructed column and 'VEHICLE_HASH' column is unique), and fill NaN values with zeros:
import pandas as pd
import numpy as np
df = pd.DataFrame() # here should be your initial dataframe
df['ID_AND_BOUND'] = df['LS_ID'] + '_' + df['UPPER_BOUND'].astype(str)
df_processed = pd.crosstab(index=df['VEHICLE_HASH'],
columns=df['ID_AND_BOUND'],
values=df['LS_RATIO'],
aggfunc=np.mean)
df_processed = df_processed.reset_index().fillna(0)
dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).