How to sample data from the proximity of existing data?

How to sample data from the proximity of existing data? - python

I have data for xor as below -
x
y
z
x ^ y ^ z
0
0
1
1
0
1
0
1
1
0
0
1
1
1
1
1
Kept only the ones that make the xor of all three equal to 1.
I want to generate synthetic data around the already available data within some range uniformly at random. The above table can be thought of as seed data. An example of expected table will be as follows:
x
y
z
x ^ y ^ z
0.1
0.3
0.8
0.9
0.25
0.87
0.03
0.99
0.79
0.09
0.28
0.82
0.97
0.76
0.91
0.89
Above table is sampled with a range of 0 to 0.3 for 0 value and with range 0.7 to 1 for value 1.
I want to achieve this using pytorch.

For a problem such as this, you are able to completely synthesise data without using a reference because it has a simple solution. For zero (0-0.3) you can use the torch.rand function to generate uniformly random data for 0-1 and scale it. For one (0.7-1) you can do the same and just offset it:
N = 5
p = 0.5 #change this to bias your outputs
x_is_1 = torch.rand(N)>p #decide if x is going to be 1 or 0
y_is_1 = torch.rand(N)>p #decide if y is going to be 1 or 0
not_all_0 = ~(x_is_1 & y_is_1) #get rid of the x ^ y ^ z = 0 elements
x_is_1,y_is_1 = x_is_1[not_all_0],y_is_1[not_all_0]
N = x_is_1.shape[0]
x = torch.rand(N) * 0.3
x = torch.where(x_is_1,x+0.7,x)
y = torch.rand(N) * 0.3
y = torch.where(y_is_1,y+0.7,y)
z = torch.logical_xor(x_is_1,y_is_1).float()
triple_xor = 1 - torch.rand(z.shape)*0.3
print(torch.stack([x,y,z,triple_xor]).T)
#x y z x^y^z
tensor([[0.2615, 0.7676, 1.0000, 0.8832],
[0.9895, 0.0370, 1.0000, 0.9796],
[0.1406, 0.9203, 1.0000, 0.9646],
[0.1799, 0.9722, 1.0000, 0.9327]])
Or, to treat your data as the basis (for more complex data), there is a preprocessing tool known as gaussian noise injection which seems to be what you're after. Or you can just define a function and call it a bunch of times.
def add_noise(x,y,z,triple_xor,range=0.3):
def proc(dat,range):
return torch.where(dat>0.5,torch.rand(dat.shape)*range+1-range,torch.rand(dat.shape)*range)
return proc(x,range),proc(y,range),proc(z,range),proc(triple_xor,range)
gen_new_data = torch.cat([torch.stack(add_noise(x,y,z,triple_xor)).T for _ in range(5)])

Related

calculate cosine similarity for all columns in a group by in a dataframe

I have a dataframe df: where APer columns range from 0-60
ID FID APerc0 ... APerc60
0 X 0.2 ... 0.5
1 Z 0.1 ... 0.3
2 Y 0.4 ... 0.9
3 X 0.2 ... 0.3
4 Z 0.9 ... 0.1
5 Z 0.1 ... 0.2
6 Y 0.8 ... 0.3
7 W 0.5 ... 0.4
8 X 0.6 ... 0.3
I want to calculate the cosine similarity of the values for all APerc columns between each row. So the result for the above should be:
ID CosSim
1 0,2,4 0.997
2 1,8,7 0.514
1 3,5,6 0.925
I know how to generate cosine similarity for the whole df:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
But I want to find similarity between each ID and group them together(or create separate df). How to do it fast for big dataset?

One possible solution could be get the particular rows you want to use for cosine similarity computation and do the following.
Here, combinations is basically the list pair of row index which you want to consider for computation.
cos = nn.CosineSimilarity(dim=0)
for i in range(len(combinations)):
row1 = df.loc[combinations[i][0], 2:62]
row2 = df.loc[combinations[i][1], 2:62]
sim = cos(row1, row2)
print(sim)
The result you can use in the way you want.

create a function for calculation, then df.apply(cosine_similarity_function()), one said that using apply function may perform hundreds times faster than row by row.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

correlation matrix filtering based on high variables correlation with selection of least correlated with target variable at scale using vectors

I have this resulting correlation matrix:
id
row
col
corr
target_corr
0
a
b
0.95
0.2
1
a
c
0.7
0.2
2
a
d
0.2
0.2
3
b
a
0.95
0.7
4
b
c
0.35
0.7
5
b
d
0.65
0.7
6
c
a
0.7
0.6
7
c
b
0.35
0.6
8
c
d
0.02
0.6
9
d
a
0.2
0.3
10
d
b
0.65
0.3
11
d
c
0.02
0.3
After filtering high correlated variables based on "corr" variable I
try to add new column that will compare will decide to mark "keep" the
least correlated variable from "row" or mark "drop" of that variable
for the most correlated variable "target_corr" column. In other works
from corelated variables matching cut > 0.5 select the one least correlated to
"target_corr":
Expected result:
id
row
col
corr
target_corr
drop/keep
0
a
b
0.95
0.2
keep
1
a
c
0.7
0.2
keep
2
b
a
0.95
0.7
drop
3
b
d
0.65
0.7
drop
4
c
a
0.7
0.6
drop
5
d
b
0.65
0.3
keep
This approach does use very large dataframes so resulting corr matrix for example is > 100kx100k and generated using pyspark:
def corrwith_matrix_no_save(df, data_cols=None, select_targets = None, method='pearson'):
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
start_time = time.time()
vector_col = "corr_features"
if data_cols == None and select_targets == None:
data_cols = df.columns
select_target = list(df.columns)
assembler = VectorAssembler(inputCols=data_cols, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
final_df = pd.DataFrame(result.reshape(-1, len(data_cols)), columns=data_cols, index=data_cols)
final_df = final_df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x )
corr_df = final_df[select_target]
#corr_df.columns = [str(col) + '_corr' for col in corr_df.columns]
corr_df['column_names'] = corr_df.index
print('Execution time for correlation_matrix function:', time.time() - start_time)
return corr_df
created the dataframe from uper triagle with numpy.triuand numpy.stack + added the target column my merging 2 resulting dataframes (if code is required can provide but will increase the content a lot so will provide only if needs clarifcation).
def corrX_to_ls(corr_mtx) :
# Get correlation matrix and upper triagle
df_target = corr_mtx['target']
corr_df = corr_mtx.drop('target', inplace=True)
up = corr_df.where(np.triu(np.ones(corr_df.shape), k=1).astype(np.bool))
print('This is triu: \n', up )
df = up.stack().reset_index()
df.columns = ['row','col','corr']
df_lsDF = df.query("row" != "col")
df_target_corr = df_target.reset_index()
df_target_corr.columns = ['target_col', 'target_corr']
sample_df = df_lsDF.merge(df_target_corr, how='left', left_ob='row', right_on='target_col')
sample_df = sample_df.drop('target_col', 1)
return (sample_df)
Now after filtering resulting dataframe based on df.Corr > cut where cut > 0.50 got stuck at marking what variable o keep and what to drop
( I do look to mark them only then select into a list variables) ...
so help on solving it will be greatly appreciated and will also
benefit community when working on distributed system.
Note: Looking for example/solution to scale so I can distribute
operations on executors so lists or like a group/subset of the
dataframe to be done in parallel and avoid loops is what I do look, so
numpy.vectorize, threading and/or multiprocessing
approaches is what I do look.
Additional "thinking" from top of my mind: I do think on grouping by
"row" column so can distribute processing each group on executors or
by using lists distribute processing in parallel on executors so each
list will generate a job for each thread from ThreadPool ( I done
done this approach for column vectors but for very large
matrix/dataframes can become inefficient so for rows I think will
work).

Given final_df as the sample input, you can try:
# filter
output = final_df.query('corr>target_corr').copy()
# assign drop/keep
output['drop_keep'] = np.where(output['corr']>2*output['target_corr'],
'keep','drop')
Output:
id row col corr target_corr drop_keep
0 0 a b 0.95 0.2 keep
1 1 a c 0.70 0.2 keep
3 3 b a 0.95 0.7 drop
6 6 c a 0.70 0.6 drop
10 10 d b 0.65 0.3 keep

How can I bin a Pandas Series setting the bin size to a preset value of max/min for each bin

I have a pd.Series of floats and I would like to bin it into n bins where the bin size for each bin is set so that max/min is a preset value (e.g. 1.20)?
The requirement means that the size of the bins is not constant. For example:
data = pd.Series(np.arange(1, 11.0))
print(data)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
dtype: float64
I would like the bin sizes to be:
1.00 <= bin 1 < 1.20
1.20 <= bin 2 < 1.20 x 1.20 = 1.44
1.44 <= bin 3 < 1.44 x 1.20 = 1.73
...
etc
Thanks

Here's one with pd.cut, where the bins can be computed taking the np.cumprod of an array filled with 1.2:
data = pd.Series(list(range(11)))
import numpy as np
n = 20 # set accordingly
bins= np.r_[0,np.cumprod(np.full(n, 1.2))]
# array([ 0. , 1.2 , 1.44 , 1.728 ...
pd.cut(data, bins)
0 NaN
1 (0.0, 1.2]
2 (1.728, 2.074]
3 (2.986, 3.583]
4 (3.583, 4.3]
5 (4.3, 5.16]
6 (5.16, 6.192]
7 (6.192, 7.43]
8 (7.43, 8.916]
9 (8.916, 10.699]
10 (8.916, 10.699]
dtype: category
Where bins in this case goes up to:
np.r_[0,np.cumprod(np.full(20, 1.2))]
array([ 0. , 1.2 , 1.44 , 1.728 , 2.0736 ,
2.48832 , 2.985984 , 3.5831808 , 4.29981696, 5.15978035,
6.19173642, 7.43008371, 8.91610045, 10.69932054, 12.83918465,
15.40702157, 18.48842589, 22.18611107, 26.62333328, 31.94799994,
38.33759992])
So you'll have to set that according to the range of values of the actual data

This is I believe the best way to do it because you are considering the max and min values from your array. Therefore you won't need to worry about what values are you using, only the multiplier or step_size for your bins (of course you'd need to add a column name or some additional information if you will be working with a DataFrame):
data = pd.Series(np.arange(1, 11.0))
bins = []
i = min(data)
while i < max(data):
bins.append(i)
i = i*1.2
bins.append(i)
bins = list(set(bins))
bins.sort()
df = pd.cut(data,bins,include_lowest=True)
print(df)
Output:
0 (0.999, 1.2]
1 (1.728, 2.074]
2 (2.986, 3.583]
3 (3.583, 4.3]
4 (4.3, 5.16]
5 (5.16, 6.192]
6 (6.192, 7.43]
7 (7.43, 8.916]
8 (8.916, 10.699]
9 (8.916, 10.699]
Bins output:
Categories (13, interval[float64]): [(0.999, 1.2] < (1.2, 1.44] < (1.44, 1.728] < (1.728, 2.074] < ... <
(5.16, 6.192] < (6.192, 7.43] < (7.43, 8.916] <
(8.916, 10.699]]

Thanks everyone for all the suggestions. None does quite what I was after (probably because my original question wasn't clear enough) but they really helped me figure out what to do so I have decided to post my own answer (I hope this is what I am supposed to do as I am relatively new at being an active member of stackoverflow...)
I liked #yatu's vectorised suggestion best because it will scale better with large data sets but I am after the means to not only automatically calculate the bins but also figure out the minimum number of bins needed to cover the data set.
This is my proposed algorithm:
The bin size is defined so that bin_max_i/bin_min_i is constant:
bin_max_i / bin_min_i = bin_ratio
Figure out the number of bins for the required bin size (bin_ratio):
data_ratio = data_max / data_min
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
Set the lower boundary for the smallest bin so that the smallest data point fits in it:
bin_min_0 = data_min
Create n non-overlapping bins meeting the conditions:
bin_min_i+1 = bin_max_i
bin_max_i+1 = bin_min_i+1 * bin_ratio
Stop creating further bins once all dataset can be split between the bins already created. In other words, stop once:
bin_max_last > data_max
Here is a code snippet:
import math
import pandas as pd
bin_ratio = 1.20
data = pd.Series(np.arange(2,12))
data_ratio = max(data) / min(data)
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
n_bins = n_bins + 1 # bin ranges are defined as [min, max)
bins = np.full(n_bins, bin_ratio) # initialise the ratios for the bins limits
bins[0] = bin_min_0 # initialise the lower limit for the 1st bin
bins = np.cumprod(bins) # generate bins
print(bins)
[ 2. 2.4 2.88 3.456 4.1472 4.97664
5.971968 7.1663616 8.59963392 10.3195607 12.38347284]
I am now set to build a histogram of the data:
data.hist(bins=bins)

replacing a range of values with one value

I have a list that I'm adding to a pandas data frame it contains a range of decimal values.
I want to divide it into 3 ranges each range represents one value
sents=[]
for sent in sentis:
if sent > 0:
if sent < 0.40:
sents.append('negative')
if (sent >= 0.40 and sent <= 0.60):
sents.append('neutral')
if sent > 0.60
sents.append('positive')
my question is if there is a more efficient way in pandas to do this as i'm trying to implement this on a bigger list and
Thanks in advance.

You can use pd.cut to produce the results that are of type categorical and have the appropriate labels.
In order to fix the inclusion of .4 and .6 for the neutral category, I add and subtract the smallest float epsilon
sentis = np.linspace(0, 1, 11)
eps = np.finfo(float).eps
pd.DataFrame(dict(
Value=sentis,
Sentiment=pd.cut(
sentis, [-np.inf, .4 - eps, .6 + eps, np.inf],
labels=['negative', 'neutral', 'positive']
),
))
Sentiment Value
0 negative 0.0
1 negative 0.1
2 negative 0.2
3 negative 0.3
4 neutral 0.4
5 neutral 0.5
6 neutral 0.6
7 positive 0.7
8 positive 0.8
9 positive 0.9
10 positive 1.0

List comprehension:
['negative' if x < 0.4 else 'positive' if x > 0.6 else 'neutral' for x in sentis]

Is there a way to test correlation between Data X and Binary output Y?

I'm trying to find a Python method/library for testing correlation between the independent variables X and the binary output Y..
So for example, lets say I have the following data and output:
X Y
0.65 1
0.11 0
0.13 0
0.35 1
0.21 0
...
Lets say the output Y is 1 if (X > 0.3) and 0 otherwise. If I don't know this correlation (the threshold value 0.3), is there a statistical method/test to find out the degree of correlation between X and Y?
So for example, some method that returns
x = [0.65, 0.11, 0.13, 0.31, 0.21]
y = [1, 0, 0, 1, 0]
print some_test(x, y)
==> returns "degree of correlation = 1.0"
Thanks

You are looking for a point biserial correlation, which is used when one of your variables is dichotomous.
from scipy import stats
stats.pointbiserialr(x,y)
If you simply want to know whether X is different depending on the value of Y, you should instead use a t-test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sample data from the proximity of existing data? - python

Related

calculate cosine similarity for all columns in a group by in a dataframe

correlation matrix filtering based on high variables correlation with selection of least correlated with target variable at scale using vectors

How can I bin a Pandas Series setting the bin size to a preset value of max/min for each bin

replacing a range of values with one value

Is there a way to test correlation between Data X and Binary output Y?

Categories

Resources