Show data only from one cluster - python

I have a pandas dataframe.
input_data = {'col1': [1, 2, 3], 'col2': [3, 4, 5]}
d = pd.DataFrame(data=input_data)
anotherdata= magic(d)
df = pd.DataFrame(data=anotherdata)
I use DBSCAN to cluster df.
As result I have cluster_labels. Labels can have values from -1 (outlier) to 2 in this case.
I want to have an opportunity to show only data from particular class separately and have an access to initial dataframe d by index.
For example, I have an element with index 1 in input_data.
The element is assigned to cluster 0 and there is no other elements of cluster 0.
How can I find this element in input_data by index?

You probably want to use
d[cluster_labels == 0]
Unless your magic function changed indexes.

Related

Can I create column where each row is a running list in a Pandas data frame using groupby?

Imagine I have a Pandas DataFrame:
# create df
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3]})
Lets assume it is ordered by 'id' and an imaginary, not shown, date column (ascending).
I want to create another column where each row is a list of 'val' at that date.
The ending DataFrame will look like this:
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3],
'val_list': [[5],[5,4],[5,4,6],[3],[3,2],[3,2,3]]})
I don't want to use a loop because the actual df I am working with has about 4 million records. I am imagining I would use a lambda function in conjunction with groupby (something like this):
df['val_list'] = df.groupby('id')['val'].apply(lambda x: x.runlist())
This raises an AttributError because the runlist() method does not exist, but I am thinking the solution would be something like this.
Does anyone know what to do to solve this problem?
Let us try
df['new'] = df.val.map(lambda x : [x]).groupby(df.id).apply(lambda x : x.cumsum())
Out[138]:
0 [5]
1 [5, 4]
2 [5, 4, 6]
3 [3]
4 [3, 2]
5 [3, 2, 3]
Name: val, dtype: object

Comparing two data frames columns and assigning Zero and One

I have a dataframe and a list, which includes a part of columns' name from my dataframe as follows:
my_frame:
col1, col2, col3, ..., coln
2, 3, 4, ..., 2
5, 8, 5, ..., 1
6, 1, 8, ..., 9
my_list:
['col1','col3','coln']
Now, I want to create an array with the size of my original dataframe (total number of columns) which consists only zero and one. Basically I want the array includes 1 if the there is a similar columns name in "my_list", otherwise 0. My desired output should be like this:
my_array={[1,0,1,0,0,...,1]}
This should help u:
import pandas as pd
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = []
for column in df.columns:
if column in my_list:
my_array.append(1)
else:
my_array.append(0)
print(my_array)
Output:
[1, 0, 1]
If u wanna use my_array as a numpy array instead of a list, then use this:
import pandas as pd
import numpy as np
dictt = {'a':[1,2,3],
'b':[4,5,6],
'c':[7,8,9]}
df = pd.DataFrame(dictt)
my_list = ['a','h','g','c']
my_array = np.empty(0,dtype = int)
for column in df.columns:
if column in my_list:
my_array = np.append(my_array,1)
else:
my_array = np.append(my_array,0)
print(my_array)
Output:
[1 0 1]
I have used test data in my code for easier understanding. U can replace the test data with ur actual data (i.e replace my test dataframe with ur actual dataframe). Hope that this helps!

Pandas Correlation One Column to Many Columns Group by range of the column

Assuming I have a data frame similar to the below (actual data frame has million observations), how would I get the correlation between signal column and list of return columns, then group by the Signal_Up column?
I tried the pandas corrwith function but it does not give me the correlation grouping for the signal_up column
df[['Net_return_at_t_plus1', 'Net_return_at_t_plus5',
'Net_return_at_t_plus10']].corrwith(df['Signal_Up']))
I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
Data and desired result is given below.
Desired Result
Data
Using simple dataframe below:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 6, 7],
'v2': [2, 2, 4, 2, 4, 4],
'v3': [3, 3, 2, 9, 2, 5],
'v4': [4, 5, 1, 4, 2, 5]})
(1st interpretation) one way to get correlations of one variable with the other columns is:
correlations = df.corr().unstack().sort_values(ascending=False) # Build correlation matrix
correlations = pd.DataFrame(correlations).reset_index() # Convert to dataframe
correlations.columns = ['col1', 'col2', 'correlation'] # Label it
correlations.query("col1 == 'v2' & col2 != 'v2'") # Filter by variable
# output of this code will give correlation of column v2 with all the other columns
(2nd interpretation) one way to get correlations of column v1 with column v3, v4 after grouping by column v2 is using this one line:
df.groupby('v2')[['v1', 'v3', 'v4']].corr().unstack()['v1']
In your case, v2 is 'Signal_Up', v1 is 'signal' and v3, v4 columns proxy 'Net_return_at_t_plusX' columns.
I am able to get the correlations by individual category of Signal_Up column by using “groupby” function. However, I am not able to apply “corr” function to more than two columns.
So, I had to use “concat” function to combine all of them.
a = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus1']].corr().unstack().iloc[:,1]
b = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus5']].corr().unstack().iloc[:,1]
c = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus10']].corr().unstack().iloc[:,1]
dfCorr = pd.concat([a, b, c], axis=1)

python 1:1 stratified sampling per each group

How can a 1:1 stratified sampling be performed in python?
Assume the Pandas Dataframe df to be heavily imbalanced. It contains a binary group and multiple columns of categorical sub groups.
df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})
display(df)
display(df[df.group == 1])
display(df[df.group == 0])
df.group.value_counts()
For each member of the main group==1 I need to find a single match of group==0 with.
A StratifiedShuffleSplit from scikit-learn will only return a random portion of data, not a 1:1 match.
If I understood correctly you could use np.random.permutation:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'group': [0, 1, 0, 1, 0], 'sub_category_1': [1, 2, 2, 1, 1],
'sub_category_2': [1, 2, 2, 1, 1], 'value': [1, 2, 3, 1, 2]})
# create new column with an identifier for a combination of categories
columns = ['sub_category_1', 'sub_category_2']
labels = df.loc[:, columns].apply(lambda x: ''.join(map(str, x.values)), axis=1)
values, keys = pd.factorize(labels)
df['label'] = labels.map(dict(zip(keys, values)))
# build distribution of sub-categories combinations
distribution = df[df.group == 1].label.value_counts().to_dict()
# select from group 0 only those rows that are in the same sub-categories combinations
mask = (df.group == 0) & (df.label.isin(distribution))
# do random sampling
selected = np.ravel([np.random.permutation(group.index)[:distribution[name]] for name, group in df.loc[mask].groupby(['label'])])
# display result
result = df.drop('label', axis=1).iloc[selected]
print(result)
Output
group id sub_category_1 sub_category_2 value
4 0 5 1 1 2
2 0 3 2 2 3
Note that this solution assumes the size of the each possible sub_category combination of group 1 is less than the size of the corresponding sub-group in group 0. A more robust version involves using np.random.choice with replacement:
selected = np.ravel([np.random.choice(group.index, distribution[name], replace=True) for name, group in df.loc[mask].groupby(['label'])])
The version with choice does not have the same assumption as the one with permutation, although it requires at least one element for each sub-category combination.

Get row index from DataFrame row

Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.

Categories

Resources