Python lineage naming with clustered dataframe - python

I have a dataframe
sample1 0 0 0 0 0 1 1 1 1 1 1 1 1 L1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0 L1-1
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0 L1-1-1
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0 L1-1-1-1
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0 L1-1-2
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0 L1-1-2-1
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0 L1-1-3
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1 L1-2
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0 L1-2-1
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1 L1-2-2
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0 L2
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0 L2-1
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0 L2-1-1
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0 L2-1-1-1
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0 L2-2
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0 L2-3
As you can see, each row is clustered.
I want to name "lineage-based" labeling to each sample.
For example, sample1 will be lin1 because it is first to appear, sample2 will be lin1-1.
Sample3 will be lin1-1-1, sample4 will be lin1-1-1-1.
Next, sample5 will be lin1-2, sample6 will be lin1-2-1...
Sample11 will be a new start for the lineage, lin2.
My original idea for the naming was.
"sample1 is lin1, if next sample is included in the previous sample, lin1 + "-1"
if not, lin(1+1)"
sample1 -> lin1
sample2 -> lin1-1 (sample2 is included in sample1)
sample3 -> lin1-1-1 (sample3 is included in sample2)
sample4 -> lin1-1-1-1 (sample4 is included in sample3)
sample5 -> lin1-1-2 (sample5 is not included in sample4)
.... logic like this.
I couldn't make this logic into a python script.

This can be done in several steps.
Step 1. Data preprocessing
Sort the data in descending order and remove duplicate, otherwise it may not work. Assume done.
import numpy as np
data = '''sample1 0 0 0 0 0 1 1 1 1 1 1 1 1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0'''
data = [x.split() for x in data.split('\n')]
data = [x[1:] for x in data]
data = np.array(data, dtype=int)
data
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])
Step 2. Encode the sample to position. Each element is a frozenset.
nrow, ncol = data.shape
def to_position(sample):
ncol = len(sample)
return frozenset(i for i in range(ncol) if sample[i] == 1)
position = [to_position(data[i]) for i in range(nrow)]
# print(position)
Step 3. Assign each sample position to a cluster, where the cluster is represented as a tuple for now.
def assign_cluster(sample, clusters, parent):
if parent not in clusters:
clusters[parent] = sample
elif sample < clusters[parent]:
# Find child
parent = parent + (0,)
assign_cluster(sample, clusters, parent)
else:
# Find siblings
parent = parent[:-1] + (parent[-1] + 1, )
assign_cluster(sample, clusters, parent)
clusters = {}
root = (0,)
clusters[root] = position[0]
for i in range(1, nrow):
sample = position[i]
assign_cluster(sample, clusters, parent=root)
# print(clusters)
Step 4. Convert cluster to string and show result.
def cluster_to_string(c):
c = [str(_ + 1) for _ in c]
return 'L' + '-'.join(c)
position_dict = {v: k for k, v in clusters.items()}
for sample in data:
sample = to_position(sample)
c = position_dict[sample]
print(cluster_to_string(c))
L1
L1-1
L1-1-1
L1-1-1-1
L1-1-2
L1-1-2-1
L1-1-3
L1-2
L1-2-1
L1-2-2
L2
L2-1
L2-1-1
L2-1-1-1
L2-2
L2-3

Related

how to label multiple columns effectively using pandas

I have a data columns look like below
a b c d e
1 0 0 0 0
0 2 0 0 0
3 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 1
For this dataframe I want to create a column called label
a b c d e lable
1 0 0 0 0 cola
0 2 0 0 0 colb
3 0 0 0 0 cola
0 0 0 1 0 cold
0 0 1 0 0 colc
0 0 0 0 1 cole
The label is the column index
my prior code is df['label'] = df['a'].apply(lambda x: 1 if x!=0) but it doesn't work. Is there anyway to return the expected result?
Try idxmax on axis 1
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3, 0, 0, 0],
'b': [0, 2, 0, 0, 0, 0],
'c': [0, 0, 0, 0, 1, 0],
'd': [0, 0, 0, 1, 0, 0],
'e': [0, 0, 0, 0, 0, 1]})
df['label'] = 'col'+df.idxmax(1)
Output
a b c d e label
0 1 0 0 0 0 cola
1 0 2 0 0 0 colb
2 3 0 0 0 0 cola
3 0 0 0 1 0 cold
4 0 0 1 0 0 colc
5 0 0 0 0 1 cole

Transform list of ndarrays into dataframe

I have a list of ndarrays that I want to transform into a pd.dataframe. The list looks like this :
from numpy import array
l = [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]),
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]),
array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
]
The length of the ndarrays is a multiple of 12 (12 months) in this case it's equal to 36. I want the final output to look like this :
Year
Jan
Feb
March
April
May
1
0
0
0
0
0
2
0
1
1
0
0
3
0
0
1
0
0
1
0
0
0
0
0
2
0
0
0
0
0
3
0
0
0
0
0
1
0
0
1
0
0
2
1
1
0
0
1
3
0
0
0
0
0
reshaping
Assuming l the input, you can use:
from calendar import month_abbr
df = (pd.DataFrame(np.vstack(l).reshape(-1, 12),
columns=month_abbr[1:])
)
df.insert(0, 'year', np.tile(range(1, len(l[0])//12+1), len(l)))
print(df)
or:
df = pd.DataFrame(np.hstack([np.tile(range(1, len(l[0])//12+1), len(l))[:,None],
np.vstack(l).reshape(-1, 12)]),
columns=['year']+month_abbr[1:])
output:
year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1 0 0 0 0 0 0 0 0 0 0 0 1
1 2 0 1 1 0 0 1 0 0 0 0 0 0
2 3 0 0 1 0 0 0 0 0 0 1 0 0
3 1 0 0 0 0 0 0 0 0 0 0 0 0
4 2 0 0 0 0 0 0 0 0 0 0 0 0
5 3 0 0 0 0 0 0 0 0 0 1 1 0
6 1 0 0 1 0 0 0 0 0 0 0 0 0
7 2 1 1 0 0 1 0 0 0 1 1 0 0
8 3 0 0 0 0 0 0 0 0 0 0 0 1
previous answer: aggregation
Assuming l the input list and that each list represents successive months to form 3 years, you can vstack, aggregate (here using max), and reshape before converting to DataFrame:
from calendar import month_abbr
df = pd.DataFrame(np.vstack(l).reshape(len(l), -1, 12).max(axis=0),
columns=month_abbr[1:])
output:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 0 0 1 0 0 0 0 0 0 0 0 1
1 1 1 1 0 1 1 0 0 1 1 0 0
2 0 0 1 0 0 0 0 0 0 1 1 1
As it is ambiguous how you want to aggregate, you can also use a different axis:
pd.DataFrame(np.vstack(l).reshape(len(l), -1, 12).max(axis=1),
columns=month_abbr[1:])
output:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 0 1 1 0 0 1 0 0 0 1 0 1
1 0 0 0 0 0 0 0 0 0 1 1 0
2 1 1 1 0 1 0 0 0 1 1 0 1

Numpy way to integer-mask an array

I have a multi-class segmentation mask
eg.
[1 1 1 2 2 2 2 3 3 3 3 3 3 2 2 2 2 4 4 4 4 4 4]
And going to need to get binary segmentation masks for each value
i.e.
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]
Any elegant numpy way to do this?
Ideally an example, where I can set 0 and 1 to other values, if I have to.
Just do "==" as this
import numpy as np
a = np.array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4])
mask1 = (a==1)*5
mask2 = (a==2)*5
mask3 = (a==3)*5
mask4 = (a==4)*5
for mask in [mask1,mask2,mask3,mask4]:
print(mask)
This gives
[5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 5 5 5 5 0 0 0 0 0 0 5 5 5 5 0 0 0 0 0 0]
[0 0 0 0 0 0 0 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5]
You can manipulate the masks further in the same manner, i. e.
mask1[mask1==0] = 3
Native python approach:
You can use comprehension and get the equality values for each unique value using set(<sequence>), then convert the boolean to int to get 0,1 values.
>>> ls =[1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4]
>>> {v:[int(v==i) for i in ls] for v in set(ls)}
{1: [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
2: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
3: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
4: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]}
Numpy approach:
Get the unique values for the list using np.unique then expand the axis and transpose the array, then expand the axis for the list also and repeat it n times where n is the number of unique values, finally do the equality comparison and convert it to integer type:
import numpy as np
ls = [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4]
uniques = np.expand_dims(np.unique(ls), 0).T
result = (np.repeat(np.expand_dims(ls, 0), uniques.shape[0], 0)==uniques).astype(int)
OUTPUT:
print(result)
[[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]]
You can build the mask using np.arange and .repeat() and then use broadcasting and the == operator to generate the arrays:
a = np.array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4])
mask = np.arange(a.min(), a.max()+1).repeat(a.size).reshape(-1, a.size)
a_masked = (a == m).astype(int)
print(a_masked.shape) # (4, 23)
print(a_masked)
# [[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]]
Setting 0 and 1 to other values can be done via normal indexing:
a_masked[a_masked == 0] = 7
a_masked[a_masked == 1] = 42

is it possible to use fnmatch.filter on a pandas dataframe instead of regex? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataframe as below for example, i want to have only tests with certain regex to be part of my updated dataframe. I was wondering if there is a way to do it with fnmatch instead of regex?
data = {'part1':[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
'part2':[0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1],
'part3':[0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1],
'part4':[0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
'part5':[1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
'part6':[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1],
'part7':[1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1],
'part8':[1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1],
'part9':[1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1 ],
'part10':[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1],
'part11':[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
'part12':[0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
}
df = pd.DataFrame(data, index =['test_gt1',
'test_gt2',
'test_gf3',
'test_gf4',
'test_gt5',
'test_gg6',
'test_gf7',
'test_gt8',
'test_gg9',
'test_gf10',
'test_gg11',
'test12'
])
i want to be able to create a new dataframe that only contains test_gg or test_gf or test_gt using fnmatch.filter? all examples i see are related to list, so how can i apply it to dataframe?
Import fnmatch.filter and filter on the index:
from fnmatch import filter
In [7]: df.loc[filter(df.index, '*g*')]
Out[7]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0
You can also just use pandas' filter function with regex, and filter on the index:
In [8]: df.filter(regex=r".+g.+", axis='index')
Out[8]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0
You can also just use like :
df.filter(like="g", axis='index')
Out[12]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0

Is it possible to turn a 3D array to coordinate system?

Is it possible to take a 3D array and and turn it into a coordinate system? My array consists of 0s and 1s. If the value is 1 I want to take the xyz coordinate. In the end I want to output all coordinates to a csv file.
import nibabel as nib
coord = []
img = nib.load('test.nii').get_fdata().astype(int)
test.nii array:
[[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 1 ... 1 1 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 1 1 1]
[0 1 0 ... 0 0 0]]
[[1 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]
[0 1 0 ... 0 0 0]
...
[0 1 0 ... 0 0 0]
[0 1 0 ... 0 0 0]
[0 0 0 ... 1 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 1 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 1 0 ... 0 1 1]]
...
[[0 0 0 ... 1 0 0]
[0 0 1 ... 0 0 0]
[0 0 1 ... 0 0 0]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 1 0 0]
[0 0 0 ... 1 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 1]
...
[0 1 0 ... 0 0 0]
[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 1 0]
[0 1 0 ... 0 0 0]]]
That might not necessarily be the best solution, but let's keep it simple (would be great if framework did that for us, but...well):
data = [[[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1],
[0, 1, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0]]]
for x in range(len(data)):
for y in range(len(data[x])):
for z in range(len(data[x][y])):
if data[x][y][z] == 1:
print(f"{x} {y} {z}")
yields:
0 2 2
0 2 3
0 2 4
0 4 3
0 4 4
0 4 5
0 5 1
1 0 0
1 1 2
1 2 1
1 3 1
1 4 1
1 5 3
Using np.where() you can get the row, col and depth index of elements that satisfy you condition.
Try this:
row_idx, col_idx, depth_idx = np.where(img==1)

Categories

Resources