how to label multiple columns effectively using pandas - python

I have a data columns look like below
a b c d e
1 0 0 0 0
0 2 0 0 0
3 0 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 1
For this dataframe I want to create a column called label
a b c d e lable
1 0 0 0 0 cola
0 2 0 0 0 colb
3 0 0 0 0 cola
0 0 0 1 0 cold
0 0 1 0 0 colc
0 0 0 0 1 cole
The label is the column index
my prior code is df['label'] = df['a'].apply(lambda x: 1 if x!=0) but it doesn't work. Is there anyway to return the expected result?

Try idxmax on axis 1
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3, 0, 0, 0],
'b': [0, 2, 0, 0, 0, 0],
'c': [0, 0, 0, 0, 1, 0],
'd': [0, 0, 0, 1, 0, 0],
'e': [0, 0, 0, 0, 0, 1]})
df['label'] = 'col'+df.idxmax(1)
Output
a b c d e label
0 1 0 0 0 0 cola
1 0 2 0 0 0 colb
2 3 0 0 0 0 cola
3 0 0 0 1 0 cold
4 0 0 1 0 0 colc
5 0 0 0 0 1 cole

Related

Join DataFrames on Condition Pandas

I have the following two dataframes with binary values that I want to merge.
df1
Action Adventure Animation Biography
0 0 1 0 0
1 0 0 0 0
2 1 0 0 0
3 0 0 0 0
4 1 0 0 0
df2
Action Adventure Biography Comedy
0 0 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 1
4 1 0 0 0
I want to join these two data frames in a way that the result has the distinct columns and if in one dataframe the value is 1 then the result has 1, if not it has 0.
Result
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0
I am stuck on this so I don not have a proposed solution.
Let us add the two dataframes then clip the upper value:
df1.add(df2, fill_value=0).clip(upper=1).astype(int)
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0
Thinking it as set problem may give you solution. Have look to code.
print((df1 | df2).fillna(0).astype(int) | df2)
COMPLETE CODE:
import pandas as pd
df1 = pd.DataFrame(
{
'Action':[0, 0, 1, 0, 1],
'Adventure':[1, 0, 0, 0, 0],
'Animation':[0, 0, 0, 0, 0],
'Biography':[0, 0, 0, 0, 0]
}
)
df2 = pd.DataFrame(
{
'Action':[0, 0, 1, 0, 1],
'Adventure':[1, 0, 0, 0, 0],
'Animation':[0, 0, 0, 0, 0],
'Biography':[0, 1, 0, 0, 0],
'Comedy':[0, 0, 0, 1, 0]
}
)
print((df1 | df2).fillna(0).astype(int) | df2)
OUTPUT:
Action Adventure Animation Biography Comedy
0 0 1 0 0 0
1 0 0 0 1 0
2 1 0 0 0 0
3 0 0 0 0 1
4 1 0 0 0 0

Python lineage naming with clustered dataframe

I have a dataframe
sample1 0 0 0 0 0 1 1 1 1 1 1 1 1 L1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0 L1-1
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0 L1-1-1
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0 L1-1-1-1
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0 L1-1-2
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0 L1-1-2-1
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0 L1-1-3
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1 L1-2
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0 L1-2-1
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1 L1-2-2
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0 L2
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0 L2-1
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0 L2-1-1
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0 L2-1-1-1
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0 L2-2
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0 L2-3
As you can see, each row is clustered.
I want to name "lineage-based" labeling to each sample.
For example, sample1 will be lin1 because it is first to appear, sample2 will be lin1-1.
Sample3 will be lin1-1-1, sample4 will be lin1-1-1-1.
Next, sample5 will be lin1-2, sample6 will be lin1-2-1...
Sample11 will be a new start for the lineage, lin2.
My original idea for the naming was.
"sample1 is lin1, if next sample is included in the previous sample, lin1 + "-1"
if not, lin(1+1)"
sample1 -> lin1
sample2 -> lin1-1 (sample2 is included in sample1)
sample3 -> lin1-1-1 (sample3 is included in sample2)
sample4 -> lin1-1-1-1 (sample4 is included in sample3)
sample5 -> lin1-1-2 (sample5 is not included in sample4)
.... logic like this.
I couldn't make this logic into a python script.
This can be done in several steps.
Step 1. Data preprocessing
Sort the data in descending order and remove duplicate, otherwise it may not work. Assume done.
import numpy as np
data = '''sample1 0 0 0 0 0 1 1 1 1 1 1 1 1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0'''
data = [x.split() for x in data.split('\n')]
data = [x[1:] for x in data]
data = np.array(data, dtype=int)
data
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])
Step 2. Encode the sample to position. Each element is a frozenset.
nrow, ncol = data.shape
def to_position(sample):
ncol = len(sample)
return frozenset(i for i in range(ncol) if sample[i] == 1)
position = [to_position(data[i]) for i in range(nrow)]
# print(position)
Step 3. Assign each sample position to a cluster, where the cluster is represented as a tuple for now.
def assign_cluster(sample, clusters, parent):
if parent not in clusters:
clusters[parent] = sample
elif sample < clusters[parent]:
# Find child
parent = parent + (0,)
assign_cluster(sample, clusters, parent)
else:
# Find siblings
parent = parent[:-1] + (parent[-1] + 1, )
assign_cluster(sample, clusters, parent)
clusters = {}
root = (0,)
clusters[root] = position[0]
for i in range(1, nrow):
sample = position[i]
assign_cluster(sample, clusters, parent=root)
# print(clusters)
Step 4. Convert cluster to string and show result.
def cluster_to_string(c):
c = [str(_ + 1) for _ in c]
return 'L' + '-'.join(c)
position_dict = {v: k for k, v in clusters.items()}
for sample in data:
sample = to_position(sample)
c = position_dict[sample]
print(cluster_to_string(c))
L1
L1-1
L1-1-1
L1-1-1-1
L1-1-2
L1-1-2-1
L1-1-3
L1-2
L1-2-1
L1-2-2
L2
L2-1
L2-1-1
L2-1-1-1
L2-2
L2-3

Numpy way to integer-mask an array

I have a multi-class segmentation mask
eg.
[1 1 1 2 2 2 2 3 3 3 3 3 3 2 2 2 2 4 4 4 4 4 4]
And going to need to get binary segmentation masks for each value
i.e.
[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]
Any elegant numpy way to do this?
Ideally an example, where I can set 0 and 1 to other values, if I have to.
Just do "==" as this
import numpy as np
a = np.array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4])
mask1 = (a==1)*5
mask2 = (a==2)*5
mask3 = (a==3)*5
mask4 = (a==4)*5
for mask in [mask1,mask2,mask3,mask4]:
print(mask)
This gives
[5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 5 5 5 5 0 0 0 0 0 0 5 5 5 5 0 0 0 0 0 0]
[0 0 0 0 0 0 0 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5]
You can manipulate the masks further in the same manner, i. e.
mask1[mask1==0] = 3
Native python approach:
You can use comprehension and get the equality values for each unique value using set(<sequence>), then convert the boolean to int to get 0,1 values.
>>> ls =[1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4]
>>> {v:[int(v==i) for i in ls] for v in set(ls)}
{1: [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
2: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
3: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
4: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]}
Numpy approach:
Get the unique values for the list using np.unique then expand the axis and transpose the array, then expand the axis for the list also and repeat it n times where n is the number of unique values, finally do the equality comparison and convert it to integer type:
import numpy as np
ls = [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4]
uniques = np.expand_dims(np.unique(ls), 0).T
result = (np.repeat(np.expand_dims(ls, 0), uniques.shape[0], 0)==uniques).astype(int)
OUTPUT:
print(result)
[[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]]
You can build the mask using np.arange and .repeat() and then use broadcasting and the == operator to generate the arrays:
a = np.array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4])
mask = np.arange(a.min(), a.max()+1).repeat(a.size).reshape(-1, a.size)
a_masked = (a == m).astype(int)
print(a_masked.shape) # (4, 23)
print(a_masked)
# [[1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]]
Setting 0 and 1 to other values can be done via normal indexing:
a_masked[a_masked == 0] = 7
a_masked[a_masked == 1] = 42

is it possible to use fnmatch.filter on a pandas dataframe instead of regex? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataframe as below for example, i want to have only tests with certain regex to be part of my updated dataframe. I was wondering if there is a way to do it with fnmatch instead of regex?
data = {'part1':[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
'part2':[0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1],
'part3':[0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1],
'part4':[0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
'part5':[1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
'part6':[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1],
'part7':[1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1],
'part8':[1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1],
'part9':[1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1 ],
'part10':[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1],
'part11':[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
'part12':[0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1]
}
df = pd.DataFrame(data, index =['test_gt1',
'test_gt2',
'test_gf3',
'test_gf4',
'test_gt5',
'test_gg6',
'test_gf7',
'test_gt8',
'test_gg9',
'test_gf10',
'test_gg11',
'test12'
])
i want to be able to create a new dataframe that only contains test_gg or test_gf or test_gt using fnmatch.filter? all examples i see are related to list, so how can i apply it to dataframe?
Import fnmatch.filter and filter on the index:
from fnmatch import filter
In [7]: df.loc[filter(df.index, '*g*')]
Out[7]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0
You can also just use pandas' filter function with regex, and filter on the index:
In [8]: df.filter(regex=r".+g.+", axis='index')
Out[8]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0
You can also just use like :
df.filter(like="g", axis='index')
Out[12]:
part1 part2 part3 part4 part5 part6 part7 part8 part9 part10 part11 part12
test_gt1 0 0 0 0 1 1 1 1 1 1 0 0
test_gt2 1 1 1 0 0 1 1 0 0 1 1 1
test_gf3 0 0 0 0 1 1 1 1 1 1 0 0
test_gf4 0 1 1 1 0 1 1 1 0 1 0 1
test_gt5 0 1 0 1 0 1 0 1 0 1 0 1
test_gg6 0 0 0 0 1 1 1 1 1 1 0 0
test_gf7 1 1 1 0 0 1 1 0 0 1 0 1
test_gt8 0 1 1 1 0 1 1 1 0 1 0 0
test_gg9 1 0 1 0 1 0 1 0 1 0 1 0
test_gf10 0 1 0 1 0 1 0 1 0 1 0 1
test_gg11 0 0 0 0 0 0 0 0 0 0 0 0

numpy for no repeating for two columns

Basically, I am looking to solve without repeating comparing via AA and BB. If AA has 1, BB will start with 1, rather than repeating.
import numpy as np
df2 = pd.DataFrame({
'A': np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0], dtype='int32'),
'B': np.array([0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1], dtype='int32')
})
df2['AA'] = np.where( (df2['A'] > df2['A'].shift(1)),1,0)
df2['BB'] = np.where(( (df2['B'] > df2['B'].shift(1))),1,0)
I am getting repeating of BB. How can I get without repeating 1 if AA has 1, next BB should get 1 rather than repeating.
df2
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 1
The result should be as following.
if AA has 1 in previous row or past rows, and BB will start with 1 rather than repeating 1 at AA again, likewise in BB it should not repeat.
Out[24]:
A B AA BB
0 1 0 0 0
1 1 0 0 0
2 0 0 0 0
3 1 0 1 0
4 0 1 0 1
5 1 0 1 0
6 1 0 0 0
7 0 1 0 1
8 0 1 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 0 1 0 0

Categories

Resources