How to mark first entry per group satisfying some criterion? - python

Let's say I have some dataframe where one column has some values occuring multiple times forming groups (column A in the snippet). Now I'd like to create a new column that with e.g. a 1 for the first x (column C) entries per group, and 0 in the other ones.
I managed to do the first part, but I did not find a good way to include the condition on the xes, is there a good way of doing that?
import pandas as pd
df = pd.DataFrame(
{
"A": ["0", "0", "1", "2", "2", "2"], # data to group by
"B": ["a", "b", "c", "d", "e", "f"], # some other irrelevant data to be preserved
"C": ["y", "x", "y", "x", "y", "x"], # only consider the 'x'
}
)
target = pd.DataFrame(
{
"A": ["0", "0", "1", "2", "2", "2"],
"B": ["a", "b", "c", "d", "e", "f"],
"C": ["y", "x", "y", "x", "y", "x"],
"D": [ 0, 1, 0, 1, 0, 0] # first entry per group of 'A' that has an 'C' == 'x'
}
)
# following partial solution doesn't account for filtering by 'x' in 'C'
df['D'] = df.groupby('A')['C'].transform(lambda x: [1 if i == 0 else 0 for i in range(len(x))])

In your case do slice then drop_duplicates and assign back
df['D'] = df.loc[df.C=='x'].drop_duplicates('A').assign(D=1)['D']
df['D'].fillna(0,inplace=True)
df
Out[149]:
A B C D
0 0 a y 0.0
1 0 b x 1.0
2 1 c y 0.0
3 2 d x 1.0
4 2 e y 0.0
5 2 f x 0.0

Related

Remove a substring from a pandas dataframe column

I have a large (45K rows) dataset and I need to remove specific values from specific columns in a handful of cases. The dataset is large enough I'd like to avoid using apply if at all possible.
Here's a sample dataset:
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
My current solution is to use:
(
df.merge(
drops.pivot(index="ID", columns="Column", values="Override").reset_index()[["ID", "S"]],
how="left",
on=["ID", "S"],
indicator="_dropS",
).assign(
S=lambda d_: d_.S.mask(d_._dropS == "both", np.nan)))
But this only successfully removes one of the entries. My general Python knowledge is telling me to split the column S by the delimiter "/", remove the matching entry, and join the list back together again (there may be more than two entries in the S column), but I can't seem to make that work within the DataFrame without using apply.
Edited to add goal state: Column S should have the entries: 'n', 'o', ''. The final could be NaN as well.
Is there a reasonable way to do this without a separate function call?
IIUC here is one solution that gives the expected output, no idea about the perfomance. Would be interested in your feedback on that.
#from your sample data
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
pivoted_rules = drops.pivot(index="ID", columns="Column", values="Override").rename(columns={'S': 'compare_S'})
res = pd.concat([df.set_index('ID'),pivoted_rules],axis=1).fillna('fill_value')
res['S'] = [''.join([x for x in a if x!=b]) for a, b in zip(res['S'].str.split('/'), res['compare_S'])]
res = res.drop('compare_S', axis=1).reset_index()
print(res)
ID T S
0 30 C n
1 40 D o
2 50 E
Didn't use apply :)
remove specific values from specific columns,you can use .str.replace
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
df.loc[:,'S'] = df['S'].str.replace(r'[/p]','')
the result :
ID T S
0 30 C n
1 40 D o
2 50 E

Removing labels in a dataset based on values from a column in a different dataset

I have the df shown below where data is used for an exercise on virus spreading.
df1
Node Target Node_Label Target_Label
A B 1 0
B A 0 1
C A 1 1
C D 1 1
I need to remove the labels of Node/Target based on the column Selected in the df below:
Node Label Selected
A 1 True
B 0 False
C 1 True
D 1 False
E 0 False
F 1 False
G -1 True
The expected output therefore would be
Node Target Node_Label Target_Label
A B 0
B A 0
C A
C D 1
How can I remove the labels in df1 based on the Selected values in df2?
Would a filter condition where I check the condition of Selected in df2 and apply it in df1 good in this case?
You could try this:
import pandas as pd
df1 = pd.DataFrame(
{
"Node": {0: "A", 1: "B", 2: "C", 3: "C"},
"Target": {0: "B", 1: "A", 2: "A", 3: "D"},
"Node_Label": {0: 1, 1: 0, 2: 1, 3: 1},
"Target_Label": {0: 0, 1: 1, 2: 1, 3: 1},
}
)
df2 = pd.DataFrame(
{
"Node": {0: "A", 1: "B", 2: "C", 3: "D", 4: "E", 5: "F", 6: "G"},
"Label": {0: 1, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: -1},
"Selected": {
0: "True",
1: "False",
2: "True",
3: "False",
4: "False",
5: "False",
6: "True",
},
}
)
mask = df2.loc[df2["Selected"] == "True", "Node"].to_list() # ['A', 'C', 'G']
df1.loc[df1["Node"].isin(mask), "Node_Label"] = ""
df1.loc[df1["Target"].isin(mask), "Target_Label"] = ""
print(df1)
# Output
Node Target Node_Label Target_Label
0 A B 0
1 B A 0
2 C A
3 C D 1

Is it possible to merge two pandas dataframes based on indices and column names?

I have two dataframes:
left = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
},
index=[0, 1, 2, 3],
)
right = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
Is it possible to merge them based on indices and col of the left and column names on the right ?
I need to get the following result:
result = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
"Val": ["D0", "C1", "B2", "A3"],
},
)
Try with
left['new'] = right.values[np.arange(len(left)), right.columns.get_indexer(left.Col)]
left
Out[129]:
Col new
0 D D0
1 C C1
2 B B2
3 A A3
Notice, we used to have lookup but it deprecated, ,above is one of the alternative of lookup from numpy
The reason here I am not use the index : numpy do not have index, so we need the position to pass by the correct value, most of time index same as position but it will may different from
each other as well.
Another solution:
left["new"] = right.apply(lambda x: x[left.loc[x.name, "Col"]], axis=1)
print(left)
Prints:
Col new
0 D D0
1 C C1
2 B B2
3 A A3
Alternative approach (convert columns to index with melt and then merge):
left['id'] = left.index
m = right.melt(ignore_index=False, var_name="Col", value_name="Val")
m['id'] = m.index
result = pd.merge(left, m, on=["id", "Col"])[["Col", "Val"]]
It is faster than use of apply but slower than the accepted answer.

Is it possible to merge two pandas dataframes based on column name?

I have two dataframes:
left = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
},
)
right = pd.DataFrame(
{
"A": ["A0"],
"B": ["B0"],
"C": ["C0"],
"D": ["D0"],
},
)
Is it possible to merge them based on col of the left and column names on the right ?
I need to get the following result:
result = pd.DataFrame(
{
"Col": ["D", "C", "B", "A"],
"Val": ["D0", "C0", "B0", "A0"],
},
)
You can do it with a pretty straightforward .map:
In [319]: left['Val'] = left['Col'].map(right.T[0])
In [320]: left
Out[320]:
Col Val
0 D D0
1 C C0
2 B B0
3 A A0
Try join with the transposition (T or transpose()):
import pandas as pd
left = pd.DataFrame({
"Col": ["D", "C", "B", "A"],
})
right = pd.DataFrame({
"A": ["A0"],
"B": ["B0"],
"C": ["C0"],
"D": ["D0"],
})
new_df = left.join(right.T, on='Col').rename(columns={0: 'Val'})
print(new_df)
new_df:
Col Val
0 D D0
1 C C0
2 B B0
3 A A0

Building new pandas DataFrame using dict of row selections (cannot reindex from a duplicate axis)

I have a pandas DataFrame and several lists of row indices. My goal is to use these row indices to create columns in a new dataset based on the corresponding values in a given the original DataFrame, and make boxplots from this. My lists of row indices are associated with names. I represent this as a dictionary of lists of row indices.
The following small example works as expected:
import pandas as pd
df = pd.DataFrame(
{
"col1" : [1, 2, 3, 4, 5, 6],
"col2" : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]},
index=["a", "b", "c", "d", "e", "f"])
lists_of_indices = {
"A" : ["a", "c", "d"],
"B" : ["b", "c", "f"],
"D" : ["a", "d"]}
new_df = pd.DataFrame(
{list_name : df.loc[id_list]["col1"] for (list_name, id_list) in lists_of_indices.items()})
new_df.plot.box()
However, with my real data, I end up with a ValueError: cannot reindex from a duplicate axis.
What can be the problem, and how do I fix it ?
As the error message suggests, some of the lists of indices may have duplicates. Simply transforming them in a set can solve the issue:
Here is an example that reproduces the error:
import pandas as pd
df = pd.DataFrame(
{
"col1" : [1, 2, 3, 4, 5, 6],
"col2" : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]},
index=["a", "b", "c", "d", "e", "f"])
lists_of_indices = {
"A" : ["a", "c", "d"],
"B" : ["b", "c", "f", "c"], # Note the extra "c"
"D" : ["a", "d"]}
new_df = pd.DataFrame(
{list_name : df.loc[id_list]["col1"] for (list_name, id_list) in lists_of_indices.items()})
new_df.plot.box()
And here is how to fix it:
new_df = pd.DataFrame(
{list_name : df.loc[set(id_list)]["col1"] for (list_name, id_list) in lists_of_indices.items()})
It might however be worthwhile to check why some of these lists of indices contain duplicates in the first place.

Categories

Resources