As part of a classification problem, I work on a DataFrame containing multiple label columns.
My dataframe is of this form :
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
>>> col1 label1 label2
0 a 1 1
1 b 1 0
2 c 0 0
As I do not want to have more than one true label per row, I want to duplicate only those rows and regularize this condition as follows :
>>> col1 label1 label2
0 a 1 0 # Modified original row
1 a 0 1 # Duplicated & modified row
2 b 1 0
3 c 0 0
With only the row of value "a" being duplicated / regularized
At the moment I do that in a for loop, replicating the rows in a second DataFrame, appending it and dropping all the "invalid" rows.
Would there be a more clean/efficient way to do that ?
>>> cols = [x for x in df.columns is x != 'col1']
>>> res = pd.concat([df[['col1', x]] for x in cols])
>>> res = res.drop_duplicates()
>>> res.fillna(0, inplace=True)
>>> res.sort_values(by='col1', inplace=True)
>>> res.reset_index(drop=True, inplace=True)
>>> res
col1 label1 label2
0 a 1 0
1 a 0 1
2 b 1 0
3 b 0 0
4 c 0 0
You can also use df.iterrows() doing as follows :
for index, row in df.iterrows():
if row[1]+row[2]==2:
df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[0], 'label2':[1]})),ignore_index=True)
df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[1], 'label2':[0]})), ignore_index=True)
df.drop(index, inplace=True)
Result :
col1 label1 label2
1 b 1 0
2 c 0 0
3 a 0 1
4 a 1 0
Then you can sort regarding values on col1
Here is a somewhat intuitive way of thinking about the problem. First, filter for just the rows that have label both equal to 1. Make two new dataframes by replacing each column by zero, once each.
Then concatenate the original dataframe without both rows equal to one to the two new dataframes created.
mask_ones = (df['label1'] == 1) & (df['label2'] == 1)
df_ones = df[mask_ones]
df_not_ones = df[~mask_ones]
df_final = pd.concat([df_not_ones,
df_ones.replace({'label2':{1:0}}),
df_ones.replace({'label1':{1:0}})]).sort_values('col1')
Split into 2 df - unique and duplicates.
For duplicates took col1 + label1 columns and concat with col1 + label2 and fill nan with 0.
Concat unique and duplicates df into one:
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]], columns=['col1', 'label1', 'label2'])
mask = (df['label1'] == 1) & (df['label2'] == 1)
df_dup, df_uq = df[mask], df[~mask]
df_dup = pd.concat([df_dup[['col1', x]] for x in df_dup.columns if x != 'col1']).fillna(0)
df = pd.concat([df_dup, df_uq], ignore_index=True)
print(df)
col1 label1 label2
0 a 1.0 0.0
1 a 0.0 1.0
2 b 1.0 0.0
3 c 0.0 0.0
Something like that:
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
df2 = pd.DataFrame()
df2["col1"] = df["col1"]
df2["label2"] = df["label2"]
df.drop(labels="label2", axis=1, inplace=True)
result = df.append(df2, ignore_index=True)
result.fillna(value=0, inplace=True)
result.sort_values(by="col1")
Result:
col1 label1 label2
0 a 1.000000 0.000000
3 a 0.000000 1.000000
1 b 1.000000 0.000000
4 b 0.000000 0.000000
2 c 0.000000 0.000000
5 c 0.000000 0.000000
Finally, you could drop duplicates
result.drop_duplicates()
Related
I have created a data frame as follows:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
for v in lvl0:
df.loc[(v, 'mode'), :] = np.nan
df.sort_index(inplace=True)
Which gives
c1 c2
A mode NaN NaN
x 0 1
y 1 0
z 1 1
B mode NaN NaN
x 1 1
y 1 1
z 1 0
I then calculate the modes for group A and B for both columns as follows
for c in cols:
modes = df.groupby(level=0)[c].agg(pd.Series.mode)
df.loc[(lvl0, 'default'), c] = modes.values
which results in
c1 c2
A mode 1 1
x 0 1
y 1 0
z 1 1
B mode 1 1
x 1 1
y 1 1
z 1 0
I now want to remove all rows where the values are equal to the mode of the corresponding first level group (A-B), which in this case should result in
c1 c2
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
I could achieve the desired result by looping over lvl0, but I was wondering if there was a more elegant solution, perhaps using groupby.
Additionally, I wonder if there is a way to add the mode rows when the modes are calculated, as opposed to adding empty (NaN) rows beforehand. If I don't add the empty rows beforehand, the line df.loc[(lvl0, 'default'), c] = modes.values gives me a KeyError on 'default'.
Use:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
df.sort_index(inplace=True)
#get first modes per groups
modes = df.groupby(level=0).agg(lambda x: x.mode().iat[0])
#compare modes per groups with removed second level and append modes df
df = (pd.concat([modes.assign(tmp='mode').set_index('tmp', append=True), \
df[df.droplevel(1).ne(modes).any(axis=1).to_numpy()]])
.sort_index())
print (df)
c1 c2
tmp
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
** task is to separate the numbers from the each element of respective row from pandas dataframe
column name is col1 and column contains elements in form of list
[‘a_1’,’b_2’,’c_3’,’d_4’] **
IIUC, you can do this at least these two ways:
df = pd.DataFrame()
df['col1'] = ['a_1', 'b_2', 'c_3', 'd_4']
df
Input dataframe:
col1
0 a_1
1 b_2
2 c_3
3 d_4
Option 1, using .str.split:
df['col1'].str.split('_', expand=True)
Output:
0 1
0 a 1
1 b 2
2 c 3
3 d 4
Option 2, using .str.extract with regex:
df['col1'].str.extract('\_(\d+)')
Output:
0
0 1
1 2
2 3
3 4
df = pd.DataFrame()
df['col1'] = [['a_1', 'b_2', 'c_3', 'd_4']]
Input:
col1
0 [a_1, b_2, c_3, d_4]
Doing:
df['col1'] = df['col1'].apply(lambda x: [i.split('_')[1] for i in x])
Output:
col1
0 [1, 2, 3, 4]
Let's suppose I have a following dataframe:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'val': [0, 0, 0, 0, 0]})
I want to modify the column val with values from another dataframes like these:
df1 = pd.DataFrame({'id': [2, 3], 'val': [1, 1]})
df2 = pd.DataFrame({'id': [1, 5], 'val': [2, 2]})
I need a function merge_values_into_df that would work in the way to provide the following result:
df = merge_values_into_df(df1, on='id', field='val')
df = merge_values_into_df(df2, on='id', field='val')
print(df)
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2
I need an efficient (by CPU and memory) solution because I want to apply the approach to huge dataframes.
Use DataFrame.update with convert id to index in all DataFrames:
df = df.set_index('id')
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df.update(df1)
df.update(df2)
df = df.reset_index()
print (df)
id val
0 1 2.0
1 2 1.0
2 3 1.0
3 4 0.0
4 5 2.0
You can concat all dataframes and drop_duplicated same id by keeping the last occurrence of id.
out = pd.concat([df, df1, df2]).drop_duplicates('id', keep='last')
print(out.sort_values('id', ignore_index=True))
# Output
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2
I am calculating unique values, per row. However I want to exclude the value 0 and then calculate uniques
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0],}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 0
1 2 4 4
2 3 0 0
Expected output
col1 col2 col3 uniques
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
df.nunique(axis = 1), this includes all values
To do this you can simply replace zeroes with Nan values.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0]}
df = pd.DataFrame(data=d)
df['uniques'] = df.replace(0, np.NaN).nunique(axis=1)
Try this:
def func(x):
s = set(x)
s.discard(0)
return len(s)
df['uniq'] = df.apply(lambda x: func(x), axis=1)
A slightly more concise version without using replace:
df['unique'] = df[df!=0].nunique(axis=1)
df
Output:
col1 col2 col3 unique
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
Let's say we have
In [0]: df = pd.DataFrame(data={'col1': [1, 2, 3], 'col2': [3, 4, 5]})
In [1]: df
Out[2]:
col1 col2
0 1 3
1 2 4
2 3 5
What I need is to divide df[1:] on df[:-1] and get a dataframe as a result, like this:
Out[3]:
col1 col2
0 2.0 1.3333333333333333
1 1.5 1.25
But of course I'm getting
Out[3]:
col1 col2
0 NaN NaN
1 1.0 1.0
2 NaN NaN
I've tried using iloc for slicing, but got the same result. I'm aware of df.values, but I need a dataframe as a result. Thank you so much.
You can divide numpy array created by values with DataFrame contructor:
df1 = pd.DataFrame(df[1:].values / df[:-1].values, columns=df.columns)
print (df1)
col1 col2
0 2.0 1.333333
1 1.5 1.250000
Or set same indices in both DataFrames:
df1 = df[1:].reset_index(drop=True).div(df[:-1].reset_index(drop=True))
a = df[1:]
b = df[:-1]
b.index = a.index
df1 = a / b
df2 = df[1:]
df1 = df2.div(df[:-1].set_index(df2.index))
print (df1)
col1 col2
1 2.0 1.333333
2 1.5 1.250000