pandas create one column equal to another if condition is satisfied - python

I have two columns as below:
id, colA, colB
0, a, 13
1, a, 52
2, b, 16
3, a, 34
4, b, 946
etc...
I am trying to create a third column, colC, that is colB if colA == a, otherwise 0.
This is what I was thinking, but it does not work:
data[data['colA']=='a']['colC'] = data[data['colA']=='a']['colB']
I was also thinking about using np.where(), but I don't think that would work here.
Any thoughts?

Use loc with a mask to assign:
In [300]:
df.loc[df['colA'] == 'a', 'colC'] = df['colB']
df['colC'] = df['colC'].fillna(0)
df
Out[300]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
EDIT
or use np.where:
In [296]:
df['colC'] = np.where(df['colA'] == 'a', df['colC'],0)
df
Out[296]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0

df['colC'] = df[df['colA'] == 'a']['colB']
should result in exactly what you want, afaik.
Then replace the NaN's with zeroes with df.fillna(inplace=True)

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

How to multiply combinations of two sets of pandas dataframe columns

I would like to multiply the combinations of two sets of columns
Let say there is a dataframe below:
import pandas as pd
df = {'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9], 'D':[0,1,2]}
df = pd.DataFrame(df)
Now, I want to multiply AC, AD, BC, BD
This is like multiplying the combination of [A,B] and [C,D]
I tried to use itertools but failed to figure it out.
So, the desired output will be like:
output = {'AC':[7,16,27], 'AD':[0,2,6], 'BC':[28,40,54], 'BD':[0,5,12]}
output = pd.DataFrame(output)
IIUC, you can try
import itertools
cols1 = ['A', 'B']
cols2 = ['C', 'D']
for col1, col2 in itertools.product(cols1, cols2):
df[col1+col2] = df[col1] * df[col2]
print(df)
A B C D AC AD BC BD
0 1 4 7 0 7 0 28 0
1 2 5 8 1 16 2 40 5
2 3 6 9 2 27 6 54 12
Or with new create dataframe
out = pd.concat([df[col1].mul(df[col2]).to_frame(col1+col2)
for col1, col2 in itertools.product(cols1, cols2)], axis=1)
print(out)
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
You can directly multiply multiple columns if you convert them to NumPy arrays first with .to_numpy()
>>> df[["A","B"]].to_numpy() * df[["C","D"]].to_numpy()
array([[ 7, 0],
[16, 5],
[27, 12]])
You can also unzip a collection of wanted pairs and use them to get a new view of your DataFrame (indexing the same column multiple times is fine) .. then multiplying together the two new NumPy arrays!
>>> import math # standard library for prod()
>>> pairs = ["AC", "AD", "BC", "BD"] # wanted pairs
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs) # new dataframe
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
This extends to any number of pairs (triples, octuples of columns..) as long as they're the same length (beware: zip() will silently drop extra columns beyond the shortest group)
>>> pairs = ["ABD", "BCD"]
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs)
ABD BCD
0 0 0
1 10 40
2 36 108

set values to columns but using indexes

I have a dataframe like this:
a b c d e
42 1 0 1 0
42 0 0 0 1
42 0 1 0 0
42 1 1 0 0
I want to do something that can make all 1 in column bcde equal to column a, so it will basically be this:
a b c d e
42 42 0 42 0
42 0 0 0 42
42 0 42 0 0
42 42 42 0 0
so it should be something like df.loc[df['b']==1,'b'] = df['a'] but for all bcde. the whole dataframe is hundreds of columns so i can not use .loc to set values, and iloc can not set value like loc.
Edit:
You can simply use pandas.DataFrame.where() and give it the df["a"] Series as replacement. This way, it will work for both numbers and strings.
# df is your DataFrame
# New DataFrame...
new_df = df.where(df != 1, df["a"], axis="index")
# ...or in place:
df.where(df != 1, df["a"], axis=0), inplace=True)

dstack with no multiple layers

I have the following dataset with a numeric outcome and several columns that represent tags for the numeric outcome
outcome tag1 tag2 tag3
340 a b a
123 a a b
23 d c b
54 c a c
I would like to unstack the dataset by creating rows from the column values (a, b, c..) and the relative outcome value, something like:
tag outcome
a 340
a 123
a 54
b 340
b 124
b 23
c 23
d 54
How?
Thanks!
Use
In [321]: (df.set_index('outcome').unstack()
.reset_index(level=0, drop=True)
.sort_values()
.reset_index(name='tag')
.drop_duplicates())
Out[321]:
outcome tag
0 340 a
1 123 a
3 54 a
5 340 b
6 123 b
7 23 b
8 54 c
9 23 c
11 23 d
Use:
df1 = (df.melt('outcome', value_name='tag')
.sort_values('tag')
.drop('variable', axis=1)
.dropna(subset=['tag'])
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by melt
Change order by sort_values
Remove column by drop
Remove possible missing values by dropna
Last remove duplicates by drop_duplicates
Or:
df1 = (df.set_index('outcome')
.stack()
.sort_values()
.reset_index(level=1, drop=True)
.reset_index(name='tag')
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by set_index with stack
Series is sorted by sort_values
Double reset_index - first remove level 1 and then create column form index
Last remove duplicates by drop_duplicates
print (df1)
tag outcome
0 a 340
1 a 123
7 a 54
4 b 340
9 b 123
10 b 23
3 c 54
6 c 23
2 d 23

How to combine single and multiindex Pandas DataFrames

I am trying to concatenate multiple Pandas DataFrames, some of which use multi-indexing and others use single indices. As an example, let's consider the following single indexed dataframe:
> import pandas as pd
> df1 = pd.DataFrame({'single': [10,11,12]})
> df1
single
0 10
1 11
2 12
Along with a multiindex dataframe:
> level_dict = {}
> level_dict[('level 1','a','h')] = [1,2,3]
> level_dict[('level 1','b','j')] = [5,6,7]
> level_dict[('level 2','c','k')] = [10, 11, 12]
> level_dict[('level 2','d','l')] = [20, 21, 22]
> df2 = pd.DataFrame(level_dict)
> df2
level 1 level 2
a b c d
h j k l
0 1 5 10 20
1 2 6 11 21
2 3 7 12 22
Now I wish to concatenate the two dataframes. When I try to use concat it flattens the multiindex as follows:
> df3 = pd.concat([df2,df1], axis=1)
> df3
(level 1, a, h) (level 1, b, j) (level 2, c, k) (level 2, d, l) single
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If instead I append a single column to the multiindex dataframe df2 as follows:
> df2['single'] = [10,11,12]
> df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
How can I instead generate this dataframe from df1 and df2 with concat, merge, or join?
I don't think you can avoid converting the single index into a MultiIndex. This is probably the easiest way, you could also convert after joining.
In [48]: df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
In [49]: pd.concat([df2, df1], axis=1)
Out[49]:
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you're just appending one column you could access df1 essentially as a series:
df2[df1.columns[0]] = df1.iloc[:, 0]
df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you could have just made a series in the first place it would be a little easier to read. This command would do the same thing:
ser1 = df1.iloc[:, 0] # make df1's column into a series
df2[ser1.name] = ser1

Categories

Resources