pandas create one column equal to another if condition is satisfied

pandas create one column equal to another if condition is satisfied - python

I have two columns as below:
id, colA, colB
0, a, 13
1, a, 52
2, b, 16
3, a, 34
4, b, 946
etc...
I am trying to create a third column, colC, that is colB if colA == a, otherwise 0.
This is what I was thinking, but it does not work:
data[data['colA']=='a']['colC'] = data[data['colA']=='a']['colB']
I was also thinking about using np.where(), but I don't think that would work here.
Any thoughts?

Use loc with a mask to assign:
In [300]:
df.loc[df['colA'] == 'a', 'colC'] = df['colB']
df['colC'] = df['colC'].fillna(0)
df
Out[300]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0
EDIT
or use np.where:
In [296]:
df['colC'] = np.where(df['colA'] == 'a', df['colC'],0)
df
Out[296]:
id colA colB colC
0 0 a 13 13
1 1 a 52 52
2 2 b 16 0
3 3 a 34 34
4 4 b 946 0

df['colC'] = df[df['colA'] == 'a']['colB']
should result in exactly what you want, afaik.
Then replace the NaN's with zeroes with df.fillna(inplace=True)

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?

Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))

If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object

How to multiply combinations of two sets of pandas dataframe columns

I would like to multiply the combinations of two sets of columns
Let say there is a dataframe below:
import pandas as pd
df = {'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9], 'D':[0,1,2]}
df = pd.DataFrame(df)
Now, I want to multiply AC, AD, BC, BD
This is like multiplying the combination of [A,B] and [C,D]
I tried to use itertools but failed to figure it out.
So, the desired output will be like:
output = {'AC':[7,16,27], 'AD':[0,2,6], 'BC':[28,40,54], 'BD':[0,5,12]}
output = pd.DataFrame(output)

IIUC, you can try
import itertools
cols1 = ['A', 'B']
cols2 = ['C', 'D']
for col1, col2 in itertools.product(cols1, cols2):
df[col1+col2] = df[col1] * df[col2]
print(df)
A B C D AC AD BC BD
0 1 4 7 0 7 0 28 0
1 2 5 8 1 16 2 40 5
2 3 6 9 2 27 6 54 12
Or with new create dataframe
out = pd.concat([df[col1].mul(df[col2]).to_frame(col1+col2)
for col1, col2 in itertools.product(cols1, cols2)], axis=1)
print(out)
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12

You can directly multiply multiple columns if you convert them to NumPy arrays first with .to_numpy()
>>> df[["A","B"]].to_numpy() * df[["C","D"]].to_numpy()
array([[ 7, 0],
[16, 5],
[27, 12]])
You can also unzip a collection of wanted pairs and use them to get a new view of your DataFrame (indexing the same column multiple times is fine) .. then multiplying together the two new NumPy arrays!
>>> import math # standard library for prod()
>>> pairs = ["AC", "AD", "BC", "BD"] # wanted pairs
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs) # new dataframe
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
This extends to any number of pairs (triples, octuples of columns..) as long as they're the same length (beware: zip() will silently drop extra columns beyond the shortest group)
>>> pairs = ["ABD", "BCD"]
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs)
ABD BCD
0 0 0
1 10 40
2 36 108

set values to columns but using indexes

I have a dataframe like this:
a b c d e
42 1 0 1 0
42 0 0 0 1
42 0 1 0 0
42 1 1 0 0
I want to do something that can make all 1 in column bcde equal to column a, so it will basically be this:
a b c d e
42 42 0 42 0
42 0 0 0 42
42 0 42 0 0
42 42 42 0 0
so it should be something like df.loc[df['b']==1,'b'] = df['a'] but for all bcde. the whole dataframe is hundreds of columns so i can not use .loc to set values, and iloc can not set value like loc.

Edit:
You can simply use pandas.DataFrame.where() and give it the df["a"] Series as replacement. This way, it will work for both numbers and strings.
# df is your DataFrame
# New DataFrame...
new_df = df.where(df != 1, df["a"], axis="index")
# ...or in place:
df.where(df != 1, df["a"], axis=0), inplace=True)

dstack with no multiple layers

I have the following dataset with a numeric outcome and several columns that represent tags for the numeric outcome
outcome tag1 tag2 tag3
340 a b a
123 a a b
23 d c b
54 c a c
I would like to unstack the dataset by creating rows from the column values (a, b, c..) and the relative outcome value, something like:
tag outcome
a 340
a 123
a 54
b 340
b 124
b 23
c 23
d 54
How?
Thanks!

Use
In [321]: (df.set_index('outcome').unstack()
.reset_index(level=0, drop=True)
.sort_values()
.reset_index(name='tag')
.drop_duplicates())
Out[321]:
outcome tag
0 340 a
1 123 a
3 54 a
5 340 b
6 123 b
7 23 b
8 54 c
9 23 c
11 23 d

Use:
df1 = (df.melt('outcome', value_name='tag')
.sort_values('tag')
.drop('variable', axis=1)
.dropna(subset=['tag'])
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by melt
Change order by sort_values
Remove column by drop
Remove possible missing values by dropna
Last remove duplicates by drop_duplicates
Or:
df1 = (df.set_index('outcome')
.stack()
.sort_values()
.reset_index(level=1, drop=True)
.reset_index(name='tag')
.drop_duplicates()[['tag','outcome']])
Explanation:
Reshape by set_index with stack
Series is sorted by sort_values
Double reset_index - first remove level 1 and then create column form index
Last remove duplicates by drop_duplicates
print (df1)
tag outcome
0 a 340
1 a 123
7 a 54
4 b 340
9 b 123
10 b 23
3 c 54
6 c 23
2 d 23

How to combine single and multiindex Pandas DataFrames

I am trying to concatenate multiple Pandas DataFrames, some of which use multi-indexing and others use single indices. As an example, let's consider the following single indexed dataframe:
> import pandas as pd
> df1 = pd.DataFrame({'single': [10,11,12]})
> df1
single
0 10
1 11
2 12
Along with a multiindex dataframe:
> level_dict = {}
> level_dict[('level 1','a','h')] = [1,2,3]
> level_dict[('level 1','b','j')] = [5,6,7]
> level_dict[('level 2','c','k')] = [10, 11, 12]
> level_dict[('level 2','d','l')] = [20, 21, 22]
> df2 = pd.DataFrame(level_dict)
> df2
level 1 level 2
a b c d
h j k l
0 1 5 10 20
1 2 6 11 21
2 3 7 12 22
Now I wish to concatenate the two dataframes. When I try to use concat it flattens the multiindex as follows:
> df3 = pd.concat([df2,df1], axis=1)
> df3
(level 1, a, h) (level 1, b, j) (level 2, c, k) (level 2, d, l) single
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If instead I append a single column to the multiindex dataframe df2 as follows:
> df2['single'] = [10,11,12]
> df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
How can I instead generate this dataframe from df1 and df2 with concat, merge, or join?

I don't think you can avoid converting the single index into a MultiIndex. This is probably the easiest way, you could also convert after joining.
In [48]: df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
In [49]: pd.concat([df2, df1], axis=1)
Out[49]:
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12

If you're just appending one column you could access df1 essentially as a series:
df2[df1.columns[0]] = df1.iloc[:, 0]
df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you could have just made a series in the first place it would be a little easier to read. This command would do the same thing:
ser1 = df1.iloc[:, 0] # make df1's column into a series
df2[ser1.name] = ser1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas create one column equal to another if condition is satisfied - python

df['colC'] = df[df['colA'] == 'a']['colB'] should result in exactly what you want, afaik. Then replace the NaN's with zeroes with df.fillna(inplace=True)

Related

Get the column names for 2nd largest value for each row in a Pandas dataframe

How to multiply combinations of two sets of pandas dataframe columns

set values to columns but using indexes

dstack with no multiple layers

How to combine single and multiindex Pandas DataFrames

Categories

Resources