How to make values in one column dependent on another? - python

I have one column called "A" with only values 0 or 1. I have another column called "B". If column A value=0, I want the column B value to equal "dog". If column A value=1, I want the column B value to equal "cat".
Sample DataFrame column:
print(df)
A
0 0
1 1
Is there anyway to fill the B column as such without a for loop?
Desired:
print(df)
A B
0 0 Cat
1 1 Dog
Thanks

Can Simply can try Below using map...
Sample Data
print(df)
A
0 0
1 1
2 0
3 1
4 1
5 1
6 0
7 0
8 1
Result:
df['B'] = df['A'].map({0:'Cat', 1:'Dog'})
print(df)
A B
0 0 Cat
1 1 Dog
2 0 Cat
3 1 Dog
4 1 Dog
5 1 Dog
6 0 Cat
7 0 Cat
8 1 Dog

Next time, please post your research and minimal reproducible code. See comments
import pandas as pd
d = {'A': [0, 1, 0]}
df = pd.DataFrame(data=d)
m = {0: ("Dog"), 1: ("Cat")}
df['B'] = df['A'].map(lambda x: m[x])

Related

Iteration over columns problems on condition

I have a Dataset with several columns and a row named "Total" that stores values between 1 and 4.
I want to iterate over each column, and based on the number stored in the row "Total", add a new row with "yes" or "No".
I also have a list "columns" for iteration.
All data are float64
I´m new at python and i don't now if i'm doing the righ way because i´m getting all "yes".
for c in columns:
if dados_por_periodo.at['Total',c] < 4:
dados_por_periodo.loc['VA'] = "yes"
else:
dados_por_periodo.loc['VA'] = "no"
My dataset:
Thanks.
You can try this, I hope it works for you:
import pandas as pd
import numpy as np
#creation of a dummy df
columns='A B C D E F G H I'.split()
data=[np.random.choice(2, len(columns)).tolist() for col in range(3)]
data.append([1,8,1,1,2,4,1,4,1]) #not real sum of values, just dummy values for testing
index=['Otono','Inverno', 'Primavera','Totals']
df=pd.DataFrame(data, columns=columns, index=index)
df.index.name='periodo' #just adding index name
print(df)
####Adition of the new 'yes/no' row
df = pd.concat([ df, pd.DataFrame([np.where(df.iloc[len(df.index)-1,:].lt(4),'Yes','No')], columns=df.columns, index=['VA'])])
df.index.name='periodo' #just adding index name
print(df)
Output:
df
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
df(with added row)
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
VA Yes No Yes Yes Yes No Yes No Yes
Also try to put some data sample the next times instead of images of the dataset, so someone can help you in a better way :)

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

Checking multiple columns condition in pandas

I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE

Set value of first item in slice in python pandas

So I would like make a slice of a dataframe and then set the value of the first item in that slice without copying the dataframe. For example:
df = pandas.DataFrame(numpy.random.rand(3,1))
df[df[0]>0][0] = 0
The slice here is irrelevant and just for the example and will return the whole data frame again. Point being, by doing it like it is in the example you get a setting with copy warning (understandably). I have also tried slicing first and then using ILOC/IX/LOC and using ILOC twice, i.e. something like:
df.iloc[df[0]>0,:][0] = 0
df[df[0]>0,:].iloc[0] = 0
And neither of these work. Again- I don't want to make a copy of the dataframe even if it id just the sliced version.
EDIT:
It seems there are two ways, using a mask or IdxMax. The IdxMax method seems to work if your index is unique, and the mask method if not. In my case, the index is not unique which I forgot to mention in the initial post.
I think you can use idxmax for get index of first True value and then set by loc:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print (df)
0
0 1
1 3
2 0
3 0
4 3
print ((df[0] == 0).idxmax())
2
df.loc[(df[0] == 0).idxmax(), 0] = 100
print (df)
0
0 1
1 3
2 100
3 0
4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
print (df)
0
0 1
1 200
2 0
3 0
4 3
EDIT:
Solution with not unique index:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df = df.reset_index()
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.set_index('index')
df.index.name = None
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT1:
Solution with MultiIndex:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df.index = [np.arange(len(df.index)), df.index]
print (df)
0
0 1 1
1 2 3
2 2 0
3 3 0
4 4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.reset_index(level=0, drop=True)
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT2:
Solution with double cumsum:
np.random.seed(1)
df = pd.DataFrame([4,0,4,7,4], index=[1,2,2,3,4])
print (df)
0
1 4
2 0
2 4
3 7
4 4
mask = (df[0] == 0).cumsum().cumsum()
print (mask)
1 0
2 1
2 2
3 3
4 4
Name: 0, dtype: int32
df.loc[mask == 1, 0] = 200
print (df)
0
1 4
2 200
2 4
3 7
4 4
Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3, 4, 5]))
print(df)
A
0 1
1 2
2 3
3 4
4 5
Create some arbitrary slice slc
slc = df[df.A > 2]
print(slc)
A
2 3
3 4
4 5
Access the first row of slc within df by using index[0] and loc
df.loc[slc.index[0]] = 0
print(df)
A
0 1
1 2
2 0
3 4
4 5
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,1),index=[1,2,2,3,3,3])
df[1] = 0
df.columns=['a','b']
df['b'][df['a']>=0.5]=1
df=df.sort(['b','a'],ascending=[0,1])
df.loc[df[df['b']==0].index.tolist()[0],'a']=0
In this method extra copy of the dataframe is not created but an extra column is introduced which can be dropped after processing. To choose any index instead o the first one you can change the last line as follows
df.loc[df[df['b']==0].index.tolist()[n],'a']=0
to change any nth item in a slice
df
a
1 0.111089
2 0.255633
2 0.332682
3 0.434527
3 0.730548
3 0.844724
df after slicing and labelling them
a b
1 0.111089 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
3 0.730548 1
3 0.844724 1
After changing value of first item in slice (labelled as 0) to 0
a b
3 0.730548 1
3 0.844724 1
1 0.000000 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
So using some of the answers I managed to find a one liner way to do this:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print df
0
0 1
1 3
2 0
3 0
4 3
df.loc[(df[0] == 0).cumsum()==1,0] = 1
0
0 1
1 3
2 1
3 0
4 3
Essentially this is using the mask inline with a cumsum.

How to use trailing rows on a column for calculations on that same column | Pandas Python

I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.

Categories

Resources