I've searched for an answer for the past 30 min, but the only solutions are either for a single column or in R. I have a dataset in which I want to change the ('Y/N') values to 1 and 0 respectively. I feel like copying and pasting the code below 17 times is very inefficient.
df.loc[df.infants == 'n', 'infants'] = 0
df.loc[df.infants == 'y', 'infants'] = 1
df.loc[df.infants == '?', 'infants'] = 1
My solution is the following. This doesn't cause an error, but the values in the dataframe doesn't change. I'm assuming I need to do something like df = df_new. But how to do this?
for coln in df:
for value in coln:
if value == 'y':
value = '1'
elif value == 'n':
value = '0'
else:
value = '1'
EDIT: There are 17 columns in this dataset, but there is another dataset I'm hoping to tackle which contains 56 columns.
republican n y n.1 y.1 y.2 y.3 n.2 n.3 n.4 y.4 ? y.5 y.6 y.7 n.5 y.8
0 republican n y n y y y n n n n n y y y n ?
1 democrat ? y y ? y y n n n n y n y y n n
2 democrat n y y n ? y n n n n y n y n n y
3 democrat y y y n y y n n n n y ? y y y y
4 democrat n y y n y y n n n n n n y y y y
This should work:
for col in df.columns():
df.loc[df[col] == 'n', col] = 0
df.loc[df[col] == 'y', col] = 1
df.loc[df[col] == '?', col] = 1
I think simpliest is use replace by dict:
np.random.seed(100)
df = pd.DataFrame(np.random.choice(['n','y','?'], size=(5,5)),
columns=list('ABCDE'))
print (df)
A B C D E
0 n n n ? ?
1 n ? y ? ?
2 ? ? y n n
3 n n ? n y
4 y ? ? n n
d = {'n':0,'y':1,'?':1}
df = df.replace(d)
print (df)
A B C D E
0 0 0 0 1 1
1 0 1 1 1 1
2 1 1 1 0 0
3 0 0 1 0 1
4 1 1 1 0 0
This should do:
df.infants = df.infants.map({ 'Y' : 1, 'N' : 0})
Maybe you can try apply,
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()
# create def for category to number 0/1
def tran_cat_to_num(df):
if df['sex'] == 'male':
return 1
elif df['sex'] == 'female':
return 0
# create sex_new
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
raw
number sex
0 1 male
1 2 female
2 3 female
3 4 female
4 5 male
after use apply
number sex sex_new
0 1 male 1
1 2 female 0
2 3 female 0
3 4 female 0
4 5 male 1
You can change the values using the map function.
Ex.:
x = {'y': 1, 'n': 0}
for col in df.columns():
df[col] = df[col].map(x)
This way you map each column of your dataframe.
All the solutions above are correct, but what you could also do is:
df["infants"] = df["infants"].replace("Y", 1).replace("N", 0).replace("?", 1)
which now that I read more carefully is very similar to using replace with dict !
Use dataframe.replace():
df.replace({'infants':{'y':1,'?':1,'n':0}},inplace=True)
Related
I have the following df :
df = data.frame("T" = c(1,2,3,5,1,3,2,5), "A" = c("0","0","1","1","0","1","0","1"), "B" = c("0","0","0","1","0","0","0","1"))
df
T A B
1 1 0 0
2 2 0 0
3 3 1 0
4 5 1 1
5 1 0 0
6 3 1 0
7 2 0 0
8 5 1 1
Column A & B were the results of as follow:
df['A'] = [int(x) for x in total_df["T"] >= 3]
df['B'] = [int(x) for x in total_df["T"] >= 5]
I have a data spilt
train_size = 0.6
training = df.head(int(train_size * df.shape[0]))
test = df.tail(int((1 - train_size) * df.shape[0]))
Here is the question:
How can I pass row values from "T" to a list called 'tr_return' from 'training' where both columns "A" & "B" are == 1?
I tried this:
tr_returns = training[training['A' and 'B'] == 1]['T'] or
tr_returns = training[training['A'] == 1 and training['B'] == 1]['T']
But neither one works :( Any help will be appreciated!
Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000
To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()
I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
Suppose I have a dataframe, df, consisting of a class of two objects, S, a set of co-ordinates associated with them, X and Y, and a value, V, that was measured there.
For example, the dataframe looks like this:
S X Y V
0 3 3 1
0 4 3 2
1 6 0 1
1 3 3 8
I would like to know the commands that allow me to group the X and Y coordinates associated with the class, S in a new binning. In this new picture, the new value of V should be the sum of the values in the bin for each class, S.
For example, suppose this co-ordinate system was initially binned between 0 and 10 in X and Y respectively. I would like to bin it between 0 and 2. This means:
Values from 0 < X <= 5, 0 < Y <= 5 in the old binning constitute the value 0;
Values from 6 < x <= 10, 6 < y <= 10 in the old binning constitute the value 1;
Edit:
For further example, considering Dataframe df:
Row 1 has X = 3 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this falls into bin (0,0)
Row 2 has X = 4 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this also falls into bin (0,0).
Since Row 1 and 2 are observed in the same bin and are of the same class S, they are added along column V. This gives a combined row, X=0, Y=0, V = 1+2 =3
Row 3 has has X = 6 and Y = 0. Since 6 < X <= 10 and 0 < Y <= 5, this falls into bin (1,0)
Row 4 has has X= 3 and Y = 3. Since 0 < X <= 5 and 0 < Y <= 5, this falls into bin (0,0). However, since the element is of class S=1, It is not added to anything, since we only add between shared classes.
The output should then be:
S X Y V
0 0 0 3
0 1 0 1
1 0 0 8
What commands must I use to achieve this?
This should do the trick:
data.loc[data['X'] <= 5, 'X'] = 0
data.loc[data['X'] > 5, 'X'] = 1
data.loc[data['Y'] <= 5, 'Y'] = 0
data.loc[data['Y'] > 5, 'Y'] = 1
data = data.groupby(['S', 'X', 'Y']).sum().reset_index()
For your example the output is:
S X Y V
0 0 0 0 3
1 1 0 0 8
2 1 1 0 1
I found this answer to be helpful.
I have a dataframe, one column (col1) of which contains values either Y or N. I would like to assign values (random, not repetitive numbers) to the next column (col2) based on the values in col1 - if value in col1 equals to N, then value in col2 would be some number, if value in col1 equals to Y, then value in col2 would repeat the previous. I tried to create a for loop and iterate over rows using df.iterrows(), however the numbers in col2 were equal for all Ns.
Example of the dataframe I want to get:
df = pd.DataFrame([[N, Y, Y, N, N, Y], [1, 1, 1, 2, 3, 3]])
where for each new N new number is assigned in other column, while for each Y the number is repeated as in previous row.
Assuming a DataFrame df:
df = pd.DataFrame(['N', 'Y', 'Y', 'N', 'N', 'Y'], columns=['YN'])
YN
0 N
1 Y
2 Y
3 N
4 N
5 Y
Using itertuples (no repeation):
np.random.seed(42)
arr = np.arange(1, len(df[df.YN == 'N']) + 1)
np.random.shuffle(arr)
cnt = 0
for idx, val in enumerate(df.itertuples()):
if df.YN[idx] == 'N':
df.loc[idx, 'new'] = arr[cnt]
cnt += 1
else:
df.loc[idx, 'new'] = np.NaN
df.new = df.new.ffill().astype(int)
df
YN new
0 N 1
1 Y 1
2 Y 1
3 N 2
4 N 3
5 Y 3
Using apply (repetition may arise with small number range):
np.random.seed(42)
df['new'] = df.YN.apply(lambda x: np.random.randint(10) if x == 'N' else np.NaN).ffill().astype(int)
YN new
0 N 6
1 Y 6
2 Y 6
3 N 3
4 N 7
5 Y 7