I got a dataframe like this:
A B C
1 1 1
2 2 2
3 3 3
4 1 1
I want to 'merge' the three columns to form a D column, the rule is: if there is at least one '1' in the row, then the value of D is '1' else is '0'. How can I achieve it?
Use DataFrame.eq for compare values with DataFrame.any for check at least one True per row and last cast boolean mask to integers:
df['D'] = df.eq(1).any(axis=1).astype(int)
print (df)
A B C D
0 1 1 1 1
1 2 2 2 0
2 3 3 3 0
3 4 1 1 1
Detail:
print (df.eq(1))
A B C
0 True True True
1 False False False
2 False False False
3 False True True
print (df.eq(1).any(axis=1))
0 True
1 False
2 False
3 True
dtype: bool
Related
I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1
Given the following dataframe:
col_1 col_2
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 1
False 2
True 2
False 2
False 2
True 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
False 2
How can I create a new index that help to identify when a True value is present in col_1? That is, when in the first column a True value appears I would like to fill backward with a number starting from one the new column. For example, this is the expected output for the above dataframe:
col_1 col_2 new_id
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 1 1
False 2 1
True 2 1 --------- ^ (fill with 1 and increase the counter)
False 2 2
False 2 2
True 2 2 --------- ^ (fill with 2 and increase the counter)
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
False 2 3
True 2 4 --------- ^ (fill with 3 and increase the counter)
The problem is that I do not know how to create the id although I know that pandas provide a bfill object that may help to achieve this purpose. So far I tried to iterate with a simple for loop:
count = 0
for index, row in df.iterrows():
if row['col_1'] == False:
print(count+1)
else:
print(row['col_2'] + 1)
However, I do not know how to increase the counter to the next number. Also I tried to create a function and then apply it to the dataframe:
def create_id(col_1, col_2):
counter = 0
if col_1 == True and col_2.bool() == True:
return counter + 1
else:
pass
Nevertheless, i lose control of filling backward the column.
Just do with cumsum
df['new_id']=(df.col_1.cumsum().shift().fillna(0)+1).astype(int)
df
Out[210]:
col_1 col_2 new_id
0 False 1 1
1 False 1 1
2 False 1 1
3 False 1 1
4 False 1 1
5 False 1 1
6 False 1 1
7 False 1 1
8 False 1 1
9 False 1 1
10 False 1 1
11 False 1 1
12 False 1 1
13 False 1 1
14 False 2 1
15 True 2 1
16 False 2 2
17 False 2 2
18 True 2 2
19 False 2 3
20 False 2 3
21 False 2 3
22 False 2 3
23 False 2 3
24 False 2 3
25 False 2 3
26 False 2 3
27 False 2 3
28 False 2 3
29 False 2 3
If you aim to append the new_id column to your dataframe:
new_id=[]
counter=1
for index, row in df.iterrows():
new_id+= [counter]
if row['col_1']==True:
counter+=1
df['new_id']=new_id
I want to compare the particular columns of all the rows, if they are unique extract the value to the new column otherwise 0.
If the example dateframe as follows:
A B C D E F
13348 judte 1 1 1 1
54871 kfzef 1 1 0 1
89983 hdter 4 4 4 4
7543 bgfd 3 4 4 4
The result should be as follows:
A B C D E F Result
13348 judte 1 1 1 1 1
54871 kfzef 1 1 0 1 0
89983 hdter 4 4 4 4 4
7543 bgfd 3 4 4 4 0
I am pleased to hear some suggestions.
Use:
cols = ['C','D','E','F']
df['Result'] = np.where(df[cols].eq(df[cols[0]], axis=0).all(axis=1), df[cols[0]], 0)
print (df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0
Detail:
First compare all column filtered by list of columns names by eq with first column of cols df[cols[0]]:
print (df[cols].eq(df[cols[0]], axis=0))
C D E F
0 True True True True
1 True True False True
2 True True True True
3 True False False False
Then check if all Trues per row by all:
print (df[cols].eq(df[cols[0]], axis=0).all(axis=1))
0 True
1 False
2 True
3 False
dtype: bool
And last use numpy.where for assign first column values for Trues and 0 for False.
I think you need apply with nunique as:
df['Result'] = df[['C','D','E','F']].apply(lambda x: x[0] if x.nunique()==1 else 0,1)
Or using np.where:
df['Result'] = np.where(df[['C','D','E','F']].nunique(1)==1,df['C'],0)
print(df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0
I have the following dataframe:
df1 = pd.DataFrame({1:[1,2,3,4], 2:[1,2,4,5], 3:[8,1,5,6]})
df1
Out[7]:
1 2 3
0 1 1 8
1 2 2 1
2 3 4 5
3 4 5 6
and I would like to create a new column that will show the distance the last column with a particular value, 2 in this case, from the reference column, 3 in this example, or return an NaN result is no such value is found in a row. Output would be something like:
df1
Out[11]:
1 2 3 dist
0 1 1 8 NaN
1 2 2 1 1
2 3 4 5 NaN
3 4 5 6 NaN
What would be an effective way of accomplishing this task?
I think need subtract 3 (last) because reference column with column name of last 2:
df1.columns = df1.columns.astype(int)
print((df1.columns.max() - df1.eq(2).iloc[:,::-1].idxmax(axis=1)).mask(lambda x: x == 0))
0 NaN
1 1.0
2 NaN
3 NaN
dtype: float64
Details:
Compare by 2:
print (df1.eq(2))
1 2 3
0 False False False
1 True True False
2 False False False
3 False False False
Inverse order of columns:
print (df1.eq(2).iloc[:,::-1])
3 2 1
0 False False False
1 False True True
2 False False False
3 False False False
Check column name of first True (because inverse columns, it is last)
print (df1.eq(2).iloc[:,::-1].idxmax(axis=1))
0 3
1 2
2 3
3 3
dtype: int64
Subtract by max value, but it also return 0 if value in reference column and if no value match:
print (df1.columns.max() - df1.eq(2).iloc[:,::-1].idxmax(1))
0 0
1 1
2 0
3 0
dtype: int64
So I have two dataframes consisting of 6 columns each containing numbers. I need to compare 1 column from each dataframe to make sure they match and fix any values in that column that don't match. Columns are already sorted and they match in terms of length. So far I can find the differences in the columns:
df1.loc[(df1['col1'] != df2['col2'])]
then I get the index # where df1 doesn't match df2. Then I'll go to that same index # in df2 to find out what value in col2 is causing a mismatch then use this to change the value to the correct one found in df2:
df1.loc[index_number, 'col1'] = new_value
Is there a way I can automatically fix the mismatches without having to manually look up what the correct value should be in df2?
if df2 is the authoritative source, you don't need to check where df1 is equal
df1.loc[:, 'column_name'] = df2['column_name']
But if we must check
c = 'column_name'
df1.loc[df1[c] != df2[c], c] = df2[c]
I think you need compare by eq and then if need add value where dont match use combine_first:
df1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,6,5],
'E':[5,3,6],
'F':[1,4,3]})
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3
df2 = pd.DataFrame({'A':[1,2,1],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 1 6 9 5 6 3
If need compare one column with all DataFrame:
print (df1.eq(df2.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 False False False False False False
print (df1.eq(df1.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 True False False False False True
And if need same column D:
df1.D = df1.loc[df1.D.eq(df2.D), 'D'].combine_first(df2.D)
print (df1)
A B C D E F
0 1 4 7 1.0 5 1
1 2 5 8 3.0 3 4
2 3 6 9 5.0 6 3
But then is easier only assign column D from df2 to D of df1:
df1.D = df2.D
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If indexes are different, is possible use values for convert column to numpy array:
df1.D = df1.D.values
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3