I have 2 dataframe with numeric values. I want to compare each column (they have the same column_names), and if them all are equal, execute a condition (add 10 points to score). I have done it "manually", it works, but I don't like it that way. Let me show you:
score=0
DATAFRAME 1
Column A
Column B
Column C
Column D
1
2
0
1
DATAFRAME 2
Column A
Column B
Column C
Column D
1
2
0
1
So, if the values of each columns are equal, score will be score=score+10
score=0
if (df1['Column A'][0]==df2['Column A'][0])&(df1['Column B'][0]==df2['Column B'][0])&(df1['Column C'][0]==df2['Column C'][0])&(df1['Column D'][0]==df2['Column D'][0]):
score=score+10
I want to do this but optimize it, like with a for loop or something like that. How could it be done? Thanks a lot
Use pandas equals function.
Here is the simple implementation:
if df1.equals(df2):
score += 10
else:
print("Columns are not equal")
Compare the values element by element, then check if you only have True per row and sum to find out how many rows satisfy the condition and finally multiply by 10:
>>> df1.eq(df2).all(axis=1).sum() * 10
10
Related
I have a dataset with unique ID with columns which have several bool values for each ID.
Therefore, I need to convert these columns into a single categorical variables concatenating multiple true bool values. And if out of mentioned bool ID have 3 true values we assign to categories as "Win"
ID
BoolCol_1
BoolCol_2
BoolCol_3
BoolCol_4
Other Col 1
Other Col 2
1
1
2
2
1
x
Y
2
2
1
1
1
A
b
1 -> True 2 -> False
ID are unique.
I am not able to think in my head how to solve this puzzle
Use the following approach:
bool_cols = ['BoolCol_1', 'BoolCol_2', 'BoolCol_3', 'BoolCol_4']
cnts = df[bool_cols].stack().groupby(level=0).value_counts().unstack()[1]
df['cat_col'] = pd.Series(np.where(cnts >= 3, 'W', 'L'), dtype='category')
Now cat_col is categorical column with fixed values W (win), L (lose)
In [229]: df
Out[229]:
ID BoolCol_1 BoolCol_2 BoolCol_3 BoolCol_4 Other Col 1 Other Col 2 cat_col
0 1 1 2 2 1 x Y L
1 2 2 1 1 1 A b W
Welcome to SO, rgl!
So in our case here, the numeric Boolean value for "True" is 1 and the numeric value for "False" is 2. The trick behind doing operations using Boolean values is to use addition and subtraction.
The first step is to add up all of the Boolean values contained in each row and append these values under a new column:
# Sum of Booleans in the row
df['sum_of_wins_and_losses'] = df.BoolCol_1 + df.BoolCol_2 + df.BoolCol_3 + df.BoolCol_4
The next step is to now write a simple function that uses if and else statements based on the logic you are looking for. You noted that there must be at least three True values in order for each ID to be considered a "Win". This is where you need to be a bit careful.
Here, the minimum sum is 4 if all four rows are "True" whereas the maximum sum is 8 if all four rows are "False". To be considered a "Win", each ID needs to have a maximum sum of 5 or less. A value of 5 basically means three wins and one loss (1+1+1+2=5).
# Write function that contains the logic
def assign_win_or_loss(row):
if row <= 5:
result = 'win'
else:
result = 'loss'
return result
Now that we have defined the function, it is time to apply it to the dataframe and create a new column containing our categorical variables:
# Apply function and create a new column based on values in other column
df['win_or_loss'] = df['sum_of_wins_and_losses'].apply(assign_win_or_loss)
I am trying to filter a data frame if the column contains even or odd values. For example I would like this table:
Player
Score
A
15
B
14
C
12
To look like this when filtering for Score even values:
Player
Score
B
14
C
12
My working solution is to create a new column using the modulo operator (%) to divide the values by 2 and then filter if it is a 1 or 0.
df['even_odd'] = df.Score % 2
Is there a better way?
I have a dataframe that I would like to compute the average across columns. I have the following dataframe:
Column 'A' repeats but not column 'B'. I would like to compute the average of the values in column 'B' for the repeating numbers in column 'A'. For example for the first value in column 'A' which is 1 the value in 'B' is 3 and the next value in column 'A' which is 1 the value in 'B' is 9 and the next is 4 and so on. Then continue with 2 and 3 etc...
I was thinking that if I can move those values to columns then compute the average across columns it would be easier but I can't find a way to copy the values there. Maybe there is an easier way?
This is what I would like :
You can use groupby and mean()
df.groupby('A').B.mean()
As #fuglede mentioned
df.groupby('A').mean()
would work as well as there is only column B left for aggregation.
Either way you get
A
1 6.25
2 6.50
3 4.75
I have a DataFrame with several different columns. Data in different columns is of different types.
I would like to apply a function to data in one of the columns and return the same DataFrame but with new values in the column to which the function was applied to.
Example:
If the DataFrame is
Letters Numbers Bools
0 a 1 True
1 b 2 False
After applying a function I would like to get:
Letters Numbers Bools
0 a 11 True
1 b 12 False
You can change the value directly for all values in a column:
df['Numbers'] = df['Numbers'] + 10
print df
This will add 10 to each value in the Numbers column. The result will be:
Letters Numbers Bools
0 a 11 True
1 b 12 False
I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks
Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.