Pandas checking if values in multiple column exists in other columns - python

I am trying to check if values for each row noted in Dataframe "Actual" match values under the same row in Dataframe "Estimate". Column position is not important. The value just needs to exist on the same row level between the different dataframes. The Dataframes can be concat/ merged, if need be.
I present below my code::
Actual=pd.DataFrame([[4,7,2,8,1],[1,5,7,9,8]], columns=['Actual1','Actual2','Actual3','Actual4','Actual5'])
estimate=pd.DataFrame([[1,2,7,9,3],[0,8,2,5,9]], columns=['estimate1','estimate2','estimate3','estimate4','estimate5'])
Actual
Actual1 Actual2 Actual3 Actual4 Actual5
0 4 7 2 8 1
1 1 5 7 9 8
estimate
estimate1 estimate2 estimate3 estimate4 estimate5
0 1 2 7 9 3
1 0 8 2 5 9
My attempt using Pandas::
for loop1 in range(1,6,1):
for loop2 in range(1,6,1):
Actual['want'+str(loop1)]=np.where(Actual['Actual'+ str(loop1)] == estimate['estimate' + str(loop2)],1,0)
and finally, my output that I would like::
want=pd.DataFrame([[0,1,1,0,1],[0,1,0,1,1]], columns=['want1','want2','want3','want4','want5'])
want
want1 want2 want3 want4 want5
0 0 1 1 0 1
1 0 1 0 1 1
So, as I was mentioning earlier, since from Dataframe "Actual" value 4 does not exist on the whole first row of dataframe "estimate", column "want1" has been assigned value 0. Once again, considering the first row of Dataframe "Actual" column 5 where value=1, since this value exists in the same first row of dataframe "estimate" (column location does not matter) column 'want5' has been assigned value 1.
Thanks.

Assuming that the indices in your Actual and estimate DataFrames are the same, one approach would be to just apply a check along the columns with isin.
Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Here we use the name attribute as the glue between the two DataFrames.
Demo
>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1

Related

How to find rows in a data frame with this criterion (with the use of Pandas)?

With the use of Pandas, I want to find rows in which there is only one column containing a specific value while the other columns of the same row are empty. In fact, I don't need the rows containing that specific value when other columns contain any value. Does Pandas offer any solution for this?
to clarify, take the table below as an example. I want to get row 2 and row 4 which have columns containing number 3 while the other two columns contain zero.
Example df:
Data
column1
column2
column3
data 1
1
2
3
data 2
3
0
0
data 3
0
4
3
data 4
0
3
0
You can try the following code:
df=df.loc[:,['column1','column2','column3',]]
three=df.isin([3]).sum(axis=1)==1
zero=df.isin([0, None, np.nan]).sum(axis=1)==2
mask=three&zero
df.loc[mask,:]
Explanation:
This extract the relevent columns:
df=df.loc[:,['column1','column2','column3',]]
Next you have to make a Boolean mask for the rows:
three=df.isin([3]).sum(axis=1)==1 # the row contains one 3
zero=df.isin([0]).sum(axis=1)==2 # the row contains two 0's
mask=three&zero
Where the mask produces:
0 False
1 True
2 False
3 True
dtype: bool
df.loc[mask,:] produces the final result:
column1 column2 column3
1 3 0 0
3 0 3 0

Assigning labels each cycle of a for loop

Similarly to these questions, How to add an empty column to a dataframe? and Adding a new column to a df each cycle of a for loop, I would like to add new labels within a column, initially initialized to null, each cycle of a for loop.
I have an initial dataset of 10 rows. In a for loop, at every loop, I add more rows. I would like to assign to the new rows a label 0, to distinguish them from the original ones, already in the dataset (1).
For example:
df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame
>>> df
a b
0 1 5
1 2 6
2 3 7
Before starting the for loop, I am creating a new column, initializing its values to 1:
a b Label
0 1 5 1
1 2 6 1
2 3 7 1
After the first run, the loop adds new rows to the df. How can I assign to those rows the Label=0?
Expected output:
a b Label
0 1 5 1
1 2 6 1
2 3 7 1
3 4 8 0
4 5 9 0
...
I tried as follows:
df['Label']=1
labels=df['Label']
for x in difference: # I will need to assign a label 0 to rows not initially included in my original df. Since 5,6 and 7 are not in a, the first run is for x in (5,6,7). I will need to skip this first run otherwise I will assign 0 to my first three rows - that I had initialised to 1
# omitted steps
labels=0
df = pd.DataFrame({"a": a_list, "b": b_list, "Labels": labels})
As mentioned, difference includes all the values in b not included in a.
Instead of the expected output, I am getting the following:
a b Label
0 1 5 0
1 2 6 0
2 3 7 0
3 4 8 0
4 5 9 0
...
The problem is that currently the value of labels = 0 is also assigned to my first original rows, because the cycle is also running for those rows, so the values 1 initially assigned are replaced.
I think an approach can be to look at length of the initial dataframe (assigning Label=1) and assign to rows greater than that a value 0. Defining a thrershold=len(df) at the beginning and, before creating the df with the new values, assigning to rows less than threshold a value 1, otherwise 0. But I do not know how to treat with rows number to try this approach. I think that .loc could solve the problem, but I do not know how to write the condition (maybe rows below the initial length, defined before the for loop).
I was thinking of something like this:
for those rows within the initial threshold (i.e., len of my df), then assign 1;
otherwise 0.
This should be set probably after defining df in my code, in order to create a column that takes into consideration la position of the value (row index).
I tried with: df.iloc[0:int(len(df)), "Label"]=1, but it gives me an errors: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Keep a copy of original index. After adding new rows to dataframe, use boolean indexing to assign new rows Label column to 0.
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame
df['Label'] = 1
origin_index = df.index.tolist()
df = df.append(df, ignore_index=True)
df.loc[~df.index.isin(origin_index), 'Label'] = 0
print(df)
a b Label
0 1 5 1
1 2 6 1
2 3 7 1
3 1 5 0
4 2 6 0
5 3 7 0

Python Pandas: Rolling backward function

I have a dataframe which has two columns (i.e. audit_value and rolling_sum). Rolling_sum_3 column contains the rolling sum of last 3 audit values. Dataframe is shown below:
df1
audit_value rolling_sum_3 Fixed_audit
0 4 NA 3
1 5 NA 3
2 3 12 3
3 1 9 1
4 2 6 2
5 1 4 1
6 4 7 3
Now I want to apply condition on rolling_sum_3 column and find if the value is greater than 5, if yes, then look at the last 3 values of audit_value and find the values which are greater than 3. If the any value among the last 3 values of audit_value is greater than 3 then replace those value with 3 and place in a new column (called fixed_audit), otherwise retain the old value of audit_value in new column. I couldn't find any builtin function in pandas that perform rolling back functionality. Could anyone suggest easy and efficient way of performing rolling back functionality on certain column?
df1['fixed_audit'] = df1['audit_value']
for i in range(3, len(df1)):
if(df1.iloc[i].rolling_sum_3 > 5):
df1.loc[i-1,'fixed_audit'] = 3 if df1.loc[i-1,'audit_value'] > 3 else df1.loc[i-1,'audit_value']
df1.loc[i-2,'fixed_audit'] = 3 if df1.loc[i-2,'audit_value'] > 3 else df1.loc[i-2,'audit_value']
df1.loc[i-3,'fixed_audit'] = 3 if df1.loc[i-3,'audit_value'] > 3 else df1.loc[i-3,'audit_value']

Removing duplicates based on two columns while deleting inconsistent data

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Find duplicates for one column with the last row group by one column in Pandas Python

I have 4 columns in my dataframe user abcisse ordonnee,time
I want to find for each user the duplicate row with the last row of the user, duplicate row meaning two row with same abcisse and ordonnee.
I was thinking to use the df.duplicated function but i don't know how to combine it with groupby ?
entry = pd.DataFrame([[1,0,0,1],[1,3,-2,2],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,1],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
output = pd.DataFrame([[1,0,0,1],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
Use drop_duplicates:
print (entry.drop_duplicates(['user', 'abcisse', 'ordonnee'], keep='last'))
user abcisse ordonnee temps
0 1 0 0 1
2 1 2 1 3
3 1 3 1 4
4 1 3 -2 5
6 2 1 3 2

Categories

Resources