Add a column in pandas dataframe using conditions on 3 existing columns - python

I have an existing Pandas Data-frame that I want to manipulate according to the following pattern:
The existing table has different set of codes in column 'code'. Each 'code' has certain labels listed in column 'label'. Each label has been tagged with either 0 or 1.
I have a requirement to add a 'new_column' with values 0 or 1 for each set of 'code', depending on the following condition:
Fill 1 in the 'new_column' only when all the 'label' of a particular 'code'
has value equals to 1 in the 'tag' column. Note I need to fill 1 for all the rows belonging to that particular 'code'.
As Shown in the desired Table, only code=30 has all the 'label' set in the 'tag' column equals to 1. Therefore i set the 'new_column' equals to 1 for that particular code. Rest of the codes have set to 0 value.
Existing Table:
code label tag
0 10 AAA 0
1 10 BBB 1
2 10 CCC 0
3 10 DDD 0
4 10 EEE 0
5 20 AAA 1
6 20 CCC 0
7 20 DDD 1
8 30 BBB 1
9 30 CCC 1
10 30 EEE 1
Desired Table
code label tag new_column
0 10 AAA 0 0
1 10 BBB 1 0
2 10 CCC 0 0
3 10 DDD 0 0
4 10 EEE 0 0
5 20 AAA 1 0
6 20 CCC 0 0
7 20 DDD 1 0
8 30 BBB 1 1
9 30 CCC 1 1
10 30 EEE 1 1
I have not tried any solution yet as it seems beyond my present level of expertise.

I think the right answer for this question is that given by #user3483203 in the comments:
df['new_column'] = df.groupby('code')['tag'].transform(all).astype(int)
The transform method applies to the dataframe whatever is passed to it, keeping the axis length the same.
The simple example in the documentation clearly explains the usage.
Coming to this particular question, the following happens when you run this snippet:
You first perform the grouping with respect to the 'code'. You end up with a DataFrameGroupBy object.
Next, from this you choose the tag column, ending up with a SeriesGroupBy object.
To this grouping, you apply the all function via transform, ultimately typecasting the boolean values to type int.
Basically, you can understand it like this (the values are binary to make them more related to your answer):
>>> int(all([1, 1, 1, 1]))
1
>>> int(all([1, 0, 1, 1]))
0
Finally, you are assigning the column you just created to the column new_column to the old dataframe.

the initial answer by user3483203 works. here is a variation. but his way was more concise.

Related

How to aggregate all values in a pandas dataframe columns in 2 values

I have a Pandas dataframe contains some columns. Each columns have some differents values. See the image.
In col1 I have that the value 1 is more frequent than others, so, I need to transform this column to have values 1 and more then 1.
How can I do that?
My goals here is transforme this column in a categorical column but I have no idea how can I do that.
The output expected is something like the next image:
Try clip function on column:
df["col1"].clip(upper=2)
0 1
1 2
2 2
3 2
4 1
5 2
6 2
7 1
8 1
9 1
10 1
11 2
12 1

Compare all columns value to another one with Pandas

I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2

Removing duplicates based on two columns while deleting inconsistent data

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Pandas checking if values in multiple column exists in other columns

I am trying to check if values for each row noted in Dataframe "Actual" match values under the same row in Dataframe "Estimate". Column position is not important. The value just needs to exist on the same row level between the different dataframes. The Dataframes can be concat/ merged, if need be.
I present below my code::
Actual=pd.DataFrame([[4,7,2,8,1],[1,5,7,9,8]], columns=['Actual1','Actual2','Actual3','Actual4','Actual5'])
estimate=pd.DataFrame([[1,2,7,9,3],[0,8,2,5,9]], columns=['estimate1','estimate2','estimate3','estimate4','estimate5'])
Actual
Actual1 Actual2 Actual3 Actual4 Actual5
0 4 7 2 8 1
1 1 5 7 9 8
estimate
estimate1 estimate2 estimate3 estimate4 estimate5
0 1 2 7 9 3
1 0 8 2 5 9
My attempt using Pandas::
for loop1 in range(1,6,1):
for loop2 in range(1,6,1):
Actual['want'+str(loop1)]=np.where(Actual['Actual'+ str(loop1)] == estimate['estimate' + str(loop2)],1,0)
and finally, my output that I would like::
want=pd.DataFrame([[0,1,1,0,1],[0,1,0,1,1]], columns=['want1','want2','want3','want4','want5'])
want
want1 want2 want3 want4 want5
0 0 1 1 0 1
1 0 1 0 1 1
So, as I was mentioning earlier, since from Dataframe "Actual" value 4 does not exist on the whole first row of dataframe "estimate", column "want1" has been assigned value 0. Once again, considering the first row of Dataframe "Actual" column 5 where value=1, since this value exists in the same first row of dataframe "estimate" (column location does not matter) column 'want5' has been assigned value 1.
Thanks.
Assuming that the indices in your Actual and estimate DataFrames are the same, one approach would be to just apply a check along the columns with isin.
Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Here we use the name attribute as the glue between the two DataFrames.
Demo
>>> Actual.apply(lambda x: x.isin(estimate.loc[x.name]), axis=1).astype('int')
Actual1 Actual2 Actual3 Actual4 Actual5
0 0 1 1 0 1
1 0 1 0 1 1

how to group by multi-index(including initial number index and other columns) in python dataframe?

I am working on groupby in Python's pd.DataFrame. The task in the code is that I want to group the data because I want to make sure that no matter how many times I query and output the data to MySQL, it won't mess with my raw data.
df1=pd.DataFrame(df) #this is a DataFrame with multiple different lines of 'Open' for one 'Symbol'
df2=pd.read_sql('select * from 6openposition',con=conn)
df2=df2.append(df1)
df2=df2.groupby(['Symbol']).agg({'Open':'first'})
df2.to_sql(name='6openposition', con=conn, if_exists='replace', index= False, flavor = 'mysql')
#Example Raw Data:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
#After I query the data for multiple times(I appended):
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
0 AA 30
1 AAA 40
2 AAA 50
3 AAA 50
4 AAA 60
#How my code ended up with:
Symbol Open
0 A 10
1 AA 20
2 AAA 40
#What I want:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
6 AAA 60
My raw data could have multiple value in column 'Open' for same 'Symbol'.
As I eliminate the influence of my multiple times of input to MySQL, raw data here is influenced.
My thought on solving this problem is to group by the initial index and 'Symbol' at the same time because after append the initial indices could be another 'group by' column. The initial indices are [0,1,2,...]. If the 'Symbol' and initial indices are the same, I could take the first value of 'Open' in that group. To group by initial indices I could:
df2=df2.groupby(level=0).agg({'Open':'first'})
#this code will combine the lines with same indices and take the first value of 'Open' column
But I have no idea how to combine 'level=0' to 'level='Symbol''. Could you teach me how to group by two columns including initial indices and another column? Or tell me a way to eliminate multiple times of input not messing with my raw data.
Starting with df, including your index which seems to indicate whether data are repeated:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50
Use
df.reset_index().drop_duplicates().drop('index', axis=1)
(keeps first occurrence by default) to get:
Symbol Open
0 A 10
1 AA 20
2 AA 30
3 AAA 40
4 AAA 50
5 AAA 50

Categories

Resources