Pandas variable creation using multiple If-else - python
Need help with Pandas multiple IF-ELSE statements. I have a test dataset (titanic) as follows:
ID Survived Pclass Name Sex Age
1 0 3 Braund male 22
2 1 1 Cumings, Mrs. female 38
3 1 3 Heikkinen, Miss. Laina female 26
4 1 1 Futrelle, Mrs. female 35
5 0 3 Allen, Mr. male 35
6 0 3 Moran, Mr. male
7 0 1 McCarthy, Mr. male 54
8 0 3 Palsson, Master male 2
where Id is the passenger id. I want to create a new flag variable in this data frame which has the following rule:
if Sex=="female" or (Pclass==1 and Age <18) then 1 else 0.
Now to do this I tried a few approaches. This is how I approached first:
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
df['Prediction']=1
else:
df['Prediction']=0
The problem with above code is that it creates a Prediction variable in df but with all values as 0.
However if I use the same code but instead output it to a dictionary it gives the right answer as shown below:
prediction={}
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
prediction[passenger['ID']=1
else:
prediction[passenger['ID']=0
This gives a dict prediction with keys as ID and values as 1 or 0 based on the above logic.
So why the df variable works wrongly?. I even tried by first defining a function and then calling it. Gave the same ans as first.
So, how can we do this in pandas?.
Secondly, I guess the same can be done if we can just use some multiple if-else statements. I know np.where but it is not allowing to add 'and' condition. So here is what I was trying:
df['Prediction']=np.where(df['Sex']=="female",1,np.where((df['Pclass']==1 and df['Age']<18),1,0)
The above gave an error for 'and' keyword in where.
So can someone help?. Solutions with multiple approache using np.where(simple if-else like) and using some function(applymap etc) or modifications to what I wrote earlier would be really appreciated.
Also how do we do the same using some applymap or apply/map method of df?.
Instead of looping through the rows using df.iterrows (which is relatively slow), you can assign the desired values to the Prediction column in one assignment:
In [27]: df['Prediction'] = ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18))).astype('int')
In [29]: df['Prediction']
Out[29]:
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
Name: Prediction, dtype: int32
For your first approach, remember that df['Prediction'] represents an entire column of df, so df['Prediction']=1 assigns the value 1 to each row in that column. Since df['Prediction']=0 was the last assignment, the entire column ended up being filled with zeros.
For your second approach, note that you need to use & not and to perform an elementwise logical-and operation on two NumPy arrays or Pandas NDFrames. Thus, you could use
In [32]: np.where(df['Sex']=='female', 1, np.where((df['Pclass']==1)&(df['Age']<18), 1, 0))
Out[32]: array([0, 1, 1, 1, 0, 0, 0, 0])
though I think it is much simpler to just use | for logical-or and & for logical-and:
In [34]: ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18)))
Out[34]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
dtype: bool
Related
Group by a category
I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions. dfRFM['idcluster'] = num_cluster dfRFM.head() idcliente Recencia Frecuencia Monetario idcluster 1 3 251 44 -90.11 0 2 8 1011 44 87786.44 2 6 88 537 36 8589.57 0 7 98 505 2 -179.00 0 9 156 11 15 35259.50 0 How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!
To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values() In your case, that would look like this: dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario') The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true. edit: add 'the way this works...'
I think you actually just need to filter your DF! df_new = dfRFM[dfRFM.idcluster == 0] and then sort by Montario df_new = df_new.sort_values(by = 'Monetario') Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.
Pandas reorder rows of dataframe
I stumble upon very peculiar problem in Pandas. I have this dataframe ,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM 0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3 1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3 2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3 3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3 4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3 5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3 6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3 7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3 8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3 9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3 10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3 11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3 12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3 13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3 14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3 15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3 16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2 17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2 18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2 This is input file Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one). I would like to reorder rows of the dataframe as it is below. ,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM 0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2 1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2 2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2 3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1 4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1 5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1 6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2 7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2 8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2 9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3 10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3 11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3 This is the desired result Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM. As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP. I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed. Would you like to help to solve the problem? Any suggestions are welcome. P.S. I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible. I am attaching pic as well.
What did you try to sort by several columns? In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM']) Out[10]: Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM 0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3 8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3 1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3 9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3 2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this: groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM'] df = df.groupby(groupby_cols).reset_index()
Merging two dataframes based on index
I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow: So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null). I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift) Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID. But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with) After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid. How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.
If I understand your question correctly, this is the thing that you want. For example with this 3 dataframes.. In [1]: df1 Out[1]: 0 1 2 0 3.588843 3.566220 6.518865 1 7.585399 4.269357 4.781765 2 9.242681 7.228869 5.680521 3 3.600121 3.931781 4.616634 4 9.830029 9.177663 9.842953 5 2.738782 3.767870 0.925619 6 0.084544 6.677092 1.983105 7 5.229042 4.729659 8.638492 8 8.575547 6.453765 6.055660 9 4.386650 5.547295 8.475186 In [2]: df2 Out[2]: 0 1 0 95.013170 90.382886 2 1.317641 29.600709 4 89.908139 21.391058 6 31.233153 3.902560 8 17.186079 94.768480 In [3]: df Out[3]: 0 1 2 0 0.777689 0.357484 0.753773 1 0.271929 0.571058 0.229887 2 0.417618 0.310950 0.450400 3 0.682350 0.364849 0.933218 4 0.738438 0.086243 0.397642 5 0.237481 0.051303 0.083431 6 0.543061 0.644624 0.288698 7 0.118142 0.536156 0.098139 8 0.892830 0.080694 0.084702 9 0.073194 0.462129 0.015707 You can do pd.concat([df,df1,df2], axis=1) This produces In [6]: pd.concat([df,df1,df2], axis=1) Out[6]: 0 1 2 0 1 2 0 1 0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886 1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN 2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709 3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN 4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058 5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN 6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560 7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN 8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480 9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN For more details you might want to see pd.concat Just a tip putting simple illustrative data in your question always helps in getting answer.
Pandas Vectorization with Function on Parts of Column
So I have a dataframe that looks something like this: df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]]) 0 1 2 0 1 2 3 1 5 7 8 2 2 5 4 I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this: df1['3'] = add5(df1[2]) But my goal is to do something like this: df1['3'] = add5(df1[2]) if df1[2] > 3 Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax. In this case, you can use numpy.where: df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2]) Alternatively, you can use pd.DataFrame.loc in a couple of steps: df1[3] = df1[2] df1.loc[df1[2] > 3, 3] = df1[2] + 5 In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series. Result: print(df1) 0 1 2 3 0 1 2 3 3 1 5 7 8 13 2 2 5 4 9
How do you set a specific column with a specific value to a new value in a Pandas DF?
I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1. However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column? # load the labels using pandas labels = pd.read_csv("bees/train_labels.csv") #Set bumble_bee to one for index in range(len(labels)): labels[labels['bee_type'] == 'bumble_bee'] = 1
I believe you need map by dictionary if only 2 possible values exist: labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2}) Another solution is to use numpy.where - set values by condition: labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2) Your code works, but for improved performance, modify it a bit - remove loops and add loc: labels.loc[labels['bee_type'] == 'bumble_bee'] = 1 print (labels) ID bee_type 0 1 1 1 1 honey_bee 2 1 1 3 3 honey_bee 4 1 1 Sample: labels = pd.DataFrame({ 'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'], 'ID': list(range(5)) }) print (labels) ID bee_type 0 0 bumble_bee 1 1 honey_bee 2 2 bumble_bee 3 3 honey_bee 4 4 bumble_bee labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2}) print (labels) ID bee_type 0 0 1 1 1 2 2 2 1 3 3 2 4 4 1
As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder