How to compare two columns of the same dataframe? - python

I have a dataframe like this:
match_id inn1 bat bowl runs1 inn2 runs2 is_score_chased
1 1 KKR RCB 222 2 82 1
2 1 CSK KXIP 240 2 207 1
8 1 CSK MI 208 2 202 1
9 1 DC RR 214 2 217 1
33 1 KKR DC 204 2 181 1
Now i want to change the values in is_score_chased column by comparing the values in runs1 and runs2 . If runs1>runs2, then the corresponding value in the row should be 'yes' else it should be no.
I tried the following code:
for i in (high_scores1):
if(high_scores1['runs1']>=high_scores1['runs2']):
high_scores1['is_score_chased']='yes'
else:
high_scores1['is_score_chased']='no'
But it didn't work. How do i change the values in the column?

You can more easily use np.where.
high_scores1['is_score_chased'] = np.where(high_scores1['runs1']>=high_scores1['runs2'],
'yes', 'no')
Typically, if you find yourself trying to iterate explicitly as you were to set a column, there is an abstraction like apply or where which will be both faster and more concise.

This is a good case for using apply.
Here there is an example of using apply on two columns.
You can adapt it to your question with this:
def f(x):
return 'yes' if x['run1'] > x['run2'] else 'no'
df['is_score_chased'] = df.apply(f, axis=1)
However, I would suggest filling your column with booleans so you can make it more simple
def f(x):
return x['run1'] > x['run2']
And also using lambdas so you make it in one line
df['is_score_chased'] = df.apply(lambda x: x['run1'] > x['run2'], axis=1)

You need to reference the fact that you are iterating through the dataframe, so;
for i in (high_scores1):
if(high_scores1['runs1'][i]>=high_scores1['runs2'][i]):
high_scores1['is_score_chased'][i]='yes'
else:
high_scores1['is_score_chased'][i]='no'

Related

Group by a category

I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions.
dfRFM['idcluster'] = num_cluster
dfRFM.head()
idcliente Recencia Frecuencia Monetario idcluster
1 3 251 44 -90.11 0
2 8 1011 44 87786.44 2
6 88 537 36 8589.57 0
7 98 505 2 -179.00 0
9 156 11 15 35259.50 0
How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!
To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values()
In your case, that would look like this:
dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario')
The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true.
edit: add 'the way this works...'
I think you actually just need to filter your DF!
df_new = dfRFM[dfRFM.idcluster == 0]
and then sort by Montario
df_new = df_new.sort_values(by = 'Monetario')
Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.

are values in rows of column of dataframe return true or false

input data:
obj number
1 433
2 342
3 111
4 345
output data:
true
tried:
df[df['number'].isin([111,433])]
df.number.isin([111,433])
df.number.any() == 111 or 433
but none of them is giving me the result I'm looking for
I'm trying to parse a file and any time the number is in 1 dataframe i would like to run some special algorithm to reformat it. For example if 111 is in the numbers column i would like to add a colum with layout-name where the value 'layout1' should appear
You are close, test values of scalar with Series.any for test at least one True:
print ((df.number == 111).any())
True
For test multiple values with OR use Series.isin with any:
df.number.isin([111, 222]).any()
And if need test consecutive values - 111 and in next row 222:
print (df)
obj number
0 1 433
1 2 342
2 3 111
3 4 222
print (((df['number'] == 111) & (df['number'].shift(-1) == 222)).any())
True
You make it too complicated, you can here check if any of the values is 111 with:
(df['number'] == 111).any()
or shorter:
df['number'].eq(111).any()
If you want to check that two (or more values) occur in a series with:
>>> import numpy as np
>>> np.any(df[:,None] == np.array([[111, 222]]), axis=0).all()
False
If the number of items to check against is relatively small, this should do the trick.

Python: Populate new df column based on if statement condition

I'm trying something new. I want to populate a new df column based on some conditions affecting another column with values.
I have a data frame with two columns (ID,Retailer). I want to populate the Retailer column based on the ids in the ID column. I know how to do this in SQL, using a CASE statement, but how can I go about it in python?
I've had look at this example but it isn't exactly what I'm looking for.
Python : populate a new column with an if/else statement
import pandas as pd
data = {'ID':['112','5898','32','9985','23','577','17','200','156']}
df = pd.DataFrame(data)
df['Retailer']=''
if df['ID'] in (112,32):
df['Retailer']='Webmania'
elif df['ID'] in (5898):
df['Retailer']='DataHub'
elif df['ID'] in (9985):
df['Retailer']='TorrentJunkie'
elif df['ID'] in (23):
df['Retailer']='Apptronix'
else: df['Retailer']='Other'
print(df)
The output I'm expecting to see would be something along these lines:
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Use numpy.select and for test multiple values use Series.isin, also if need test strings like in sample data change numbers to numeric like 112 to '112':
m1 = df['ID'].isin(['112','32'])
m2 = df['ID'] == '5898'
m3 = df['ID'] == '9985'
m4 = df['ID'] == '23'
vals = ['Webmania', 'DataHub', 'TorrentJunkie', 'Apptronix']
masks = [m1, m2, m3, m4]
df['Retailer'] = np.select(masks, vals, default='Other')
print(df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
If many catagories also is possible use your solution with custom function:
def get_data(x):
if x in ('112','32'):
return 'Webmania'
elif x == '5898':
return 'DataHub'
elif x == '9985':
return 'TorrentJunkie'
elif x == '23':
return 'Apptronix'
else: return 'Other'
df['Retailer'] = df['ID'].apply(get_data)
print (df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Or use map by dictionary, if no match get NaN, so added fillna:
d = {'112': 'Webmania','32':'Webmania',
'5898':'DataHub',
'9985':'TorrentJunkie',
'23':'Apptronix'}
df['Retailer'] = df['ID'].map(d).fillna('Other')

How to add a new column to a table formed from conditional statements?

I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)

Pandas variable creation using multiple If-else

Need help with Pandas multiple IF-ELSE statements. I have a test dataset (titanic) as follows:
ID Survived Pclass Name Sex Age
1 0 3 Braund male 22
2 1 1 Cumings, Mrs. female 38
3 1 3 Heikkinen, Miss. Laina female 26
4 1 1 Futrelle, Mrs. female 35
5 0 3 Allen, Mr. male 35
6 0 3 Moran, Mr. male
7 0 1 McCarthy, Mr. male 54
8 0 3 Palsson, Master male 2
where Id is the passenger id. I want to create a new flag variable in this data frame which has the following rule:
if Sex=="female" or (Pclass==1 and Age <18) then 1 else 0.
Now to do this I tried a few approaches. This is how I approached first:
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
df['Prediction']=1
else:
df['Prediction']=0
The problem with above code is that it creates a Prediction variable in df but with all values as 0.
However if I use the same code but instead output it to a dictionary it gives the right answer as shown below:
prediction={}
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
prediction[passenger['ID']=1
else:
prediction[passenger['ID']=0
This gives a dict prediction with keys as ID and values as 1 or 0 based on the above logic.
So why the df variable works wrongly?. I even tried by first defining a function and then calling it. Gave the same ans as first.
So, how can we do this in pandas?.
Secondly, I guess the same can be done if we can just use some multiple if-else statements. I know np.where but it is not allowing to add 'and' condition. So here is what I was trying:
df['Prediction']=np.where(df['Sex']=="female",1,np.where((df['Pclass']==1 and df['Age']<18),1,0)
The above gave an error for 'and' keyword in where.
So can someone help?. Solutions with multiple approache using np.where(simple if-else like) and using some function(applymap etc) or modifications to what I wrote earlier would be really appreciated.
Also how do we do the same using some applymap or apply/map method of df?.
Instead of looping through the rows using df.iterrows (which is relatively slow), you can assign the desired values to the Prediction column in one assignment:
In [27]: df['Prediction'] = ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18))).astype('int')
In [29]: df['Prediction']
Out[29]:
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
Name: Prediction, dtype: int32
For your first approach, remember that df['Prediction'] represents an entire column of df, so df['Prediction']=1 assigns the value 1 to each row in that column. Since df['Prediction']=0 was the last assignment, the entire column ended up being filled with zeros.
For your second approach, note that you need to use & not and to perform an elementwise logical-and operation on two NumPy arrays or Pandas NDFrames. Thus, you could use
In [32]: np.where(df['Sex']=='female', 1, np.where((df['Pclass']==1)&(df['Age']<18), 1, 0))
Out[32]: array([0, 1, 1, 1, 0, 0, 0, 0])
though I think it is much simpler to just use | for logical-or and & for logical-and:
In [34]: ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18)))
Out[34]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
dtype: bool

Categories

Resources