Iterations and lambda functions - python

I would like to apply a lambda function to several columns but I am not sure how to loop through the columns. Basically I have Column1 - Column50 and I want the exact same thing to happen on each but can't figure out how to iterate through them where x.column is below. Is there a way to do this?
for column in df:
df[column] = df.apply(lambda x: x.datacolumn * x.datacolumn2 if x.column >= x.datacolumn3, axis=1)

Are you looking for something like map()? map() applies a function to every item in a list (or other iterable) and returns a list containing the results.
Here's an eloquent explanation of how it works (way better than what I could write).
At a certain point, however, declaring a normal function and/or using a for loop might be easier.

At first, you are missing the else branch (what to do when the if condition is False?), and for accessing Panda's Series (the input for the lambda function) elements, you could use indexes.
For example, setting to 0, if the condition does not stand:
for column in df:
df[column] = df.apply(lambda x: x[0] * x[1] if x[0] >= x[2] else 0, axis=1)

It might be easiest to extract each column as a list, perform the operation, then write the result back into the dataframe.
for column in df:
temp = [x for x in df.loc[:, column]] #pull a list out using loc
if temp[0] > temp[2]:
temp[0] = temp[0] * temp[1]
df.loc[:, column] = temp #overwrite original df column
The above leaves data unchanged if condition is not met.

Related

Forming a condition based on a Dataframe groupby object, but getting more columns than expected

So i have the following line of code.
df[['Steps','CampaignSource','UserId']].groupby(['Steps','CampaignSource']).apply(lambda x : x.nunique() if x.name[0] != '9.2-Finalizado' else x.count())
Which as can see i apply a condition based on a groups key specifically the first one. But the thing is i get this weird end result, which basically gives me two more columns than i would like.
Any clues on the why, i would like that only UserId returns. if necessary i can provide a sample df.
You can slice the GroupBy object:
(df.groupby(['Steps','CampaignSource'])['UserId']
.apply(lambda x : x.nunique() if x.name[0] != '9.2-Finalizado' else x.count())
)
or for a DataFrame:
(df.groupby(['Steps','CampaignSource'])[['UserId']]
.apply(lambda x : x.nunique() if x.name[0] != '9.2-Finalizado' else x.count())
)
If you are asking for the reason why you are seeing the 2 additional columns, it is because you are applying the lambda function over all the 3 columns (Steps, CampaignSource, UserId), and performing a nunique() operation. This would return a 1 for both Steps and CampaignSource columns because they have 1 unique record each.

Apply function on multiple columns and create new column based on condition

I am trying to apply a function on multiple columns in a pandas dataframe where I compare the value of two columns to create a third new based on this comparison. The code runs, however, the output does not get correct. For example, this code:
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x,column1=x[item], column2=x[lst1[i]]) , axis=1)
i=i+1
The output should be that the first row contains an incorrect instance, but it marks it as correct.
This is how the output looks:
The correct would be that col4_4_2 and col5_5_2 should be marked as incorrect. This is how it should look:
Is it not possible to apply a function in this way on multiple columns and pass the column name as arguments in pandas? If so, how should it be performed?
You didn't provide a df, so I used this:
df = pd.DataFrame([[0,0,0,1,0,0,0,0,0,1,0,0,0,0,0]],columns = ['col1', 'col2', 'col3', 'col4', 'col5','col1_1','col2_2','col3_3','col4_4','col5_5','col1_1_2','col2_2_2','col3_3_2','col4_4_2','col5_5_2',])
Your conditions function is expecting a dataframe and then references to two of it's columns, but you are supplying it a df and then two values. One way to solve your problem is to change your comparison function to something like this (note you don't actually need the df itself anymore):
def conditions(x,column1, column2):
print(column1,column2)
if column1 != column2:
return "incorrect"
else:
return "correct"
Alternatively, you could change the line with lamba in it to something like this:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x, lst2[i], lst1[i]) , axis=1)
I first had to add the columns and fill them with zeros, then apply the function.
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = 0
i=0
for item in df.columns[-5:]:
df[item]=df.apply(lambda x: conditions(x, column1=lst1[i], column2=lst2[i]) , axis=1)
i=i+1

Groupby and apply a specific function to certain columns and get first or last values of the df Pandas

Based on the previous post: Groupby and apply a specific function to certain columns and another function to the rest of the df Pandas
I want to group a dataframe with a large amount of columns but applying a function (sum, mean, etc. ) to only two columns and to get the first value of the remaining columns. How can I do that? In the quoted post the following code worked, but when i replace "esle x.mean()" by "esle x.first()", it doesnt work anymore.
df = df.groupby('id').agg(lambda x : x.count() if x.name in ['var1','var2'] else x.mean())
Any ideas?
Try using x.iloc[0] for first value and x.iloc[-1] for last value:
df = df.groupby('id').agg(lambda x : x.count() if x.name in ['var1','var2'] else x.iloc[0])

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

List comprehension pandas assignment

How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')

Categories

Resources