So i have the following line of code.
df[['Steps','CampaignSource','UserId']].groupby(['Steps','CampaignSource']).apply(lambda x : x.nunique() if x.name[0] != '9.2-Finalizado' else x.count())
Which as can see i apply a condition based on a groups key specifically the first one. But the thing is i get this weird end result, which basically gives me two more columns than i would like.
Any clues on the why, i would like that only UserId returns. if necessary i can provide a sample df.
You can slice the GroupBy object:
(df.groupby(['Steps','CampaignSource'])['UserId']
.apply(lambda x : x.nunique() if x.name[0] != '9.2-Finalizado' else x.count())
)
or for a DataFrame:
(df.groupby(['Steps','CampaignSource'])[['UserId']]
.apply(lambda x : x.nunique() if x.name[0] != '9.2-Finalizado' else x.count())
)
If you are asking for the reason why you are seeing the 2 additional columns, it is because you are applying the lambda function over all the 3 columns (Steps, CampaignSource, UserId), and performing a nunique() operation. This would return a 1 for both Steps and CampaignSource columns because they have 1 unique record each.
I am trying to apply a function on multiple columns in a pandas dataframe where I compare the value of two columns to create a third new based on this comparison. The code runs, however, the output does not get correct. For example, this code:
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x,column1=x[item], column2=x[lst1[i]]) , axis=1)
i=i+1
The output should be that the first row contains an incorrect instance, but it marks it as correct.
This is how the output looks:
The correct would be that col4_4_2 and col5_5_2 should be marked as incorrect. This is how it should look:
Is it not possible to apply a function in this way on multiple columns and pass the column name as arguments in pandas? If so, how should it be performed?
You didn't provide a df, so I used this:
df = pd.DataFrame([[0,0,0,1,0,0,0,0,0,1,0,0,0,0,0]],columns = ['col1', 'col2', 'col3', 'col4', 'col5','col1_1','col2_2','col3_3','col4_4','col5_5','col1_1_2','col2_2_2','col3_3_2','col4_4_2','col5_5_2',])
Your conditions function is expecting a dataframe and then references to two of it's columns, but you are supplying it a df and then two values. One way to solve your problem is to change your comparison function to something like this (note you don't actually need the df itself anymore):
def conditions(x,column1, column2):
print(column1,column2)
if column1 != column2:
return "incorrect"
else:
return "correct"
Alternatively, you could change the line with lamba in it to something like this:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x, lst2[i], lst1[i]) , axis=1)
I first had to add the columns and fill them with zeros, then apply the function.
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = 0
i=0
for item in df.columns[-5:]:
df[item]=df.apply(lambda x: conditions(x, column1=lst1[i], column2=lst2[i]) , axis=1)
i=i+1
Based on the previous post: Groupby and apply a specific function to certain columns and another function to the rest of the df Pandas
I want to group a dataframe with a large amount of columns but applying a function (sum, mean, etc. ) to only two columns and to get the first value of the remaining columns. How can I do that? In the quoted post the following code worked, but when i replace "esle x.mean()" by "esle x.first()", it doesnt work anymore.
df = df.groupby('id').agg(lambda x : x.count() if x.name in ['var1','var2'] else x.mean())
Any ideas?
Try using x.iloc[0] for first value and x.iloc[-1] for last value:
df = df.groupby('id').agg(lambda x : x.count() if x.name in ['var1','var2'] else x.iloc[0])
I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']
How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')