I am trying to apply a function on multiple columns in a pandas dataframe where I compare the value of two columns to create a third new based on this comparison. The code runs, however, the output does not get correct. For example, this code:
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x,column1=x[item], column2=x[lst1[i]]) , axis=1)
i=i+1
The output should be that the first row contains an incorrect instance, but it marks it as correct.
This is how the output looks:
The correct would be that col4_4_2 and col5_5_2 should be marked as incorrect. This is how it should look:
Is it not possible to apply a function in this way on multiple columns and pass the column name as arguments in pandas? If so, how should it be performed?
You didn't provide a df, so I used this:
df = pd.DataFrame([[0,0,0,1,0,0,0,0,0,1,0,0,0,0,0]],columns = ['col1', 'col2', 'col3', 'col4', 'col5','col1_1','col2_2','col3_3','col4_4','col5_5','col1_1_2','col2_2_2','col3_3_2','col4_4_2','col5_5_2',])
Your conditions function is expecting a dataframe and then references to two of it's columns, but you are supplying it a df and then two values. One way to solve your problem is to change your comparison function to something like this (note you don't actually need the df itself anymore):
def conditions(x,column1, column2):
print(column1,column2)
if column1 != column2:
return "incorrect"
else:
return "correct"
Alternatively, you could change the line with lamba in it to something like this:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x, lst2[i], lst1[i]) , axis=1)
I first had to add the columns and fill them with zeros, then apply the function.
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = 0
i=0
for item in df.columns[-5:]:
df[item]=df.apply(lambda x: conditions(x, column1=lst1[i], column2=lst2[i]) , axis=1)
i=i+1
Related
I have two dataframes df1 and df2. I would like to highlight in yellow all the rows in df1 that are also present in df2.
df1
df2
What I want to achive
So far I have only found solutions in which I insert another row and use a variable there to identify which row I have to colour.
My question is whether it is possible to compare these two df directly in the function presented below.
So these are the two df's:
df1 = pd.DataFrame([['AA',3,'hgend',1], ['BB','frdf',7,2], ['C1',4,'asef',4], ['C2',4,'asef',4], ['C3',4,'asef',4]], columns=list("ABCD"))
df2 = pd.DataFrame([['C1',4,'asef',4], ['C2',4,'asef',4], ['C3',4,'asef',4]], columns=list("XYZQ"))
This is my code to colour the rows:
def highlight_rows(row):
value = row.loc['A']
if value == 'C1':
color = 'yellow'
else:
color = ''
return ['background-color: {}'.format(color) for r in row]
df1.style.apply(highlight_rows, axis=1)
As I said, if I do the comparison beforehand, insert another column and put a variable there, I can then search for this variable and highlight the row.
My question is whether I can also do this directly in the function. To do this, I would have to be able to compare both df's in the function. Is this possible at all? It would be enough to be able to compare a single row, e.g. with .isin
Comparing to df2 inside the function would be inefficient.
You could define a temporary column to identify matches using a merge (the indicator column in_1 becomes left_only or both depending on whether or not the row is present in df2). It is then ignored by the styler:
def highlight_rows(row):
highlight = 'yellow' if row['in_1'] == "both" else ""
return ['background-color: {}'.format(highlight) for r in row]
(pd.merge(df1, df2.set_axis(df1.columns.tolist(), axis=1),
how="left", indicator="in_1")
.style
.hide_columns(['in_1'])
.apply(highlight_rows, axis=1))
Alternatively, to actually do the comparison inside the function, define a set of tuples of df2 rows beforehand:
set_df2 = set(df2.apply(tuple, axis=1))
def highlight_rows(row):
color = 'yellow' if tuple(row) in set_df2 else ""
return [f'background-color: {color}'] * len(row)
df1.style.apply(highlight_rows, axis=1)
I would like to apply a lambda function to several columns but I am not sure how to loop through the columns. Basically I have Column1 - Column50 and I want the exact same thing to happen on each but can't figure out how to iterate through them where x.column is below. Is there a way to do this?
for column in df:
df[column] = df.apply(lambda x: x.datacolumn * x.datacolumn2 if x.column >= x.datacolumn3, axis=1)
Are you looking for something like map()? map() applies a function to every item in a list (or other iterable) and returns a list containing the results.
Here's an eloquent explanation of how it works (way better than what I could write).
At a certain point, however, declaring a normal function and/or using a for loop might be easier.
At first, you are missing the else branch (what to do when the if condition is False?), and for accessing Panda's Series (the input for the lambda function) elements, you could use indexes.
For example, setting to 0, if the condition does not stand:
for column in df:
df[column] = df.apply(lambda x: x[0] * x[1] if x[0] >= x[2] else 0, axis=1)
It might be easiest to extract each column as a list, perform the operation, then write the result back into the dataframe.
for column in df:
temp = [x for x in df.loc[:, column]] #pull a list out using loc
if temp[0] > temp[2]:
temp[0] = temp[0] * temp[1]
df.loc[:, column] = temp #overwrite original df column
The above leaves data unchanged if condition is not met.
I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']
df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.
The current code is:
df["row_hash"] = df["row_hash"].apply(self.hash_string)
The function self.hash_string is:
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.
The file that I am reading is(the first 10 rows):
16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026
The col names are: col_test_1, col_test_2, .... , col_test_11
You can create a new column, which is concatenation of all others:
df['new'] = df.astype(str).values.sum(axis=1)
And then apply your hash function on it
df["row_hash"] = df["new"].apply(self.hash_string)
or this one-row should work:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)
However, not sure if you need a separate function here, so:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())
You can use apply twice, first on the row elements then on the result:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)
Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())
I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.
Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames
There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64