So i have a dataframe like this and i am trying to check correlation between the columns with isbigger function which controls the values.
df = pd.read_csv('2020.csv')
def isbigger(x):
if x > 7:
return True
return False
It works fine with when i pass in a single column name like
df['Ladder score'].apply(isbigger)
However when i get theese dataframe's correlation and try to apply it does not works like this
df.drop(axis=1,columns = ['Country name','Regional indicator']).apply(isbigger).corr(method = 'spearman')
I did even dropped the string columns but not still works, how can i apply an entire dataframe ?
You don't need apply, just use the usual boolean operators here:
df.drop(['Country name', 'Regional indicator'], axis=1).gt(7).corr()
The apply() function is used to invoke a python function on values of Series.
Syntax:
Series.apply(self, func, convert_dtype=True, args=(), **kwds)
Related
I have this current lambda function: df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)
But I want to convert it to a regular function like this def get_domain_count() how do I do this? I'm not sure what parameters it would take in as I want to apply it to an entire column in a dataframe? The domain column will contain duplicates and I want to know how many times a domain appears in my dataframe.
ex start df:
|domain|
---
|target.com|
|macys.com|
|target.com|
|walmart.com|
|walmart.com|
|target.com|
ex end df:
|domain|count|
---|---|
|target.com|3
|macys.com|1
|target.com|3
|walmart.com|2
|walmart.com|2
|target.com|3
Please help! Thanks in advance!
You can pass the column name as a string, and the dataframe object to mutate:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame.apply(lambda row: df[col_name]...)
But better yet, you don't need to apply!
df["domain"].map(df["domain"].value_counts())
will first get the counts per unique value, and map each value in the column with that. So the function could become:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame[col_name].map(frame[col_name].value_counts())
A lambda is just an anonymous function and its usually easy to put it into a function using the lambda's own parameter list (in this case, row) and returning its expression. The challenge with this one is the df parameter that will resolve differently in a module level function than in your lambda. So, add that as a parameter to the function. The problem is that this will not be
def get_domain_count(df, row):
return df['domain'].value_counts()[row['domain']]
This can be a problem if you still want to use this function in an .apply operation. .apply wouldn't know to add that df parameter at the front. To solve that, you could create a partial.
import functools.partial
def do_stuff(some_df):
some_df.apply(functools.partial(get_domain_count, some_df))
I am wondering how to convert a for-loop into a .apply() method in Pandas.
I'm trying to iterate over one column of a dataframe (df1) and return matches from subsets of a second dataframe (df2). I have a function to do the matching (Matching), and also a function to select the right subset from df2 (Filter). I want to know if it's possible to use Pandas' .apply() method to call both functions.
I have worked out how to do this as a for-loop (see below), and it seems that I can do it with a list comprehension by creating a complete function first (see here) but I'm having trouble doing it via the Pandas .apply() method and a lambda expression.
## Here is my Filter, which selects titles from df2 for one year
## either side of a target year
def Filter (year):
years = [year-1, year, year+1]
return df2[df2.year.isin(years)].title
# Here is my matching routine, it uses the process method from
# fuzzywuzzy
def Matcher(title, choices):
title_match, percent_match, match3 = process.extractOne(title,
choices, scorer=fuzz.token_sort_ratio)
return title_match
# Here is my solution using a for-loop
for index, row in df1.iterrows():
targets = Filter(row.year)
df1.loc[index,'return_match'] = Matcher(row.title, targets)
# Here's my attempt with a lambda function
df1['new_col'] = df1.title.apply(lambda x: Matcher(x,
Filter(df1.year)))
When I use the lambda function, what appears to happen is that the Filter function is only called on the very first iteration of the .apply() method, so every title is matched to that first filtered set. Is there a way to fix that?
Welcome to SO JP.
I see a problem in this line:
# Here's my attempt with a lambda function
df1['new_col'] = df1.title.apply(lambda x: Matcher(x, Filter(df1.year)))
you call Filter on all of your DataFrame column year, while as your for loop solution you want to call it on just the year of that row. so I recommend using apply on rows like this:
df1['new_col'] = df1.apply(lambda row: Matcher(row.title, Filter(row.year)), axis=1)
I hope this helps.
df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.
The current code is:
df["row_hash"] = df["row_hash"].apply(self.hash_string)
The function self.hash_string is:
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.
The file that I am reading is(the first 10 rows):
16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026
The col names are: col_test_1, col_test_2, .... , col_test_11
You can create a new column, which is concatenation of all others:
df['new'] = df.astype(str).values.sum(axis=1)
And then apply your hash function on it
df["row_hash"] = df["new"].apply(self.hash_string)
or this one-row should work:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)
However, not sure if you need a separate function here, so:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())
You can use apply twice, first on the row elements then on the result:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)
Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())
I'm trying to apply to a dataframe a function that has more than one argument, of which two need to be assigned to the dataframe's rows, and one is a variable (a simple number).
A variation from a similar thread works for the rows: (all functions are oversimplified compared to my original ones)
import pandas as pd
dict={'a':[-2,5,4,-6], 'b':[4,4,5,-8]}
df=pd.DataFrame (dict)
print(df)
def DummyFunction (row):
return row['a']*row['b']
#this works:
df['Dummy1']=df.apply(DummyFunction, axis=1)
But how can I apply the following variation, where my function takes in an additional argument (a fixed variable)? I seem to find no way to pass it inside the apply method:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
# where threshold will be assigned to a number?
# I don't seem to find a viable option to fill the row argument below:
# df['Dummy2']=df.apply(DummyFunction2(row,1000), axis=1)
Thanks for your help!
You can pass the additional variable directly as a named argument to pd.DataFrame.apply:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
df['Dummy2'] = df.apply(DummyFunction2, threshold=2, axis=1)
I'm having a problem with the apply() method of the pandas DataFrame. My issue is that apply() can return either a Series or a DataFrame, depending on the return type of the input function; however, when the frame is empty, apply() (almost) always returns a DataFrame. So I can't write code that expects a Series. Here's an example:
import pandas as pd
def area_from_row(row):
return row['width'] * row['height']
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1)
# This works as expected.
non_empty_frame = pd.DataFrame(data=[[2, 3]], columns=['width', 'height'])
add_area_column(non_empty_frame)
# This fails!
empty_frame = pd.DataFrame(data=None, columns=['width', 'height'])
add_area_column(empty_frame)
Is there a standard way of dealing with this? I can do the following, but it's silly:
def area_from_row(row):
# The way we respond to an empty row tells pandas whether we're a
# reduction or not.
if not len(row):
return None
return row['width'] * row['height']
(I'm using pandas 0.11.0, but I checked this on 0.12.0-1100-g0c30665 as well.)
You can set the result_type parameter in apply to 'reduce'.
From the documentation,
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
And then,
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
In your code, update here:
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1, result_type='reduce')