How to compute hash of all the columns in Pandas Dataframe? - python

df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.
The current code is:
df["row_hash"] = df["row_hash"].apply(self.hash_string)
The function self.hash_string is:
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.
The file that I am reading is(the first 10 rows):
16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026
The col names are: col_test_1, col_test_2, .... , col_test_11

You can create a new column, which is concatenation of all others:
df['new'] = df.astype(str).values.sum(axis=1)
And then apply your hash function on it
df["row_hash"] = df["new"].apply(self.hash_string)
or this one-row should work:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)
However, not sure if you need a separate function here, so:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())

You can use apply twice, first on the row elements then on the result:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)
Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())

Related

convert lambda function to regular function PYTHON df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)

I have this current lambda function: df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)
But I want to convert it to a regular function like this def get_domain_count() how do I do this? I'm not sure what parameters it would take in as I want to apply it to an entire column in a dataframe? The domain column will contain duplicates and I want to know how many times a domain appears in my dataframe.
ex start df:
|domain|
---
|target.com|
|macys.com|
|target.com|
|walmart.com|
|walmart.com|
|target.com|
ex end df:
|domain|count|
---|---|
|target.com|3
|macys.com|1
|target.com|3
|walmart.com|2
|walmart.com|2
|target.com|3
Please help! Thanks in advance!
You can pass the column name as a string, and the dataframe object to mutate:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame.apply(lambda row: df[col_name]...)
But better yet, you don't need to apply!
df["domain"].map(df["domain"].value_counts())
will first get the counts per unique value, and map each value in the column with that. So the function could become:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame[col_name].map(frame[col_name].value_counts())
A lambda is just an anonymous function and its usually easy to put it into a function using the lambda's own parameter list (in this case, row) and returning its expression. The challenge with this one is the df parameter that will resolve differently in a module level function than in your lambda. So, add that as a parameter to the function. The problem is that this will not be
def get_domain_count(df, row):
return df['domain'].value_counts()[row['domain']]
This can be a problem if you still want to use this function in an .apply operation. .apply wouldn't know to add that df parameter at the front. To solve that, you could create a partial.
import functools.partial
def do_stuff(some_df):
some_df.apply(functools.partial(get_domain_count, some_df))

how to generate column in pandas data frame using other columns and string formatting

I am trying to generate a third column in pandas dataframe using two other columns in dataframe. The requirement is very particular to the scenario for which I need to generate the third column data.
The requirement is stated as:
let the dataframe name be df, first column be 'first_name'. second column be 'last_name'.
I need to generate third column in such a manner so that it uses string formatting to generate a particular string and pass it to a function and whatever the function returns should be used as value to third column.
Problem 1
base_string = "my name is {first} {last}"
df['summary'] = base_string.format(first=df['first_name'], last=df['last_name'])
Problem 2
df['summary'] = some_func(base_string.format(first=df['first_name'], last=df['last_name']))
My ultimate goal is to solve problem 2 but for that problem 1 is pre-requisite and as of now I'm unable to solve that. I have tried converting my dataframe values to string but it is not working the way I expected.
You can do apply:
df.apply(lambda r: base_string.format(first=r['first_name'], last=r['last_name']) ),
axis=1)
Or list comprehension:
df['summary'] = [base_string.format(first=x,last=y)
for x,y in zip(df['first_name'], df['last_name'])
And then, for general function some_func:
df['summary'] = [some_func(base_string.format(first=x,last=y) )
for x,y in zip(df['first_name'], df['last_name'])
You could use pandas.DataFrame.apply with axis=1 so your code will look like this:
def mapping_function(row):
#make your calculation
return value
df['summary'] = df.apply(mapping_function, axis=1)

How can you call a second function while using a lambda function in an .apply() call in Python?

I am wondering how to convert a for-loop into a .apply() method in Pandas.
I'm trying to iterate over one column of a dataframe (df1) and return matches from subsets of a second dataframe (df2). I have a function to do the matching (Matching), and also a function to select the right subset from df2 (Filter). I want to know if it's possible to use Pandas' .apply() method to call both functions.
I have worked out how to do this as a for-loop (see below), and it seems that I can do it with a list comprehension by creating a complete function first (see here) but I'm having trouble doing it via the Pandas .apply() method and a lambda expression.
## Here is my Filter, which selects titles from df2 for one year
## either side of a target year
def Filter (year):
years = [year-1, year, year+1]
return df2[df2.year.isin(years)].title
# Here is my matching routine, it uses the process method from
# fuzzywuzzy
def Matcher(title, choices):
title_match, percent_match, match3 = process.extractOne(title,
choices, scorer=fuzz.token_sort_ratio)
return title_match
# Here is my solution using a for-loop
for index, row in df1.iterrows():
targets = Filter(row.year)
df1.loc[index,'return_match'] = Matcher(row.title, targets)
# Here's my attempt with a lambda function
df1['new_col'] = df1.title.apply(lambda x: Matcher(x,
Filter(df1.year)))
When I use the lambda function, what appears to happen is that the Filter function is only called on the very first iteration of the .apply() method, so every title is matched to that first filtered set. Is there a way to fix that?
Welcome to SO JP.
I see a problem in this line:
# Here's my attempt with a lambda function
df1['new_col'] = df1.title.apply(lambda x: Matcher(x, Filter(df1.year)))
you call Filter on all of your DataFrame column year, while as your for loop solution you want to call it on just the year of that row. so I recommend using apply on rows like this:
df1['new_col'] = df1.apply(lambda row: Matcher(row.title, Filter(row.year)), axis=1)
I hope this helps.

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?
Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

Pandas: apply a function with columns and a variable as argument

I'm trying to apply to a dataframe a function that has more than one argument, of which two need to be assigned to the dataframe's rows, and one is a variable (a simple number).
A variation from a similar thread works for the rows: (all functions are oversimplified compared to my original ones)
import pandas as pd
dict={'a':[-2,5,4,-6], 'b':[4,4,5,-8]}
df=pd.DataFrame (dict)
print(df)
def DummyFunction (row):
return row['a']*row['b']
#this works:
df['Dummy1']=df.apply(DummyFunction, axis=1)
But how can I apply the following variation, where my function takes in an additional argument (a fixed variable)? I seem to find no way to pass it inside the apply method:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
# where threshold will be assigned to a number?
# I don't seem to find a viable option to fill the row argument below:
# df['Dummy2']=df.apply(DummyFunction2(row,1000), axis=1)
Thanks for your help!
You can pass the additional variable directly as a named argument to pd.DataFrame.apply:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
df['Dummy2'] = df.apply(DummyFunction2, threshold=2, axis=1)

Categories

Resources