Indexing pandas series with parent dataframe index - python

I wanted to run a function over each row of the pandas dataframe and output its value in the derived column score: The function shown below is a lambda for example but the function should be able to index by parent dataframe column labels and access column names like row['col1'] , but a series object is passed to the lambda function which loses the column label information:
eg:
def calculate(row):
cols=row.columns
loc=row['loc']
h=row['h']
isst=row['Ist']
Hol=row['Hol']
return loc+h+len(cols)
a['score']=a.apply(lambda row:calculate(row),axis=1)
gives:
AttributeError: ("'Series' object has no attribute 'columns'", u'occurred at index 0')
so how can I access a named series like a named tuple in the lambda function?
A quick hack would be to do:
a['score']=a.apply(lambda row:calculate(makedict(row,row.index)),axis=1)
where makedict function will create a dictionary for each row so that it can be accessed in the function by column labels. But is there an pandas way?

Finally found the to_dict function which helps this:
def calculate(row):
row=row.to_dict()
loc=row['loc']
h=row['h']
isst=row['Ist']
Hol=row['Hol']
return loc+h+len(row.keys())
a['score']=a.apply(calculate,axis=1)

Why not:
a['score']=a.apply(lambda row:row['loc'] + row['h']+len(row.index),axis=1)

Related

convert lambda function to regular function PYTHON df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)

I have this current lambda function: df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)
But I want to convert it to a regular function like this def get_domain_count() how do I do this? I'm not sure what parameters it would take in as I want to apply it to an entire column in a dataframe? The domain column will contain duplicates and I want to know how many times a domain appears in my dataframe.
ex start df:
|domain|
---
|target.com|
|macys.com|
|target.com|
|walmart.com|
|walmart.com|
|target.com|
ex end df:
|domain|count|
---|---|
|target.com|3
|macys.com|1
|target.com|3
|walmart.com|2
|walmart.com|2
|target.com|3
Please help! Thanks in advance!
You can pass the column name as a string, and the dataframe object to mutate:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame.apply(lambda row: df[col_name]...)
But better yet, you don't need to apply!
df["domain"].map(df["domain"].value_counts())
will first get the counts per unique value, and map each value in the column with that. So the function could become:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame[col_name].map(frame[col_name].value_counts())
A lambda is just an anonymous function and its usually easy to put it into a function using the lambda's own parameter list (in this case, row) and returning its expression. The challenge with this one is the df parameter that will resolve differently in a module level function than in your lambda. So, add that as a parameter to the function. The problem is that this will not be
def get_domain_count(df, row):
return df['domain'].value_counts()[row['domain']]
This can be a problem if you still want to use this function in an .apply operation. .apply wouldn't know to add that df parameter at the front. To solve that, you could create a partial.
import functools.partial
def do_stuff(some_df):
some_df.apply(functools.partial(get_domain_count, some_df))

How to compute hash of all the columns in Pandas Dataframe?

df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.
The current code is:
df["row_hash"] = df["row_hash"].apply(self.hash_string)
The function self.hash_string is:
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.
The file that I am reading is(the first 10 rows):
16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026
The col names are: col_test_1, col_test_2, .... , col_test_11
You can create a new column, which is concatenation of all others:
df['new'] = df.astype(str).values.sum(axis=1)
And then apply your hash function on it
df["row_hash"] = df["new"].apply(self.hash_string)
or this one-row should work:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)
However, not sure if you need a separate function here, so:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())
You can use apply twice, first on the row elements then on the result:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)
Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())

In pandas expanding/rolling function, how to use the index of the a dataframe or series?

Let say I have a pandas.Series with a datetime index:
srs = pd.Series(index = pd.date_range('2013-01-01','2013-01-10' )).fillna(1)
I can use the expanding function to calculate say expanding sum of the series.
srs.expanding(5).sum()
However, I can not access other properties of the series (say its index) by using the expanding function. For example by running:
srs.expanding(5).apply(lambda x: x.index[-1])
I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'index'
Why are the groups are passed as numpy array as opposed to pandas.Series? Is there another way to use expanding/rolling functions to have access to the indices as well?
You can get access to the indices by using the expanding, but if the indices have the same type as values.
For example it works:
s1 = pd.Series(index = range(10)).fillna(1)
s1.expanding(5).agg(lambda x: x.index[-1])
But it doesn't work:
srs = pd.Series(index = pd.date_range('2013-01-01','2013-01-10' )).fillna(1)
srs.expanding(5).agg(lambda x: x.index[-1])

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

I'm using Pandas 0.20.3 in my python 3.X. I want to add one column in a pandas data frame from another pandas data frame. Both the data frame contains 51 rows. So I used following code:
class_df['phone']=group['phone'].values
I got following error message:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
class_df.dtypes gives me:
Group_ID object
YEAR object
Terget object
phone object
age object
and type(group['phone']) returns pandas.core.series.Series
Can you suggest me what changes I need to do to remove this error?
The first 5 rows of group['phone'] are given below:
0 [735015372, 72151508105, 7217511580, 721150431...
1 []
2 [735152771, 7351515043, 7115380870, 7115427...
3 [7111332015, 73140214, 737443075, 7110815115...
4 [718218718, 718221342, 73551401, 71811507...
Name: phoen, dtype: object
In most cases, this error comes when you return an empty dataframe. The best approach that worked for me was to check if the dataframe is empty first before using apply()
if len(df) != 0:
df['indicator'] = df.apply(assign_indicator, axis=1)
You have a column of ragged lists. Your only option is to assign a list of lists, and not an array of lists (which is what .value gives).
class_df['phone'] = group['phone'].tolist()
The error of the Question-Headline
"ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
might as well occur if for what ever reason the table does not have any rows.
Instead of using an if-statement, you can use set result_type argument of apply() function to "reduce".
df['new_column'] = df.apply(func, axis=1, result_type='reduce')
The data assigned to a column in the DataFrame must be a single dimension array. For example, consider a num_arr to be added to a DataFrame
num_arr.shape
(1, 126)
For this num_arr to be added to a DataFrame column, It should be reshaped....
num_arr = num_arr.reshape(-1, )
num_arr.shape
(126,)
Now I could set this arr as a DataFrame column
df = pd.DataFrame()
df['numbers'] = num_arr

I'm having trouble understanding how is this line of code filtering a pandas dataframe's rows

I understand that first i'm assigning the csv file's content to the dataframe, but i dont understand what exactly the lambda function is doing to not select the rows that have the value of 'None' in the 'Fat' column.
data = pd.read_csv('data.csv',delimiter=';')
filtered_data = data[lambda row:row.Fat != 'None']
It is using the selection by callable feature of dataframes. You can pass a callable (such as a function) as the index to select a subset.
The lambda is just a shorthand to create a function, ie. you could also write:
def is_fat(row):
return row.Fat != 'None'
and use that function for indexing:
filtered_data = data[is_fat]
As you can see, the lambda function basically returns False for rows that has 'None' in the column Fat, and True otherwise.

Categories

Resources