function to sort rows - python

I have the following code which filters a row based on the part number.
df= sets[sets['num_parts'] == 11695]
df
Now, i want to write a function where i can simply call a function and pass set_num as a string and return the rows.
Here is what I have tried. This is giving me an error.
def select_set_row(df):
return display(df.iloc['set_num'])

iloc is integer-location based indexing. You should use df['set_num'].iloc[0]

This is exactly not a function, but you can create similar function.
Sample data:
df = pd.DataFrame({'id': [1,2,3,4,5,6], 'num_parts': [132, 234, 345, 578, 462, 222]})
print(df)
list of num_parts:
num_parts = df['num_parts'].tolist()
Outputing dataframe for each num_parts:
for parts in num_parts:
print(df[df['num_parts'] == parts])

I was able to sort by this function :
def select_set_numparts(num_parts):
return sets[sets['num_parts'] == num_parts]

Related

pandas column calculated using function including dict lookup, 'Series' objects are mutable, thus they cannot be hashed

I am aware there are tons of questions similar to mine, but I could not find the solution to my question in the last 30 Minutes of looking through dozens of threads.
I have a dataframe with hundereds of columns and rows, and use most columns within a function to return a value thats supposed to be added to an additional column.
The problem can be broken down to the following.
lookup = {"foo": 1, "bar": 0}
def lookuptable(input_string, input_factor):
return lookup[input_string] * input_factor
mydata = pd.DataFrame([["foo", 4], ["bar",3]], columns = ["string","faktor"])
mydata["looked up value"] = lookuptable(mydata["string"], mydata["faktor"])
But this returns:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Is there a way to avoid this problem without, restructuring the function itself?
Thanks in advance!
Try this:
lookup = {"foo": 1, "bar": 0}
def lookuptable(data):
return lookup[data["string"]] * data["faktor"]
mydata = pd.DataFrame([["foo", 4], ["bar",3]], columns = ["string","faktor"])
mydata["looked up value"] = mydata.apply(lookuptable, axis=1)
print(mydata)
string faktor looked up value
0 foo 4 4
1 bar 3 0
Besides of using .apply(), you can use list comprehension with .iterrows()
mydata["looked up value"] = [lookuptable(row[1]["string"], row[1]["faktor"]) for row in mydata.iterrows()]
Your functions accepts 2 parameters, a string and a integer.
But you provide 2 pandas series to the function instead. You can iterate through the dataframe however and provide the function with the parameters (row-wise) by using .apply().
mydata["looked up value"] = mydata\
.apply(lambda row: lookuptable(row["string"], row["faktor"]), axis=1)
You can do this without function -
import pandas as pd
lookup = {"foo": 1, "bar": 0}
mydata = pd.DataFrame([["foo", 4], ["bar",3]], columns = ["string","factor"])
mydata["looked up value"] = mydata['string'].map(lookup) * mydata['factor']

Pandas merging dataframes resulting in x and y suffixes

I am creating my own dataset for a Uni project. I've used the merge function often and it always worked perfectly. This time I get x and y suffixes which I can not understand. I know pandas does this because -> The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. But I really don't get why. I assume it has to do with a warning I got earlier:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
I tried to merge the dataframes on the column 'CustomerID' where they obviously match. I really don't get the error.
Here is my code:
I first want to remove duplicate rows where the relevant columns are CustomerID and WebsiteID
Then I want to apply a function which returns true or false as a string randomly. Up to this point the resulting dataframe looks fine. The only warning I get is the one I described earlier.
And lastly I want to merge them and it results in a dataframe way larger than the original one. I really don't understand that.
import numpy as np
import pandas as pd
from numpy.random import choice
df = pd.DataFrame()
df['AdID'] = np.random.randint(1,1000001, size=100000)
df['CustomerID'] = np.random.randint(1,1001, size=len(df))
df["Datetime"] = choice(pd.date_range('2015-01-01', '2020-12-31'), len(df))
def check_weekday(date):
res = len(pd.bdate_range(date, date))
if res == 0:
result = "Weekend"
else:
result = "Working Day"
return result
df["Weekend"] = df["Datetime"].apply(check_weekday)
def apply_age():
age = choice([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36],
p=[.00009, .00159, .02908, .06829, .09102, .10043, .10609, .10072, .09223, .08018, .06836, .05552,
.04549,.03864, .03009, .02439, .01939, .01586, .01280, .01069, .00905])
return age
def apply_income_class():
income_class = choice([np.random.randint(50,501),np.random.randint(502,1001), np.random.randint(1002,1501),np.random.randint(1502,2001)],
p=[.442, .387, .148, .023])
return income_class
def apply_gender():
gender = choice(['male', 'female'], p=[.537, .463])
return gender
unique_customers = df[['CustomerID']].drop_duplicates(keep="first")
unique_customers['Age'] = [apply_age() for x in unique_customers.index]
unique_customers['Gender'] = [apply_gender() for x in unique_customers.index]
unique_customers['Monthly Income'] = [apply_income_class() for x in unique_customers.index]
unique_customers['Spending Score'] = [np.random.randint(1,101) for x in unique_customers.index]
df = df.merge(unique_customers, on=['CustomerID'], how='left')
df['WebsiteID'] = np.random.randint(1,31, len(df))
df['OfferID'] = np.random.randint(1,2001, len(df))
df['BrandID'] = np.random.randint(1,10, len(df))
unique_offers = df[['OfferID']].drop_duplicates(keep="first")
print(len(unique_offers))
unique_offers['CategoryID'] = [np.random.randint(1,501) for x in unique_offers.index]
unique_offers['NPS'] = [np.random.randint(1, 101) for x in unique_offers.index]
df = df.merge(unique_offers, on=['OfferID'], how='left')
def apply_website_user():
purchase = np.random.choice(['True', 'False'])
return purchase
unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
print(unique_website_user.head())
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
#df['PurchaseOnWebsite']= df.groupby(['CustomerID', 'WebsiteID']).apply(apply_website_user)
print(df.head)
#Erstellen der csv-Datei
#df.to_csv(r'/Users/alina/Desktop/trainingsdaten.csv', sep=',', #index=False)
It's better to paste the data, rather than provide images, so this is just guidance as I can't test it. You have a couple issues and I don't think they are related.
copy or slice warning. You might be able to get rid of this two ways. One is reconfigure the line:
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
to the format it is suggesting. The other, more simple way that might work is to use .copy() on the line before it. You are dropping duplicates and then modifying it, and pandas is just warning that you are modifying a slice or view of the original. try this:
unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
If you just want to merge back that one column and reduce number of columns, try this:
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
Another alternative to this would be to use groupby() and apply your True/False function in and apply method. Something like:
df.groupby(['CustomerID']).apply(yourfunctionhere)
This gets rid of creating and merging dataframes. If you post all the code actual dataframe, we can be more specific.
UPDATE:
Saw your comment that you found your own answer. Also, this is way faster than your call to the weekday function.
df["Weekend"] = df['Datetime'].apply(lambda x: 'Weekend' if (x.weekday() == 5 or x.weekday() == 6) else 'Working Day')

if statement and call function for dataframe

I know how to apply an IF condition in Pandas DataFrame. link
However, my question is how to do the following:
if (df[df['col1'] == 0]):
sys.path.append("/desktop/folder/")
import self_module as sm
df = sm.call_function(df)
What I really want to do is when value in col1 equals to 0 then call function call_function().
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
I provide a simple example above for call_function().
Since your function interacts with multiple columns and returns a whole data frame, run conditional logic inside the method:
def call_function(ds):
ds['new_age'] = np.nan
ds.loc[ds['col'] == 0, 'new_age'] = ds['age'].mul(0.012345678901).round(12)
return ds
df = call_function(df)
If you are unable to modify the function, run method on splits of data frame and concat or append together. Any new columns in other split will be have values filled with NAs.
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
df = pd.concat([call_function(df[df['col'] == 0].copy()),
df[df['col'] != 0].copy()])

issue in writing function to filter rows data frame

I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])

Applying similar functions across multiple columns in python/pandas

Problem: Given the dataframe below, I'm trying to come up with the code that will apply a function to three distinct columns without having to write three separate function calls.
The code for the data:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'days': [365, 365, 213, 318, 71],
'spend_30day': [22, 241.5, 0, 27321.05, 345],
'spend_90day': [22, 451.55, 64.32, 27321.05, 566.54],
'spend_365day': [854.56, 451.55, 211.65, 27321.05, 566.54]}
df = pd.DataFrame(data)
cols = df.columns.tolist()
cols = ['name', 'days', 'spend_30day', 'spend_90day', 'spend_365day']
df = df[cols]
df
The function below will essentially annualize spend; if someone has fewer than, say, 365 days in the "days" column, the following function will tell me what the spend would have been if they had 365 days:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
Then I apply the function to the particular column:
df.spend_365day = df.apply(annualize_spend_365, axis=1).round(2)
df
This works exactly as I want it to for that one column. However, I don't want to have to rewrite this for each of the three different "spend" columns (30, 90, 365). I want to be able to write code that will generalize and apply this function to multiple columns in one pass.
I thought I could create lists of the columns and their respective days, use the "zip" function, and nest the function in a for loop, but my attempt below ultimately fails:
spend_cols = [df.spend_30day, df.spend_90day, df.spend_365day]
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
col = df.apply(annualize_spend, axis = 1)
The error:
AttributeError: ("'Series' object has no attribute 'col'")
I'm not sure why the loop approach is failing. Regardless, I'm hoping for guidance on how to generalize function application in pandas. Thanks in advance!
Look at your two function definitions:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
and
#col in [df.spend_30day, df.spend_90day, df.spend_365day]
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
See the difference? On the one hand, in the first case you access the fields with explicit field names, and it works. In the second case you try to access row.col, which fails, but in this case col assumes the values of the corresponding fields in df. Instead try
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
before your loop. On the other hand, in the syntax df.days the field name is actually "days", but in df.col the field name is not the string "col", but the value of the string col. So you might want to use row[col] in the latter case as well. And anyway, I'm not sure how wise it is to take col as an output variable inside your loop over col.
I'm unfamiliar with pandas.DataFrame.apply, but it's probably possible to use a single function definition, which takes the number of days and the field of interest as input variables:
def annualize_spend(col,day,row):
if (row['days']/(float(day)) < 1:
return (row[col])/((row['days'])/float(day))
else:
return row[col]
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
col = df.apply(lambda row,col=col,day=day: annualize_spend(col,day,row), axis = 1)
The lambda will ensure that only one input parameter of your function is dangling free when it gets applyd.

Categories

Resources