Mapping a function to a dataframe - python

I was trying to apply a function to a dataframe in pandas. I am trying to take two columns as positional arguments and map a function to it. Below is the code I tried.
Code:
df_a=pd.read_csv('5_a.csv')
def y_pred(x):
if x<.5:
return 0
else:
return 1
df_a['y_pred']=df_a['proba'].map(y_pred)
def confusion_matrix(act,pred):
if act==1 and act==pred:
return 'TP'
elif act==0 and act==pred:
return 'TN'
elif act==0 and pred==1:
return 'FN'
elif act==1 and pred==0:
return 'FP'
df_a['con_mat_label']=df_a[['y','y_pred']].apply(confusion_matrix)
But the function is not considering y_pred as the second column and mapping it to pred variable in the defined function.
I am gettting this error:
TypeError: ("confusion_matrix() missing 1 required positional argument: 'pred'", 'occurred at index y')

What you get as argument in the function that you pass as part of apply method is a pandas series and using the axis argument you can specify if has to be a row or a column.
So you need to modify your confusion_matrix function to
I am assuming that the act corresponds to the column name y here*
def confusion_matrix(row):
if row.y==1 and row.y==row.y_pred:
return 'TP'
elif row.y==0 and row.y==row.y_pred:
return 'TN'
elif row.y==0 and row.y_pred==1:
return 'FN'
elif row.y==1 and row.y_pred==0:
return 'FP'
And you need to modify your apply call to
df_a['con_mat_label']=df_a[['y','y_pred']].apply(confusion_matrix, axis=1)
Now let me give you some tips on how you could improve your code.
Say you have a data frame like this:
>>> df
X Y
0 1 4
1 2 5
2 3 6
3 4 7
To add a Y_pred column
>>> df['Y_pred'] = (df.X < 3).astype(int)
>>> df
X Y Y_pred
0 1 4 1
1 2 5 1
2 3 6 0
3 4 7 0
Oh btw, I would like you to refer you to this interesting blog post

The apply function take each column one by one, run it through the function and return an transformed column. Here are more documentation on it pandas documentation.
Your setup would be better for a list comprehension. Here how you can get the intended behavior:
df_a['con_mat_label'] = [confusion_matrix(act,pred) for (act,pred) in df[['y','y_pred']].to_numpy()]
Hope it helps!

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

Pandas groupby result splitted into two columns?

I have a pandas dataframe and I want to summarize/reorganize it to produce a figure. I think what I'm looking for involves groupby.
Here's what my dataframe df looks like:
Channel Flag
1 pass
2 pass
3 pass
1 pass
2 pass
3 pass
1 fail
2 fail
3 fail
And this is what I want my dataframe to look like:
Channel pass fail
1 2 1
2 2 1
3 2 1
Running the following code gives something "close", but not in the format I would like:
In [12]: df.groupby(['Channel', 'Flag']).size()
Out[12]:
Channel Flag
1 fail 1
pass 2
2 fail 1
pass 2
3 fail 1
pass 2
Maybe this output is actually fine to make my plot. It's just that I already have the code to plot the data with the previous format. I'm adding the code in case it would be relevant:
df_all = pd.DataFrame()
df_all['All'] = df['Pass'] + df['Fail']
df_pass = df[['Pass']] # The double square brackets keep the column name
df_fail = df[['Fail']]
maxval = max(df_pass.index) # maximum channel value
layout = FastqPlots.make_layout(maxval=maxval)
value_cts = pd.Series(df_pass['Pass'])
for entry in value_cts.keys():
layout.template[np.where(layout.structure == entry)] = value_cts[entry]
sns.heatmap(data=pd.DataFrame(layout.template, index=layout.yticks, columns=layout.xticks),
xticklabels="auto", yticklabels="auto",
square=True,
cbar_kws={"orientation": "horizontal"},
cmap='Blues',
linewidths=0.20)
ax.set_title("Pass reads output per channel")
plt.tight_layout() # Get rid of extra margins around the plot
fig.savefig(out + "/channel_output_all.png")
Any help/advice would be much appreciated.
Thanks!
df.groupby(['Channel', 'Flag'],as_index=False).size().pivot('Channel','Flag','size')

How to access Pandas series value in a custom function

I'm working on a project to monitor my 5k time for my running/jogging activities based on their GPS data. I'm currently exploring my data in a Jupyter notebook & now realize that I will need to exclude some activities.
Each activity is a row in a dataframe. While I do want to exclude some rows, I don't want to drop them from my dataframe as I will also use the df for other calculations.
I've added a column to the df along with a custom function for checking the invalidity reasons of a row. It's possible that a run could be excluded for multiple reasons.
In []:
# add invalidity reasons column & update logic
df['invalidity_reasons'] = ''
def maintain_invalidity_reasons(reason):
"""logic for maintaining ['invalidity reasons']"""
reasons = []
if invalidity_reasons == '':
return list(reason)
else:
reasons = invalidity_reasons
reasons.append(reason)
return reasons
I filter down to specific rows in my df and pass them to my function. The below example returns a set of five rows from the df. Below is an example of using the function in my Jupyter notebook.
In []:
columns = ['distance','duration','notes']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns].apply(maintain_invalidity_reasons('short_run'),axis=1)
Out []:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-107-0bd06407ef08> in <module>
2
3 filt = (df['duration'] < pd.Timedelta('5 minutes'))
----> 4 df.loc[filt,columns].apply(maintain_invalidity_reasons(reason='short_run'),axis=1)
<ipython-input-106-60264b9c7b13> in maintain_invalidity_reasons(reason)
5 """logic for maintaining ['invalidity reasons']"""
6 reasons = []
----> 7 if invalidity_reasons == '':
8 return list(reason)
9 else:
NameError: name 'invalidity_reasons' is not defined
Here is an example of the output of my filter if I remove the .apply() call to my function
In []:
columns = ['distance','duration', 'notes','invalidity_reasons']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns]
Out []:
It seems that my issue lies in not knowing how to specify that I want to reference the scalar value in the 'invalidity_reasons' index/column (not sure of the proper term) of the specific row.
I've tried adjusting the IF statement with the below variants. I've also tried to apply the function with/out the axis argument. I'm stuck, please help!
if 'invalidity_reasons' == '':
if s['invalidity_reasons'] == '':
This is pretty much a stab in the dark, but I hope it helps. In the following I'm using this simple frame as an example (to have something to work with):
df = pd.DataFrame({'Col': range(5)})
Now if you define
def maintain_invalidity_reasons(current_reasons, new_reason):
if current_reasons == '':
return [new_reason]
if type(current_reasons) == list:
return current_reasons + [new_reason]
return [current_reasons] + [new_reason]
add another column invalidity_reasons to df
df['invalidity_reasons'] = ''
populate one cell (for the sake of exemplifying)
df.loc[0, 'invalidity_reasons'] = 'a reason'
Col invalidity_reasons
0 0 a reason
1 1
2 2
3 3
4 4
build a filter
filt = (df.Col < 3)
and then do
df.loc[filt, 'invalidity_reasons'] = (df.loc[filt, 'invalidity_reasons']
.apply(maintain_invalidity_reasons,
args=('another reason',)))
you will get
Col invalidity_reasons
0 0 [a reason, another reason]
1 1 [another reason]
2 2 [another reason]
3 3
4 4
Does that somehow resemble what you are looking for?

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

pandas apply function with arguments no lambda

I am trying to apply a function to the rows of a dataframe with the apply args argument. I see multiple similar questions, but following the solutions does not seem to work. I have created a sample example.
Here I divide my dataframe by the sum of its columns
pij=pd.DataFrame(np.random.randn(500,2))
pij.divide(pij.sum(1),axis=0).head()
0 1
0 1.077353 -0.690463
1 0.608302 0.583209
2 -0.724272 -1.665318
3 -0.735404 -0.606744
4 -0.033409 -0.162695
I know how to use lambda's to return the same result
def lambda_divide(row):
return row / row.sum(0)
pij.apply(lambda row: lambda_divide(row), axis=1).head()
0 1
0 1.077353 -0.690463
1 0.608302 0.583209
2 -0.724272 -1.665318
3 -0.735404 -0.606744
4 -0.033409 -0.162695
However, when I try to use the apply arguments, it does not work
pij.apply(np.divide,args=(pij.sum(1)))
The full error suggests this is due to pandas special casing ufuncs:
4045
4046 if isinstance(f, np.ufunc):
-> 4047 results = f(self.values)
4048 return self._constructor(data=results, index=self.index,
4049 columns=self.columns, copy=False)
ValueError: invalid number of arguments
This looks like a bug!
In this specific case you can use div:
In [11]: df.div(df.sum(1), axis=0)
Out[11]:
0 1
0 2.784649 -1.784649
1 0.510530 0.489470
2 0.303095 0.696905
3 0.547931 0.452069
4 0.170364 0.829636

Categories

Resources