Pandas groupby result splitted into two columns?

Pandas groupby result splitted into two columns? - python

I have a pandas dataframe and I want to summarize/reorganize it to produce a figure. I think what I'm looking for involves groupby.
Here's what my dataframe df looks like:
Channel Flag
1 pass
2 pass
3 pass
1 pass
2 pass
3 pass
1 fail
2 fail
3 fail
And this is what I want my dataframe to look like:
Channel pass fail
1 2 1
2 2 1
3 2 1
Running the following code gives something "close", but not in the format I would like:
In [12]: df.groupby(['Channel', 'Flag']).size()
Out[12]:
Channel Flag
1 fail 1
pass 2
2 fail 1
pass 2
3 fail 1
pass 2
Maybe this output is actually fine to make my plot. It's just that I already have the code to plot the data with the previous format. I'm adding the code in case it would be relevant:
df_all = pd.DataFrame()
df_all['All'] = df['Pass'] + df['Fail']
df_pass = df[['Pass']] # The double square brackets keep the column name
df_fail = df[['Fail']]
maxval = max(df_pass.index) # maximum channel value
layout = FastqPlots.make_layout(maxval=maxval)
value_cts = pd.Series(df_pass['Pass'])
for entry in value_cts.keys():
layout.template[np.where(layout.structure == entry)] = value_cts[entry]
sns.heatmap(data=pd.DataFrame(layout.template, index=layout.yticks, columns=layout.xticks),
xticklabels="auto", yticklabels="auto",
square=True,
cbar_kws={"orientation": "horizontal"},
cmap='Blues',
linewidths=0.20)
ax.set_title("Pass reads output per channel")
plt.tight_layout() # Get rid of extra margins around the plot
fig.savefig(out + "/channel_output_all.png")
Any help/advice would be much appreciated.
Thanks!

df.groupby(['Channel', 'Flag'],as_index=False).size().pivot('Channel','Flag','size')

Related

python pandas: fulfill condition and assign a value to it

I am really hoping you can help me here...I need to assign a label(df_label) to an exact file within dataframe (df_data) and save all labels that appear in each file in a separate txt file (that's an easy bit)
df_data:
file_name file_start file_end
0 20190201_000004.wav 0.000 1196.000
1 20190201_002003.wav 1196.000 2392.992
2 20190201_004004.wav 2392.992 3588.992
3 20190201_010003.wav 3588.992 4785.984
4 20190201_012003.wav 4785.984 5982.976
df_label:
Begin Time (s)
0 27467.100000
1 43830.400000
2 43830.800000
3 46378.200000
I have tried to switch to np.array and use for loop and np.where but without any success...

If the time values in df_label fall under exactly one entry in df_data, you can use the following
def get_file_name(begin_time):
file_names = df_data[
(df_data["file_start"] <= begin_time)
& (df_data["file_end"] >= begin_time)
]["file_name"].values
return file_names.values[0] if file_names.values.size > 0 else None
df_label["file_name"] = df_label["Begin Time (s)"].apply(get_label)
This will add another col file_name to df_label

If the labels from df_label matches the order of files in df_data you can simply:
add the labels as new column of df_data (df_data["label"] = df_label["Begin Time (s)"]).
or
use DataFrame.merge() function (df_data = df_data.merge(df_labels, left_index=True, right_index=True)).
More about merging/joining with examples you can find here:
https://thispointer.com/pandas-how-to-merge-dataframes-by-index-using-dataframe-merge-part-3/
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

How to access Pandas series value in a custom function

I'm working on a project to monitor my 5k time for my running/jogging activities based on their GPS data. I'm currently exploring my data in a Jupyter notebook & now realize that I will need to exclude some activities.
Each activity is a row in a dataframe. While I do want to exclude some rows, I don't want to drop them from my dataframe as I will also use the df for other calculations.
I've added a column to the df along with a custom function for checking the invalidity reasons of a row. It's possible that a run could be excluded for multiple reasons.
In []:
# add invalidity reasons column & update logic
df['invalidity_reasons'] = ''
def maintain_invalidity_reasons(reason):
"""logic for maintaining ['invalidity reasons']"""
reasons = []
if invalidity_reasons == '':
return list(reason)
else:
reasons = invalidity_reasons
reasons.append(reason)
return reasons
I filter down to specific rows in my df and pass them to my function. The below example returns a set of five rows from the df. Below is an example of using the function in my Jupyter notebook.
In []:
columns = ['distance','duration','notes']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns].apply(maintain_invalidity_reasons('short_run'),axis=1)
Out []:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-107-0bd06407ef08> in <module>
2
3 filt = (df['duration'] < pd.Timedelta('5 minutes'))
----> 4 df.loc[filt,columns].apply(maintain_invalidity_reasons(reason='short_run'),axis=1)
<ipython-input-106-60264b9c7b13> in maintain_invalidity_reasons(reason)
5 """logic for maintaining ['invalidity reasons']"""
6 reasons = []
----> 7 if invalidity_reasons == '':
8 return list(reason)
9 else:
NameError: name 'invalidity_reasons' is not defined
Here is an example of the output of my filter if I remove the .apply() call to my function
In []:
columns = ['distance','duration', 'notes','invalidity_reasons']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns]
Out []:
It seems that my issue lies in not knowing how to specify that I want to reference the scalar value in the 'invalidity_reasons' index/column (not sure of the proper term) of the specific row.
I've tried adjusting the IF statement with the below variants. I've also tried to apply the function with/out the axis argument. I'm stuck, please help!
if 'invalidity_reasons' == '':
if s['invalidity_reasons'] == '':

This is pretty much a stab in the dark, but I hope it helps. In the following I'm using this simple frame as an example (to have something to work with):
df = pd.DataFrame({'Col': range(5)})
Now if you define
def maintain_invalidity_reasons(current_reasons, new_reason):
if current_reasons == '':
return [new_reason]
if type(current_reasons) == list:
return current_reasons + [new_reason]
return [current_reasons] + [new_reason]
add another column invalidity_reasons to df
df['invalidity_reasons'] = ''
populate one cell (for the sake of exemplifying)
df.loc[0, 'invalidity_reasons'] = 'a reason'
Col invalidity_reasons
0 0 a reason
1 1
2 2
3 3
4 4
build a filter
filt = (df.Col < 3)
and then do
df.loc[filt, 'invalidity_reasons'] = (df.loc[filt, 'invalidity_reasons']
.apply(maintain_invalidity_reasons,
args=('another reason',)))
you will get
Col invalidity_reasons
0 0 [a reason, another reason]
1 1 [another reason]
2 2 [another reason]
3 3
4 4
Does that somehow resemble what you are looking for?

Mapping a function to a dataframe

I was trying to apply a function to a dataframe in pandas. I am trying to take two columns as positional arguments and map a function to it. Below is the code I tried.
Code:
df_a=pd.read_csv('5_a.csv')
def y_pred(x):
if x<.5:
return 0
else:
return 1
df_a['y_pred']=df_a['proba'].map(y_pred)
def confusion_matrix(act,pred):
if act==1 and act==pred:
return 'TP'
elif act==0 and act==pred:
return 'TN'
elif act==0 and pred==1:
return 'FN'
elif act==1 and pred==0:
return 'FP'
df_a['con_mat_label']=df_a[['y','y_pred']].apply(confusion_matrix)
But the function is not considering y_pred as the second column and mapping it to pred variable in the defined function.
I am gettting this error:
TypeError: ("confusion_matrix() missing 1 required positional argument: 'pred'", 'occurred at index y')

What you get as argument in the function that you pass as part of apply method is a pandas series and using the axis argument you can specify if has to be a row or a column.
So you need to modify your confusion_matrix function to
I am assuming that the act corresponds to the column name y here*
def confusion_matrix(row):
if row.y==1 and row.y==row.y_pred:
return 'TP'
elif row.y==0 and row.y==row.y_pred:
return 'TN'
elif row.y==0 and row.y_pred==1:
return 'FN'
elif row.y==1 and row.y_pred==0:
return 'FP'
And you need to modify your apply call to
df_a['con_mat_label']=df_a[['y','y_pred']].apply(confusion_matrix, axis=1)
Now let me give you some tips on how you could improve your code.
Say you have a data frame like this:
>>> df
X Y
0 1 4
1 2 5
2 3 6
3 4 7
To add a Y_pred column
>>> df['Y_pred'] = (df.X < 3).astype(int)
>>> df
X Y Y_pred
0 1 4 1
1 2 5 1
2 3 6 0
3 4 7 0
Oh btw, I would like you to refer you to this interesting blog post

The apply function take each column one by one, run it through the function and return an transformed column. Here are more documentation on it pandas documentation.
Your setup would be better for a list comprehension. Here how you can get the intended behavior:
df_a['con_mat_label'] = [confusion_matrix(act,pred) for (act,pred) in df[['y','y_pred']].to_numpy()]
Hope it helps!

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index

You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

How to create grouped piechart

I have this pandas dataframe:
df =
GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0
I need to create a piechart (using Python or R). The size of each pie should correspond to the proportional count (i.e. the percent) of rows with particular GROUP. Moreover, each pie should be divided into 2 sub-parts corresponding to the percent of rows with MARK==1 and MARK==0 within given GROUP.
I was googling for this type of piecharts and found this one. But this example seems to be overcomplicated for my case. Another good example is done in JavaScript, which doesn't serve for me because of the language.
Can somebody tell me what's the name of this type of piecharts and where can I find some examples of code in Python or R.

Here is a solution in R that uses base R only. Not sure how you want to arrange your pies, but I used par(mfrow=...).
df <- read.table(text=" GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0", header=TRUE)
plot_pie <- function(x, multiplier=1, label){
pie(table(x), radius=multiplier * length(x), main=label)
}
par(mfrow=c(1,3), mar=c(0,0,2,0))
invisible(lapply(split(df, df$GROUP), function(x){
plot_pie(x$MARK, label=unique(x$GROUP),
multiplier=0.2)
}))
This is the result:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby result splitted into two columns? - python

df.groupby(['Channel', 'Flag'],as_index=False).size().pivot('Channel','Flag','size')

Related

python pandas: fulfill condition and assign a value to it

How to access Pandas series value in a custom function

Mapping a function to a dataframe

Reading values from datafram.iloc is too slow and problem in dataframe.values

How to create grouped piechart

Categories

Resources