I have table below and I wanted to get the % for each type that is >= 10 seconds or more. What is an efficient modular code for that? I would normally just filter for each type and then divid, but wanted to know if better way to calculate the percentage of each value in type column that is >= 10 seconds or more.
Thanks
Type | Seconds
A 23
V 10
V 10
A 7
B 1
V 10
B 72
A 11
V 19
V 3
expected output:
type %
A .67
V .80
B .50
A slightly more efficient option is to create a boolean mask of Seconds.ge(10) and use groupby.mean() on the mask:
df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
# Type %
# 0 A 0.666667
# 1 B 0.500000
# 2 V 0.800000
Given these functions:
mask_groupby_mean = lambda df: df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
groupby_apply = lambda df: df.groupby('Type').Seconds.apply(lambda x: (x.ge(10).sum() / len(x)) * 100).reset_index(name='%')
set_index_mean = lambda df: df.set_index('Type').ge(10).mean(level=0).rename(columns={'Seconds': '%'}).reset_index()
You can use .groupby:
x = (
df.groupby("Type")["Seconds"]
.apply(lambda x: (x.ge(10).sum() / len(x)) * 100)
.reset_index(name="%")
)
print(x)
Prints:
Type %
0 A 66.666667
1 B 50.000000
2 V 80.000000
Another other option set_index + ge then mean on level=0:
new_df = (
df.set_index('Type')['Seconds'].ge(10).mean(level=0)
.round(2)
.reset_index(name='%')
)
new_df:
Type %
0 A 0.67
1 V 0.80
2 B 0.50
Related
I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3) to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%
i have a huge data and when and there a lot of duplicate so i wanna remove all value have less then 5 in value_counts() function
like this and less i wanna remove it
If want remove values from counts Series use boolean indexing:
y = pd.Series(['a'] * 5 + ['b'] * 2 + ['c'] * 3 + ['d'] * 7)
s = y.value_counts()
out = s[s > 4]
print (out)
d 7
a 5
dtype: int64
If want remove values from original Series use Series.isin:
y1 = y[y.isin(out.index)]
print (y1)
0 a
1 a
2 a
3 a
4 a
10 d
11 d
12 d
13 d
14 d
15 d
16 d
dtype: object
Thank you mr.jezrael your answer so helpful and i will add a small tip, after you gathering the values this how you can replace the values :
s = y.value_counts()
x = s[s>5]
for z in y:
if z not in x:
y = y.replace([z],'Other')
else:
continue
Hi I have the following function to decide the winner:
def winner(T1,T2,S1,S2,PS1,PS2):
if S1>S2:
return T1
elif S2>S1:
return T2
else:
#print('Winner will be decided via penalty shoot out')
Ninit = 5
Ts1 = np.sum(np.random.random(size=Ninit))*PS1
Ts2 = np.sum(np.random.random(size=Ninit))*PS2
if Ts1>Ts1:
return T1
elif Ts2>Ts1:
return T2
else:
return 'Draw'
And I have the following data frame:
df = pd.DataFrame()
df['Team1'] = ['A','B','C','D','E','F']
df['Score1'] = [1,2,3,1,2,4]
df['Team2'] = ['U','V','W','X','Y','Z']
df['Score2'] = [2,2,2,2,3,3]
df['Match'] = df['Team1'] + ' Vs '+ df['Team2']
df['Match_no']= [1,2,3,4,5,6]
df ['P1'] = [0.8,0.7,0.6,0.9,0.75,0.77]
df ['P2'] = [0.75,0.75,0.65,0.78,0.79,0.85]
I want to create a new column in which winner from each match will be assigned.
To decide a winner from each match, I used the function winner. I tested the function using arbitrary inputs. it works. When I used dataframe,
as follow:
df['Winner']= winner(df.Team1,df.Team2,df.Score1,df.Score2,df.P1,df.P2)
it showed me the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can anyone advise why there is an error?
Thanks
Zep.
Your function isn't set up to take pandas.Series as inputs. Use a different way.
df['Winner'] = [
winner(*t) for t in zip(df.Team1, df.Team2, df.Score1, df.Score2, df.P1, df.P2)]
df
Team1 Score1 Team2 Score2 Match Match_no P1 P2 Winner
0 A 1 U 2 A Vs U 1 0.80 0.75 U
1 B 2 V 2 B Vs V 2 0.70 0.75 V
2 C 3 W 2 C Vs W 3 0.60 0.65 C
3 D 1 X 2 D Vs X 4 0.90 0.78 X
4 E 2 Y 3 E Vs Y 5 0.75 0.79 Y
5 F 4 Z 3 F Vs Z 6 0.77 0.85 F
Another way to go about it
def winner(T1,T2,S1,S2,PS1,PS2):
ninit = 5
Ts1 = np.random.rand(5).sum() * PS1
Ts2 = np.random.rand(5).sum() * PS2
a = np.select(
[S1 > S2, S2 > S1, Ts1 > Ts2, Ts2 > Ts1],
[T1, T2, T1, T2], 'DRAW')
return a
df.assign(Winner=winner(df.Team1, df.Team2, df.Score1, df.Score2, df.P1, df.P2))
Team1 Score1 Team2 Score2 Match Match_no P1 P2 Winner
0 A 1 U 2 A Vs U 1 0.80 0.75 U
1 B 2 V 2 B Vs V 2 0.70 0.75 B
2 C 3 W 2 C Vs W 3 0.60 0.65 C
3 D 1 X 2 D Vs X 4 0.90 0.78 X
4 E 2 Y 3 E Vs Y 5 0.75 0.79 Y
5 F 4 Z 3 F Vs Z 6 0.77 0.85 F
I've been trying to implement the following plyr chain in python:
# Data
data_L1
X Y r2 contact_id acknowledge_issues
a c 100 xyzx 0
b d 100 fsdjkfl 0
a c 80 ejrkl 20
b d 60 fdsdl 40
b d 80 gsdkf 20
# Transformation
test <- ddply(data_L1,
.(X,Y),
summarize,
avg_r2 = mean(r2),
tickets = length(unique(contact_id)),
er_ai =length(acknowledge_issues[which(acknowledge_issues>0)])/length(acknowledge_issues)
)
# Output
test
X Y avg_r2 tickets er_ai
a c 90 2 0.5
b d 80 3 0.6667
However I only came this far in python:
test = data_L1.groupby(['X','Y']).agg({'r2': 'mean', 'contact_id' : 'count'})
I can't figure out how to create the variables er_ai in Python. Do you have suggestions for solutions in pandas or other libraries?
Use instead count function nunique and for er_ai get mean of all values by condition:
cols = {'r2':'avg_r2', 'contact_id':'tickets', 'acknowledge_issues':'er_ai'}
test = (data_L1.groupby(['X','Y'], as_index=False)
.agg({'r2': 'mean',
'contact_id' : 'nunique',
'acknowledge_issues': lambda x: (x>0).mean()})
.rename(columns=cols))
print (test)
X Y tickets er_ai avg_r2
0 a c 2 0.500000 90
1 b d 3 0.666667 80
I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData