I want to group my Dataframe, and then count the mean of the dummy occurance per group.
df3 = pd.DataFrame({'Number':['001','001','001','002','002','002','002'],
'name':['peter','chris','meg','albert','cathrine','leo','leo'],
'dummy':[0,1,1,0,0,1,1]})
i could calculate the mean number of unique occurances (based on names) per group using this code:
test=df3.groupby('Number')
test_1 = []
for name, group in test:
x= len(group.name.unique())
test_1.append(x)
pd.Series(test_1).mean()
now i want to calculate how often the dummy equals to 1 on average in a group given that the name is unique
so for this example the calculation would be (2+1)/2 =1.5.
where ( unique dummy counts from group 1 (2) + unique dummy count from group 2 (1))/divided by number of groups (2) =1.5 unique dummy counts on average per group
note that if there is no dummy in the group, the number of groups in the denominator should still increase by 1
Please comment if i didnt express the task clearly!
Ok i just found an answer to my question, even though it is a little workaround:
df3 = pd.DataFrame({'Number':['001','001','001','002','002','002','002'],
'name':['peter','chris','meg','albert','cathrine','leo','leo'],
'dummy':[0,1,0,0,0,1,1]})
df4=df3.loc[df3.dummy.isin(['1'])] #creating new dataframe with only the rows where dummy = 1
test=df4.groupby('Number') # group it by the number column
test_1 = []
for name, group in test:
x= len(group.name.unique()) #take only the unique names in each group
test_1.append(x)
pd.Series(test_1).sum()/len(test) # divide value count by number of groups
s = df3.groupby('Number').agg({"name":["nunique"], "dummy": ["sum"]})
sum(s["name"]["nunique"]/s["dummy"]["sum"])
If I understand correctly what you meant
And in a more elegant implementation -
def my_func(x):
n = x['name'].nunique()
s = x['dummy'].sum()
return n/s
df3.groupby('Number').apply(my_func).mean()
Edit
I finally think that I understood after see the suggested solution by the question asker -
df4 = df3[df3.dummy == 1]
df4.groupby('Number').apply(lambda x: x["name"].nunique()).sum()/df4.Number.nunique()
Related
I have a dataframe with 0 and 1 and I would like to count groups of 1s (don't mind the 0s) with a Pandas solution (not itertools, not python iteration).
Other SO posts suggest methods based on shift()/diff()/cumsum() which seems not to work when the leading sequence in the dataframe starts with 0.
df = pandas.Series([0,1,1,1,0,0,1,0,1,1,0,1,1]) # should give 4
df = pandas.Series([1,1,0,0,1,0,1,1,0,1,1]) # should also give 4
df = pandas.Series([1,1,1,1,1,0,1]) # should give 2
Any idea ?
If you only have 0/1, you can use:
s = pd.Series([0,1,1,1,0,0,1,0,1,1,0,1,1])
count = s.diff().fillna(s).eq(1).sum()
output: 4 (4 and 2 for the other two)
Then fillna ensures that Series starting with 1 will be counted
faster alternative
use the diff, count the 1 and correct the result with the first item:
count = s.diff().eq(1).sum()+(s.iloc[0]==1)
comparison of different pandas approaches:
Let us identify the diffrent groups of 1's using cumsum, then use nunique to count the number of unique groups
m = df.eq(0)
m.cumsum()[~m].nunique()
Result
case 1: 4
case 2: 4
case 3: 2
For doing so, I have a list of lists (which are my clusters), for example:
asset_clusts=[[0,1],[3,5],[2,4, 12],...]
and original dataframe(in my code I call it 'x') is as :
return time series of s&p 500 companies
I want to choose column [0,1] of the original dataframe and compute the mean (by row) of them and store it in a new dataframe, then compute the mean of columns [3, 5], and add it to the new dataframe, and so on ...
mu=pd.DataFrame()
for j in range(get_number_of_elements(asset_clusts)):
mu=x.iloc[:,asset_clusts[j]].mean(axis=1)
but, it gives to me only a column and i checked, this one column is the mean of last cluster columns
in case of ambiguity, function of get_number_of_elements is:
def get_number_of_elements(clist):
count = 0
for element in clist:
count += 1
return count
def get_number_of_elements(clust_list):
count = 0
for element in clust_list:
count += 1
return count
I solved it and in case if it would be helpful for others, here is the final function:
def clustered_series(x, org_asset_clust):
"""
x:return data
org_asset_clust: list of clusters
----> mean of each cluster returns by row
"""
def get_number_of_elements(org_asset_clust):
count = 0
for element in org_asset_clust:
count += 1
return count
mu=[]
for j in range(get_number_of_elements(org_asset_clust)):
mu.append(x.iloc[:,org_asset_clust[j]].mean(axis=1))
cluster_mean=pd.concat(mu, axis=1)
return cluster_mean
I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]
Starting with a CSV file with the columns ['race_number', 'number_of_horses_bet_on','odds']
I would like to add/calculate an extra column called 'desired_output'.
The 'desired_output' column is computed by
for 'race_number' 1, the 'number_of_horses_bet_on'=2, therefore in the 'desired_output column', only the first 2 'odds' are included. The remaining values for 'race_number' 1 are 0. Then we go to 'race_number' 2 and the cycle repeats.
Code I have tried includes:
import pandas as pd
df=pd.read_csv('test.csv')
desired_output=[]
count=0
for i in df.number_of_horses_bet_on:
for j in df.odds:
if count<i:
desired_output.append(j)
count+=1
else:
desired_output.append(0)
print(desired_output)
and also
df['desired_output']=df.odds.apply(lambda x: x if count<number_of_horses_bet_on else 0)
Neither of these give the output of the column 'desired_output'
I realise the 'count' in the lambda above is misplaced - but hopefully you can see what I am after.
Thanks.
I'm gonna do it a bit differently, this will be what I'm gonna do
get a list of all race_number
for each race_number, extract the number_of_horses_bet_on
create a list that contains 1 or 0, where we would have number_of_horses_bet_on number of 1s and the rest would be zero.
multiple this list with the odds column
import pandas as pd
df=pd.read_csv('test.csv')
mask = []
races = df['race_number'].unique().tolist() # unique list of all races
for race in races:
# filter the dataframe by the race number
df_race = df[df['race_number'] == race]
# assuming number of horses is unique for every race, we extract it here
number_of_horses = df_race['number_of_horses_bet_on'].iloc[0]
# this mask will contain a list of 1s and 0s, for example for race 1 it'll be [1,1,0,0,0]
mask = mask + [1] * number_of_horses + [0] * (len(df_race) - number_of_horses)
df['mask'] = mask
df['desired_output'] = df['mask'] * df['odds']
del df['mask']
print(df)
This assumes that for each race the numbers_of_horses_bet_on equals or less than the number of rows for that race, otherwise you might need to use min/max to get proper results
TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().
I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})