I'm trying to go from df to df2
I'm grouping by review_meta_id, age_bin then calculate a ctr from sum(click_count)/ sum(impression_count)
In [69]: df
Out[69]:
review_meta_id age_month impression_count click_count age_bin
0 3 4 10 3 1
1 3 10 5 2 2
2 3 20 5 3 3
3 3 8 9 2 2
4 4 9 9 5 2
In [70]: df2
Out[70]:
review_meta_id ctr age_bin
0 3 0.300000 1
1 3 0.285714 2
2 3 0.600000 3
3 4 0.555556 2
import pandas as pd
bins = [0, 5, 15, 30]
labels = [1,2,3]
l = [dict(review_meta_id=3, age_month=4, impression_count=10, click_count=3), dict(review_meta_id=3, age_month=10, impression_count=5, click_count=2), dict(review_meta_id=3, age_month=20, impression_count=5, cli\
ck_count=3), dict(review_meta_id=3, age_month=8, impression_count=9, click_count=2), dict(review_meta_id=4, age_month=9, impression_count=9, click_count=5)]
df = pd.DataFrame(l)
df['age_bin'] = pd.cut(df['age_month'], bins=bins, labels=labels)
grouped = df.groupby(['review_meta_id', 'age_bin'])
Is there an elegant way of doing the following?
data = []
for name, group in grouped:
ctr = group['click_count'].sum() / group['impression_count'].sum()
review_meta_id, age_bin = name
data.append(dict(review_meta_id=review_meta_id, ctr=ctr, age_bin=age_bin))
df2 = pd.DataFrame(data)
You can first aggregate goth columns by sum, then divide columns with DataFrame.pop for use and remove columns and last convert MultiIndex to columns with remove rows with missing values by DataFrame.dropna:
df2 = df.groupby(['review_meta_id', 'age_bin'])[['click_count','impression_count']].sum()
df2['ctr'] = df2.pop('click_count') / df2.pop('impression_count')
df2 = df2.reset_index().dropna()
print (df2)
review_meta_id age_bin ctr
0 3 1 0.300000
1 3 2 0.285714
2 3 3 0.600000
4 4 2 0.555556
you can use apply function after you grouping the dataframe by 'review_meta_id', 'age_bin' in order to calculate 'ctr', the result will be a pandas series in order to convert it to a dataframe we use reset_index() and provide name='ctr', The name of the column corresponding to the Series values.
def divide_two_cols(df_sub):
return df_sub['click_count'].sum() / float(df_sub['impression_count'].sum())
df2 = df.groupby(['review_meta_id', 'age_bin']).apply(divide_two_cols).reset_index(name='ctr')
new_df
Related
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
I have imported the following Excel file but would like to sort it based on Frequency descending, but then with 'Other','No data' and 'All' (the total) at the bottom in that order. Is this possible?
table1 = pd.read_excel("table1.xlsx")
table1
Use:
df = pd.DataFrame({
'generalenq':list('abcdef'),
'percentage':[1,3,5,7,1,0],
'frequency':[5,3,6,9,2,4],
})
df.loc[0, 'generalenq'] = 'All'
df.loc[2, 'generalenq'] = 'No data'
df.loc[3, 'generalenq'] = 'Other'
print (df)
generalenq percentage frequency
0 All 1 5
1 b 3 3
2 No data 5 6
3 Other 7 9
4 e 1 2
5 f 0 4
First create dictionary for ordering by some integers. Then create mask by membership with Series.isin and sorting non matched rows selected with ~ for invert mask with boolean indexing:
d = {'Other':0,'No data':1,'All':2}
mask = df['generalenq'].isin(list(d.keys()))
df1 = df[~mask].sort_values('frequency', ascending=False)
print (df1)
generalenq percentage frequency
5 f 0 4
1 b 3 3
4 e 1 2
Then filter matched rows by mask and create helper column for sorting by mapped dict:
df2 = df[mask].assign(new = lambda x: x['generalenq'].map(d)).sort_values('new').drop('new', 1)
print (df2)
generalenq percentage frequency
3 Other 7 9
2 No data 5 6
0 All 1 5
And last join together by concat:
df = pd.concat([df1, df2], ignore_index=True)
print (df)
generalenq percentage frequency
0 f 0 4
1 b 3 3
2 e 1 2
3 Other 7 9
4 No data 5 6
5 All 1 5
I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!
Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9
The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns
I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.
You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
So I have a file 500 columns by 600 rows and want to take the average of all columns for rows 200-400:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400]
Then create a new column of the averages of all rows across all columns. And extract only that newly created column:
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean']
But how can I prevent the indexes from resetting when I extract the new column?
I think is not necessary create new column in sliced_df, only rename name of Series and if need output as DataFrame add to_frame. Indexes are not resetting, see sample bellow:
#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#in real data use df.iloc[200:400]
sliced_df=df.iloc[2:4]
print (sliced_df)
A B C D E
2 2 2 1 0 8
3 4 0 9 6 2
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
2 2.6
3 4.2
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
2 2.6
3 4.2
Python counts from 0, so maybe need change slice from 200:400 to 100:300, see difference:
sliced_df=df.iloc[1:3]
print (sliced_df)
A B C D E
1 0 4 2 5 2
2 2 2 1 0 8
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
1 2.6
2 2.6
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
1 2.6
2 2.6
Use copy() function as follows:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400].copy()
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean'].copy()