I have a dataframe df that I want to look like df2.
yesterday = datetime.datetime.today - datetime.timedelta(days=1)
cols = pd.MultiIndex.from_product(iterables=[['A','B'],['1','2']], names=['letter','number'])
idx = pd.DatetimeIndex(start=yesterday, periods=2, freq='H')
df = pd.DataFrame(columns=cols, index=idx, data=[[1,2,3,4],[5,6,7,8]])
letter A B
number 1 2 1 2
2019-09-26 07:45:49.659873 1 2 3 4
2019-09-26 08:45:49.659873 5 6 7 8
df2 = pd.DataFrame(columns=['letter', '1', '2'], index=idx.append(idx), data=[['A',1,2],['A',5,6],['B',3,4],['B',7,8]])
letter 1 2
2019-09-26 07:45:49.659873 A 1 2
2019-09-26 08:45:49.659873 A 5 6
2019-09-26 07:45:49.659873 B 3 4
2019-09-26 08:45:49.659873 B 7 8
So it's sort of like a melt, or reset_index but I haven't found the magic combination to do this on a much larger dataframe in practice. I want to change level 0 of the columns into an unpivoted column, but leave level 1 unchanged.
Related
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
I'm trying to go from df to df2
I'm grouping by review_meta_id, age_bin then calculate a ctr from sum(click_count)/ sum(impression_count)
In [69]: df
Out[69]:
review_meta_id age_month impression_count click_count age_bin
0 3 4 10 3 1
1 3 10 5 2 2
2 3 20 5 3 3
3 3 8 9 2 2
4 4 9 9 5 2
In [70]: df2
Out[70]:
review_meta_id ctr age_bin
0 3 0.300000 1
1 3 0.285714 2
2 3 0.600000 3
3 4 0.555556 2
import pandas as pd
bins = [0, 5, 15, 30]
labels = [1,2,3]
l = [dict(review_meta_id=3, age_month=4, impression_count=10, click_count=3), dict(review_meta_id=3, age_month=10, impression_count=5, click_count=2), dict(review_meta_id=3, age_month=20, impression_count=5, cli\
ck_count=3), dict(review_meta_id=3, age_month=8, impression_count=9, click_count=2), dict(review_meta_id=4, age_month=9, impression_count=9, click_count=5)]
df = pd.DataFrame(l)
df['age_bin'] = pd.cut(df['age_month'], bins=bins, labels=labels)
grouped = df.groupby(['review_meta_id', 'age_bin'])
Is there an elegant way of doing the following?
data = []
for name, group in grouped:
ctr = group['click_count'].sum() / group['impression_count'].sum()
review_meta_id, age_bin = name
data.append(dict(review_meta_id=review_meta_id, ctr=ctr, age_bin=age_bin))
df2 = pd.DataFrame(data)
You can first aggregate goth columns by sum, then divide columns with DataFrame.pop for use and remove columns and last convert MultiIndex to columns with remove rows with missing values by DataFrame.dropna:
df2 = df.groupby(['review_meta_id', 'age_bin'])[['click_count','impression_count']].sum()
df2['ctr'] = df2.pop('click_count') / df2.pop('impression_count')
df2 = df2.reset_index().dropna()
print (df2)
review_meta_id age_bin ctr
0 3 1 0.300000
1 3 2 0.285714
2 3 3 0.600000
4 4 2 0.555556
you can use apply function after you grouping the dataframe by 'review_meta_id', 'age_bin' in order to calculate 'ctr', the result will be a pandas series in order to convert it to a dataframe we use reset_index() and provide name='ctr', The name of the column corresponding to the Series values.
def divide_two_cols(df_sub):
return df_sub['click_count'].sum() / float(df_sub['impression_count'].sum())
df2 = df.groupby(['review_meta_id', 'age_bin']).apply(divide_two_cols).reset_index(name='ctr')
new_df
Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])
Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0
This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()
Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))