Aggregating results based on three variables

Aggregating results based on three variables - python

I have a dataframe as shown below
import pandas as pd
data = {
'id': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
'date': ['2021-03-15', '2021-03-15', '2021-03-17', '2021-03-17', '2021-03-12', '2021-03-12', '2021-12-14', '2021-04-07', '2021-07-09', '2021-04-25', '2021-04-25'],
'n': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2],
'type': ['A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B', 'A'],
't': [1.41, 1.05, 2.01, 0.79, 1.37, 2.19, 1.28, 1.9, 0.97, 1.48, 1.96],
'leq': [73.95284344, 75.08732477, 42.52073186, 14.16069694, 59.36296547, 48.7827182, 44.48691532, 63.63032644, 95.20787662, 61.38061937, 12.50041565]
}
df = pd.DataFrame(data)
and would like to aggregate the values based on three variables id, date and type using the formula below
In other words, the aggregation will encompass the three variables
Thanks in advance!

Seems like a direct application of groupby and your mathematical formula:
df.groupby(['id', 'date', 'type'])\
.apply(lambda s: 10 * np.log(1/(s['t'].sum()) * np.sum(s['t'] * (10**(s['leq']/10)))))
id date type
1 2021-03-15 A 171.482002
2021-03-17 B 94.598488
2 2021-03-12 B 128.447851
2021-12-14 B 102.434908
3 2021-04-07 B 146.514241
2021-04-25 A 28.783271
B 141.334099
2021-07-09 A 219.224237
dtype: float64

Related

Comparing 2 dataframes by ID

I am very new to Python. I want to compare two dataframes. They both have the same columns, first column is the key variable (ID). My goal is to print the differences.
For example:
import pandas as pd
import numpy as np
dframe1 = {'ID': [1, 2, 3, 4, 5], 'Apple': ['C', 'B', 'C', 'A', 'E'], 'Pear': [2, 3, 5, 6, 7]}
dframe2 = {'ID': [4, 2, 1, 3], 'Apple': ['A', 'C', 'C', 'C'], 'Pear': [6, 'NA', 'NA', 5]}
df1 = pd.DataFrame(dframe1)
df2 = pd.DataFrame(dframe2)
import datacompy
compare=datacompy.Compare(
df1,
df2,
df1_name='Reference',
df2_name='Test',
on_index=True
)
print(compare.report())
This produces a comparison report but I want my output to be like the following. Columns of my desired output:
out1 = {'var.x': ['Apple', 'Pear', 'Pear'], 'var.Y': ['Apple', 'Pear', 'Pear'], 'ID': [2, 1, 2],'values.x': ['B', '2', '3'], 'values.Y': ['C','NA','NA'],'row.x': [2, 1, 4], 'row.y': [2, 3, 1]}
outp = pd.DataFrame(out1)
print(outp)
Thanks a lot for your support.

Sub setting dataframe

I have Dataframe with three columns as
Date, Id, pages. In pages values are according to time of visit. So I want customers who visit page A after page B on the same date.
As in the image below ID 2 visit page A after B on 2 Nov

Try:
A_after_B = lambda x: x.eq('B').idxmax() < x.eq('A').idxmax()
m1 = df['Page'].isin(['A', 'B'])
m2 = df.groupby(['ID', 'Date'])['Page'].transform(A_after_B)
out = df.loc[m1 & m2]
print(out)
# Output:
ID Date Page
5 2 02-Nov B
6 2 02-Nov A
Setup:
data = {'ID': [1, 1, 1, 2, 2, 2, 2, 2],
'Date': ['01-Nov', '01-Nov', '01-Nov', '01-Nov',
'01-Nov', '02-Nov', '02-Nov', '02-Nov'],
'Page': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'A']}
df = pd.DataFrame(data)

Shifting values for 1...N previous months as separate columns

I have the following data:
import pandas as pd
import numpy as np
data = pd.DataFrame({
'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'date': ['2018-08-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-12-01', '2019-01-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-09-01'],
'value': [10, 3, 15, 16, -20, 2, 1, 3, 3, 0]
})
and at the end I would like to have:
expected = pd.DataFrame({
'proj': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'date': ['2018-08-01', '2018-09-01', '2018-10-01', '2018-11-01', '2018-12-01', '2019-01-01', '2018-06-01', '2018-07-01', '2018-08-01', '2018-09-01'],
'value': [10, 3, 15, 16, -20, 2, 1, 3, 3, 0],
'prev_month_value': [np.NaN, 10, 3, 15, 16, -20, np.NaN, 1, 3, 3],
'prev_prev_month_value': [np.NaN, np.NaN, 10, 3, 15, 16, np.NaN, np.NaN, 1, 3]
})
How to do that in pandas?

You can call GroupBy.shift inside a dict comprehension and concat the results after:
N = 2
g = data.groupby('proj')
u = pd.DataFrame({
('prev_'*i) + 'month_value': g['value'].shift(i) for i in range(1, N + 1)})
pd.concat([data, u], axis=1)
proj date value prev_month_value prev_prev_month_value
0 A 2018-08-01 10 NaN NaN
1 A 2018-09-01 3 10.0 NaN
2 A 2018-10-01 15 3.0 10.0
3 A 2018-11-01 16 15.0 3.0
4 A 2018-12-01 -20 16.0 15.0
5 A 2019-01-01 2 -20.0 16.0
6 B 2018-06-01 1 NaN NaN
7 B 2018-07-01 3 1.0 NaN
8 B 2018-08-01 3 3.0 1.0
9 B 2018-09-01 0 3.0 3.0

What is the 'name' in pandas.DataFrame.columns?

When I execute a pivot on a pandas dataframe,
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df.pivot(index='foo', columns='bar', values='baz')
>>> bar A B C
foo
one 1 2 3
two 4 5 6
Which has these columns,
df.pivot(index='foo', columns='bar', values='baz').columns
>>> Index(['A', 'B', 'C'], dtype='object', name='bar')
My question is, what does name=bar part mean?

From the docs
name : object
Name to be stored in the index
In your example, it's the name of the pandas.Index that is used as the column name.
The name attribute becomes useful in some cases, for instance if you have a multiindex, you can refer to the level of the index by it's name:
>>> df
idx1 1 2 3 # <- column header 1
idx2 a b c # <- column header 2
vals 5 4 6
>>> df.columns
MultiIndex(levels=[[1, 2, 3], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]],
names=['idx1', 'idx2'])
>>> df.columns.get_level_values('idx1')
Int64Index([1, 2, 3], dtype='int64', name='idx1')
>>> df.columns.get_level_values('idx2')
Index(['a', 'b', 'c'], dtype='object', name='idx2')

How can I change the original DataFrame from a group?

Let's suppose I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'label': ['a', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'a', 'a'],
'numbers': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'arbitrarydata': [False] * 10})
I want to assign a value to the arbitrarydata column according to the values in both of the other colums. A naive approach would be as follows:
for _, grp in df.groupby(('label', 'numbers')):
grp.arbitrarydata = pd.np.random.rand()
Naturally, this doesn't propagate changes back to df. Is there a way to modify a group such that changes are reflected in the original DataFrame ?

Try using transform, e.g.:
df['arbitrarydata'] = df.groupby(('label', 'numbers')).transform(lambda x: np.random.rand())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating results based on three variables - python

Related

Comparing 2 dataframes by ID

Sub setting dataframe

Shifting values for 1...N previous months as separate columns

What is the 'name' in pandas.DataFrame.columns?

How can I change the original DataFrame from a group?

Categories

Resources