Pandas and sum and cum sum in same dataframe - python

I use the below to create a sum and a cumsum. But they are in two separate dataframes. I want all in one
asp = np.array(np.array([0,0,1]))
asq = np.array(np.array([10,10,20]))
columns=['asp']
df = pd.DataFrame(asp, index=None, columns=columns)
df['asq'] = asq
df = df.groupby(by=['asp']).sum()
dfcum =df.cumsum()
How do I have both the sum and the cumsum in the same dataframe. Totally not clear how to do this. Below is what I want
asqsum cumsum
asp
0 20 20
1 20 40

Maybe you want this?
In [20]: df['asq_cum']=df['asq'].cumsum()
In [21]: df
Out[21]:
asq asq_cum
asp
0 20 20
1 20 40

Related

Random sampling with replacement, increasing groupsize, sum and append in dataframe

I have a dataframe which i'd like to repeatedly sample, with replacement. Everytime I sample the df, I would like to increase the size of the sample (n) by one, up to N.
For example:
id
value_1
value_2
a
5
10
b
10
30
c
6
8
d
9
12
Would result in something like
id's
sum_of_value_1
sum_of_value_2
b
10
30
a, c
11 (5+6)
18 (10+8)
b,a,d
24 (10+5+9)
52 (30+10+12)
I can do this with a for loop but can't figure how how to add the summation and the append into the query:
for n in range(200):
print(df_groups.sample(n))
you can use pandas.Dataframe.aggregate for summation of all columns and then use pandas.concat to concatinate the new single row dataframe at the end of a new dataframe that you can use as an accumulator of samples.
maybe something like this
acc = df_groups.sample(1).aggregate('sum')
for n in range(2, df_groups.shape[0]):
pd.concat([acc, df_groups.sample(n).aggregate('sum')])
You can use sample and concat, then groupby.agg:
out = (pd.concat({n: df.sample(n)
for n in range(1, len(df))})
.groupby(level=0)
.agg({'id': ','.join,
'value_1': 'sum',
'value_2': 'sum'
})
)
print(out)
Output:
id value_1 value_2
1 a 5 10
2 b,c 16 38
3 a,c,d 20 30

Create multi-index and transpose data with pandas with columns as additional index

I've tried multiple things to read-in this excel file and reshape it with pandas. I've tried different functions like merge(), pivot(), melt(), reset_index() and I still can't figure it out. Can anyone point me in the right direction?
This is the current table:
current
This is the desired output:
desired output
Sorry for the formatting. I'm new to stackoverflow but I have done research and can't seem to figure out the answer.
I have a lot of deleted code that I tried but wasn't working here are a few examples of what I tried to do.
import pandas as pd
df = pd.read_excel(file)
df.iloc[0:,0].fillna(method= 'ffill', inplace = True)
new_cols = df.columns[2:]
df = df.rename(columns = {"Unnamed: 1":"to col"})
end_file_cols was a list with the columns in 'Desired' image
df = df.reindex(columns = end_file_cols)
df['Demo'] = df.index.tolist()
df.pivot(index = 'Media', columns = new_cols.tolist())
This is what happens when printing df
import pandas as pd
df = pd.read_excel(file)
df.iloc[0:,0].fillna(method= 'ffill', inplace = True)
new_cols = df.columns[2:]
df = df.rename(columns = {"Unnamed: 1":"to col"})
print(df)
Media to col Age Group 1 Age Group 2 Age Group 3 Age Group 4
0 Plan 1 Total Cost 65 4 90 88
1 Plan 1 Net Loss 88 77 85 85
2 Plan 1 Views 60 97 76 82
3 Plan 2 Total Cost 96 92 5 0
4 Plan 2 Net Loss 89 77 51 59
5 Plan 2 Budget 42 67 49 96
6 Plan 3 Total Cost 22 78 100 10
7 Plan 3 Net Prof 59 33 72 87
You can stack and unstack to change MultiIndex between columns and rows. Simply do,
df = pd.read_excel('data.xlsx', index_col=[0,1])
new_df = df.unstack().stack(level=0)
To rename the indices simply do,
new_df.index.rename(('Media','Demo'), inplace=True)
The empty values will be np.NaN which can be replaced with any value you want using new_df.fillna(<value>) (Optionally)

Pandas: sum of every N columns

I have dataframe
ID 2016-01 2016-02 ... 2017-01 2017-02 ... 2017-10 2017-11 2017-12
111 12 34 0 12 3 0 0
222 0 32 5 5 0 0 0
I need to count every 12 columns and get
ID 2016 2017
111 46 15
222 32 10
I try to use
(df.groupby((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
But it returns to all columns
But when I try to use
df.groupby['ID']((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
It returns
TypeError: 'method' object is not subscriptable
How can I fix that?
First set_index of all columns without dates:
df = df.set_index('ID')
1. groupby by splited columns and selected first:
df = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
2. lambda function for split:
df = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
3. converted columns to datetimes and groupby years:
df.columns = pd.to_datetime(df.columns)
df = df.groupby(df.columns.year, axis=1).sum()
4. resample by years:
df.columns = pd.to_datetime(df.columns)
df = df.resample('A', axis=1).sum()
df.columns = df.columns.year
print (df)
2016 2017
ID
111 46 15
222 32 10
The above code has a slight syntax error and throws the following error:
ValueError: No axis named 1 for object type
Basically, the groupby condition needs to be wrapped by []. So I'm rewriting the code correctly for convenience:
new_df = df.groupby([[i//n for i in range(0,m)]], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.
If you don't mind losing the labels, you can try this:
new_df = df.groupby([i//n for i in range(0,m)], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.

Flatten / Remove hierarchical column headers

I have the following dataframe which is the result of doing a groupby + aggregate sum:
df.groupby(['id', 'category']).agg([pd.Series.sum])
supply stock
sum sum
id category
4 abc 161 -0.094
6 sde -76 0.150
23 hgv 64 -0.054
1 wcd -14 0.073
76 jhf -8 -0.057
Because of the groupby and agg, the column headings are now tuples. How do I change the column headings back into single values, ie: the column headings need to be supply and stock. I just need to get rid of sum from the headings
If you use sum the "agg function name" won't be created as part of the columns (as a MultiIndex):
df.groupby(['id', 'category']).sum()
To remove them, you can drop the level:
df.columns = df.columns.droplevel(1)
For example:
In [11]: df
Out[11]:
supply stock
sum sum
0 0.501176 0.482497
1 0.442689 0.008664
2 0.885112 0.512066
3 0.724619 0.418720
In [12]: df.columns.droplevel(1)
Out[12]: Index(['supply', 'stock'], dtype='object')
In [13]: df.columns = df.columns.droplevel(1)
In [14]: df
Out[14]:
supply stock
0 0.501176 0.482497
1 0.442689 0.008664
2 0.885112 0.512066
3 0.724619 0.418720
You can explicitly set the columns attribute to whatever you'd like it to be. For example:
>>> df = pd.DataFrame(np.random.random((4, 2)),
... columns=pd.MultiIndex.from_arrays([['supply', 'stock'],
['sum', 'sum']]))
>>> df
supply stock
sum sum
0 0.170950 0.314759
1 0.632121 0.147884
2 0.955682 0.127857
3 0.776764 0.318614
>>> df.columns = df.columns.get_level_values(0)
>>> df
supply stock
0 0.170950 0.314759
1 0.632121 0.147884
2 0.955682 0.127857
3 0.776764 0.318614

Calculate weights for grouped data in pandas

I would like to calculate portfolio weights with a pandas dataframe. Here is some dummy data for an example:
df1 = DataFrame({'name' : ['ann','bob']*3}).sort('name').reset_index(drop=True)
df2 = DataFrame({'stock' : list('ABC')*2})
df3 = DataFrame({'val': np.random.randint(10,100,6)})
df = pd.concat([df1, df2, df3], axis=1)
Each person owns 3 stocks with a value val. We can calculate portfolio weights like this:
df.groupby('name').apply(lambda x: x.val/(x.val).sum())
which gives this:
If I want to add a column wgt to df I need to merge this result back to df on name and index. This seems rather clunky.
Is there a way to do this in one step? Or what is the way to do this that best utilizes pandas features?
Use transform, this will return a series with an index aligned to your original df:
In [114]:
df['wgt'] = df.groupby('name')['val'].transform(lambda x: x/x.sum())
df
Out[114]:
name stock val wgt
0 ann A 18 0.131387
1 ann B 43 0.313869
2 ann C 76 0.554745
3 bob A 16 0.142857
4 bob B 44 0.392857
5 bob C 52 0.464286

Categories

Resources