Add column using groupby in multiindex Pandas

Add column using groupby in multiindex Pandas - python

I am trying to find the sum of a column based upon the groupby function. so in this example i want to find the sum of all the bar, baz, foo, and qux.
the sum would be added to a new column at the end. i can get the results i need but i can not join it back to the dataframe.
import numpy as np
import pandas as pd
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
np.random.seed(7)
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
results=df.groupby(level=[0]).sum(axis=1)
col_names=results.columns.values
hold=[]
for i in col_names:
hold.append('sum_'+str(i))
results.columns=hold
df=pd.concat([df,results],axis=1)
Desired result below. thanks for looking
0 1 2 3 sum_0 sum_1 sum_2 sum_3
bar one 1.69 (0.47) 0.03 0.41 0.90 (0.46) 0.03 (1.35)
bar two (0.79) 0.00 (0.00) (1.75) 0.90 (0.46) 0.03 (1.35)
baz one 1.02 0.60 (0.63) (0.17) 1.52 0.34 (0.87) (1.62)
baz two 0.51 (0.26) (0.24) (1.45) 1.52 0.34 (0.87) (1.62)
foo one 0.55 0.12 0.27 (1.53) 2.21 0.28 (0.11) 0.50
foo two 1.65 0.15 (0.39) 2.03 2.21 0.28 (0.11) 0.50
qux one (0.05) (1.45) (0.41) (2.29) 1.00 (1.87) (1.15) (1.22)
qux two 1.05 (0.42) (0.74) 1.07 1.00 (1.87) (1.15) (1.22)

Use transform instead, you can rid your code of that loop.
df = pd.concat([df, df.groupby(level=0).transform('sum').add_prefix('sum_')], axis=1)
df
0 1 2 3 sum_0 sum_1 sum_2 sum_3
bar one 1.69 -0.47 0.03 0.41 0.90 -0.46 0.03 -1.35
two -0.79 0.00 -0.00 -1.75 0.90 -0.46 0.03 -1.35
baz one 1.02 0.60 -0.63 -0.17 1.52 0.34 -0.87 -1.62
two 0.51 -0.26 -0.24 -1.45 1.52 0.34 -0.87 -1.62
foo one 0.55 0.12 0.27 -1.53 2.21 0.28 -0.11 0.50
two 1.65 0.15 -0.39 2.03 2.21 0.28 -0.11 0.50
qux one -0.05 -1.45 -0.41 -2.29 1.00 -1.87 -1.15 -1.22
two 1.05 -0.42 -0.74 1.07 1.00 -1.87 -1.15 -1.22

Related

How to add sum() and mean() value above the df column values in the same line?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, now sum value is in the first line, and avg value is the second line, but it's ugly, how to let sum value and avg value in the same column and with index name:Total? Also place it in the first line as below
# Total 27.56 25.04 -1.31
code in pandas is as below:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df[['value_a','value_b']].sum().to_frame().T
df2 = df[['difference']].mean().to_frame().T
df = pd.concat([df1,df2, df], ignore_index=True)
df
value_a value_b name up_or_down difference
project_name
27.56 25.04
-1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Thanks so much for any advice

Use DataFrame.agg with dictionary for aggregate functions:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df.agg({'value_a':'sum', 'value_b':'sum', 'difference':'mean'}).to_frame('Total').T
df = pd.concat([df1,df])
print (df.head())
value_a value_b difference name up_or_down
Total 27.56 25.04 -0.035405 NaN NaN
2021-project11 0.43 0.48 0.050000 2021-project11 up
2021-project1 0.62 0.56 -0.060000 2021-project1 down
2021-project2 0.51 0.47 -0.040000 2021-project2 down
2021-porject3 0.37 0.34 -0.030000 2021-porject3 down

How to make histogram using pandas

In this problem, a .txt file is read using pandas. The number of genes needs to be calculated, and a histogram needs to be made for the specific sample and the amount of interaction with each gene.
I have tried using .transpose(), as well as, using value_counts() to access the appropriate information; however, because of it being in a row, and the way the table is set up, I cannot figure out how to get the appropriate histogram.
Use Pandas to read the file. Write a program to answer the following questions:
How many samples are in the data set?
How many genes are in the data set?
Which sample has the lowest average expression of genes?
Plot a histogram showing the distribution of the IL6 expression
across all samples.
Data:
protein M-12 M-24 M-36 M-48 M+ANDV-12 M+ANDV-24 M+ANDV-36 M+ANDV-48 M+SNV-12 M+SNV-24 M+SNV-36 M+SNV-48
ARG1 -11.67 -9.92 -4.37 -11.92 -3.62 -9.38 -11.54 -4.88 -3.59 -2.96 -4.95 -4.31
CASP3 0.05 -0.05 -0.18 0.02 0.04 0.14 -0.35 -0.41 0.24 0.23 -0.40 -0.36
CASP7 -1.40 -0.05 -0.78 -1.33 -0.43 0.63 -1.39 -0.95 0.81 1.45 0.09 0.11
CCL22 -0.96 1.47 0.37 -1.48 1.34 2.72 -11.12 -1.05 -0.63 1.42 0.30 0.12
CCL5 -5.59 -3.84 -4.64 -5.84 -5.19 -5.24 -5.45 -5.45 -2.86 -4.53 -4.80 -6.46
CCR7 -11.26 -9.50 -2.96 -11.50 -2.35 -2.31 -11.12 -3.66 -3.18 -1.31 -2.48 -2.84
CD14 2.85 4.14 3.87 4.33 1.16 3.28 3.68 3.74 1.20 2.80 3.23 2.79
CD200R1 -11.67 -9.92 -5.37 -11.92 -4.61 -9.38 -11.54 -11.54 -3.59 -2.96 -4.54 -4.89
CD274 -5.59 -9.92 -4.64 -5.84 -1.78 -3.30 -5.45 -5.45 -4.17 -10.61 -4.80 -4.48
CD80 -6.57 -9.50 -4.96 -6.82 -6.17 -4.28 -6.43 -6.43 -3.18 -5.51 -5.12 -4.16
CD86 0.14 0.94 0.87 1.12 -0.23 0.58 1.09 0.66 -0.15 0.42 0.74 0.49
CXCL10 -6.57 -2.85 -4.96 -6.82 -4.20 -2.31 -4.47 -4.47 -2.38 -2.74 -5.12 -4.67
CXCL11 -5.28 -9.50 -5.63 -11.50 -10.85 -8.97 -11.12 -11.12 -9.83 -10.20 -5.79 -6.14
IDO1 -5.02 -9.92 -4.37 -5.26 -4.61 -2.72 -4.88 -4.88 -2.60 -3.96 -4.54 -5.88
IFNA1 -11.67 -9.92 -5.37 -5.26 -11.27 -9.38 -11.54 -4.88 -3.59 -10.61 -6.52 -5.88
IFNB1 -11.67 -9.92 -6.35 -11.92 -11.27 -9.38 -11.54 -11.54 -10.25 -10.61 -12.19 -12.54
IFNG -2.09 -1.21 -1.66 -2.24 -2.75 -2.50 -2.83 -3.22 -2.48 -1.60 -2.13 -2.48
IFR3 -0.39 0.05 -0.21 0.15 -0.27 0.07 -0.01 -0.11 -0.28 0.28 0.04 -0.09
IL10 -1.53 -0.21 -0.51 0.45 -3.40 -1.00 -0.51 -0.04 -2.38 -1.55 -0.25 -0.72
IL12A -11.67 -9.92 -4.79 -11.92 -3.30 -3.71 -11.54 -11.54 -10.25 -3.38 -4.22 -4.09
IL15 -1.91 -2.53 -3.50 -3.85 -2.75 -9.38 -4.15 -4.15 -2.19 -2.09 -2.81 -3.16
IL1A -4.28 -2.53 -2.26 -3.39 -2.12 -0.51 -11.54 -2.67 -1.73 -1.75 -2.13 -1.84
IL1B -1.61 -2.53 -0.31 -0.16 0.77 -3.30 -1.95 -0.21 -1.73 -2.55 -0.65 -0.64
IL1RN 3.14 -0.40 -1.54 -3.53 3.95 0.76 0.15 -3.15 3.34 0.95 -1.23 -1.02
IL6 -4.60 -0.21 -1.82 -3.53 -1.25 0.76 -11.12 -2.47 -0.94 -0.60 -1.61 -1.74
IL8 5.43 5.04 4.57 4.22 5.67 5.06 4.30 4.53 4.84 4.53 4.25 3.79
IRF7 0.14 0.97 -0.13 -0.72 0.83 1.85 -0.19 -0.19 1.01 0.62 0.07 -0.03
ITGAM -1.68 0.91 0.28 -0.12 0.67 1.73 -0.30 -0.07 1.21 1.28 0.71 1.21
NFKB1 0.80 0.31 0.29 0.43 1.21 -0.74 0.39 0.02 0.15 -0.02 0.01 -0.09
NOS2 -11.26 -3.52 -4.50 -5.52 -4.87 -2.98 -5.14 -5.14 -3.85 -4.22 -5.79 -6.14
PPARG 0.68 0.23 0.02 -1.16 0.56 1.38 0.80 -0.95 1.17 1.04 1.09 0.94
TGFB1 3.99 3.21 2.41 2.62 4.05 3.48 2.87 2.15 3.68 2.97 2.46 2.31
TLR3 -3.61 -1.85 -1.72 -11.92 -2.40 -1.32 -11.54 -11.54 -0.57 0.09 -1.32 -1.60
TLR7 -3.80 -2.05 -1.64 -0.35 -6.17 -4.28 -2.47 -1.75 -3.18 -3.54 -1.86 -2.84
TNF 1.09 0.53 0.71 1.17 1.91 0.58 1.04 1.41 1.20 1.18 1.13 0.66
VEGFA -2.36 -2.85 -3.64 -3.53 -3.40 -4.28 -4.47 -4.47 -5.15 -5.51 -4.32 -4.67
df=pd.read_csv('../Data/virus_miniset0.txt', sep='\t')
len(df['Sample'])
df

Set the index, in order to properly transpose:
in tabular data, the top row should indicate the name of each column
in this data, the first header was named sample, with all the M prefixed names being the samples.
sample was renamed to protein to properly identify the column.
Current Data:
import pandas as pd
import matplotlib.pylot as plt
import seaborn as sns
df.set_index('protein', inplace=True)
Transpose:
df_sample = df.T
df_sample.reset_index(inplace=True)
df_sample.rename(columns={'index': 'sample'}, inplace=True)
df_sample.set_index('sample', inplace=True)
How many samples:
len(df_sample.index)
>>> 12
How many proteins / genes:
len(df_sample.columns)
>>> 36
Lowest average expression:
find the mean and then find the min
df_sample.mean().min() works, but doesn't include the protein name, just the value.
protein_avg = df_sample.mean()
protein_avg[protein_avg == df_sample.mean().min()]
>>> protein
IFNB1 -10.765
dtype: float64
The following boxplot of all genes, confirms IFNB1 as the protein with the lowest average expression across samples, and shows IL8 as the protein with highest average expression.
Boxplot:
seaborn to make your plots look nicer
plt.figure(figsize=(12, 8))
g = sns.boxplot(data=df_sample)
for item in g.get_xticklabels():
item.set_rotation(90)
plt.show()
Alternate Boxplot:
plt.figure(figsize=(8, 8))
sns.boxplot('IL6', data=df_sample, orient='v')
plt.show()
IL6 Histogram:
sns.distplot(df_sample.IL6)
plt.show()
Bonus Plot - Heatmap:
I thought you might like this
plt.figure(figsize=(20, 8))
sns.heatmap(df_sample, annot=True, annot_kws={"size": 7}, cmap='PiYG')
plt.show()
M-12 and M+SNV-48 are only half size in the plot. This will be resolved in the forthcoming matplotlib v3.1.2

DataFrame of means of top N most correlated columns

I have a dataframe df1 where each column represents a time series of returns. I want to create a new dataframe df2 with columns that corresponds to each of the columns in df1 where the column in df2 is defined to be the average of the top 5 most correlated columns in df1.
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
print df1.head()
A B C D E F G H I J
0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77
1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40
2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83
3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73
4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13
I expect the head of the resulting dataframe rounded to 2 places to look like:
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

For each column in the correlation matrix, take the six largest and ignore the first one (i.e. 100% correlated with itself). Use a dictionary comprehension to do this for each column.
Use another dictionary comprehension to located this columns in df1 and take their mean. Create a dataframe from the result, and reorder the columns to match those of df1 by appending [df1.columns].
corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
>>> df2.head()
A B C D E F G H I J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042 0.100 -0.258 0.108 -0.064 -0.102
2 0.026 0.132 0.544 0.330 -0.130 0.272 0.224 0.320 0.414 0.274
3 -0.224 0.128 0.186 0.582 0.626 0.242 0.344 0.506 0.318 0.224
4 -0.044 0.310 0.230 0.518 0.428 0.238 0.068 0.306 0.734 0.432
%%timeit
corr = df1.corr()
most_correlated_cols = {
col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop
%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop

Setup
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
Solution
corr = df.corr()
# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
return x.loc[x.index != x.name]
# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
return pd.Series(remove_self(x).sort_values(ascending=False).index)
# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
return df.loc[:, x].mean(axis=1)
# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))
print df2.round(2).head()
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

Creating Multi-hierarchy pivot table in Pandas

1. Background
The .xls files I have now contain some parameters of multi-pollutant in many aspects for different sites.
I created an simplified dataframe below as an illustration:
Some declaration:
Column Site contain the monitoring sites properties. In this case, Sites S1, S2 are the only two locatio here.
Column Time contain the monitoring period for different sites.
Species A & B represents two chemical pollutants had been detected.
Conc is one key parameter for each species(A & B) represents the concentration. Notice that, the concentration of Species A should be measured twice as parallel.
P and Q are two different analysis experiments. Since species A has two samples, it has P1, P2, P3 & Q1, Q2 as the analysis results respectively. Species B has only be analyzed by P. So, P1, P2, P3 are the only parameters.
After read some post on manipulating the pivot_table using Pandas, I want to have a try.
2. My target
I presented my target file construction manually in Excel showing like this:
3. My work
df = pd.ExcelFile("./test_file.xls")
df = df.parse("Sheet1")
pd.pivot_table(df,index = ["Site","Time","Species"])
This is the result:
Update
What I'm trying to figure out is to creat two columns P & Q and sub_columns below them.
I have re-upload my test file here. Anyone interested in can download it.
The P and Q tests are for each sample of species A respectively.
The Conc test are for them both.
Any advice would be appreciate!

IIUC
You want the same dataframe, but with a better column index.
To create the first level:
level0 = df.columns.str.extract(r'([^\d]*)', expand=False)
then assign a multiindex to the columns attribute.
df.columns = pd.MultiIndex.from_arrays([level0, df.columns])
Looks like:
print df
Conc P Q
Conc P1 P2 P3 Q1 Q2
Site Time Species
S1 20141222 A 0.79 0.02 0.62 1.05 0.01 1.73
20141228 A 0.13 0.01 0.79 0.44 0.01 1.72
20150103 B 0.48 0.03 1.39 0.84 NaN NaN
20150104 A 0.36 0.02 1.13 0.31 0.01 0.94
20150109 A 0.14 0.01 0.64 0.35 0.00 1.00
20150114 B 0.47 0.08 1.16 1.40 NaN NaN
20150115 A 0.62 0.02 0.90 0.95 0.01 2.63
20150116 A 0.71 0.03 1.72 1.71 0.01 2.53
20150121 B 0.61 0.03 0.67 0.87 NaN NaN
S2 20141222 A 0.23 0.01 0.66 0.44 0.01 1.49
20141228 A 0.42 0.06 0.99 1.56 0.00 2.18
20150103 B 0.09 0.01 0.56 0.12 NaN NaN
20150104 A 0.18 0.01 0.56 0.36 0.00 0.67
20150109 A 0.50 0.03 0.74 0.71 0.00 1.11
20150114 B 0.64 0.06 1.76 0.92 NaN NaN
20150115 A 0.58 0.05 0.77 0.95 0.01 1.54
20150116 A 0.93 0.04 1.33 0.69 0.00 0.82
20150121 B 0.33 0.09 1.33 0.76 NaN NaN

Group by - select most recent 4 events

I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.

For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add column using groupby in multiindex Pandas - python

Related

How to add sum() and mean() value above the df column values in the same line?

How to make histogram using pandas

DataFrame of means of top N most correlated columns

Creating Multi-hierarchy pivot table in Pandas

Group by - select most recent 4 events

Categories

Resources