I have a dataframe that looks as following:
Type Month Value
A 1 0.29
A 2 0.90
A 3 0.44
A 4 0.43
B 1 0.29
B 2 0.50
B 3 0.14
B 4 0.07
I want to change the dataframe to following format:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Is this possible ?
Use set_index + unstack
df.set_index(['Month', 'Type']).Value.unstack()
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
To match your exact output
df.set_index(['Month', 'Type']).Value.unstack().rename_axis(None)
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Pivot solution:
In [70]: df.pivot(index='Month', columns='Type', values='Value')
Out[70]:
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
In [71]: df.pivot(index='Month', columns='Type', values='Value').rename_axis(None)
Out[71]:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
You're having a case of long format table which you want to transform to a wide format.
This is natively handled in pandas:
df.pivot(index='Month', columns='Type', values='Value')
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 12 months ago.
I have the following data frame:
ID
value
freq
A
a
0.1
A
b
0.12
A
c
0.19
B
a
0.15
B
b
0.2
B
c
0.09
C
a
0.39
C
b
0.15
C
c
0.01
and I would like to get the following
ID
freq_a
freq_b
freq_c
A
0.1
0.12
0.19
B
0.15
0.2
0.09
C
0.39
0.15
0.01
Any ideas how to easily do this?
using pivot:
df.pivot(index='ID', columns='value', values='freq').add_prefix('freq_').reset_index()
output:
>>
value ID freq_a freq_b freq_c
0 A 0.10 0.12 0.19
1 B 0.15 0.20 0.09
2 C 0.39 0.15 0.01
Use pivot_table:
out = df.pivot_table('freq', 'ID', 'value').add_prefix('freq_') \
.rename_axis(columns=None).reset_index()
print(out)
# Output
ID freq_a freq_b freq_c
0 A 0.10 0.12 0.19
1 B 0.15 0.20 0.09
2 C 0.39 0.15 0.01
I have an initial dataset data grouped by id:
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
3 0.12 1.00
3 0.34 0.71
3 0.64 0.43
3 0.89 0.14
4 0.32 1.00
4 0.33 0.66
4 0.45 0.33
4 0.76 0.00
I am trying to predict the maximum y based on variable x while considering the groups. First, I train_test_split based on the groups:
data_train
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
and
data_test
id x y
3 0.12 1.00
3 0.34 0.66
3 0.64 0.33
3 0.89 0.00
4 0.33 1.00
4 0.32 0.66
4 0.45 0.33
4 0.76 0.00
After training the model and applying the model on data_test, I get:
y_hat
0.65
0.33
0.13
0.00
0.33
0.34
0.21
0.08
I am trying to transform y_hat so that the maximum in each of the initial groups is 1.00; otherwise, it is 0.00:
y_hat_transform
1.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
How would I do that? Note that the groups can be of varying sizes.
Edit: To simplify the problem, I have id_test and y_hat, where
id_test
3
3
3
3
4
4
4
4
and I am trying to get y_hat_transform.
id y
0 3 0.65
1 3 0.65
2 3 0.33
3 3 0.13
4 3 0.00
5 4 0.33
6 4 0.34
7 4 0.21
8 4 0.08
# Find max rows per group and assign them values
# I see 1.0 and 0.0 as binary so directly did it by casting to float
# transform gives new column of same size and repeated maxs per group
id_y['y_transform'] = (id_y['y'] == id_y.groupby(['id'])['y'].transform(max)).astype(float)
Let's say in the dataframe df there is:
a b c d
ana 31% 26% 29%
bob 52% 45% 9%
cal 11% 6% 23%
dan 29% 12% 8%
where all data types under a, b c and d are objects. I want to convert b, c and d to their decimal forms with:
df.columns = df.columns.str.rstrip('%').astype('float') / 100.0
but I don't know how to not include column a
Let us do update with to_numeric
df.update(df.apply(lambda x : pd.to_numeric(x.str.rstrip('%'),errors='coerce'))/100)
df
Out[128]:
a b c d
0 ana 0.31 0.26 0.29
1 bob 0.52 0.45 0.09
2 cal 0.11 0.06 0.23
3 dan 0.29 0.12 0.08
Use Index.drop for all columns without a with DataFrame.replace, convert to floats and divide by 100:
cols = df.columns.drop('a')
df[cols] = df[cols].replace('%', '', regex=True).astype('float') / 100.0
print (df)
a b c d
0 ana 0.31 0.26 0.29
1 bob 0.52 0.45 0.09
2 cal 0.11 0.06 0.23
3 dan 0.29 0.12 0.08
Or you can convert first column to index by DataFrame.set_index, so all columns without a should be processing:
df = df.set_index('a').replace('%', '', regex=True).astype('float') / 100.0
print (df)
b c d
a
ana 0.31 0.26 0.29
bob 0.52 0.45 0.09
cal 0.11 0.06 0.23
dan 0.29 0.12 0.08
I have a dataframe df1 where each column represents a time series of returns. I want to create a new dataframe df2 with columns that corresponds to each of the columns in df1 where the column in df2 is defined to be the average of the top 5 most correlated columns in df1.
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
print df1.head()
A B C D E F G H I J
0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77
1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40
2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83
3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73
4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13
I expect the head of the resulting dataframe rounded to 2 places to look like:
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43
For each column in the correlation matrix, take the six largest and ignore the first one (i.e. 100% correlated with itself). Use a dictionary comprehension to do this for each column.
Use another dictionary comprehension to located this columns in df1 and take their mean. Create a dataframe from the result, and reorder the columns to match those of df1 by appending [df1.columns].
corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
>>> df2.head()
A B C D E F G H I J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042 0.100 -0.258 0.108 -0.064 -0.102
2 0.026 0.132 0.544 0.330 -0.130 0.272 0.224 0.320 0.414 0.274
3 -0.224 0.128 0.186 0.582 0.626 0.242 0.344 0.506 0.318 0.224
4 -0.044 0.310 0.230 0.518 0.428 0.238 0.068 0.306 0.734 0.432
%%timeit
corr = df1.corr()
most_correlated_cols = {
col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop
%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop
Setup
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
Solution
corr = df.corr()
# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
return x.loc[x.index != x.name]
# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
return pd.Series(remove_self(x).sort_values(ascending=False).index)
# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
return df.loc[:, x].mean(axis=1)
# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))
print df2.round(2).head()
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43
I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.
For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.